All Best Practices
To make your data available using standard and open software tools you should:
- Use standard language and terms to clearly communicate to others that your data are available for reuse and that you expect ethical and appropriate use of your data
- Use an open source datacasting (RSS or other type) service that enables you to advertise your data and the options for others to obtain access to it (RSS, GeoRSS, DatacastingRSS)
File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.
When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. Avoid using...
To avoid accidental loss of data you should:
- Backup your data at regular frequencies
- When you complete your data collection activity
- After you make edits to your data
- Streaming data should be backed up at regularly scheduled points in the collection process
- High-value data should be backed up daily or more often
- Automation simplifies frequent backups
- Backup strategies (e.g., full, incremental, differential,...
Terms and phrases that are used to represent categorical data values or for creating content in metadata records should reflect appropriate and accepted vocabularies in your community or institution. Methods used to identify and select the proper terminology include:
- Identify the relevant descriptive terms used as categorical values in your community prior to start of the project (ex: standard terms describing soil horizons, plant taxonomy, sampling methodology or equipment, etc...
Information about quality control and quality assurance are important components of the metadata:
- Qualify (flag) data that have been identified as questionable by including a flagging_column next to the column of data values. The two columns should be properly associated through a naming convention such as Temperature, flag_Temperature.
- Describe the qality control methods applied and their assumptions in the metadata. Describe any software used when performing the...
To assure that metadata correctly describes what is actually in a data file, visual inspection or analysis should be done by someone not otherwise familiar with the data and its format. This will assure that the metadata is sufficient to describe the data. For example, statistical software can be used to summarize data contents to make sure that data types, ranges and, for categorical data, values found, are as described in the documentation/metadata.
The integration of multiple data sets from different sources requires that they be compatible. Methods used to create the data should be considered early in the process, to avoid problems later during attempts to integrate data sets. Note that just because data can be integrated does not necessarily mean that they should be, or that the final product can meet the needs of the study. Where possible, clearly state situations or conditions where it is and is not appropriate to use your data,...
A data dictionary provides a detailed description for each element or variable in your dataset and data model. Data dictionaries are used to document important and useful information such as a descriptive name, the data type, allowed values, units, and text description. A data dictionary provides a concise guide to understanding and using the data.
A backup policy helps manage users' expectations and provides specific guidance on the "who, what, when, and how" of the data backup and restore process. There are several benefits to documenting your data backup policy:
- Helps clarify the policies, procedures, and responsibilities
- Allows you to dictate:
- where backups are located
- who can access backups and how they can be contacted
- how often data should be backed up...
- where backups are located
Data files should be managed to avoid disorder. To facilitate access to files, all storage devices, locations and access accounts should be documented and accessible to team members. Use appropriate tools, such as version control tools, to keep track of the history of the data files. This will help with maintaining files in different locations, such as at multiple off-site backup locations or servers.
Data sets that result in many files structured in a file directory can be...
The process of science generates a variety of products that are worthy of preservation. Researchers should consider all elements of the scientific process in deciding what to preserve:
- Raw data
- Tables and databases of raw or cleaned observation records and measurements
- Intermediate products, such as partly summarized or coded data that are the input to the next step in an analysis
- Documentation of the protocols used
- Software or algorithms...
In the planning process, researchers should carefully consider what data will be produced in the course of their project.
Consider the following:
- What types of data will be collected? E.g. Spatial, temporal, instrument-generated, models, simulations, images, video etc.
- How many data files of each type are likely to be generated during the project? What size will they be?
- For each type of data file, what are the variables that are expected to be...
In addition to the primary researcher(s), there might be others involved in the research process that take part in aspects of data management. By clearly defining the roles and responsibilities of the parties involved, data are more likely to be available for use by the primary researchers and anyone re-using the data. Roles and responsibilities should be clearly defined, rather than assumed; this is especially important for collaborative projects that involve many researchers, institutions...
A data model documents and organizes data, how it is stored and accessed, and the relationships among different types of data. The model may be abstract or concrete.
Use these guidelines to create a data model:
- Identify the different data components- consider raw and processed data, as well as associated metadata (these are called entities)
- Identify the relationships between the different data components (these are called associations)
The parameters reported in the data set need to have names that clearly describe the contents. Ideally, the names should be standardized across files, data sets, and projects, in order that others can readily use the information.
The documentation should contain a full description of the parameter, including the parameter name, how it was measured, the units, and the abbreviation used in the data file.
A missing value code should also be defined. Use the same notation for...