File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.
Terms and phrases that are used to represent categorical data values or for creating content in metadata records should reflect appropriate and accepted vocabularies in your community or institution. Methods used to identify and select the proper terminology include:
To assure that metadata correctly describes what is actually in a data file, visual inspection or analysis should be done by someone not otherwise familiar with the data and its format. This will assure that the metadata is sufficient to describe the data. For example, statistical software can be used to summarize data contents to make sure that data types, ranges and, for categorical data, values found, are as described in the documentation/metadata.
A data dictionary provides a detailed description for each element or variable in your dataset and data model. Data dictionaries are used to document important and useful information such as a descriptive name, the data type, allowed values, units, and text description. A data dictionary provides a concise guide to understanding and using the data.
A data model documents and organizes data, how it is stored and accessed, and the relationships among different types of data. The model may be abstract or concrete.
Use these guidelines to create a data model:
The parameters reported in the data set need to have names that clearly describe the contents. Ideally, the names should be standardized across files, data sets, and projects, in order that others can readily use the information.
The documentation should contain a full description of the parameter, including the parameter name, how it was measured, the units, and the abbreviation used in the data file.
Spatial coordinates should be reported in decimal degrees format to at least 4 (preferably 5 or 6) significant digits past the decimal point. An accuracy of 1.11 meters at the equator is represented by +/- 0.00001. This does not include uncertainty introduced by a GPS instrument.
Provide latitude and longitude with south latitude and west longitude recorded as negative values, e.g., 80 30' 00" W longitude is -80.5000.
For date, always include four digit year and use numbers for months. For example, the date format yyyy-mm-dd would appear as 2011-03-15 (March 15, 2011).
If Julian day is used, make sure the year field is also supplied. For example, mmm.yyyy would appear as 122.2011, where mmm is the Julian day.
If the date is not completely known (e.g. day not known) separate the columns into parts that do exist (e.g. separate column for year and month). Don't introduce a day because the database date format requires it.
When describing the process for creating derived data products, the following information should be included in the data documentation or the companion metadata file:
A description of the contents of the data file should contain the following:
- Define the parameters and the units on the parameter
- Explain the formats for dates, time, geographic coordinates, and other parameters
- Define any coded values
- Describe quality flags or qualifying values
- Define missing values
Data sets or collections are often composed of multiple files that are related. Files may have come from (or still be stored in) a relational database, and the relationships among the data tables or other entities are important if the data are to be reused. These relationships should be documented for a repository.
The research project description should contain the following information:
- Who: project personnel (principal investigator, researchers, technicians, others)
- Where: location and description of study site or sites
- When: range of dates for the project
- Why: rational for the project (abstract)
- How: description of project methods
Other useful information might include the project title, the overarching project (if any), institution(s) involved, and source of funding.
The spatial extent of your data set or collection as a whole should be described. The minimum acceptable description would be a bounding box describing the northern most, southern most, western most, and eastern most limits of the data.
The temporal extent over which the data within your dataset or collection was acquired or collected should be described. Normally this is done by providing
The units of reported parameters need to be explicitly stated in the data file and in the documentation. We recommend SI units (The International System of Units) but recognize that each discipline has its own commonly used units of measure. The critical aspect here is that the units be defined so that others understand what is reported.
Do not use abbreviations when describing the units. For example the units for respiration are moles of carbon dioxide per meter squared per year.
Different types of new data may be created in the course of a project, for instance visualizations, plots, statistical outputs, a new dataset created by integrating multiple datasets, etc. Whenever possible, document your workflow (the process used to clean, analyze and visualize data) noting what data products are created at each step. Depending on the nature of the project, this might be as a computer script, or it may be notes in a text file documenting the process you used (i.e. process metadata).
Identification of any species represented in the data set should be as complete as possible.
- Use a standard taxonomy whenever possible
- Full taxonomic tree to most specific level available
- Source of taxonomy should accompany taxonomic tree (if available)
- References used for taxonomic identification should be provided, if appropriate (e.g. technical document, journal article, book, database, person, etc.)
Examples of standardized identification systems:
The following are strategies for effective data organization:
In order for a large dataset to be effectively used by a variety of end users, the following procedures for preparing a virtual dataset are recommended:
Many times significant overlap exists among metadata content standards. You should identify those standards that include the fields needed to describe your data. In order to describe your data, you need to decide what information is required for data users to discover, use, and understand your data. The who, what, when, where, how, why, and a description of quality should be considered. The description should provide enough information so that users know what can and cannot be done with your data.
Choose the right data type and precision for data in each column. As examples: (1) use date fields for dates; and (2) use numerical fields with decimal places precision. Comments and explanations should not be included in a column that is meant to include numeric values only. Comments should be included in a separate column that is designed for text. This allows users to take advantage of specialized search and computing functionality and improves data quality.
For appropriate attribution and provenance of a dataset, the following information should be included in the data documentation or the companion metadata file:
People have different perspectives on what data means to them, and how it can be used and interpreted in different contexts. Data users ranging from community participants to researchers in different domains can provide unique and valuable insights into data through the use of annotation and tagging. The community-generated notes and tags should be discoverable through the data search engine to enhance discovery and use.
When providing capabilities for community tagging and annotations, you should consider the following:
In order to ensure replicable data access:
- Choose a broadly utilized Data Identification Standard based on specific user community practices or preferences
- Consistently apply the standard
- Maintain the linkage
- Participate in implementing infrastructure for consistent access to the resources referenced by the Identifier
A separate column should be used for data qualifiers, descriptions, and flags, otherwise there is the potential for problems to develop during analyses. Potential entries in the descriptor column:
- Potential sources of error
- Missing value justification (e.g. sensor off line, human error, data rejected outside of range, data not recorded
- Flags for values outside of expected range, questionable etc.
Delimit the columns within a data table using commas or tabs; these are listed in order of preference. Semicolons are used in many systems as line end delimiters and may cause problems if data are imported into those systems (e.g. SAS, PHP scripts). Avoid delimiters that also occur in the data fields. If this cannot be avoided, enclose data fields that also contain a delimiter in single or double quotes.
An example of a consistently delimited data file with a header row:
Be consistent in the use of codes to indicate categorical variables, for example species names, sites, or land cover types. Codes should always be the same within one data set Pay particular attention to spelling and case; most frequent problems are with abbreviations for species names and sites.
Consistent codes can be achieved most easily by defining standard categorical variables (codes) and using drop down lists (excel, database). Frequently a code is needed for ‘none of the above’ or ‘unknown’ or ‘other’ to avoid imprecise code assignment.