All Best Practices
Document that steps used to integrate disparate datasets.
- Ideally, one would adopt mechanisms to systematically capture the integration process, e.g. in an executable form such as a script or workflow, so that it can be reproduced
- In lieu of a scientific workflow system, document the process, scripts, or queries used to perform the integration of data in documentation that will accompany the data (metadata)
- Provide a conceptual model that describes the...
The following are strategies for effective data organization:
- Sparse matrix: Optimal data models for storing data avoid sparse matrices, i.e. if many data points within a matrix are empty a data table with a column for parameters and a column for values may be more appropriate.
- Repetitive information in a wide matrix: repeated categorical information is best handled in separate tables to reduce redundancy in the data table. In database design this is called...
Ensuring accuracy of your data is critical to any analysis that follows.
When transcribing data from paper records to digital representation, have at least two, but preferably more people transcribe the same data, and compare resulting digital files. At a minimum someone other than the person who originally entered the data should compare the paper records to the digital file. Disagreements can then be flagged and resolved.
In addition to transcription accuracy, data...
Quality control practices are specific to the type of data being collected, but some generalities exist:
- Data collected by instruments:
- Values recorded by instruments should be checked to ensure they are within the sensible range of the instrument and the property being measured. Example: Concentrations cannot be < 0, and wind speed cannot exceed the maximum speed that the anemometer can record.
- Analytical results:
- Values measured in...
When searching for data, whether locally on one's machine or in external repositories, one may use a variety of search terms. In addition, data are often housed in databases or clearinghouses where a query is required in order access data. In order to reproduce the search results and obtain similar, if not the same results, it is necessary to document which terms and queries were used.
- Note the location of the originating data set
- Document which...
In order for a large dataset to be effectively used by a variety of end users, the following procedures for preparing a virtual dataset are recommended:
- Identify data service users
- Define data access capabilities needed by community(s) of users. For example:
- Spatial subsetting
- Temporal subsetting
- Parameter subsetting
- Coordinate transformation
- Statistical characterization
- Define service interfaces based...
For successful data replication and backup:
- Users should ensure that backup copies have the same content as the original data file.
- Calculate a checksum for both the original and the backup copies and compare; if different back up the file again MD5: algorithm to determine check sum http://en.wikipedia.org/wiki/MD5
- Compare files to ensure that there are no differences
- Document all procedures (e...
All storage media, whether hard drives, discs or data tapes, will wear out over time, rendering your data files inaccessible. To ensure ongoing access to both your active data files and your data archives, it is important to continually monitor the condition of your storage media and track its age. Older storage media and media that show signs of wear should be replaced immediately. Use the following guidelines to ensure the ongoing integrity and accessibility of your data:
Many times significant overlap exists among metadata content standards. You should identify those standards that include the fields needed to describe your data. In order to describe your data, you need to decide what information is required for data users to discover, use, and understand your data. The who, what, when, where, how, why, and a description of quality should be considered. The description should provide enough information so that users know what can and cannot be done with...
Steps for the identification of the sensitivity of data and the determination of the appropriate security or privacy level are:
- Determine if the data has any confidentiality concerns
- Can an unauthorized individual use the information to do limited, serious, or severe harm to individuals, assets or an organization’s operations as a result of data disclosure?
- Would unauthorized disclosure or dissemination of elements of the data violate laws, executive orders...
As part of the data life cycle, research data will be contributed to a repository to support preservation and discovery. A research project may generate many different iterations of the same dataset - for example, the raw data from the instruments, as well as datasets which already include computational transformations of the data.
In order to focus resources and attention on these core datasets, the project team should define these core data assets as early in the process as...
Missing values should be handled carefully to avoid their affecting analyses. The content and structure of data tables are best maintained when consistent codes are used to indicate that a value is missing in a data field. Commonly used approaches for coding missing values include:
- Use a missing value code that matches the reporting format for the specific parameter. For example, use ""-999.99"", when the reporting format is a FORTRAN-like F7.2.
- For character fields, it...
Follow the steps below to choose the most appropriate software to meet your needs.
- Identify what you want to achieve (discover data, analyze data, write a paper, etc.)
- Identify the necessary software features for your project (i.e. functional requirements)
- Identify logistics features of the software that are required, such as licensing, cost, time constraints, user expertise, etc. (i.e. non-functional requirements)
- Determine what software has...
Outliers may not be the result of actual observations, but rather the result of errors in data collection, data recording, or other parts of the data life cycle. The following can be used to identify outliers for closer examination:
- Outliers may be detected by using Dixon’s test, Grubbs test or the Tietjen-Moore test.
- Box plots are useful for indicating outliers
- Scatter plots help...
Shaping the data management plan towards a specific desired repository will increase the likelihood that the data will be accepted into that repository and increase the discoverability of the data within the desired repository. When beginning a data management plan:
- Look to the data management guidelines of the project/grant for a required repository
- Ask colleagues what repositories are used in the community
- Determine if your local institution has a...