Data Observation Network for Earth

Collage of four nature photos

DataONEpedia

A 3-day NSF-funded Informatics Education Planning Workshop was held in Santa Fe, New Mexico June 28-30, 2010. The project was supported through the NSF INTEROP project and the results informed the new NSF DataNet projects, especially DataONE, which focuses on the biological, ecological, and environmental sciences. Two of the activities from this workshop were a Best Practices database that describes to scientists and students how best to perform a certain data or information management function such as naming a file, label columns in a spreadsheet, etc. and a “Tools” database. The Tools database includes a brief description of a particular tool that is recommended for use by scientists and students. Credits

The goals for these two databases are:

  • A place where the data can be collected, managed, and updated by appropriate individuals.
  • Ensure that the collected data can be reused to some degree and potentially presented in multiple different ways
    • Best practices and Tools need to be searchable
    • Defined fields and tags that make sense
  • Cross-referential Best Practices

Best Practices

Featured Best Practice

Documenting Data

A description of the contents of the data file should contain the following:

  • Define the parameters and the units on the parameter
  • Explain the formats for dates, time, geographic coordinates, and other parameters
  • Define any coded values
  • Describe quality flags or qualifying values
  • Define missing values
Documenting Data

When describing the process for creating derived data products, the following information should be included in the data documentation or the companion metadata file:

  • Description of primary input data and derived data
  • Why processing is required
  • Data processing steps and assumptions
    • Assumptions about primary input data
    • Additional input data requirements
    • Processing algorithm (e.g., volts to mol fraction, averaging)
    • Assumptions and limitations of algorithm
    • Describe how algorithm is applied (e.g., manually, using R, IDL)
  • How outcome of processing is evaluated
    • How problems are identified and rectified
    • Tools used to assess outcome
    • Conditions under which reprocessing is required
  • How uncertainty in processing is assessed
    • Provide a numeric estimate of uncertainty
  • How processing technique changes over time, if applicable
Data Management Planning

A Data Management Plan should include the following information:

  • Types of data to be produced and their volume
    • Who will produce the data
  • Standards that will be applied
    • File formats and organization, parameter names and units, spatial and temporal resolution, metadata content, etc.
  • Methods for preserving the data and maintaining data integrity
    • What hardware / software resources are required to store the data
    • How will the data be stored and backed up
    • Describe the method for periodically checking the integrity of the data
  • Access and security policies;
    • What access requirements does your sponsor have
    • Are there any privacy / confidentiality / intellectual property requirements
    • Who can access the data:
    • During active data collection
    • When data are being analyzed and incorporated into publications
    • When data have been published
    • After the project ends
    • How should the data be cited and the data collectors acknowledged
  • Plans for eventual transition of the data to an archive after the project ends
    • Identify a suitable data center within your discipline
    • Establish an agreement for archival
    • Understand the data center's requirements for submission and incorporate into data management plan

Data Files and File Management

For date, always include four digit year and use numbers for months. For example, the date format yyyy-mm-dd would appear as 2011-03-15 (March 15, 2011).

If Julian day is used, make sure the year field is also supplied. For example, mmm.yyyy would appear as 122.2011, where mmm is the Julian day.

If the date is not completely known (e.g. day not known) separate the columns into parts that do exist (e.g. separate column for year and month). Don't introduce a day because the database date format requires it.

For time, use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.). Report in both local time and Coordinated Universal Time (UTC). Include local time zone in a separate field. As appropriate, both the begin time and end time should be reported in both local and UTC time. Because UTC and local time may be on different days, we suggest that dates be given for each time reported.

Be consistent in date and time formats within one data set.

Documenting Data

Spatial coordinates should be reported in decimal degrees format to at least 4 (preferably 5 or 6) significant digits past the decimal point. An accuracy of 1.11 meters at the equator is represented by +/- 0.00001. This does not include uncertainty introduced by a GPS instrument.

Provide latitude and longitude with south latitude and west longitude recorded as negative values, e.g., 80 30' 00" W longitude is -80.5000.

Make sure that all location information in a file uses the same coordinate system, including coordinate type, datum, and spheroid. Document all three of these characteristics (e.g., Lat/Long decimal degrees, NAD83 (North American Datum of 1983), WGRS84 (World Geographic Reference System of 1984)). Mixing coordinate systems [e.g., NAD83 and NAD27 (North American Datum of 1927)] will cause errors in any geographic analysis of the data.

If locating field sites is more convenient using the Universal Transverse Mercator (UTM) coordinate system, be sure to record the datum and UTM zone (e.g., NAD83 and Zone 15N), and the easting and northing coordinate pair in meters, to ensure that UTM coordinates can be converted to latitude and longitude.

To assure the quality of the geospatial data, plot the locations on a map and visually check the location.

Data Files and File Management

The units of reported parameters need to be explicitly stated in the data file and in the documentation. We recommend SI units (The International System of Units) but recognize that each discipline has its own commonly used units of measure. The critical aspect here is that the units be defined so that others understand what is reported.

Do not use abbreviations when describing the units. For example the units for respiration are moles of carbon dioxide per meter squared per year.

Data Files and File Management

The parameters reported in the data set need to have names that clearly describe the contents. Ideally, the names should be standardized across files, data sets, and projects, in order that others can readily use the information.

The documentation should contain a full description of the parameter, including the parameter name, how it was measured, the units, and the abbreviation used in the data file.

A missing value code should also be defined. Use the same notation for each missing value in the data set. Use an extreme value (-9999) and do not use character codes in a numeric field. Supply a flag or a tag in a separate field to define briefly the reason for the missing data.

Within the data file use commonly accepted abbreviations for parameter names, for example, Temp for temperature, Precip for precipitation, Lat and Long for latitude and longitude. See the references in the Bibliography for additional examples. Some systems still have length limitations for column names (e.g.13 characters in ArcGIS); lower case column names are generally more transferrable between systems; Space and special characters should not be used in attribute names. Only numbers, letters and underscors (“_”) transfer easily between systems.

Also, be sure to use consistent capitalization (not temp, Temp, and TEMP in the same file).

Data Management Planning

Data should not be entered with higher precision than they were collected in (e.g if a device collects data to 2dp, an Excel file should not present it to 5 dp). If the system stores data in higher precision, care needs to be taken when exporting to ASCII. E.g. calculation in excel will be done to the highest possible precision of the system, which is not related to the precision of the original data.

Data Management Planning

The following are strategies for effective data organization:

  • Sparse matrix: Optimal data models for storing data avoid sparse matrices, i.e. if many data points within a matrix are empty a data table with a column for parameters and a column for values may be more appropriate.
  • Repetitive information in a wide matrix: repeated categorical information is best handled in separate tables to reduce redundancy in the data table. In database design this is called normalization of data.
  • Column name is a value or repeating group: If the column name contains variable information, e.g. date or species name, the parameter/value organization of data is recommended as well for storage. Although the wide matrix is needed for statistical analysis and graphing it cannot be queried or subset in that format.
Data Files and File Management

Data tables should ideally include values that were acquired in a consistent fashion. However, sometimes instruments fail and gaps appear in the records. For example, a data table representing a series of temperature measurements collected over time from a single sensor may include gaps due to power loss, sensor drift, or other factors. In such cases, it is important to document that a particular record was missing and replaced with an estimated or gap-filled value.

Specifically, whenever an original value is not available or is incorrect and is substituted with an estimated value, the method for arriving at the estimate needs to be documented at the record level. This is best done in a qualifier flag field. An example data table including a header row follows:

Day, Avg Temperature, Flag
1, 31.2, actual
2, 32.3, actual
3, 33.4, estimated
4, 35.8, actual

Data Files and File Management

Delimit the columns within a data table using commas or tabs; these are listed in order of preference. Semicolons are used in many systems as line end delimiters and may cause problems if data are imported into those systems (e.g. SAS, PHP scripts). Avoid delimiters that also occur in the data fields. If this cannot be avoided, enclose data fields that also contain a delimiter in single or double quotes.

An example of a consistently delimited data file with a header row:

Date, Avg Temperature, Precipitation
01Jan2010, 32.3, 0.0
02Jan2010, 34.1, 0.5
03Jan2010, 31.4, 2.5
04Jan2010, 33.2, 0.0

Documenting Data

Data measurement descriptions should:

  • Describe data collection methods or protocols(can include diagrams, images, schematics, etc.)
    • How the data were collected
    • Measurement frequency and regularity
  • Describe instrumentation
    • Include manufacturer, model number, dates in use
    • Maintenance/repair history
    • Malfunction history
    • Calibration methods, scale, detection limits, and history
  • Document measurement uncertainty, including accuracy, precision, and reproducibility. Provide values in the context of the measurements, e.g., standard error, standard deviation, confidence limits.
Data Management Planning

In order to preserve the raw data for future use:

  • Do not make any changes / corrections to the original raw data file
  • Use a scripted language (e.g., R) or a software language that can be documented (eg., C, Java, Python, etc.) to perform analysis or make corrections and save that information in a separate file
    • The code, along with appropriate documentation will be a record of the changes
    • The code can be modified and rerun, using the raw data file as input, if needed
  • Consider making your original data file read-only, so it cannot be inadvertently altered.
  • Avoid spreadsheet software and other Graphical User Interface-based software. They may seem convenient, but changes are made without a clear record of what was done or why. Spreadsheets provide incredible freedom and power for manipulating data, but if used inappropriately can create tremendous problems. For this reason special attention needs to be paid to adhering to best practices in organizing data in spreadsheets. Particularly important best practices that are also highlighted elsewhere are:
    • Data should be organized in columns with each column representing only a single type of data (number, date, character string. An exception to this is that sometimes a header line containing column names (sometimes called variable or field names) may be placed at the top of a column.
    • Each data line should be complete, that is, each line of the data should contain data for each column. Sometimes in spreadsheets, to promote human readability, values will be provided only when they change. However, if the data is sorted, the relationships would become scrambled. An exception to this rule is if a data item is really missing (and not just omitted for human readability)a missing value code might be used.

    Additional best practices regarding consistent use of codes for categorical variables, and informative field names also apply, but keeping the data in consistent and complete columns are the most important.

    A key test is whether the data from a spreadsheet can be exported as a delimited text file, such as a comma-separated-value (.csv) file that can be read by other software. If columns are not consistent the resulting data may cause software such as relational databases (e.g., MySQL, Oracle, Access) or statistical software (e.g., R, SAS, SPSS) to record errors or even crash.

    As a general rule, spreadsheets make a poor archival data format. Standards for spreadsheet file formats change frequently or go out of fashion. Even within a single software package (e.g., Excel) there is no guarantee that future versions of the software will read older file versions. For this reason, and as specified in other best practices, generic (e.g., text) formats such as comma-separated-value files are preferred.

    Sometimes it is the formulae embedded in a spreadsheet, rather than the data values themselves that are important. In this case, the spreadsheet itself may need to be archived. The danger of the spreadsheet being rendered obsolete or uninterpretable may be reduced by exporting the spreadsheet in a variety of forms (e.g., both as .xls and as .xlsx formats). However the long-term utility of the spreadsheet may still depend on periodic opening of the archived spreadsheet and saving it into new forms.

    Upgrades and new versions of software applications often perform conversions or modifications to data files produced in older versions, in many cases without notifying the user of the internal change(s).

    Many contemporary software applications that advertise forward compatibility for older files actually perform significant modifications to both visible and internal file contents. While this is often not a problem, there are cases where important elements like numerical formulas in a spreadsheet, are changed significantly when they are converted to become compatible with a current software package. The following practices will help ensure that your data files maintain their original fidelity in the face of application updates and new releases:

    • Where practical, continue using the version of the software that was originally used to create the data file to view and manipulate the file contents (For example, if Excel 97 was used to create a spreadsheet that contains formulas and formatting, continue using Excel 97 to access those data files as long as possible).
    • When forced to use a newer version of a software package to open files created with an older version of the application, first save a copy of the original file as a safeguard against irretrievable modification or corruption.
    • Carefully inspect older files that have been opened/converted to be compatible with newer versions of an application to ensure data fidelity has been carried forward. Where possible, compare the converted files to copies of the original files to ensure there have been no data modifications during conversion.
Data Management Planning

Steps for the identification of the sensitivity of data and the determination of the appropriate security or privacy level are:

  • Determine if the data has any confidentiality concerns
    • Can an unauthorized individual use the information to do limited, serious, or severe harm to individuals, assets or an organization’s operations as a result of data disclosure?
    • Would unauthorized disclosure or dissemination of elements of the data violate laws, executive orders, or agency regulations (i.e., HIPPA or Privacy laws)?
    • Does the data have any integrity concerns?
    • What would be the impact of unauthorized modification or destruction of the data?
    • Would it reduce public confidence in the originating organization?
    • Would it create confusion or controversy in the user community?
    • Could a potentially life-threatening decision be made based on the data or analysis of the data?
    • Are there any availability concerns about the data?
    • Is the information time-critical? Will another individual or system be relying on the data to make a time-sensitive decision (i.e. sensing data for earthquakes, floods, etc.)?
  • Document data concerns identified and determine overall sensitivity (Low, Moderate, High)
    • Low criticality would result in a limited adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean degradation in mission capability or result in minor harm to individuals.
    • Moderate criticality would result in a serious adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean a severe degradation or loss of mission capability or result in significant harm to individuals that does not involve loss of life or serious life threatening injuries.
    • High criticality would result in a severe or catastrophic adverse effect as a result of the loss of confidentiality, integrity, or availability of the data. It might cause a severe degradation in or loss of mission capability or result in severe or catastrophic harm to individuals involving loss of life or serious life threatening injuries.
  • Develop data access and dissemination policies and procedures based on sensitivity of the data and need-to-know.
  • Develop data protection policies, procedures and mechanisms based on sensitivity of the data.
Data Preservation and Backup

In order for a large dataset to be effectively used by a variety of end users, the following procedures for preparing a virtual dataset are recommended:

  • Identify data service users
  • Define data access capabilities needed by community(s) of users. For example:
    • Spatial subsetting
    • Temporal subsetting
    • Parameter subsetting
    • Coordinate transformation
    • Statistical characterization
  • Define service interfaces based upon Open Standards. For example:
    • Open Geospatial Consortium (OGC WMS, WFS, WCS)
    • W3C (SOAP)
    • IETF (REST – derived from Hypertext Transfer Protocol [HTTP])
  • Publish service metadata for published services based upon Open Standards. For example:
    • Web Services Definition Language (WSDL)
    • RSS/Atom (see Service Casting reference below for an example of a model for publishing service metadata for a variety of service types)

Quality Assurance and Quality Control

Quality control practices are specific to the type of data being collected, but some generalities exist:

  • Data collected by instruments:
    • Values recorded by instruments should be checked to ensure they are within the sensible range of the instrument and the property being measured. Example: Concentrations cannot be < 0, and wind speed cannot exceed the maximum speed that the anemometer can record.
  • Analytical results:
    • Values measured in the laboratory should be checked to ensure that they are within the detection limit of the analytical method and are valid for what is being measured. If values are below the detection limit, they should be properly coded and qualified.
    • Any ancillary data used to assess data quality should be described and stored. Example: data used to compare instrument readings against known standards.
  • Observations (such as bird counts or plant cover):
    • Range checks and comparisons with historic maxima will help identify anomalous values that require further investigation.
    • Comparing current and past measurements help identify highly unlikely events. For example, it is unlikely that the girth of a tree will decrease from one year to the next.

Codes should be used to indicate quality of data.

  • Codes should be checked against the list of allowed values to validate code entries
  • When coded data are digitized, they should be re-checked against the original source. Double data entry, or having another person check and validate the data entered, is a good mechanism for identifying data entry errors.

Dates and times:

  • Ensure that dates and times are valid
  • Time zones should be clearly indicated (UTC or local)

Data Types:

  • Values should be consistent with the data type (integer, character, datetime) of the column in which they are entered. Example: 12-20-2000A should not be entered in a column of dates).
  • Use consistent data types in your data files. A database, for instance, will prevent entry of a string into a column identified as having integer data.

Geographic coordinates:

  • Map coordinates to detect errors
Quality Assurance and Quality Control

Outliers may not be the result of actual observations, but rather the result of errors in data collection, data recording, or other parts of the data life cycle. The following can be used to identify outliers for closer examination:

Statistical determination:

  • Outliers may be detected by using Dixon’s test, Grubbs test or the Tietjen-Moore test.

Visual determination:

  • Box plots are useful for indicating outliers
  • Scatter plots help identify outliers when there is an expected pattern, such as a daily cycle

Comparison to related observations:

  • Difference plots for co-located data streams can show unreasonable variation between data sources. Example: Difference plots from weather stations in close proximity or from redundant sensors can be constructed.
  • Comparisons of two parameters that should covary can indicate data contamination. Example: Declining soil moisture and increasing temperature are likely to result in decreasing evapotranspiration.

No outliers should be removed without careful consideration and verification that they are not representing true phenomena.

Documenting Data

Terms and phrases that are used to represent categorical data values or for creating content in metadata records should reflect appropriate and accepted vocabularies in your community or institution. Methods used to identify and select the proper terminology include:

  • Identify the relevant descriptive terms used as categorical values in your community prior to start of the project (ex: standard terms describing soil horizons, plant taxonomy, sampling methodology or equipment, etc.)
  • Identify locations in metadata where standardized terminology should be used and sources for the terms. Terminology should reflect both data type/content and access methods.
  • Review existing thesauri, ontologies, and keyword lists for your use before making up a new terms. Potential sources include: Semantic Web for Earth and Environmental Terminology (SWEET), Planetary Ontologies, and NASA Global Change Master Directory (GCMD)
  • Enforce use of standard terminology in your workflow, including:
    • Use of lookup tables in data-entry forms
    • Use of field-level constraints in databases (restrict data import to match accepted domain values)
    • Use XML validation
    • Do manual review
  • Publish metadata using Open Standards, for example:
    • z39.50
    • OGC Catalog Services for Web (CSW)
    • Web Accessible Directory (WAD)

    If you must use an unconventional or unique vocabulary, it should be identified in the metadata and fully defined in the data documentation (attribute name, values, and definitions).

Documenting Data

For appropriate attribution and provenance of a dataset, the following information should be included in the data documentation or the companion metadata file:

  • Name the people responsible for the dataset throughout the lifetime of the dataset, including for each person:
    • Name
    • Contact information
    • Role (e.g., principal investigator, technician, data manager)

According to the International Polar Year Data and Information Service, an author is the individual(s) whose intellectual work, such as a particular field experiment or algorithm, led to the creation of the dataset. People responsible for the data can include: individuals, groups, compilers or editors.

  • Description of the context of the dataset with respect to a larger project or study (include links and related documentation), if applicable.
  • Revision history, including additions of new data and error corrections.
  • Links to source data, if the data in one dataset were derived from data in another dataset.
  • List of project support (e.g., funding agencies, collaborators, material support).
  • Describe how to properly cite the dataset. The data citation should include:
    • All contributors
    • date of dataset publication
    • Title of dataset
    • media or URL
    • Data publisher
    • Identifier (Digital Object Identifier)
  • Quality Assurance and Quality Control

    Information about quality control and quality assurance are important components of the metadata:

    • Qualify (flag) data that have been identified as questionable by including a flagging_column next to the column of data values. The two columns should be properly associated through a naming convention such as Temperature, flag_Temperature.
    • Describe the qality control methods applied and their assumptions in the metadata. Describe any software used when performing the quality analysis, including code where practical. Include in the metadata who did the quality control analysis, when it was done, and what changes were made to the dataset.
    • Describe standards or test data used for the quality analysis. For instance, include, when practical, the data used to make a calibration curve.
    • If data with qualifier flags are summarized to create a derived data set, include the percent flagged data and percent missing data in the metadata of the derived data file. High frequency observations are often downsampled, and it is critical to know how much of the data were rejected in the primary data.
    Documenting Data

    Identification of any species represented in the data set should be as complete as possible.

    • Use a standard taxonomy whenever possible
    • Full taxonomic tree to most specific level available
    • Source of taxonomy should accompany taxonomic tree (if available)
    • References used for taxonomic identification should be provided, if appropriate (e.g. technical document, journal article, book, database, person, etc.)

    Examples of standardized identification systems:

    Data Files and File Management

    Missing values should be handled carefully to avoid their affecting analyses. The content and structure of data tables are best maintained when consistent codes are used to indicate that a value is missing in a data field. Commonly used approaches for coding missing values include:

    • Use a missing value code that matches the reporting format for the specific parameter. For example, use ""-999.99"", when the reporting format is a FORTRAN-like F7.2.
    • For character fields, it may be appropriate to use ""Not applicable"" or ""None"" depending upon the organization of the data file.
    • It might be useful to use a placeholder value such as ""Pending assignment"" when compiling draft information to facilitate returning to incomplete fields.
    • Do not use character codes in an otherwise numeric field.

    Whatever missing value is chosen, it should be used consistently throughout all data associated files and identified in the metadata and/or data description files.

    Data Files and File Management

    In order to ensure replicable data access:

    • Choose a broadly utilized Data Identification Standard based on specific user community practices or preferences
      • DOI
      • OIDs
      • ARKs
      • LSIDs
      • XRIs
      • URNs/URIs/URLs
      • UUIDs)
    • Consistently apply the standard
    • Maintain the linkage
    • Participate in implementing infrastructure for consistent access to the resources referenced by the Identifier
    Data Files and File Management

    Items to consider when versioning data products:

    • Develop definition of what constitutes a new version of the data, for example:
      • New processing algorithms
      • Additions or removal of data points
      • Time or date range
      • Included parameters
      • Data format
      • Immutability of versions
    • Develop standard naming convention for versions with associated descriptive information
    • Associate metadata with each version including the description of what differentiates this version from another version
    Data Files and File Management

    A separate column should be used for data qualifiers, descriptions, and flags, otherwise there is the potential for problems to develop during analyses. Potential entries in the descriptor column:

    • Potential sources of error
    • Missing value justification (e.g. sensor off line, human error, data rejected outside of range, data not recorded
    • Flags for values outside of expected range, questionable etc.
    Data Files and File Management

    Choose the right data type and precision for data in each column. As examples: (1) use date fields for dates; and (2) use numerical fields with decimal places precision. Comments and explanations should not be included in a column that is meant to include numeric values only. Comments should be included in a separate column that is designed for text. This allows users to take advantage of specialized search and computing functionality and improves data quality. If a particular spreadsheet or software system does not support data typing, it is still recommended that one keep the data type consistent within a column and not mix numbers, dates and text.

    Data Files and File Management

    Be consistent in the use of codes to indicate categorical variables, for example species names, sites, or land cover types. Codes should always be the same within one data set Pay particular attention to spelling and case; most frequent problems are with abbreviations for species names and sites.

    Consistent codes can be achieved most easily by defining standard categorical variables (codes) and using drop down lists (excel, database). Frequently a code is needed for ‘none of the above’ or ‘unknown’ or ‘other’ to avoid imprecise code assignment.

    Data Preservation and Backup

    File formats are important for understanding how data can be used and possibly integrated. The following issues need to be documented:

    • Does the file format of the data adhere to one or more standards?
    • Is that file standard an open (i.e. open source) or closed (i.e. proprietary) format?
    • Is a particular software package required to read and work with the data file? If so, the software package, version, and operating system platform should be cited in the metadata
    • Do multiple files comprise the data file structure? If so, that should be specified in the metadata

    When choosing a file format, data collectors should select a consistent format that can be read well into the future and is independent of changes in applications.

    • Appropriate file types include:
      • Non-proprietary: Open, documented standard
      • Common usage by research community: Standard representation (ASCII, Unicode)
      • Unencrypted
      • Uncompressed
    • ASCII formatted files will be readable into the future
      • Use ASCII (comma-separated) for tabular data
    • For geospatial (raster) data the following provide a stable format:
      • GeoTIFF/TIFF
      • ASCII Grid
      • Binary image files
      • Net-CDF
      • HDF or HDF-EOS
    • For image (Vector) data use the following file formats (these are mostly proprietary data formats; please be sure to document the Software Package, Version, Vendor, and native platform):
      • ARCVIEW software -- please store components of an ArcView shape file (*.shp, *.sbx, *.sbn, *.prj, and *.dbf files) ;
      • ENVI -- *.evf (ENVI vector file)
      • ESRI Arc/Info export file (.e00)

    Data Preservation and Backup

    For successful data replication and backup:

    • Users should ensure that backup copies have the same content as the original data file.
      • Calculate a checksum for both the original and the backup copies and compare; if different back up the file again MD5: algorithm to determine check sum http://en.wikipedia.org/wiki/MD5
      • Compare files to ensure that there are no differences
    • Document all procedures (e.g., compression / decompression process) to ensure a successful recovery from a backup copy
    • To check the integrity of the backup file, periodically retrieve your backup file, open it on a separate system, and compare to the original file.
    • A data backup is only valuable if it is accessible. When access to a data backup is required, the owner of the backup may not be available. It is important that others know how to access the backup, otherwise the data may not be accessible for recovery. It is important to know the "who, what, when, where, and how" of the backups:
      • Have contact information available for the person responsible for the data
      • Ensure that those who need access to backups have proper access
      • Communicate what data is being backed up
      • Note how often the data is backed up and where that particular backup is located including
        • physical location (machine, office, company)
        • file system location
      • Be aware that there may be different backup procedures for different data sets:
        • Not all backups may be located in the same location
        • Depending upon the backup schedule, each iteration of the backup may be located in different locations (for example, more recent backups may be located on-site and older backups may be located off-site)
      • Have instructions and training available so that others know how to pull the backup and access the necessary data in case you are unavailable
    Data Files and File Management

    File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.

    When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. Avoid using spaces and special characters in file names, directory paths and field names. Automated processing, URLs and other systems often use spaces and special characters for parsing text string. Instead, consider using underscore ( _ ) or dashes ( - ) to separate meaningful parts of file names. Avoid $ % ^ & # | : and similar.

    If versioning is desired a date string within the file name is recommended to indicate the version.

    Avoid using file names such as mydata.dat or 1998.dat.

    Data Preservation and Backup

    To avoid accidental loss of data you should:

    • Backup your data at regular frequencies
      • When you complete your data collection activity
      • After you make edits to your data
    • Streaming data should be backed up at regularly scheduled points in the collection process
      • High-value data should be backed up daily or more often
      • Automation simplifies frequent backups
    • Backup strategies (e.g., full, incremental, differential, etc…) should be optimized for the data collection process
    • Create, at a minimum, 2 copies of your data
    • Place one copy at an “off-site” and “trusted” location
      • Commercial storage facility
      • Campus file-server
      • Cloud fire-server (e.g., Amazon S3, Carbonite)
    • Use a reliable device when making backups
      • External USB drive (avoid the use of “light-weight” devices e.g., floppy disks, USB stick-drive; avoid network drives that are intermittently accessible)
      • Managed network drive
      • Managed cloud file-server (e.g., Amazon S3, Carbonite)
    • Ensure backup copies are identical to the original copy
      • Perform differential checks
      • Perform “checksum” check
    • Document all procedures to ensure a successful recovery from a backup copy

    Data Management Planning

    As a best practice, one must first acknowledge that the process of managing data will incur costs. Researchers should plan to address these costs and the allocation of resources in the early planning phases of the project. This best practice focuses on data management costs during the life cycle of the project, and does not aim to address costs of data beyond the end of the project.

    Budgeting and costing for your project is dependent upon institutional resources, services, and policies. We recommended that you verify with your sponsored project office, your office of research, tech transfer resources, and other appropriate entities at your institution to understand resources available to you.

    There are a variety of approaches to budgeting for data management costs. All approaches should address the following costs in each phase:

    • short-term costs
    • long-term costs
    • internal/external costs
    • equipment/services (ie. compute cycles, storage, software, and hardware) costs
    • overhead costs
    • time costs
    • human resource costs

    Methods for Managing Costs

    • In-sourced costs: items that are managed directly within the research group.
    • Out-sourced costs: items that are contracted or managed outside of the research group.

    Phases of the Data Life Cycle (see Primer on Data Management on the DataONE website for a description of the life cycle)

    • Collect - Likely both in-sourced and out-sourced costs.
      Coordinate with central IT services or community storage resources to ensure appropriate data storage environment and associated costs during this phase or throughout the life of the project.
    • Assure - Likely in-sourced costs. This phase is primarily focused on quality assurance/control, and costs will primarily be incurred around time and personnel.
    • Describe - Likely in-sourced costs. This phase includes initial and ongoing documentation as well as continuous development of metadata. Documentation captures the entire structure of the project, all configurations/parameters, as well as all processes during the course of the entire project. See the Documentation and Metadata best practices for more detail on what should be addressed.
    • Deposit - Likely both in-sourced and out-sourced costs.
    • Preserve - Likely both in-sourced and out-sourced costs. Coordinate with central IT services or community repository environments that are equipped to provide preservation services. This phase will be tied closely to the costs of the collection phase.
    • Discover - Likely in-sourced costs. Coordinate with librarians, IT service providers, or repository providers to identify and access data sources.
    • Integrate - Likely in-sourced costs. Coordinate with IT service providers or other service groups to merge and prepare data sources for analysis phase.
    • Analyze - Likely in-sourced costs. Coordinate with central IT services or other workspace providers to connect data sources with appropriate analysis and visualization software.
    Data Preservation and Backup

    The process of science generates a variety of products that are worthy of preservation. Researchers should consider all elements of the scientific process in deciding what to preserve:

    • Raw data
    • Tables and databases of raw or cleaned observation records and measurements
    • Intermediate products, such as partly summarized or coded data that are the input to the next step in an analysis
    • Documentation of the protocols used
    • Software or algorithms developed to prepare data (cleaning scripts) or perform analyses
    • Results of an analysis, which can themselves be starting points or ingredients in future analyses, e.g. distribution maps, population trends, mean measurements
    • Any data sets obtained from others that were used in data processing
    • Multimedia: documented procedures, or standalone data

    When deciding on what data products to preserve, researchers should consider the costs of preserving data:

    • Raw data are usually worth preserving
    • Consider space requirements when deciding on whether to preserve data
    • If data can be easily or automatically re-created from raw data, consider not preserving. E.g. if data that have undergone quality control processes and were analyzed, consider preserving since reproduction might be costly
    • Algorithms and software source code cost very little to preserve
    • Results of analyses may be particularly valuable for future discovery and cost very little to preserve

    Researchers should consider the following goals and benefits of preservation:

    • Enabling re-analysis of the same products to determine whether the same conclusions are reached
    • Enabling re-use of the products for new analysis and discovery
    • Enabling restoration of original products in the case that working datasets are lost
    Data Management Planning

    In the planning process, researchers should carefully consider what data will be produced in the course of their project.

    Consider the following:

    • What types of data will be collected? E.g. Spatial, temporal, instrument-generated, models, simulations, images, video etc.
    • How many data files of each type are likely to be generated during the project? What size will they be?
    • For each type of data file, what are the variables that are expected to be included?
    • What software programs will be used to generate the data?
    • How will the files be organized in a directory structure on a file system or in some other system?
    • Will metadata information be stored separately from the data during the project?
    • What is the relationship between the different types of data?
    • Which of the data products are of primary importance and should be preserved for the long-term, and which are intermediate working versions not of long-term interest?

    When preparing a data management plan, defining the types of data that will be generated helps in planning for short-term organization, the analyses to be conducted, and long-term data storage.

    Data Management Planning

    All research requires the sharing of information and data. The general philosophy is that data are freely and openly shared. However, funding organizations and institutions may require that their investigators cite the impact of their work, including shared data. By creating a usage rights statement and including it in data documentation, users of your data will be clear what the conditions of use are, and how to acknowledge the data source.

    Include a statement describing the "usage rights" management, or reference a service that provides the information. Rights information encompasses Intellectual Property Rights (IPR), copyright, cost, or various Property Rights. For data, rights might include requirements for use, requirements for attribution, or other requirements the owner would like to impose. If there are no requirements for re-use, this should be stated.

    Usage rights statements should include what are appropriate data uses, how to contact the data creators, and acknowledge the data source. Researchers should be aware of legal and policy considerations that affect the use and reuse of their data. It is important to provide the most comprehensive access possible with the fewest barriers or restrictions.

    There are three primary areas that need to be addressed when producing sharable data:

    1. Privacy and confidentiality: Adhere to your institution's policy
    2. Copyright and intellectual property (IP): Data is not copyrightable. Ensure that you have the appropriate permissions when using data that has multiple owners or copyright layers. Keep in mind that information documenting the context of data collection may be under copyright.
    3. Licensing: Data can be licensed. The manner in which you license your data can determine its ability to be consumed by other scholars. For example the Creative Commons Zero License provides for very broad access.

    If your data falls under any of the categories below there are additional considerations regarding sharing:

    • Rare, threatened or endangered species
    • Cultural items returned to their country of origin
    • Native American and Native Hawaiian human remains and objects
    • Any research involving human subjects

    If you use data from other sources, you should review your rights to use the data and be sure you have the appropriate licenses and permissions.

    Data Management Planning

    A data model documents and organizes data, how it is stored and accessed, and the relationships among different types of data. The model may be abstract or concrete.

    Use these guidelines to create a data model:

    1. Identify the different data components- consider raw and processed data, as well as associated metadata (these are called entities)
    2. Identify the relationships between the different data components (these are called associations)
    3. Identify anticipated uses of the data (these are called requirements), with recognition that data may be most valuable in the future for unanticipated uses
    4. Identify the strengths and constraints of the technology (hardware and software) that you plan to use during your project (this is called a technology assessment phase)
    5. Build a draft model of the entities and their relations, attempting to keep the model independent from any specific uses or technology constraints.
    6. Incorporate intended usage and technology constraints as needed to derive the simplest, most general model possible
    7. Test the model with different scenarios, including best- and
      worst-case (worst-case includes problems such as invalid raw data, user mistakes, failing algorithms, etc)
    8. Repeat these steps to optimize the model
    Data Discovery

    To make your data available using standard and open software tools you should:

    • Use standard language and terms to clearly communicate to others that your data are available for reuse and that you expect ethical and appropriate use of your data
    • Use an open source datacasting (RSS or other type) service that enables you to advertise your data and the options for others to obtain access to it (RSS, GeoRSS, DatacastingRSS)
    Data Management Planning

    In addition to the primary researcher(s), there might be others involved in the research process that take part in aspects of data management. By clearly defining the roles and responsibilities of the parties involved, data are more likely to be available for use by the primary researchers and anyone re-using the data. Roles and responsibilities should be clearly defined, rather than assumed; this is especially important for collaborative projects that involve many researchers, institutions, and/or groups.

    Examples of roles in data management:

    • data collector
    • metadata generator
    • data analyzer
    • project director
    • data model and/or database designer
    • computing staff responsible for backup and/or storage
    • staff responsible for running instruments
    • administrative support staff responsible for grant submission
    • specialized skills as defined in the plan (GIS, relational database design/implementation, computer programming of sensors/input forms, etc)
    • external data center or archive

    Steps for assigning data management responsibilities:

    1. For each task identified in your data management plan, identify the skills needed to perform the task
    2. Match skills needed to available staff and identify gaps
    3. Develop training/hiring plan
    4. Develop staffing/training budget and incorporate into project budget
    5. Assign responsible parties and monitor results
    Data Preservation and Backup

    A backup policy helps manage users' expectations and provides specific guidance on the "who, what, when, and how" of the data backup and restore process. There are several benefits to documenting your data backup policy:

    • Helps clarify the policies, procedures, and responsibilities
    • Allows you to dictate:
      • where backups are located
      • who can access backups and how they can be contacted
      • how often data should be backed up
      • what kind of backups are performed and
      • what hardware and software are recommended for performing backups
    • Identifies any other policies or procedures that may already exist (such as contingency plans) or which ones may supersede the policy
    • Has a well-defined schedule for performing backups
    • Identifies who is responsible for performing the backups and their contact information. This should include more than one person, in case the primary person responsible is unavailable
    • Identifies who is responsible for checking the backups have been performed successfully, how and when they will perform this
    • Ensures data can be completely restored
    • Has training for those responsible for performing the backups and for the users who may need to access the backups
    • Is partially, if not fully automated
    • Ensures that more than one copy of the backup exists and that it is not located in same location as the originating data
    • Ensures that a variety of media are used to backup data, as each media type has its own inherent reliability issues
    • Ensures the structure of the data being backed up mirrors the originating data
    • Notes whether or not the data will be archived

    If this information is located in one place, it makes it easier for anyone needing the information to access it. In addition, if a backup policy is in place, anyone new to the project or office can be given the documentation which will help inform them and provide guidance.

    Data Management Planning

    The plan will be created at the conceptual stage of the project. It should be considered a living document and a road map for the project, and should be closely followed. Any changes to the data management plan should be made deliberately, and the plan should be updated throughout the data life cycle.

    Data management planning provides crucial guidance to all stages of the data life cycle. It provides continuity for operations within the research group. The data management plan will define roles for all project participants and workflows for data collection, quality assurance, description, and deposit for preservation and access. The data management plan is a tool to communicate requirements and restrictions to all members of the project team, including researchers, archivists, librarians, IT staff and repository managers. The plan governs the active research phase of the project life cycle and makes provisions for the hand-off to a repository for preservation and data delivery.

    Funding agencies and institutions require data management plans for project funding and approval.

    Data Integration

    Different types of new data may be created in the course of a project, for instance visualizations, plots, statistical outputs, a new dataset created by integrating multiple datasets, etc. Whenever possible, document your workflow (the process used to clean, analyze and visualize data) noting what data products are created at each step. Depending on the nature of the project, this might be as a computer script, or it may be notes in a text file documenting the process you used (i.e. process metadata). If workflows are preserved along with data products, they can be executed and enable the data product to be reproduced.

    Quality Assurance and Quality Control

    To assure that metadata correctly describes what is actually in a data file, visual inspection or analysis should be done by someone not otherwise familiar with the data and its format. This will assure that the metadata is sufficient to describe the data. For example, statistical software can be used to summarize data contents to make sure that data types, ranges and, for categorical data, values found, are as described in the documentation/metadata.

    Quality Assurance and Quality Control

    Ensuring accuracy of your data is critical to any analysis that follows.

    When transcribing data from paper records to digital representation, have at least two, but preferably more people transcribe the same data, and compare resulting digital files. At a minimum someone other than the person who originally entered the data should compare the paper records to the digital file. Disagreements can then be flagged and resolved.

    In addition to transcription accuracy, data compiled from multiple sources may need review or evaluation. For instance, citizen science records such as bird photographs may have taxonomic identification that an expert may need to review and potentially revise.

    Data Management Planning

    Follow the steps below to choose the most appropriate software to meet your needs.

    1. Identify what you want to achieve (discover data, analyze data, write a paper, etc.)
    2. Identify the necessary software features for your project (i.e. functional requirements)
    3. Identify logistics features of the software that are required, such as licensing, cost, time constraints, user expertise, etc. (i.e. non-functional requirements)
    4. Determine what software has been used by others with similar requirements
      • Ask around (yes, really); find out what people like
      • Find out what software your institution has licensed
      • Search the web (e.g. directory services, open source sites, forums)
      • Follow-up with independent assessment
    5. Generate a list of software candidates
    6. Evaluate the list; iterate back to Step 1 as needed
    7. As feasible, try a few software candidates that seem promising
    Documenting Data

    The spatial extent of your data set or collection as a whole should be described. The minimum acceptable description would be a bounding box describing the northern most, southern most, western most, and eastern most limits of the data.

    • If the entire collection is from a single location, use the same values for northerly/southerly limits and easterly/westerly values.
    • Be sure to specify in the metadata what units you choose to describe your spatial extent.
    • Use the following guidelines for quality control:
      • If the collection spans the north pole, the northerly limit should be 90.0 degrees
      • If the collection spans the south pole, the southerly limit should be -90.0 degrees
      • If the collection crosses the date line, the westerly limit should be greater than the easterly limit

    If your data collection or dataset as a whole contains data acquired over a range of spatial locations during each collection period, it is important to document the spatial resolution of your dataset. Many metadata standards have standard terminology for describing data spacing or resolution (e.g. every half degree, 250 m resolution, etc.), but it may be necessary to describe complex data acquisition schemes textually.

    Data Integration

    Understand the input geospatial data parameters, including scale, map projection, geographic datum, and resolution, when integrating data from multiple sources. Care should be taken to ensure that the geospatial parameters of the source datasets can be legitimately combined. If working with raster data, consider the data type of the raster cell values as well as if the raster data represent discrete or continuous values. If working with vector data, consider feature representation (e.g., points, polygons, lines). It may be necessary to re-project your source data into one common projection appropriate to your intended analysis. Data product quality degradation or loss of data product utility can result when combining geospatial data that contain incompatible geospatial parameters. Spatial analysis of a dataset created from combining data having considerably different scales or map projections may result in erroneous results.

    Document the geospatial parameters of any output dataset derived from combining multiple data products. Include this information in the final data product's metadata as part of the product's provenance or origin.

    Data Management Planning

    As part of the data life cycle, research data will be contributed to a repository to support preservation and discovery. A research project may generate many different iterations of the same dataset - for example, the raw data from the instruments, as well as datasets which already include computational transformations of the data.

    In order to focus resources and attention on these core datasets, the project team should define these core data assets as early in the process as possible, preferably at the conceptual stage and in the data management plan. It may be helpful to speak with your local data archivist or librarian in order to determine which datasets (or iterations of datasets) should be considered core, and which datasets should be discarded. These core datasets will be the basis for publications, and require thorough documentation and description.

    • Only the datasets which have significant long-term value should be contributed to a repository, requiring decisions about which datasets need to be kept.
    • If data cannot be recreated or it is costly to reproduce, it should be saved.
    • Four different categories of potential data to save are observational, experimental, simulation, and derived (or compiled).
    • Your funder or institution may have requirements and policies governing contribution to repositories.

    Given the amount of data produced by scientific research, keeping everything is neither practical nor economically feasible.

    Quality Assurance and Quality Control

    Just as data checking and review are important components of data management, so is the step of documenting how these tasks were accomplished. Creating a plan for how to review the data before it is collected or compiled allows a researcher to think systematically about the kinds of errors, conflicts, and other data problems they are likely to encounter in a given data set. When associated with the resulting data and metadata, these documented quality control procedures help provide a complete picture of the content of the dataset. A helpful approach to documenting data checking and review (often called Quality Assurance, Quality Control, or QA/QC) is to list the actions taken to evaluate the data, how decisions were made regarding problem resolution, and what actions were taken to resolve the problems at each step in the data life cycle. Quality control and assurance should include:

    • determining how to identify potentially erroneous data
    • how to deal with erroneous data
    • how problematic data will be marked (i.e. flagged)

    For instance, a researcher may graph a list of particular observations and look for outliers, return to the original data source to confirm suspicions about certain values, and then make a change to the live dataset. In another dataset, researchers may wish to compare data streams from remote sensors, finding discrepant data and choosing or dropping data sources accordingly. Recording how these steps were done can be invaluable for later understanding of the dataset, even by the original investigator.

    Datasets that contain similar and consistent data can be used as baselines against each other for comparison.

    • Obtain data using similar techniques, processes, environments to ensure similar outcome between datasets.
    • Provide mechanisms to compare data sets against each other that provide a measurable means to alert one of differences if they do indeed arise. These differences can indicate a possible error condition since one or more data sets are not exhibiting the expected outcome exemplified by similar data sets.

    One efficient way to document data QA/QC as it is being performed is to use automation such as a script, macro, or stand alone program. In addition to providing a built-in documentation, automation creates error-checking and review that can be highly repeatable, which is helpful for researchers collecting similar data through time.
    The plan should be reviewed by others to make sure the plan is comprehensive.

    Data Discovery

    People have different perspectives on what data means to them, and how it can be used and interpreted in different contexts. Data users ranging from community participants to researchers in different domains can provide unique and valuable insights into data through the use of annotation and tagging. The community-generated notes and tags should be discoverable through the data search engine to enhance discovery and use.

    When providing capabilities for community tagging and annotations, you should consider the following:

    • Differentiate between the metadata developed by the creator and additional tags or annotations to the data or metadata
    • Allow for community tags and annotations to be indexed as part of the terms or text that is indexed in a search
    • Provide easy-to-understand examples of the kinds of tagging or annotation that will promote the discovery of your data
    • Consider whether or not a review process for community tagging is needed
    • Consider whether controlled vocabularies will be used for tags
    • Provide clear guidelines for the addition of tags and construction of annotations
    • Make tags accessible via an application programming interface (API)
    Documenting Data

    Data sets or collections are often composed of multiple files that are related. Files may have come from (or still be stored in) a relational database, and the relationships among the data tables or other entities are important if the data are to be reused. These relationships should be documented for a repository.

    Describe the overall organization of your data set or collection. Often, a data set or collection contains a large number of files, perhaps organized into a number of directories or database tables. By describing and documenting this organization, files and data can be easily located and used.

    At a minimum, the organization and relationships between the directories and files, or database tables and other supporting materials, need to be fully described. Use a description of the data set or collection (e.g, an abstract) to describe what tables contain, where the supporting material, metadata, or other documentation are located, and/or descriptions of directory contents. Consider describing the logical relationships between data entities using an entity relationship diagram (ERD).

    Associated specimens: if specimens (e.g., taxonomic vouchers, DNA samples) were collected with the data, include the name of the repository in which these specimens reside.

    Data Integration

    The integration of multiple data sets from different sources requires that they be compatible. Methods used to create the data should be considered early in the process, to avoid problems later during attempts to integrate data sets. Note that just because data can be integrated does not necessarily mean that they should be, or that the final product can meet the needs of the study. Where possible, clearly state situations or conditions where it is and is not appropriate to use your data, and provide information (such as software used and good metadata) to make integration easier.

    Data Integration

    Document that steps used to integrate disparate datasets.

    • Ideally, one would adopt mechanisms to systematically capture the integration process, e.g. in an executable form such as a script or workflow, so that it can be reproduced
    • In lieu of a scientific workflow system, document the process, scripts, or queries used to perform the integration of data in documentation that will accompany the data (metadata)
    • Provide a conceptual model that describes the relationships among datasets from different sources
    • Use unique identifiers in the data records to maintain data integrity by reducing duplication
    • Identify foreign key fields in the data records which support the relationship between the data sources
    • When you use datasets and data elements from within those datasets as a source for new datasets, it is important to identify and document those data within the documentation of the new/derived dataset. This is known as dataset provenance; provenance describes the origin or source of something. Just as you would cite papers that are sources for your research paper, it is critical to identify the sources of the data used within your own datasets. This will allow for:
      • tracing the chain of use of datasets and data elements
      • credit and attribution to accrue to the creators of the original datasets
      • the possibility that if errors or new information about the original datasets or data elements comes to light, that any impact on your new datasets and interpretation of such could be traced
    Data Files and File Management

    A data dictionary provides a detailed description for each element or variable in your dataset and data model. Data dictionaries are used to document important and useful information such as a descriptive name, the data type, allowed values, units, and text description. A data dictionary provides a concise guide to understanding and using the data.

    Data Management Planning

    Multimedia data present unique challenges for data discovery, accessibility, and metadata formatting and should be thoughtfully managed. Researchers should establish their own requirements for management of multimedia during and after a research project using the following guidelines. Multimedia data includes still images, moving images, and sound. The Library of Congress has a set of web pages discussing many of the issues to be considered when creating and working with multimedia data. Researchers should consider quality, functionality and formats for multimedia data. Transcriptions and captioning are particularly important for improving discovery and accessibility.

    Storage of images solely on local hard drives or servers is not recommended. Unaltered images should be preserved at the highest resolution possible. Store original images in separate locations to limit the chance of overwriting and losing the original image.

    Ensure that the policies of the multimedia repository are consistent with your general data management plan.

    There are a number of options for metadata for multimedia data, with many MPEG standards (http://mpeg.chiariglione.org/), and other standards such as PBCore (http://pbcore.org/).

    The following web pages have sections describing considerations for quality and functionality and formats for each of still images, sound (audio) and moving images (video).

    Sustainability of Digital Formats Planning for Library of Congress Collections:

    Online, generic multimedia repositories and tools (e.g. YouTube, Vimeo, Flickr, Picasa)

    • are low-cost (can be free)
    • are open to all
    • may provide community commenting and tagging
    • some provide support for explicit licenses and re-use
    • provide some options for valuable metadata such as geolocation
    • potential for large-scale dissemination
    • optimize usability and low barrier for participation
    • rely on commercial business models for sustainability
    • may have limits on file size or resolution
    • may have unclear access, backup, and reliability policies, so ensure you are aware of them before you rely upon them

    Specialized multimedia repositories (e.g. MorphBank, Macaulay Library, LIFE)

    • provide domain-specific metadata fields and controlled vocabularies customized for expert users
    • are highly discoverable for those in the same domain
    • can provide assistance in curating metadata
    • optimize scientific use cases such as vouchering, image analysis
    • rely on research or institutional/federal funding
    • may require high-quality multimedia, completeness of metadata, or restrict manipulation
    • may not be open to all
    • may provide APIs for sharing or re-use for other projects
    • are recognized as high-quality, scientific repositories
    • may migrate multimedia to new formats (e.g. analog to digital)
    • may have restrictions on bandwidth usage

    Some institutions or projects maintain digital asset management systems, content management systems, or other collections management software (e.g. Specify, KE Emu) which can manage multimedia along with other kinds of data

    • projects or institutions should provide assistance
    • may be mandated by institution
    • may be more convenient, e.g. when multiple data types result from a project
    • may not be optimized for discovery, access, or re-use
    • usually not domain-specific
    • may or may not be suitable for long-term preservation
    Analysis and Visualization

    To maximize usability of your data or outputs, ensure that those with impairments or disabilities will still be able to access and understand them. The Web Accessibility Initiative, from the W3C, suggests that those producing content for others consider the following (text from their website):

    Make your outputs perceivable

    • Provide text alternatives for non-text content.
    • Provide captions and other alternatives for multimedia.
    • Create content that can be presented in different ways, including by assistive technologies, without losing meaning.
    • Make it easier for users to see and hear content.

    Make your outputs operable

    • Make all functionality available from a keyboard.
    • Give users enough time to read and use content.
    • Do not use content that causes seizures.
    • Help users navigate and find content.

    Make your outputs understandable

    • Make text readable and understandable.
    • Make content appear and operate in predictable ways.
    • Help users avoid and correct mistakes.

    Make your outputs robust

    • Maximize compatibility with current and future user tools.
    Data Files and File Management

    Data files should be managed to avoid disorder. To facilitate access to files, all storage devices, locations and access accounts should be documented and accessible to team members. Use appropriate tools, such as version control tools, to keep track of the history of the data files. This will help with maintaining files in different locations, such as at multiple off-site backup locations or servers.

    Data sets that result in many files structured in a file directory can be difficult to decipher. Organize files logically to represent the structure of the research/data. Include human readable "readme" files at critical levels of the directory tree. A "readme" file might include such things as explanations of naming conventions and how the structure of the directory relates to the structure of the data.

    Data Management Planning

    Shaping the data management plan towards a specific desired repository will increase the likelihood that the data will be accepted into that repository and increase the discoverability of the data within the desired repository. When beginning a data management plan:

    • Look to the data management guidelines of the project/grant for a required repository
    • Ask colleagues what repositories are used in the community
    • Determine if your local institution has a repository that would be appropriate (and might be required) for depositing your data
    • Check the DataONE website for a list of potential repositories.
    Data Preservation and Backup

    All storage media, whether hard drives, discs or data tapes, will wear out over time, rendering your data files inaccessible. To ensure ongoing access to both your active data files and your data archives, it is important to continually monitor the condition of your storage media and track its age. Older storage media and media that show signs of wear should be replaced immediately. Use the following guidelines to ensure the ongoing integrity and accessibility of your data:

    • Test Your Storage Media Regularly: As noted in the “Backup Your Data” best practice, it is important to routinely perform test retrievals or restorations of data you are storing for extended periods on hard drives, discs or tapes. It is recommended that storage media that is used infrequently be tested at least once a year to ensure the data is accessible.
    • Beware of Early Hardware Failures: A certain percentage of storage media will fail early due to manufacturing defects. In particular, hard drives, thumb drives and data tapes that have electronic or moving parts can be susceptible to early failure. When putting a new drive or tape into service, it is advisable to maintain a redundant copy of your data for 30 days until the new device “settles in.”
    • Determine the Life of Your Hard Drives: When purchasing a new drive unit, note the Mean Time Between Failure (MTBF) of the device, which should be listed on its specifications sheet (device specifications are usually packaged with the unit, or available online). The MTBF is expressed in the number of hours on average that a device can be used before it is expected to fail. Use the MTBF to calculate how long the device can be used before it needs to be replaced, and note that date on your calendar (For example, if the MTBF of a new hard drive is 2,500 hours and you anticipate having the unit powered on for 8 hours a day during the work week, the device should last about 2 years before it needs to be replaced).
    • Routinely Inspect and Replace Data Discs: Contemporary CD and DVD discs are generally robust storage media that will fail more often from mishandling and improper storage than from deterioration. However lower quality discs can suffer from delamination (separation of the disc layers) or oxidation. It is advisable to inspect discs every year to detect early signs of wear. Immediately copy the data off of discs that appear to be warping or discolored. Data tapes are susceptible both to physical wear and poor environmental storage conditions. In general, it is advisable to move data stored on discs and tapes to new media every 2-5 years (specific estimates on media longevity are available on the web).
    • Handle and Store Your Media With Care: All storage media types are susceptible to damage from dust and dirt exposure, temperature extremes, exposure to intense light, water penetration (more so for tapes and drives than discs), and physical shock. To help prolong its operational life, store your media in a dry environment with a comfortable and stable room temperature. Encapsulate all media in plastic during transportation. Provide cases or plastic sheaths for discs, and avoid handling them excessively.
    Data Management Planning

    When creating the data management plan, review all who may have a stake in the data so future users of the data can easily track who may need to give permission. Possible stakeholders include but are not limited to:

    • Funding body
    • Host institution for the project
    • Home institution of contributing researchers
    • Repository where the data are deposited

      It is considered a matter of professional ethics to acknowledge the work of other scientists and provide appropriate citation and acknowledgment for subsequent distribution or publication of any work derived from stakeholder datasets. Data users are encouraged to consider consultation, collaboration, or co-authorship with original investigators.

    Documenting Data

    Many times significant overlap exists among metadata content standards. You should identify those standards that include the fields needed to describe your data. In order to describe your data, you need to decide what information is required for data users to discover, use, and understand your data. The who, what, when, where, how, why, and a description of quality should be considered. The description should provide enough information so that users know what can and cannot be done with your data.

    • Who: The person and/or organization responsible for collecting and processing the data. Who should be contacted if there are questions about your data?
    • What: What parameters were measured or observed? What are the units of your measurements or results?
    • When: A description of the temporal characteristics of your data (e.g., time support, spacing, and extent).
    • Where: A description of the spatial characteristics of your data (e.g., spatial support, spacing, and extent). What is the geographic location at which the data were collected? What are the details of your field sensor deployment.
    • How: What methods were used (e.g., sensors, analytical instruments, etc.). Did you collect physical samples or specimens? What analytical methods did you use to measure the properties of your samples/specimens? Is your result a field or laboratory result? Is your result an observation or a model simulation?
    • Why: What is the purpose of the study or the data collection? This can help others determine whether your data is fit for their particular purpose or not.
    • Quality: Describe the quality of the data, which will help others determine whether your data is fit for their purpose or not.

      Considering a number of metadata content standards may help you fine-tune your metadata content needs. There may be content details or elements from multiple standards that can be added to your requirements to help users understand your data or methods. You wouldn't know this unless you consider multiple content standards.

      • If the project or grant requirements define a particular metadata standard, incorporate it into the data management plan
      • If the community has a recommended or has a most commonly used metadata standard, use it
      • Consider using a metadata standard that is interoperable with many systems, repositories, and harvesters
      • If the community’s preferred metadata standard is not widely interoperable, consider creating metadata using a simple but interoperable standard, e.g. Dublin Core, in addition to the main standard.

      Useful Definitions:

      • Metadata Content Standard: A Standard that defines elements users can expect to find in metadata and the names and meaning of those elements.
      • Metadata Format Standard: A Standard that defines the structures and formats used to represent or encode elements from a content standard.
    Documenting Data

    The temporal extent over which the data within your dataset or collection was acquired or collected should be described. Normally this is done by providing

    • the earliest date of data acquisition
    • the date that the last data in the collection was acquired

    Year, month, day, and time should be included in the description. If data collection is still ongoing, the end date can be omitted, though some statement about this should be placed in the dataset abstract. The status of the data set should indicate that data collection is still ongoing if the metadata standard being used supports this type of documentation.
    Describe the temporal resolution of your dataset collection. The temporal resolution of your dataset is the frequency with which data is collected or acquired. While many metadata standards provide standard nomenclature for describing simple temporal resolutions (e.g., daily or monthly), more complex temporal collection patterns may need to be described textually.

    Documenting Data

    The research project description should contain the following information:

    • Who: project personnel (principal investigator, researchers, technicians, others)
    • Where: location and description of study site or sites
    • When: range of dates for the project
    • Why: rational for the project (abstract)
    • How: description of project methods

    Other useful information might include the project title, the overarching project (if any), institution(s) involved, and source of funding.

    Documenting Data

    If your project uses a sensor network, you should describe and document that network and the instruments it uses. This information is essential to understanding and interpreting the data you use, and should be included as a part of the metadata generated for your project's data.

    • Describe the basic set-up of the sensor network installation, including such details as mount, power source, enclosures, wiring protection, etc.
    • Describe instrumentation, cameras and samplers (See "Describe measurement techniques" Best Practice in DataONEpedia)
    • Describe data loggers used by the network. Include the following:
      • Manufacturer, model, serial number, dates in use
      • Maintenance/repair history
      • Malfunction history
      • Deployment history
      • Replacement history
    • Ensure localization and time synchronization across data logger arrays
    • Archive copies of any custom scripts, software, or programs used. Scripts and programs should be accompanied by documentation that includes any information pertinent to their use (metadata).
    • As part of metadata, create a human-readable document that describes sampling frequency and data processing performed by the data logger
    Quality Assurance and Quality Control

    As part of any review or quality assurance of data, potential problems can be categorized systematically. For example data can be labeled as 0 for unexamined, -1 for potential problems and 1 for "good data." Some research communities have developed standard protocols; check with others in your discipline to determine if standards for data flagging already exist.

    The marine community has many examples of quality control flags that can be found on the web. There does not yet seem to be standards across the marine or terrestrial communities.

    Data Preservation and Backup

    When searching for data, whether locally on one's machine or in external repositories, one may use a variety of search terms. In addition, data are often housed in databases or clearinghouses where a query is required in order access data. In order to reproduce the search results and obtain similar, if not the same results, it is necessary to document which terms and queries were used.


    • Note the location of the originating data set
    • Document which search terms were used
    • Document any additional parameters that were used, such as any controls that were used (pull-down boxes, radio buttons, text entry forms)
    • Document the query term that was used, where possible
    • Note the database version and/or date, so you can any limit newly-added data sets since the query was last performed
    • Note the name of the website and URL, if applicable

    Software Tools

    Featured Tool

    Data and Metadata Management

    37 Signals offers a suite of web-based collaboration applications. All applications are mobile-optimized.

    Basecamp - Project Management

    • Project organization, task management
    • Milestone tracking
    • File storage/sharing

    Highrise - Contact Management

    • Tracking of proposals/contracts/drafts
    • Contact list management
    • Task management

    Backpack - File/Information Sharing

    • Document management, file storage
    • Management of documentation
    • Logistical coordination/organization

    Campfire - Collaborative Chat Room Space

    • Chat room space with history tracking
    • Conference calling capability
    • Image/graphic sharing during collaboration
    Free


    Exploration, Visualization, and Analysis

    3D World Studio is a modeling program useful for visualizing real world data utilizing tools developed within computer gaming environments. The program allows you to create buildings and terrain and export your visualization into a variety of formats.

    Cost-basis


    Exploration, Visualization, and Analysis

    Professional standard software for creating original vector-based graphics. Includes powerful drawing tools and brushes.

    The ai file format is a common vector format for exchange and its feature set allows creation of complex vector artwork. Illustrator imports over two dozen formats (including PDF and SVG). Of particular use to data visualization is importation of SVG, or scalar vector graphics, which is an W3C recommendation, and is often exported from other programs.

    Cost-basis


    Exploration, Visualization, and Analysis

    Photoshop is a comprehensive photo editing tool produced by Adobe Systems. Users can manipulate photos, graphics, and other raster images using a variety of tools and predefined filters. Photoshop also allows users to record specific photo editing steps, which allows for automated batch processing. Photoshop is available as a stand-alone product, but is also part of Adobe's "Creative Suite" family of products. Photoshop Extended is an enhancement to Photoshop, and provides for enhanced 3D creation and editing.

    Cost-basis


    Exploration, Visualization, and Analysis

    "Amber" refers to two things: a set of molecular mechanical force fields for the simulation of biomolecules (Amber) and a package of molecular simulation programs that includes source code and demos (AmberTools).

    Cost-basis


    Data and Metadata Management

    Velocity is a Java-based template engine. Its template language references objects defined in Java code. When Velocity is used for web development, Web designers can work in parallel with Java programmers to develop web sites.

    Velocity has broader uses, such as generation of SQL, PostScript and XML from templates. It can be used either as a standalone utility for generating source code and reports, or as an integrated component of other systems.

    Free


    Data and Metadata Management

    ArcCatalog is a geobase administration application within ESRI's ArcGIS suite whose primary role is to maintain geospatial data and the corresponding metadata. ArcCatalog provides an integrated and unified view of all the data files, databases, and ArcGIS documents, integrating information that exists in many forms, including relational databases, files, ArcGIS documents, and remote GIS web services. In ArcGIS 10, ArcCatalog is folded into ArcMap - the mapping and analysis application. Users have several choices of how metadata is displayed and metadata editing is easily performed as the Federal Geographic Data Committee (FGDC) metadata requirements are enforced in the application. Metadata is stored in an XML file that 'stays' with the geospatial dataset. ArcCatalog also allows users to connect to other geospatial data servers, create address locators, and create connections to other databases (ODBC) and spatial databases (SDE).

    Cost-basis


    Exploration, Visualization, and Analysis

    ArcGIS Desktop is a collection of software products for building complete geographic information systems (GIS). produced by Esri. ArcGIS Desktop 9 provides an integrated GIS, combining object-oriented and traditional file-based data models with a set of tools to create and work with geographic data. The following three applications comprise the ArcGIS Desktop software suite:

    • ArcMap (mapping and data manipulation): ArcMap is a map-authoring application
    • ArcCatalog (data management): shared ArcGIS application that allows you to organize and access all GIS information (e.g., maps, globes, datasets, models, metadata, and services). Includes tools for browsing and finding geographic information; recording, viewing, and managing metadata; viewing datasets; and defining the schema structure of the object-based geographic datasets.
    • ArcToolbox (data conversion, modeling, and spatial analysis): includes tools to do geographic feature overlay, feature selection and analysis, topology processing, and data conversion resulting in an output dataset. The geoprocessing framework allows you to use each geoprocessing function in a variety of ways. The tools can be used directly from a dialog, executed via command line, combined with other processes in visual models using Model Builder, or used in advanced scripts.
    Cost-basis


    Data and Metadata Management

    Archivematica is a comprehensive digital preservation system. Archivematica uses a micro-services design pattern to provide an integrated suite of tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model.

    Users monitor and control the micro-services via a web-based dashboard. Archivematica uses a range of best practices and metadata standards. Archivematica implements media type preservation plans based on an analysis of the significant characteristics of file formats.

    Free


    Data Deposition, Citation, Curation and Preservation

    The Archivists' Toolkit is an open-source data management system that provides support for the management of archives. This tool is aimed at archival repositories that store various kinds of data, especially text and image data. The tool supports accessioning and describing archival materials, establishing names and subjects associated with archives, managing locations of the materials, and exporting EAD finding aids, MARCXML records, and METS, MODS and Dublin Core records.

    Free


    Data Deposition, Citation, Curation and Preservation

    Archon is an open source, Web-based archive management system for archivists and manuscript curators that automatically publishes archival descriptive information and digital archival objects.

    Archon users do not need to encode a finding aid, input a catalog record, or program a stylesheet. Archon operates through scripts that automatically make data elements in the system searchable and browsable on a repository website. Information can be input or edited using simple web forms. Archon automatically uploads the information, publishes the website, and generates EAD and MARC records.

    Free


    Exploration, Visualization, and Analysis

    Asymptote is a vector graphics language that can be used for technical drawing.

    Being a language, it gives ultimate control to the user. Typesetting of labels and equations is done by LaTeX, which produces high-quality PostScript output. It provides a portable standard for typesetting mathematical figures and generates output in PostScript, PDF, SVG, or 3D PRC vector graphics.

    Free


    Exploration, Visualization, and Analysis

    Bayesian Evolutionary Analysis Sampling Trees (BEAST) is a program for Bayesian Markov Chain Monte Carlo (MCMC) analysis of molecular sequences. It is oriented towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST averages over tree space, so that each tree is weighted proportional to its posterior probability.

    Free


    Data Deposition, Citation, Curation and Preservation

    BibDesk allows the user to edit and manage bibliographies. It helps keep track of both the bibliographic information and the associated files or web links.

    BibDesk helps simplify using a bibliography in other applications and is suited for LaTeX users.

    Free


    Data and Metadata Management

    Bitbucket is a free hosting site for open-source computer program source code. Bitbucket supports version control, offsite backup, and access by multiple authenticated users.

    Free


    Exploration, Visualization, and Analysis

    The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between genetic sequences, comparing nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. It is used to compare a novel sequence with those contained in nucleotide and protein databases by aligning the novel sequence with previously characterized genes. Alternative implementations include AB-BLAST (formerly known as WU-BLAST), FSA-BLAST, and ScalaBLAST.

    Free


    Data and Metadata Management

    Box.net allows you to store and share content online. Files and folders can be shared as web links, files and folders can be synced from the desktop. This means that files can be automatically backed up from multiple computers/devices, and stored on the Box.net server. It provides searching tools, and the ability to view files without downloading.

    Box.net supports standard web browsers and mobile devices such as Android, iPhone and iPad. It can be automatically accessed through a variety of other mobile apps, and it integrates with other collaboration software such as Google Docs, Gmail, Microsoft Sharepoint, etc. Box.net allows free use of up to 5 GB of storage, and has pricing plans for enterprise capabilities, larger storage use and some additional features such as versioning, encrypted storage, etc.

    Free


    Data and Metadata Management

    CatMDEdit is a metadata editor tool that facilitates the documentation of resources, focusing on the description of geographic information resources. The metadata conforms to Dublin Core and ISO 19115 (Geographic Information) standards. Automatic metadata generation for some common geospatial data file formats including Shapefile, DGN, ECW, FICC, GeoTiff, GIF/GFW, JPG/JGW, and PNG/PGW. CatMDEdit allows the automatic creation of metadata for collections of related resources, in particular spatial series that arise as a result of the fragmentation of geometric resources into datasets of manageable size and similar scale.

    There are Spanish, English, French, Polish, Portuguese and Czech versions. CatMDEdit is an initiative of the National Geographic Institute of Spain (IGN), which is the result of a scientific and technical collaboration between IGN and the Advanced Information Systems Group (IAAA) of the University of Zaragoza with the technical support of GeoSpatiumLab (GSL).

    Free


    Discovery Tools

    CiteBank is an open access repository to aggregate citations for biodiversity publications and deliver access to biodiversity related articles. It provides search and browse capabilities to biodiversity publications stored in multiple international repositories. There is a storage platform for articles and documents that are digitized, but not yet online. It also provides a common system for scholars to share their specialist bibliographies. Users can upload, edit, and share their own personal lists of references and citations. CiteBank indexes the Biodiversity Heritage Library (BHL).

    Free


    Discovery Tools

    CiteULike is a free online web-based bibliography manager. It allows you to post, view, and organize scientific papers. Several journal services have one-click linking to CiteULike for saving references. This application also allows you to post links on a variety of social networking sites. Users can also search this site for publications that others have pulled into the site, and share reference lists publicly.

    Groups can be established within this site to share publications of interest.

    Free


    Exploration, Visualization, and Analysis

    ClustalW2, ClustallW, and ClustalX are general purpose, multiple sequence alignment tools. Multiple alignments of protein sequences can identify conserved sequence regions. This is useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins and in identifying new members of protein families. Clustal is a general purpose multiple sequence alignment program for DNA or proteins. ClustalW is the command line version and ClustalX is the graphical version of Clustal. The current version is ClustalW2. It produces biologically meaningful multiple sequence alignments of divergent sequences by calculating the best match for the selected sequences and lining them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms.

    Free


    Data Deposition, Citation, Curation and Preservation

    ColdFusion is both a platform and a language (ColdFusion Markup Language [CFML]) for enabling developers to build, deploy, and maintain Internet applications. ColdFusion is an Adobe product. ColdFusion is specifically designed to make it easier to connect HTML pages to a database and thereby create dynamically-generated web pages. Website content is managed in connection with a relational database and so can be generated on the fly. Updates can occur on multiple web pages by changing data in the database.

    Cost-basis


    Discovery Tools

    Collection casting is a process for advertising a data set by creating a structured Atom news feed so people and computer systems can find your data. The "Collection Caster" tool is a web-based application that creates a "cast" (an eXtensible Markup Language/XML file) for your data set. You then place the XML file on your web server and create links to it from wherever you would like to advertise your data.

    Free


    Data and Metadata Management

    Confluence is a commercial wiki product used by many universities, open source software efforts, etc. It is a product of Atlassian, and provides rich and flexible editing capabilities and a plugin environment to extend the features of the wiki. There is an extensive range of plugins. Many organizations use it for documentation, group collaboration, project or course sites, knowledge management, internal web sites, etc. It supports a range of access control options for supporting anything from private to group to open-to-the-world access for viewing and editing. It also supports a range of export options that make it easy to get information out of the wiki in a form that can be easily re-purposed.

    Cost-basis


    Exploration, Visualization, and Analysis

    CrowdLabs is a system that supports community sharing, visualization and analysis of workflows using a philosophy employed by social web sites.

    CrowdLabs features the ability to interact with the workflows through a Web browser. Workflow provenance is captured allowing scientists to publish their process as well as their results. For the latter, CrowdLabs generates links to the process that can be embedded in Wiki and HTML pages, as well as LaTeX documents.

    Free


    Discovery Tools

    CyberTracker is a software tool that allows users to collect field data with handheld computers or PDAs. It can also be used to create digital field guides because it allows rich content to be displayed in conjunction with data capture fields.

    The CyberTracker Species Identification Filter consists of a sequence of screens each with a checklist of characteristic features of a species. Once data has been filtered it can be Exported to Microsoft Excel, Comma Seperated Values, XML or HTML formats. Creating data elements for each screen automatically creates a structured database. Cybertracker provides some templates.

    CyberTracker software can be used on smart phones and handheld computers with GPS to record observations. The design allows users to display icons, text or both, which makes data collection faster. It also allows field data collection by non-literate users and school children. CyberTracker Conservation is a non-profit organization whose vision is to promote the development of a worldwide environmental monitoring network.

    Free


    Data Deposition, Citation, Curation and Preservation

    The Data Asset Framework (DAF) provides a toolkit for organizations to identify their digital assets and assess how they are managed. Previously known as the Data Audit Framework, this tool guides the user through a DAF assessment. It is primarily useful for institutions, departments or research groups starting to think about data management, and who need to prepare a register of their data assets.

    Free


    Discovery Tools

    Data Turbine (DT) is middleware for streaming sensor data based at environmental observatories. It provides reliable data transport for a wide range of sensors and a comprehensive suite of services for data management, and real time data visualization. It manages data sources and data sinks, data routing, scheduling, and security.
    In the simplest application, one or more of DT's configurable 'on-ramps' reads a data stream from one or more sensors (e.g. a file deposited by a data logger or the instrument itself), making the data accessible to other applications by holding them in its memory. The data are then routed to one or more 'off-ramps' for permanent storage, e.g. a database. While the data are in memory, visualization applications may access them in near real time for monitoring, and quality control routines may be applied.

    Some examples of applications accessing the data in DT are a DataTurbine actor which makes the data available to the Kepler workflow system, the real time visualization application Real Time (or Remote) Data Viewer (RDV) and Google Earth.

    Free


    Data and Metadata Management

    DBDesigner 4 is an open source visual data design application that includes functionality for database design and data modeling. It is primarily designed for use with the Open Source database platform MySQL. It includes specific functionality for database and data design including documentation, Structured Query Language (SQL), and reverse database engineering for any ODBC-compatible database. For the MySQL database platform DBDesigner has largely been succeeded by MySQL Workbench which is an integrated development environment (IDE) .

    Free


    Exploration, Visualization, and Analysis

    DesktopGarp is a software package for predicting and analyzing wild species distributions; it is a reengineered version of the original GARP that runs on personal computers and workstations. The acronym GARP stands for Genetic Algorithm for Rule-set Production. It is a genetic algorithm that creates models describing environmental conditions under which the species should be able to maintain populations. For input, GARP uses a set of point localities where the species is known to occur and a set of geographic layers representing the environmental parameters that might limit the species' capabilities to survive. GARP is available to run as part of workflows in Kepler.

    Free


    Data and Metadata Management

    A web-based tool for creating, maintaining and exporting data management plans. Developed initially for a UK audience, but can be adapted to other needs.

    Free


    Data and Metadata Management

    To be released September 2011, this data management tool will be based on the requirements of US funding agencies and the institutions conducting the research, and will build upon the DMP Online tool developed by the UK Digital Curation Centre. It will allow researchers to quickly initiate a data management plan online. The tool will help researchers answer various data management questions relating to their project, such as how data will be documented and made available for public or secondary uses, how data quality will be assured, backup procedures, and preservation plans. It will also aid institutions in identifying costs associated with data management, and help with forward resource planning.

    This tool will:

    • Be web-based
    • Allow users to generate and export data management plans
    • Connect users with information about funder requirements, related resources, and institutional resources/services
    • Initially offer data management plan templates for NSF guidelines
    • Offer local account access as well as some authentication methods to affiliate users with institutional resources
    Free


    Data and Metadata Management

    DotNetNuke is a popular web content management system (or CMS) for Microsoft ASP.NET. The flexible DotNetNuke open source CMS platform also functions as a web application development framework. It is available in both a free Community Edition as well as cost basis Professional and Enterprise editions.

    Free


    Data Deposition, Citation, Curation and Preservation

    DRAMBORA is the Digital Repository Audit Method Based On Risk Assessment. The methodology takes a risk-focused approach to assessing the status of repositories and other digital asset collections and supports repository managers or collection custodians in managing the risks associated with their collections, from direct physical risk (such as fire and flood) to less tangible risks such as reputational damage.

    In a similar way to the Data Asset Framework, the DRAMBORA tool supports users engaged in carrying out a DRAMBORA audit and allows the sharing and maintenance of their repository audit paperwork.

    DRAMBORA has both web-based and downloadable versions of their risk assessment tool.

    Free


    Data and Metadata Management

    Dropbox is an on-line file storage and sharing service. 2GB of Dropbox is available for free, with subscriptions up to 100GB available. Shared folders allow people to work together on the same projects and documents.

    Dropbox files are also available off-line, and folders can be synced between multiple computers and mobile devices. Dropbox therefore can be used as a backup mechanism for important files, although it is by no means a complete solution.

    Free


    Data and Metadata Management

    Drupal is a open source CMS (Content Management System) for websites. Drupal enables webmasters to create professional websites with a minimal amount of specialized coding or systems support.

    Drupal's architecture consists of a core platform that can be customized with user-supplied profiles, modules, themes, and languages. Drupal separates content from presentation, and -- when combined with Drupal's "codeless" module building -- allows high flexibility, while maintaining relative simplicity. This yields high productivity in designing and deploying a wide variety of websites, thus driving Drupal's wide user base.

    Code (when required) is written in PHP and the content is stored in a MySQL database. In general, Drupal is fairly approachable for someone with basic programming and web authoring skills. It is a generally low-overhead solution for web page construction that enables multiple contributors. Many university departments support Drupal for their investigators and will help with Drupal administration.

    Free


    Discovery Tools

    NASA's Earth Observing System (EOS) Clearinghouse (ECHO) is a metadata registry and order broker that allows query and access to data from a large number of repositories, primarily NASA repositories, though any repository can request to have their metadata included in the ECHO database. ECHO stores metadata from a variety of science disciplines and domains, including Climate Variability and Change, Carbon Cycle and Ecosystems, Earth Surface and Interior, Atmospheric Composition, Weather, and Water and Energy Cycle. The primary way for interactively searching and ordering data within ECHO is through the WIST client, though a variety of web service API's are available for those who wish to include WISTFUL data within their own clients.

    Free


    Data Deposition, Citation, Curation and Preservation

    EndNote is software for managing bibliographic citations. It integrates with Web of Science to allow for quick entry of citations. EndNote Web is also available to licensed users, which allows them to use EndNote when they are away from their own computer. Citations can be output in a large number of citation styles. Plugins exist for word processors and web browsers to enable quick citation.

    Cost-basis


    Data and Metadata Management

    Enterprise Architect is a modeling, visualization, and design platform. It can be used in software design, data modeling, and database design and is useful for creating and analyzing UML diagrams. It has a built-in data modeling profile that extends UML to provide a mapping from the database concepts of tables and relationships onto the UML concepts of classes and associations. Enterprise Architect supports modeling of database schema for many popular relational database management systems (RDBMS). It can be used to capture and trace formal requirements for designing, building, and deploying software and databases. Enterprise Architect also supports generation and reverse engineering of source code for a variety of programming languages. It has a built-in source code editor that lets you navigate from a visual model to source code in the same interface.

    Cost-basis


    Exploration, Visualization, and Analysis

    ENVI is software for processing and analyzing geospatial imagery. ENVI handles hyperspectral, LiDAR, and other remotely sensed data sets easily with both wizard based approaches and allowing users to program operations. The main benefit of using ENVI is for the analysis and visualization of spectral and hyperpsectral data. Currently, ENVI is developing comprehensive GIS tools to integrate within the ESRI software family. ENVI has internal workflows and allows users to customize procedures with their IDL programming language.

    Cost-basis


    Exploration, Visualization, and Analysis

    Erdas Imagine is an image processing software package that allows users to process both geospatial and other imagery as well as vector data. Erdas can also handle hyperspectral imagery and LiDAR from various sensors. Erdas also offers a 3D viewing module (VirtualGIS) and a vector module for modeling. The native programming language is EML (Erdas Macro Language). Erdas is integrated within other GIS and remote sensing applications and the storage format for the imagery can be read in many other applications (*.img files). Leica Geosystems also purchased ER Mapper to add to their mapping software. Imagine is tightly woven into the GIS fabric more than other image processing software packages and that is the advantage of this package.

    Cost-basis


    Modeling

    CA ERwin Data Modeler (or ERwin for short) is a data modeling and database design tool that is used to create conceptual, logical, and physical data models. ERwin can create the actual database from the physical model, and create different physical implementations from a single logical model. ERwin can also reverse-engineer existing databases into a data model diagram. ERwin works with many database management systems (DBMS). Outputs from the tool include entity-relationship (ER) diagrams and standard or custom reports on all objects in the design (tables, fields, relationships).

    While users are charged for the full version of ERwin, there is a free "Community Edition" available for students and others new to modeling to try the functionality of the software on a small dataset. The Community Edition has a limit on the number of objects (25) that can be created in the data model.

    Cost-basis


    Exploration, Visualization, and Analysis

    ArcGIS Explorer is a geographical information system (GIS) viewer to explore, visualize, and share GIS information. It provides a freely-distributable way to share products produced by ESRI's commercial products.

    There are two versions: one for the desktop, the other on-line. The on-line version includes support for time-enabled maps.

    Free


    Scientific Workflows

    ESRI's ArcGIS Desktop software contains ModelBuilder, which is a work flow tool that enables the creation and execution of consistent, repeatable models that are comprised of one or more processing steps. ModelBuilder can be used to ensure the integrity of a particular model or set of analytical processes through modeling, storing, and publishing complex operations and workflows. ModelBuilder workflows can be created and executed on both the desktop and over the web. Within ModelBuilder, a model consists of processes and the connections between them. Parameters can be defined that will be filled into a pop-up form at runtime. Most of the geoprocessing tools available within ArcGIS can be used as processes within ModelBuilder as part of a workflow. Model workflows can also be rerun with different data or inputs for evaluating scenarios. ModelBuilder is included in all license levels of ArcGIS Desktop. Models created can also be exported as scripts in Python and other programming languages.

    Cost-basis


    Data Deposition, Citation, Curation and Preservation

    ArcGIS Online is a system created by ESRI aimed at providing a common platform to find, share, and organize geographic content and to build GIS applications. The web front end to ArcGIS Online is ArcGIS.com. Through ArcGIS Online you can access maps, applications, and tools published by ESRI and other geographic information systems (GIS) users.

    ArcGIS Online provides web services that the user can use as data sources in ESRI products such as ArcMap, and for geoprocessing and analysis.

    You can also upload and share your own content through ArcGIS Online. ArcGIS Online enables you to control who can access the resources that you upload. You can create one or more user groups, within which you an share resources and collaborate. ESRI provides services that enable you to host your own content in ArcGIS Online, and you can publish your content as part of a community basemap that users can access freely through ArcGIS Online as a map service. ArcGIS Online also includes tools for creating online maps and for viewing maps created by others.

    Free


    Exploration, Visualization, and Analysis

    ArcGIS Server is a component of the ESRI suite of commercial software focused on the ability to create, manage, and distribute GIS services over the Web to support desktop, mobile and Web mapping applications.

    The user can choose from Basic, Standard, or Advanced editions of ArcGIS Server. Costs vary depending on the level of service required.

    Cost-basis


    Exploration, Visualization, and Analysis

    ESRI provides three application program interfaces (API) for embedding Internet mapping capabilities into websites. The JavaScript API requires only a text editor, while the Silverlight/WPF and Flex versions may require additional licensing/software (see additional information below). These applications can be used with many online GIS web services, including several offered free of charge at ArcGIS.com.

    Free


    Discovery Tools

    ArcMap is the map display and editing workhorse for the ESRI ArcGIS Geographical Information System (GIS) software package. It is most widely used for map creation, but also has broad capabilities for editing and analysis. The "Toolbox" available in ArcMap provides an encyclopedic array of GIS data manipulation and analysis functions for almost any application.

    ArcMap is included in three versions of ArcGIS Desktop, each with an increasing array of capabilities. "ArcView" provides basic mapping functionality. "ArcEditor" adds the ability to create and edit data and to do interconversions between raster and vector data formats. Finally, "ArcInfo" contains the full array of mapping, editing and analysis features.

    ArcMap is sold with a number of "extensions" that, if purchased, extend the capabilities to allow manipulation and analysis of additional forms of data or add additional capabilities. For example, the "Spatial Analyst" extension provides capabilities for the manipulation and analysis of raster data such as satellite images and digtal elevation model (DEM) data, including image processing and hydrologic analysis functions.

    Scripting of functions is provided in several languages, with built-in support for Python programming within the ArcMap interface.

    Cost-basis


    Data Deposition, Citation, Curation and Preservation

    ESRI Geoportal Server is a free open source product that enables discovery and use of geospatial resources including datasets, rasters, and Web services. It can help organizations manage and publish metadata for their geospatial resources and provides access to users. The Geoportal Server supports standards-based clearinghouse and metadata discovery applications. There are four key features: cataloging, geoportal administration, data publishing, and data discovery.

    The geoportal host environment requires an operating system, a database, a full Java JDK, a web application server, and access to ArcGIS Server services (ArcGIS Server map, locator, and geometry services for the geoportal search map and place finder). The geoportal connects to an organization's LDAP structure, and thus needs access to a directory server.

    ESRI Geoportal Server is a free, open source product that is available for download at http://sourceforge.net/projects/geoportal/. It is a stand-alone product that does not require ArcGIS Server or an ArcGIS Server license. It has been released under the Apache 2.0 open source license.

    Free


    Data and Metadata Management

    Esri2EML and BDP2EML are very closely related programs for translating metadata from ESRI or FGDC and from Biological Data Profile (BDP) to Ecological Metadata Language (EML), version 2.0.1. Generally, these kinds of programs are called "crosswalks".

    The Esri2EML XSLT stylesheet will allow you to create an Ecological Metadata Language(EML) document out of an FGCD XML file generated by ESRI ArcGIS products. Note that there is a document explaining how to converting ArcGIS v10.0 Metadata to EML, since ESRI changed their XML metadata structure at Version 10 of ArcGIS.

    Biological Data Profile <-> Ecological Metadata Language Crosswalk:
    BDP is an extension of FGCD that addresses certain biological features, such as taxonomy. Stylesheets which translate XML both ways: EML->BDP and BDP->EML. BDP is used by NBII.

    The content standard for Digital Geospatial Metadata and Biological Data Profile can be found here: http://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/...

    The Esri2EML stylesheet: http://intranet.lternet.edu/im/project/Esri2Eml/docs

    The BDP<->EML stylesheets:
    https://svn.lternet.edu/websvn/filedetails.php?repname=EML&path=%2Ftrunk...

    Both sets of stylesheets are part of the EML code in the Ecoinformatics repository:
    https://code.ecoinformatics.org

    Free


    Discovery Tools

    EtherPad is a collaborative, web-based text editor that supports concurrent document changes, versioning, and built-in formatting for a group of people. Etherpad makes it easy for people to simultaneously type, in real-time, on one document. Each author enters their name and selects a color - as they type, text shows up as a color. A collaborative "pad" can be created by anyone, and each pad has its own URL. It is possible to set up password protected pads. It is free and open source software that can be installed on a web server, and there are public websites that provide the EtherPad application as a service.

    Free


    EVO
    Discovery Tools

    The EVO (Enabling Virtual Organizations) system provides broad support for video conferencing and related services. In addition to standard video conferencing, EVO supports instant messaging, private or group chats, session recording and playback, shared files, whiteboard functions, and encryption of all media. Users must register to get a login name and password.

    EVO is based on Java Webstart, and so needs a Java Runtime Environment (JRE) on the client computer. The first time you run EVO, the JRE can be installed. A client program named Koala then runs on the user's machine and connects to the server. For questions about firewalls, registration, etc. see the FAQ on the EVO site, below.

    Free


    Data and Metadata Management

    eXist is an open source database management system built on XML (extensible markup language) technology. eXist stores information (data or metadata) encoded in XML. The database is queried using XQuery (XML query language), and follows many other W3C XML standards, including XPath and XSLT.

    eXist includes a query editor and debugger. There is a large library of example data, code and applications that can be adapted.

    eXist typically runs as a Java web application under Tomcat, and also comes with a desktop application which is useful for uploading documents in batch-mode.

    Free


    Discovery Tools

    The eXtensible Text Framework (XTF) is an open source platform for providing access to digital content. Developed and maintained by the California Digital Library (CDL), XTF functions as the primary access technology for the CDL's digital collections and other digital projects worldwide.

    XTF is widely used in the digital library community and there is extensive help, tutorials, and a community of users for assistance in using the tool.

    Free


    Data and Metadata Management

    EZID allows users to create and manage unique, persistent identifiers. EZID is a service from the University of California Curation Center (UC3) at the California Digital Library that makes it simple for digital object producers (researchers and others) to obtain and manage long-term identifiers for their digital content. You can assign identifiers to anything: scientific datasets, technical reports, audio files, digital photographs, and non-digital objects as well.

    Using the EZID service users can do the following:

    • Create identifiers for any kind of entity including physical, digital (images, data, text, etc.), abstracts, etc. For identifiers of objects on the web, use EZID to maintain their current locations so that people who click on the identifiers are correctly forwarded.
    • Store citation metadata with identifiers to aid in interpreting and maintaining them. Several citation formats are possible.

    EZID supports a number of persistent identifier technologies. Currently supported technologies include ARKs (Archival Resource Keys) and DOIs (Digital Object Identifiers), and other persistent identifier schemes will be added. Users can use the EZID service either through a user-interface or through an API.

    Cost-basis


    Exploration, Visualization, and Analysis

    The Forest Sector Carbon Calculator is a tool to help users learn about how carbon stores in the forest change over time.

    The Forest Sector Carbon Calculator integrates a number of kinds of software to gather information from users, process, and then output results. The foundation for the Calculator is a model called LANDCARB that is designed to simulate the dynamics of living and dead pools of carbon in forest stands and landscapes. It also includes a submodel that estimates how harvested carbon is manufactured into forest products, as well as how these are used, and disposed.

    This web interface allows users to control scenarios by selecting different regions, integrating past histories of disturbance and management, and choosing alternative futures. Calculations can be done for a single stand or for an entire landscape. Reports and time trend graphs on stores in the forest, in wood products (including bioenergy), and disposal can be generated.

    Free


    Exploration, Visualization, and Analysis

    FRAGSTATS is a computer software program designed to work with geospatial data to help the user categorize landscape patterns and metrics, and is useful in identifying areas where land use activities have resulted in fragmentation of the landscape. The current release is version 3.3

    The program is currently undergoing another major revamping, which will result in the release of version 4.0 sometime in 2011.

    Free


    Discovery Tools

    Fusion is a LiDAR viewing and analysis software tool developed by the Silviculture and Forest Models Team, Research Branch of the US Forest Service. Fusion also works with IFSAR and terrain data sets. LIDAR uses a laser sensor comprised of a transmitter and receiver, a geodetic-quality Global Positioning System (GPS) receiver and an Inertial Navigation System (INS) unit. The laser sensor is mounted to the underside of an aircraft. Once airborne, the sensor emits rapid pulses of infrared laser light, which are used to determine ranges to points on the terrain below. For more information: http://forsys.cfr.washington.edu/fusion/fusion_overview.html

    Free


    Discovery Tools

    Gallery is a web-based image management system characterized as a "photo album organizer." Typical uses are to display collections of photographs on a web page. Index pages allow users to view small "thumbnail" versions of images, with the ability to zoom in to see images. Image upload capabilities and searching are also supported.

    Written in the PHP programming language, there are several versions (1,2 and 3) of Gallery available and still undergoing development. They differ slightly in their capabilities and in the type of database used. Gallery 1 uses an integrated file-based database system, Gallery 2 & 3 use an external database (e.g., MySQL).

    Free


    Discovery Tools

    The Global Biodiversity Information Facility Integrated Publishing Toolkit (GBIF IPT) is an open source, Java-based web application that connects and serves three types of biodiversity data: taxon primary occurrence data, taxon checklists and general resource metadata. The data registered in a GBIF IPT instance is connected to the GBIF distributed network and made available for public consultation and use.

    The IPT tool can be used in two forms:

    1) As a cloud user, through one of its instances (at GBIF http://ipt.gbif.org/ or at the National Biological Information Infrastructure at: http://nbii-ipt.ornl.gov/ipt)

    2) An instance of the IPT tool can be deployed locally as a service, requiring a bit of technical expertise ( a war file deploys as a service, requires a servlet container such as Tomcat) and a series of other requirements.

    Free


    Data and Metadata Management

    Developed by NASA Global Change Master Directory (GCMD), DocBUILDER is a metadata development tool that can be used to develop collection-level metadata for deposition in the GCMD repository. Any organization or individual can use the tool, although registering an account is a prerequisite. Tool descriptions and other documentation are password protected.

    Free


    Exploration, Visualization, and Analysis

    The Geospatial Data Abstraction Library (GDAL) is a C/C++ geospatial data format translation programming library and associated set of utility programs built using the library. Within the GDAL library are two components: the GDAL component which supports the reading/writing/translation of numerous raster formats, and the OGR component which supports reading/writing/translation of numerous vector data formats. The GDAL/OGR library is integrated into a wide variety of Open Source and commercial products as a core data access library for reading and writing geospatial data in the supported data formats. The GDAL/OGR Applicaton Programming Interface (API) has also been implemented in a number of other programming languages for programatic processing of geospatial data, including
    • Perl
    • Python
    • VB6 Bindings (not using SWIG)
    • GDAL Bindings into R by Timothy H. Keitt.
    • Ruby
    • Java
    • C# / .Net

    Free


    Exploration, Visualization, and Analysis

    GenePattern is a genomic analysis platform that provides access to more than 150 tools for gene expression analysis, proteomics, SNP analysis, flow cytometry, RNA-seq analysis, and common data processing tasks. A web-based interface provides access to these tools and allows the creation of multi-step analysis pipelines that enable reproducible in silico research.

    Free


    Discovery Tools

    GeoNetwork OpenSource is an open source geospatial data catalog service host, metadata creation and management system, and basic web mapping platform that is in broad use by a number of international organizations in deploying their Spatial Data Infrastructure (SDI). GeoNetwork implements a set of key international geospatial standards, including catalog service (OGC Catalog Services for Web [CSW], z39.50), metadata (FGDC, ISO19139, Dublin Core [for general documents]), and map visualization (OGC Web Map Services [WMS]) standards that allow for integration of GeoNetwork services into national (i.e. NBII, Geospatial OneStop) and international (i.e. GEOSS) geospatial data discovery and access programs.

    Free


    Exploration, Visualization, and Analysis

    GeoServer is a Java-based Open Source software map server that enables the sharing and editing of geospatial data on the Internet via open standards. It implements the Open Geospatial Consortium (OGC) Web Mapping Service (WMS), Web Feature Service (WFS), and Web Coverage Service (WCS) standards. It can serve data that can be displayed on common mapping client applications such as ArcMap, Google Maps, Google Earth, Yahoo Maps, and Microsoft Virtual Earth as well as Open Source applications such as OpenLayers and MapServer. It can also read a variety of open and proprietary data formats including PostGIS, ESRI ArcSDE, shapefiles, MySQL, GeoTIFF, and JPEG2000 to name a few.

    Free


    Exploration, Visualization, and Analysis

    GIMP is an open source and free alternative to Photoshop. It is intended for use with Raster graphics, and can be used for creating new images, retouching photographic images and converting files to different file types. Gimp supports most image file types as well as other graphical image manipulation program file types. It is also possible to create custom scripts to automate a variety of tasks.

    Free


    Git
    Data and Metadata Management

    A distributed version control system (DCVS). Git provides a distributed development, giving each developer/user a local copy of a repository, which includes the entire revision history. Changes are copied from one repository to another. Branching and merging are easy to do. Users are not dependent on network access or a central server so Git is very fast and scales well when working with large projects. It provides cryptographic authentication of history and offers tools for both easy human usage and easy scripting to perform clever operations.

    Free


    Discovery Tools

    The Global Change Master Directory (GCMD) contains collection / data set level metadata from repositories located around the world holding Earth or environmental science data. The GCMD website offers facetted search and browse for data sets by spatiotemporal extent, resolution, data center, location, instrument, platform, and project. In addition to discovery tools, the GCMD provides a variety of on-line and web-accessible tools for developing collection-level metadata that is NASA DIF, FGDC CGDSM, and ISO 19115 compliant.

    Free


    Data and Metadata Management

    GNU RCS (Revision Control System) is an open source revision control system for text files, source code, programs, graphics, and other documentation. It stores, tracks, logs, identifies and merges versions. Development continues as a volunteer effort under the Free Software Foundation.

    Free


    Exploration, Visualization, and Analysis

    Google Charts is a combination of two application programming interfaces (APIs), Google Chart API and Google Visualization API. Google Chart API creates static visualizations of data and embeds them into webpages. Some HTML programming experience is recommended. Available visualization types include standard scatter, line, bar, pie, and box charts as well as Venn diagrams, dynamic icons or callouts, formulas, and connectivity graphs. In addition, maps can be made and embedded into webpages.
    The Google Visualization API creates dynamic visualizations that allow for user interaction within a webpage. These visualizations are created with Java scripting, and Google provides links to Java script tutorials as well as source or example code in a Java library. In addition to the visualization types available with Google Charts API, Google Visualization API include timelines, heat maps, tree maps, word or term clouds, filters for other visualizations, and interactive Google Maps.
    Data sources that can be used for both APIs include any file that can be imported as a two-dimensional table, including text files, spreadsheets, and database tables. The data must be retrieved using a retrieval protocol and the structured accordingly. Google provides Java, Python, and Google Web Toolkit (GWT) libraries for data retrieval, as well as an API to retrieve data from Google Spreadsheets.

    Free


    Data and Metadata Management

    Google Docs provides for web-based creation, editing and management of:

    • spreadsheets
    • textual documents
    • presentations
    • forms
    • graphics

    Users can upload and download documents in different formats, and Google Docs can be used as a basic migration tool between them. The tool allows collaborative real-time editing by multiple users of the same document. It also allows users to share collections of documents with others. Different permission levels can be assigned to documents, restricting access to individuals, groups, or open to the public.

    Free


    Discovery Tools

    Google Earth is a flexible mapping and display tool that is installed on a local computer, and accesses on-line data sources provided by Google. These data sources include aerial imagery at a variety of scales and additional data hosted by Google. The user needs to have internet access to view the data provided by Google. Once Google Earth is installed it allows the user to "zoom in" and view satellite and photographic images on any place in the world. The user can also change the perspective of their view to panoramic, and they can "fly" from one area of the globe to others as if in an airplane. It includes 3-D representations of topography and, of buildings (for selected areas), and provides street view photography for some locations. It includes links to a large number of external services that provide additional information, such as photos of points of interest. Time-controls allow you to select imagery from a number of different years for analyses of landscape change.

    If you download or are given a file in KML or KMZ format, you can set your local computer preferences to automatically start Google Earth when the file is "clicked".

    In addition to its display capabilities, Google Earth can also be used to create data through on-screen digitizing. It produces KML/KMZ files that can then be used with other GIS products such as Google Maps, ArcGlobe and ArcMap.

    A limited version of Google Earth is available for mobile devices such as cell phones.

    Google Earth is free, but professional versions that include additional capabilities are available for purchase.

    Free


    Data and Metadata Management

    Google Fusion Tables is a free Google Labs application for data management, integration and collaboration and visualizing data online. It allows the uploading and sharing data, merging data from multiple tables into interesting derived tables, and seeing the most up-to-date data from all sources. The Google Fusion Tables Application Programming Interface (API) lets the user query the data, insert rows, update data, and delete rows. It also provides authorization functionality so that data in Google Fusion Tables can be made accessible to applications as well as individuals.

    Free


    Discovery Tools

    Google Groups supports the creation of discussion forums for virtual communities to share information via the internet. It is a free mailing list service and can provide open access. Groups can be open or closed. Users can be anonymous. Posts can be made through the web browser or by sending email. It also provides a variety of group management functions. Google Groups archives past posts/emails.

    Free


    Exploration, Visualization, and Analysis

    Google Maps is web-based application that provides for web-enabled map creation. Google Maps includes a suite of supporting products including:


    • Google Map API, which provides for embedding maps in web pages
    • Google Mobile, which allows maps to run on mobile devices
    • Google Transit, which provides information for public transportation routes

    The Google Maps product itself is a website that allows users to map various locations, obtain directions, view georeferenced images, satellite imagery, roads, and other associated items like traffic conditions. You are also able to overlay a wide variety of georeferenced data layers on top of Google Maps such as mapping sites, displaying coverages, and photographs. A simple visualization can be accomplished by attaching a web-accessible KML file to the google maps URL with the "q=http://web.accessible/file.kml" construct.

    Google Maps is a different product than Google Earth, which is a stand-alone application that users run from their desktop. KML files which are prepared with or for Google Earth do not always "behave" the same way in Google Maps.

    Google Maps requires that JavaScript be enabled in any browser that uses this product and runs on a variety of platforms and browsers.

    There are many sources of example code for using the Google Maps API, making this a very approachable method for creating web pages with included maps. A license "key" is required for your web server.

    Free


    Exploration, Visualization, and Analysis

    ​Google Public Data Explorer provides an interface for exploring, visualizing and sharing large datasets. Its interactive visualization tools enable changes to be tracked over time. Although primarily intended to enable non-specialists to interrogate public datasets, users can also upload their own data after describing them in Datset Publishing Language (DSPL). The tool is in beta form.

    Free


    Data and Metadata Management

    Google Sites is a software technology created by Google that enables you to quickly create a collaborative website. Multiple people can work together on a Google Site to add file attachments and new, free-form pages and content. Google Sites uses an editor for creating content that is very much like editing a document. Creators of a Google Site have control over who has access (via Google accounts), or a Google Site can be published so that it is accessible to the public.

    Google Sites are hosted by Google, so you do not need a server or specific information technology (IT) expertise. You do not need to know how to code HTML, but there is still a lot of flexibility for you to control the look, feel, and content of your site. Other supported features include uploading files and attachments. Google Sites is also integrated with other Google products, so you can insert videos from YouTube, documents, spreadsheets, and presentations from Google docs, images from Picasa, and calendars from Google Calendar. You can also search across Google Sites pages and content using Google search.

    Free


    Discovery Tools

    GRASS (Geographic Resources Analysis Support System) is an Open Source Geographic Information System (GIS) that support 2d and 3d raster (gridded) and vector (point/line/polygon) data processing, analysis, and modeling capabilities. Through its use of several Open Source geospatial libraries (GDAL, OGR, PROJ4) GRASS supports dozens of raster and vector data formats for import and export, and may also connect to external GeoDatabases, depending upon the database drivers that are installed on a particular system. GRASS includes both a Graphical User Interface (GUI) and a command line mode for interaction with the system. The command line capabilities of GRASS are also accessible through a variety of scripting languages (e.g. Python, shell scripting) for automating geo-processing for repeated analyses and automated visualization or data processing. As an Open Source GIS platform, a variety of other tools (e.g. the R statistical programming language, QGIS desktop mapping application) can seamlessly access and interact with GRASS data.

    Free


    Exploration, Visualization, and Analysis

    HDFView is a visual tool for browsing, viewing, managing and editing HDF4 (Hierarchical Data Format) and HDF5 binary data files. HDF files are designed to contain large amounts of numerical or other data.

    The tool allows you to view the hierarchical file structure, create and edit new files, groups, datasets, dataset contents, and attributes of the data.

    Free


    Exploration, Visualization, and Analysis

    HUBzero is an open source platform that allows the creation of active web sites that support scientific collaboration and educational activities. It supports creating a group and inviting other users to join it, and delegating various group management roles. Researchers can upload files, tools, presentations, data, etc. It provides wiki and blog services. It supports social networking features such as content tagging, ratings. comments, citations, etc. It has news and event calendaring features. Software tools can be enabled to run interactively within the web browser. Many tools with a graphical user interface can be uploaded, installed and deployed with a small amount of work. And tools without a graphical interface can be adapted by using HUBzero's associated Rapture toolkit. HUBzero also can provide a variety of usage metrics.

    Free


    Scientific Workflows

    Hydrant is a web-based scientific workflow application that is designed to interact with the open source scientific workflow tool Kepler, enabling efficient, user-friendly scientific workflow processing. Hydrant allows scientists to: discover, view and load Kepler workflows; view and edit properties of Kepler Actors; execute workflows; and share workflows and results.

    Free


    Exploration, Visualization, and Analysis

    HydroDesktop is a free and open source desktop application developed in C# .NET that serves as a client for CUAHSI HIS WaterOneFlow web services data and includes data discovery, download, visualization, editing, and integration with other analysis and modeling tools.

    HydroDesktop is intended to solve the problem of how to obtain, organize, and manage hydrologic data on a user’s computer to support analysis and modeling. HydroDesktop is a platform for the integration of hydrologic data, which can be used in analysis applications such as R, MATLAB, and Excel, or in custom code developed by the end user. The HydroDesktop design paradigm includes the use of a plug-in architecture and data abstraction layer that will allow extension of the core functionality. HydroDesktop provides local access to data obtained from distributed data services that are part of the Internet-based, SOA that the CUAHSI HIS project has developed for the sharing of hydrologic data.

    HydroDesktop is designed to be useful for a number of different groups of users with a wide variety of needs and skill levels including: university faculty, graduate and undergraduate students, K-12 students, engineering and scientific consultants, and others. HydroDesktop is for users primarily interested in discovering and retrieving observational data from the HIS system for use within HydroDeskop itself, or in other analytical and modeling applications installed on their local computer.

    Free


    Data and Metadata Management

    IBM's DB2 is a comprehensive relational database management system (RDBMS). Application versions are available for both desktops and servers and run on a variety of platforms. Unsupported open source versions are available.

    Cost-basis


    Data and Metadata Management

    IBM InfoSphere Data Architect is an enterprise data modeling application built on the Eclipse Integrated Development Environment (IDE) platform. Data Architect enables information designers to create both logical and physical data model diagrams, which can be used to describe a variety of applications and systems. For example, this tool can document a SQL database application, a complex website, a multi-server application platform, or a networked workflow process. Built into Data Architect are the technical specifications of a variety of popular IT platforms and services, which enables the designer to not only specify a data connection between two entities, but also save the technical requirements for making the connection functional. A good example of this feature is shown in the online Data Architect demo (see below for link) which shows a connection being made between an Oracle and an IBM DB2 database. The data model encapsulates all the information a systems engineer will need to actually build a connection between these two platforms. In addition, Data Architect provides a variety of output methods for its data models. For example a web designer can print out a site architecture report, while a database designer can actually output the SQL script necessary to build the database they just designed.

    Cost-basis


    Discovery Tools

    iMacros was designed to automate the most repetitious tasks on the web. With iMacros, you can quickly fill out web forms, remember passwords, create a webmail notifier, download information from other sites, scrape the Web (get data from multiple sites), and more. You can keep the macros on your computer for your own use, or share them with others by embedding them on your homepage, blog, company Intranet or any social bookmarking service.

    Web professionals can use iMacros for functional, performance, and regression testing of web applications. The built-in STOPWATCH command captures precise web page response times. iMacros also includes support for many AJAX elements.

    iMacros can be combined with other extensions such as Greasemonkey, Web Developer, Firebug, Stylish, Download Statusbar, NoScript, PDF Download, Foxmarks, Fasterfox, All-in-One Sidebar, Megaupload, Foxyproxy, Flashblock and Adblock.

    Free


    Exploration, Visualization, and Analysis

    ImageJ is an open-source, Java-based image processing and display tool. It can read and write images in GIF, JPEG, BMP, PNG, PGM, FITS, ASCII and TIFF formats. Editing capabilities include image enhancement (e.g., smoothing, sharpening, edge detection, median filtering and thresholding), image manipulation (e.g., crop, scale, resize, rotate and flip) and even analyses (e.g., area measurement, mean brightness, standard deviation, min and max brightness and measurement of lengths and angles). A "macro" feature allows automation of processing tasks and a library of previously-created macros is available. Similarly, a library of "plugins" allows additional capabilities to be added.

    Free


    Data and Metadata Management

    ImageMagick is a collection of command-line tools for manipulating image data. It is capable of working with over 100 different image formats and has interfaces that allow it to be used from within programs as well. Capabilities include the ability to resize, flip, mirror, rotate, distort, shear and transform images. You can also adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves. It also can be used to mosaic or stack images. It has minimal image-display capabilities, but is a powerful tool for automating image processing tasks.

    Free


    Exploration, Visualization, and Analysis

    The IMSL Numerical Libraries provide a wide variety of mathematical and statistical algorithms written in various programming languages for easy incorporation by programmers. There are libraries for C, Fortran, Java, .NET, and Python (through wrappers). These algorithms are not only useful for desktop applications, but also can be applied to High Performance Computing (HPC) and High Throughput Computing (HTC). IMSL provides a comprehensive set of mathematical and statistical functions that programmers can include into the software applications they are developing. The statistical functions include time series, correlation, data mining, regression, neural networks and many more. The mathematical functions include matrix operations, linear algebra, nonlinear equations, optimization, genetic algorithms and many more.

    Cost-basis


    Exploration, Visualization, and Analysis

    Integrated Data Viewer (IDV) is a Java-based software framework for analyzing and visualizing geoscience data. IDV includes a software library and a reference application made from that software. It uses the VisAD (http://www.ssec.wisc.edu/~billh/visad.html) and netCDF-Java (http://www.unidata.ucar.edu/software/netcdf-java) libraries and other Java-based utility packages.

    The IDV "reference application" is a geoscience display and analysis software system with many of the standard data displays that other Unidata software (e.g. GEMPAK and McIDAS) provide. It brings together the ability to display and work with satellite imagery, gridded data (for example, numerical weather prediction model output), surface observations, balloon soundings, NWS WSR-88D Level II and Level III RADAR data, and NOAA National Profiler Network data, all within a unified interface. It also provides 3-D views of the earth system and allows users to interactively slice, dice, and probe the data, creating cross-sections, profiles, animations and value read-outs of multi-dimensional data sets. The IDV can display any Earth-located data if it is provided in a known format (see Data Sources).

    Free


    Exploration, Visualization, and Analysis

    Interactive Data Language (IDL) is a high-level language for data manipulation, visualization and analysis. IDL has strong signal and image processing capabilities and extensive math and statistical functions. There is extensive web support with hundreds of freely available applications from a large userbase. IDL includes mapping tools and direct access to standard databases. The IDL development environment requires minimal programming skills.

    Cost-basis


    Data Deposition, Citation, Curation and Preservation

    JHOVE2 is open source software for characterization of digital objects. Characterization captures the information about a digital object that describes that object's significant technical properties. For example, for a digital image file, JHOVE2 can identify the precise file format, as well as the salient technical properties of the file, such as resolution, bit-depth, and color-space. Capturing this information supports digital preservation analysis and decision making.

    JHOVE2 analyzes digital objects with these questions:

    • What is it? (Identification)
      Identification is the process of determining the format of a digital object on the basis of both internal (e.g. magic number) and external (e.g. file extension) information.
    • What about it? (Feature extraction)
      Feature extraction is the process of reporting the properties of a digital object which are significant to preservation planning and action.
    • What is it, really? (Validation)
      Validation is the process of determining the level of conformance of a digital object to the rules defined by the authoritative specification of the object's format.
    • So what? (Assessment)
      Assessment is the process of determining the level of acceptability of a digital object for a specific purpose on the basis of locally-defined policy rules.

    JHOVE2 supports the validation and feature extraction of the following format families and specific format subtypes:

    • ICC color profile
    • JPEG 2000
    • JP2 (ISO/IEC 15444-1) and JPX (ISO/IEC 15444-2) profiles
    • PDF
    • PDF 1 - 1.7, ISO 32000-1, PDF/X-1 (ISO 15930-1), PDF/X-1 (ISO 15920-1), \-1a (ISO 15930-4), \-2 (ISO 15930-5), \-3 (ISO 15930-6), PDF/A-1 (ISO 19005-1)
    • SGML
    • Shapefile
    • TIFF
    • TIFF 4 - 6, Class B, F, G, P, R, and Y, TIFF/EP (ISO 12234-2), TIFF-FX, TIFF/IT (ISO 12639), Exif (JEITA CP-3451), GeoTIFF, Digital Negative (DNG), RFC 1314
    • UTF-8 encoded text
    • ASCII (ANSI X3.4)
    • WAVE audio
    • Broadcast Wave Format (EBU N22-1997)
    • XML
    • Zip

    Jhove2 is run at the command line. There are also mechanisms for extending the number of supported file types.

    Free


    JMP
    Exploration, Visualization, and Analysis

    JMP is a desktop software package designed by SAS for dynamic data visualization and statistical data exploration. JMP includes an interactive graph builder that supports a wide variety of two- and three-dimensional graph types, and statisical reports are displayed along with plots for assessment and interpretation. Data can be loaded into JMP from common desktop file formats (e.g. text and spreadsheet files), as well as from a database or SAS server, and reports and visualizations can be exported in HTML, PDF and Adobe Flash formats for displaying and sharing results. JMP also integrates with the full SAS statistical software package to support more comprehensive analyses.

    Cost-basis


    Data and Metadata Management

    Joomla! is an open source content management system (CMS). Joomla! provides a structured website that enables users to create and edit various types of web content without requiring in depth technical knowledge of web authoring or programming languages. Novice users can create web pages and add basic text and graphics to them with only a minimal introduction to the system; they can also take advantage of Joomla! “components,” which are preformatted data templates that provide added features and functionality to web pages. Specific components available include Contacts, Weblinks and News Feeds. Intermediate and advanced users can link Joomla! sites to databases, publish digital audio and video clips, create forums, surveys and other types of Web 2.0 collaborative content.

    While Joomla! does not require technical skills to use once it is installed, as a server application the initial implementation of Joomla! needs to be performed by a systems administrator with at least an intermediate skills in server installation and management.

    Free


    Scientific Workflows

    Kepler is a scientific workflow application that enables scientists, engineers, analysts, and computer programmers to create, execute, and share models and analyses. Kepler is a java-based application that can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging "R" scripts with compiled "C" code, or facilitating remote, distributed execution of models. Using Kepler's graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a "scientific workflow"—an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and components developed by the scientific community to address common needs.

    Free


    Exploration, Visualization, and Analysis

    Keynote is an Apple's presentation software. Presentation software is primarily used for composing "slides" for presentations. Emphasis is on graphics and animation. Keynote files can be saved as PowerPoint.

    Cost-basis


    Discovery Tools

    LifeDesks is intended for teams of taxonomists to facilitate easy and fast data sharing. Individuals or teams can create taxon-based checklists and taxon pages. Taxon pages may include descriptions, photographs, maps and bibliographies. The software will display the taxonomic hierarchy on the taxon page. There are options to tag objects with intellectual property rights using the Creative Commons licenses. The information can be readily shared with the Encyclopedia of Life to generate their species pages.

    Free


    Exploration, Visualization, and Analysis

    Many Eyes is data visualization service that gives users the capability to upload their data to the website, then choose what types of visualizations they would like to generate. Choices include text analysis, comparisons, trend tracking, mapping, and relationships among data points.

    Many Eyes was developed by the IBM's Collaborative User Experience (CUE) Research group in the Visual Communication Lab.

    Free


    Exploration, Visualization, and Analysis

    Maple is a software application for symbolic and numeric mathematical analysis, mathematical modeling and visualization. The software provides a comprehensive computer algebra system and an interactive graphical environment for editing and solving both symbolic and numeric mathematical equations and performing calculations. Equations can be entered and displayed using conventional symbolic notation, making this application ideal for educational settings and classroom exercises. A dynamically-typed, imperative-style programming language is also included for advanced analyses, and Maple can be interfaced with computer languages including C, Fortran, Java, MATLAB and Visual Basic.

    Operating Systems: Microsoft Windows, Apple McIntosh, Linux, Sun Solaris

    Cost-basis


    Exploration, Visualization, and Analysis

    MapServer is an open source tool for publishing spatial data and interactive mapping applications to the web. It allows users to create “geographic image maps”, that is, maps that can direct users to content. MapServer uses Web Feature Services (WFS) that will serve out features and attributes to users and WMS (Web Mapping Service), that serves an 'image' of what the symbolized layers (gis datasets) look like on the server based on the extent requested. Users can implement python, java, perl, or .net programming languages to work with MapServer.

    Free


    Exploration, Visualization, and Analysis

    MapWindow GIS is a free desktop geographic information system (GIS) application with an extensible plug-in architecture. It also includes a GIS ActiveX control and a fully C# GIS application programmer’s interface (API) called DotSpatial that can be used by software developers. MapWindow can be used as an open-source alternative to many commercial desktop GIS software packages, it can be used to distribute data to others who may not have a GIS, and it can be used to develop and distribute custom geospatial data analysis tools, or plug-ins.

    MapWindow includes standard GIS data visualization features as well as DBF attribute table editing, shapefile editing, and data converters. Many standard GIS data formats are supported including shapefiles, GeoTIFF, and ESRI ASCII and ArcGrid formats.

    MapWindow is free to use and redistribute to others. The source code is also available, and MapWindow can be modified to fit user needs. It can also be embedded in other software programs.

    Free


    Exploration, Visualization, and Analysis

    Mathematica is a computational platform used by scientists, engineers and mathematicians. Mathematica has support for equation solving, numerical analysis, as well as graphing and visualization. Mathematica has import and export filters for tabular data, images, video, sound, CAD, GIS documents and biomedical formats. There is support for data mining tools such as cluster analysis, sequence alignment and pattern matching as well as text mining support. The programming feature supports functional, procedural, and object oriented styles of programming.

    Cost-basis


    Exploration, Visualization, and Analysis

    MATLAB is an interactive data analysis and visualization environment that can be used to perform computationally-intense operations on large data sets efficiently. MATLAB also provides a high level programming language that supports rapid development of work-flow scripts and Graphical User Interface applications to automate repetitive tasks. A wide variety of discipline-specific software libraries, called toolboxes, are available from the publisher or user communities to extend the capabilities of the base program (e.g. statistics, curve fitting, image analysis and mapping). MATLAB programs can also leverage existing code written in Fortran, Java or other languages and source code is provided for most functions, allowing end-users to extend or customize routines for specialized analyses.

    Cost-basis


    Data and Metadata Management

    MATT is a Metadata AuThoring Tool that runs from within a web browser and guides you through the writing of metadata using pull-down lists and keywords. The tool has been written using a combination of XHTML and client-side JavaScript. It can be used either over the internet, across a network or offline. When you have finished describing your data, your entries are converted into a machine readable format known as XML. This XML can then be submitted for incorporation into a metadata directory and made available for others to view. All metadata produced by the tool can be straightforwardly transformed into the internationaly recognized DIF metadata format (Currently adopted by GLOBEC and IOC). By default however, MATT produces metadata in a format (MATT XML) that can be submitted into the SADC Regional Metadata Directory.

    Free


    Discovery Tools

    MDNR (Minnesota Department of Natural Resources) Garmin GPS is an ArcView extension built to provide Garmin handheld GPS receiver users with the ability to directly transfer data between GPS receivers and various GIS software packages. A user can use point features (graphics or shapefile) and upload them to the GPS as waypoints. Line and Polygon Graphics or shapes can be uploaded to the GPS as Track Logs or Routes. Conversely, Waypoints, Track Logs, and Routes collected using the GPS can be transferred directly to ArcView/ArcMap/Google Earth/Landview and saved as Graphics or Shapefiles.

    Free


    Data and Metadata Management

    MediaWiki is a free web-based software application written in PHP with a backend database. It was developed by the Wikimedia Foundation and it also runs projects such as Wikispecies and WikiMediaCommons. It is the wiki tool used for WikiPedia. There are numerous extensions available for adding capabilities to MediaWiki (see http://www.mediawiki.org/wiki/Category:Extensions).

    Free


    Data and Metadata Management

    Mendeley is an bibliographic management tool. An optional local client, Mendeley Desktop, synchronizes with a standard Web-accessible interface. Users can enter new references, generate references from hard drive directories, or find references that others have already entered. References can be tagged as required. The emphasis is on social collaboration and sharing. Users can associate with each other and groups can be formed to developed shared collections. Browser plug-in allows capture of reference information from just about any web page, and Mendeley has an API and plug-ins for word processors such as Open Office. PDFs can be uploaded from your local system, and citations can be generated and lists exported. Many online journals have one-click function to add a reference to your Mendeley library. There are limits on how many groups can be managed in the free-version.

    Free


    Data and Metadata Management

    Mercurial is a free, distributed source control management tool and is used for version control of files. Mercurial is distributed, giving each developer a local copy of the entire development history.

    • It works independently of network access or a central server.
    • Committing, branching and merging are fast and cheap.
    • You can generate diffs between revisions, or jump back in time within seconds and is suitable for large projects.
    • Mercurial is platform independent. Most of Mercurial is written in Python, with a small part in portable C for performance reasons.
    • The functionality of Mercurial can be expanded with extensions, which can change the workings of the basic commands, add new commands and access all the core functions of Mercurial.
    • The basic interface is easy to use, easy to learn and hard to break.
    Free


    Data and Metadata Management

    Mercury is a web-based system to search for metadata and retrieve associated data sets. The Mercury Metadata Editor creates a subset of standard FGDC metadata, along with data documentation that can be used in the Mercury search tool. The Mercury Metadata Editor:

    • Allows users to enter contextual metadata that is specifically designed for use in the Mercury search tool.
    • Creates a standard format for metadata (XML)
    • Has picklists for standard terms Using the on-line Mercury Metadata Editor tool to create and edit contextual metadata requires no programming expertise.
    Free


    Discovery Tools

    Mercury is a web-based system to search for metadata and retrieve associated data. Mercury provides a single portal to information contained in disparate data management systems. It collects metadata and key data from contributing project servers distributed around the world and builds a centralized index. The Mercury search interfaces then allow the users to perform simple, fielded, spatial and temporal searches across these metadata sources. Mercury supports various metadata standards including XML, Z39.50, FGDC, Dublin-Core, Darwin-Core, EML, and ISO-19115.

    Free


    Data and Metadata Management

    The Metadata Enterprise Resource Management Aid (MERMAid) is a tool to develop, validate, manage and publish metadata records via secure internet access. It allows users and data providers to establish unlimited metadata databases to organize their metadata records as they choose (i.e. by program, project, data type, personnel). Some of the key features in MERMAid include (1) user-defined roles and permissions at the metadata management and database levels; (2) change tracking; and (3) enhanced validation. Also, your existing FGDC compliant metadata (in XML format) can be ingested into and managed through MERMAid. MERMAid was developed by the National Coastal Data Development Center (NCDDC).

    Free


    Data and Metadata Management

    Merritt is a repository service and curation environment for storage and preservation of digital objects, provided by the California Digital Library. Merritt can be used to manage, archive, and share content. It can provide significant features for a digital object:

    • permanent storage
    • access via persistent URLs
    • tools for long term management
    • easy-to-use interface for deposits and udpates

    Merritt is built upon micro-services, which are independent yet interoperable set of functions that together, combine to form the technical infrastructure of a digital preservation repository. These micro-services are meant to be small and modular, and easy to develop, deploy, maintain, and if needed, replace. The complex function of a digital repository emerges from the interaction of the micro-services. More background, and specifications of individual micro-services, can be found at the Curation wiki:
    https://confluence.ucop.edu/display/Curation/Home

    The Merritt Repository Service is provided by the University of California Curation Center (UC3), part of the California Digital Library. The primary audience for the service is the students, staff and faculty memberss of the University of California (UC), and is also available to researchers and organizations outside of UC. UC researchers are using Merritt to fulfill data management and sharing requirements for NSF and NIH grants.

    Cost-basis


    Exploration, Visualization, and Analysis

    Mesquite is a modular software system for evolutionary analysis, designed to help biologists analyze comparative data about organisms. Although its emphasis is on phylogenetic analysis, some of its modules concern population genetics, while others do non-phylogenetic multivariate analysis. The analyses include:

    • Reconstruction of ancestral states (parsimony, likelihood)
    • Tests of process of character evolution, including correlation
    • Analysis of speciation and extinction rates
    • Simulation of character evolution (categorical, DNA, continuous)
    • Parametric bootstrapping (integration with programs such as PAUP* and NONA)
    • Morphometrics (PCA, CVA, geometric morphometrics)
    • Coalescence (simulations, other calculations)
    • Tree comparisons and simulations (tree similarity, Markov speciation models)

    Mesquite is not primarily designed to infer phylogenetic trees, but rather for diverse analyses using already inferred trees.

    Free


    Data and Metadata Management

    Metacat is a flexible, open source metadata catalog and data repository that targets scientific data, particularly from ecology and environmental science. Metacat accepts XML as a common syntax for representing the large number of metadata content standards that are relevant to ecology and other sciences. Thus, Metacat is a generic XML database that allows storage, query, and retrieval of arbitrary XML documents without prior knowledge of the XML schema.

    Metacat is designed and implemented as a Java servlet application that utilizes a relational database management system to store XML and associated meta-level information. Installation of Metacat recommends the use of Apache Tomcat for servlet management and PostgreSQL as the underlying RDBMS, although other configurations are possible. Metacat provides a rich client Application Programming Interface (API) and supports a variety of languages, including Java, Python, and Perl.

    Metacat is being used extensively throughout the world to manage environmental data. It is a key infrastructure component for the NCEAS data catalog, the Knowledge Network for Biocomplexity (KNB) data catalog, and for the DataONE system, among others.

    Free


    Data and Metadata Management

    Metadoor is a metadata entry tool whose output conforms to the Content Standards for Digital Geospatial Metadata devised by the Federal Geographic Data Committee (FGDC). The developer stopped active development on 2007, but the code is well documented, it is open source and available for download.

    Metadoor uses pick lists to ease the metadata entry process. It also has collapsible form fields, providing the user a way to segment the documentation task without scrolling or shuffling through forms. Since it is not a desktop tool, it does not need platform or further software requirements. The metadata editor allows the user to download the final output as XML.

    Free


    Exploration, Visualization, and Analysis

    MetaMorph is an industry standard image analysis suite and capture platform.

    The MetaMorph software suite supports a wide array of microscopes, cameras, and precision stages used in bioresearch. The software provides acquisition, processing, and analysis features that allow researchers to build custom imaging systems for solving experimental problems in cellular imaging.

    Cost-basis


    Data and Metadata Management

    Metavist is a a software tool for the metadata archivist, and is used to create metadata compliant with the Content Standards for Digital Geospatial Metadata devised by the Federal Geographic Data Committee (FGDC).

    Free


    Data and Metadata Management

    Microsoft Access is a desktop database package that includes a relational database engine, database query tools, a report building module, and a forms builder. For intermediate and advanced users it also includes the Microsoft Visual Basic for Applications platform and an IDE (Integrated Development Environment). Microsoft Access can serve as standalone database tool, or it can be connected to other database server platforms and be used as a database server client.

    Some programming/database skills may be required in order to perform complex queries.

    Cost-basis


    Data and Metadata Management

    Microsoft Excel is a software package, included in the Microsoft Office Suite, that enables the creation of spreadsheets or forms, provides simple data comparison and analysis tools, and creates graphs. Data are captured in workbooks, which can be composed of a single or several sheets. Simple sort and filtering tools allow data to be queried. QA/QC can be performed using built-in tools that can find values and replace them with other values, remove duplicates, find missing values, characterize column data types, etc. Built-in or user-defined formulas can be used for calculations or transformations. Excel can also utilize Visual Basic for Applications (VBA) or .NET framework programming. Excel can also be used to create tables and visualizations. Other objects, such as photos and other images, text boxes, and clip art can be inserted into a spreadsheet.

    Cost-basis


    Exploration, Visualization, and Analysis

    PowerPoint is commercial presentation software that is part of the Microsoft Office suite. PowerPoint presentations consist of a number of individual pages or "slides". Slides may contain text, graphics, sound, animation, movies, and other objects. The presentation can be printed (with or without speaker notes), displayed or projected live by computer, or exported as an animated movie. Slides can also form the basis of webcasts.

    In order to view PowerPoint files, the user either must have PowerPoint installed or have the free PowerPoint Viewer installed.

    Cost-basis


    Exploration, Visualization, and Analysis

    Microsoft SQL Server Analysis Services (SSAS) is part of Microsoft SQL Server, which is a relational database management system (RDBMS). SSAS contains online analytical processing (OLAP) and data mining functionality for business intelligence applications. Many of the business intelligence and data mining functions within SSAS are applicable to environmental datasets. Mining historical data using SSAS can provide new insights and form a basis for forecasting, and may be particularly interesting for analysis of environmental time series data.

    SSAS supports OLAP by assisting you in designing, creating, and managing multidimensional structures that contain data aggregated from other sources, such as relational databases. OLAP data cubes constructed using SSAS can be accessed using Microsoft Excel, which can be a powerful way to present data to potential users. In data mining applications, SSAS can assist you in designing, creating, and visualizing data mining models that are constructed from data sources using a wide variety of industry-standard data mining algorithms.

    Cost-basis


    Data and Metadata Management

    A diagramming program with support for a wide variety of modeling languages, such as Unified Modeling Language (UML), Deployment Diagrams, Network Diagrams and many others. It is also extensible to accommodate custom diagrams. Common uses might include Entity Relationship Diagrams for databases, Class Models for object oriented languages and the creation of workflows for documenting business processes.

    Diagrams and models can be output in in a variety of formats and are generally of high quality. There is support for integrating into Microsoft products and programming languages, such as Entity Relationship Models can directly create Microsoft SQLServer databases, and Class Diagrams can create classes in C#. Support for other languages such as Java or C++ can be extended but don't enjoy the same level of support.

    Cost-basis


    Exploration, Visualization, and Analysis

    Minitab 16 is commercial software for data analysis, graphing, and statistics. It is interactive and menu-driven, and users are guided through the data analysis process according to "assistant" dialog boxes. The software can be used to run basic statistics including parametric regression and analysis of variance, survival analysis, and a limited number of multivariate analysis. Users can also graph data and statistical models, analyze experimental design and do power analysis, and store and manipulate data.

    Minitab 16 is marketed to commercial businesses although they also offer unspecified discounts for "qualified academic users."

    Cost-basis


    Discovery Tools

    The MODIS Land Product Subsets tools provide summaries of selected MODIS Land Products for the community to use for validation of models and remove-sensing products and to characterize field sites. Users have several options to select the area of interest. The tools deliver spatial subsets of MODIS Land Data Products for small areas (<200 x 200 km) from 2000 to present. Output files contain pixel values of MODIS land products in text format and in GeoTIFF format. In addition, data visualizations (time series plots and grids showing single composite periods) are available.

    Free


    Data and Metadata Management

    Morpho a program that can be used to enter metadata, which are then stored in a file that conforms to the Ecological Metadata Language (EML) specification. Information about people, sites, research methods, and data attributes are among the metadata collected. Data can be stored with the metadata in the same file.
    Morpho allows the user to create a local catalog of data and metadata that can be queried, edited and viewed. Morpho also interfaces with the Knowledge Network for Biocomplexity (KNB) Metacat server, which allows scientists to upload, download, store, query and view public metadata and data.

    Free


    Data and Metadata Management

    mp (metadata parser) is a compiler to parse formal metadata, checking the syntax against the FGDC Content Standard for Digital Geospatial Metadata and generating output suitable for viewing with a web browser or text editor. mp generates a textual report indicating errors in the metadata, primarily in the structure but also in the values of some of the scalar elements (that is, those whose values are restricted by the standard). mp can read indented text (compliant to a strict format -FGDC style- encoded as plain text) or XML. As a standalone tool, it can be operated at the DOS Shell prompt.

    Free


    Exploration, Visualization, and Analysis

    MrBayes is a program for doing Bayesian phylogenetic analysis including phylogenetic reconstruction. Bayesian inference of phylogeny is based on the posterior probability distribution of trees, which is the probability of a tree conditioned on the observations. The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is often impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov Chain Monte Carlo (MCMC) to approximate the posterior probabilities of trees.

    Free


    Scientific Workflows

    myExperiment is a collaborative environment where scientists can safely publish their workflows and experiment plans, share them with groups and find those of others. It allows workflows, other digital objects and bundles (called Packs) to be swapped, sorted and searched like photos and videos on the Web. myExperiment makes it easy for the next generation of scientists to contribute to a pool of scientific methods, build communities and form relationships - reducing time-to-experiment, sharing expertise and avoiding reinvention.

    Free


    Data and Metadata Management

    MySQL is a relational database system (RDBMS) that runs as a server providing multi-user access to several databases. All major programming languages include libraries to access MySQL. It comes with a command line tool, with many third party graphical user interfaces also available. Although not considered a full enterprise RDBMS like Oracle or PostgreSQL, it supports most RDBMS technologies like foreign keys, triggers, views, indexing, and backup. MySQL is easy to use, sufficient for most environmental data management applications and the choice of many web server hosting companies.

    MySQL is included in several third party installation bundles, providing easy installation and configuration for the most popular combinations of programs for web development. E.g. LAMP, WAMP etc. which combine Apache webserver, MySQL and PHP for Linux and Windows respectively.

    Free


    Data and Metadata Management

    MySQL Workbench is an open source, visual based tool for MySQL database design, creation and administration. It is separate from, but connects to MySQL, which is database software built on a version of Structured Query Language or SQL (see MySQL tool description in the DataONEpedia).

    SQL Development
    Create and manage connections to database servers.
    Enables the user to configure connection parameters.
    Capability to execute SQL queries on the database connections using the built-in SQL Editor.
    Data Modeling
    Enables you to create models of your database schema graphically.
    Forward engineer, or turn diagrams into MySQL databases.
    Reverse engineer, or download existing MySQL databases and represent them as diagrams.
    Edit Tables, Columns, Indexes, Triggers, Partitioning, Options, Inserts and Privileges, Routines and Views.
    Database Administration
    Enables you to create and administer server instances.
    Manage users and user permissions.
    Free


    Data and Metadata Management

    National Instruments LabVIEW is a sophisticated application for the creation and management of engineering and scientific measurement, test, data collection and control systems. LabVIEW includes a graphical user interface that allows external hardware devices such as mechanical or electronic sensors to be configured and operated using "point-and-click" methods. Networks of sensors and processing devices can be joined together using flowchart-like "wire" connectors. Both physical and virtual (software-based) devices are supported. Complex processes, virtual devices and workflows can be developed using the LabVIEW programming language instead of the GUI interface. In addition to device management, LabView provides an extensive set of on-board analysis libraries that enable data feeds to be aggregated, evaluated and manipulated. Raw and processed data can then be routed to a remote database platform or other repository for storage. LabView also provides numerous plug-ins that enable both live and stored data to be visualized in various types of charts, graphs and tables, and includes a technical reporting module that allows data output to be formatted for print or online distribution.

    Cost-basis


    Data and Metadata Management

    ncISO is a package of tools that facilitates the generation of ISO 19115-2 metadata from data in NetCDF (Network common data format).

    These data must be included in a Thematic Realtime Environmental Distributed Data Service (THREDDS) data server catalog. ncISO is based on the Unidata Attribute Convention for Data Discovery (http://www.unidata.ucar.edu/software/netcdf-java/formats/DataDiscoveryAt...).

    There are currently two tools available, first a command line utility that can be run on your local desktop or workstation, and second a THREDDS server extension library. This utility is also included in the THREDDS Data Server.

    Inputs to this tool are THREDDS catalogs.

    Outputs from the tool are ISO metadata in XML format.

    Free


    Exploration, Visualization, and Analysis

    NodeXL is a free, open-source template for Excel 2007 and 2010 that lets you enter a network edge list, click a button, and see the network graph, all in the Excel window.

    You can customize the graph’s appearance; zoom, scale and pan the graph; dynamically filter vertices and edges; alter the graph’s layout; find clusters of related vertices; and calculate graph metrics. Networks can be imported from and exported to a variety of file formats, and built-in connections for getting networks from Twitter, Flickr, YouTube.

    Free


    Data and Metadata Management

    The NPS Metadata Tools & Editor is a custom software application developed for the National Park Service (NPS) for authoring and editing NPS metadata. It extends the basic functionality of ESRI's ArcCatalog for managing geospatial metadata and also provides a stand-alone version for creating and manipulating non-spatial metadata outside of ArcCatalog. While the tool was not specifically designed for users external to the NPS, the Tools and Editor features powerful metadata editing capabilities that a metadata author is likely to find useful.

    Free


    Discovery Tools

    OAIster is a freely accessible search engine for open access web resources, available from OCLC. OAIster uses the Open Access Initiative Protocol for Metadata Harvesting (OAI-PMH) to harvest records from websites. OAIster contains over 25 million records from all disciplines and subjects contributed by over 1,000 libraries, archives, and repositories. The records harvested by OAIster use Dublin Core (unqualified) metadata format. Repository managers running OAI-compliant repositories can contribute records for open access web resources. OAIster records can be searched separately at the website, and are incorporated into OCLC Worldcat.

    Free


    Exploration, Visualization, and Analysis

    Ocean Data View (ODV) is desktop software for analysis and visualization of oceanographic, atmospheric and other geo-referenced profile or time-series data.

    Basic features:

    • Input format is basic spreadsheet-styled data tables
    • Users can customize their configurations with high resolution bathymetry, coastlines, and other reference material
    • Data and configuration files are platform-independent and can be exchanged between different systems

    ODV is particularly useful for:

    • Plots of properties at selected stations,
    • Cross-sections along cruise tracks, and
    • Color distributions on general isosurfaces

    ODV was developed by the Alfred Weneger Institute (http://www.awi.de), under the SeaDataNet program. There are licensing restrictions for uses other than scientific research.

    Free


    Exploration, Visualization, and Analysis

    GNU Octave is a high-level language, primarily intended for numerical computations. It provides a command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with MATLAB. It may also be used as a batch-oriented language.

    Octave has extensive tools for solving common numerical linear algebra problems, finding the roots of nonlinear equations, integrating ordinary functions, manipulating polynomials, and integrating ordinary differential and differential-algebraic equations. It is easily extensible and customizable via user-defined functions written in Octave's own language, or using dynamically loaded modules written in C++, C, Fortran, or other languages.

    Free


    Exploration, Visualization, and Analysis

    OpenBUGS is software for running Markov Chain Monte Carlo (MCMC) simulations following Bayesian statistical theory. It is one of two software packages created for Bayesian Inference Using Gibbs Sampling, or BUGS. OpenBUGS is so named because it runs on multiple operating systems; the WinBUGS software can be used with Windows operating systems (see WinBUGS tool in the DataONEpedia for details).

    Bayesian inference is built on specified probabilities of models and evaluated using MCMC simulation including error components. OpenBUGS implements these simulations and "samples" them according to user-defined criteria. OpenBUGS can be used as a stand-alone application but can also be integrated with R statistical software.

    OpenBUGS requires thorough knowledge of Bayesian statistics to create and evaluate models appropriately.

    Free


    Exploration, Visualization, and Analysis

    OpenLayers is an open-source JavaScript library that provides an application programmers Interface (API) for incorporating maps and geospatial data within web pages. OpenLayers has no server-side dependencies and works with most modern web browsers. It offers basic panning and zooming functionality for data exploration and discovery in a "slippy-map" format similar to Google Maps. It can serve up geospatial data from many sources including web map services (WMS), web feature services (WFS), Google Maps, and other proprietary and open-source map servers such as GeoServer and MapServer.

    Free


    Modeling

    OpenMI provides users with a standard interface that allows the construction of modeling workflows. OpenMI allows models to exchange data with each other and other modeling tools as they run, facilitating the modeling of process interactions. Models may come from many different sources, represent processes from different scientific domains, have different spatial and temporal resolutions, and have different spatial domains/representations. The OpenMI standard is defined by a set of software interfaces that a compliant model must implement. These interfaces enable models to communicate with each other, with the possibility of two-way links between models where the involved models mutually depend on calculation results from each other. The OpenMI interfaces are available in both C# and Java. Models may run asynchronously with respect to timesteps.

    As the OpenMI standard is a software component interface definition for the computational core (the engine) of the computational models, model components that comply with the OpenMI standards can, without any programming, be configured to exchange data during computation. Once developed, OpenMI models can be reused in many different applications and configurations. Most existing applications of OpenMI, and subsequently most of the available OpenMI compliant models, have been developed within the water resources domain.

    Free


    Data and Metadata Management

    OpenOffice Base is an open source desktop relational database application that is part of the OpenOffice suite of productivity tools offered free on the web by Oracle Corporation. OpenOffice Base is SQL compliant and provides a core database platform, a query tools set, report building module, and forms editor. Each major component in OpenOffice Base provides both novice and expert modes (for example, new users can build a database table using a wizard, and advanced users can jump directly to an advanced table editor). OpenOffice Base also includes macro building tools, and the ability to connect to external database servers (including Access, mySQL, PostgresSQL, and other ODBC sources). Base does not allow for multi-user access or database sharing capabilities.

    Free


    Data and Metadata Management

    The Oracle Database is a proprietary relational database management system (RDBMS). There are various editions available depending on technical requirements. All editions are built using the same common code base which can scale from small, single-processor servers to clusters of multi-processor servers without changing the code. Oracle runs on various operating systems including: Apple Mac OS X Server, HP UNIX, HP OpenVMS, IBM AIX, IBM z/OS, Linux, Microsoft Windows and Sun Solaris.

    Cost-basis


    Exploration, Visualization, and Analysis

    Oriana is a tool for calculating statistics for circular or radial data (angles or directions measured in degrees, time of day, day of week, month of year, etc.). It can be used for orientation data (direction taken from a point), for describing and comparing species temporal distributions and ranges, and other types of data that are not directly handled in most statistics packages. It provides basic statistics such as mean vector and confidence limits, single sample distribution tests (Rayleigh's), and also pairwise and multisample tests such as Watson-Williams F-Test and chi-squared test, and correlations. Oriana can graph your circular data in a variety of ways, including rose diagrams, circular histograms or wind roses.

    Cost-basis


    Exploration, Visualization, and Analysis

    OriginPro is an expanded version of Origin, both of which are software for data management, statistics, and graphics. OriginPro is point-and-click interactive software and uses multiple windows to manage data and run analyses. A variety of graphics can be created using a graph editor and exported for incorporation with the Microsoft Office suite. Data management is done through worksheets bundled into project management files. There are a limited number of statistical analyses available including basic descriptive statistics, linear regression and analysis of variance, survival analysis and non-parametric tests. Signal processing tools such as Fast Fourier Transform (FFT), and peak analysis tools are also available in OriginPro. Analyses can be scripted using the custom programming languages (LabTalk and Origin C).

    OriginPro is sold at a discount to students for personal use.

    Cost-basis


    Data and Metadata Management

    The Outwit suite of Firefox extensions that allows you to harvest materials on the web. The suite currently includes Outwit Hub, Outwit Images, and Outwit Docs.

    OutWit Hub allows you to automatically browse through pages, collect and format data and information on the web. It will automatically explore series of Web pages or search engine results for you and extract contacts, links, images, data, news, etc. You can use the Hub's default scraping utilities to help you extract data with explicit structures in the HTML source code of the page, such as lists or tables. Also, you can build scripts that will navigate from page to page in sequences of results and automatically extract quantities of information objects. Results can be exported into MS Excel.

    OutWit Images is a Firefox extension for simple image browsing. With OutWit Images, as you explore the web for pictures, you can create, save, and share collections. Also, you can automatically explore Web pages or search engine results for pictures, and create, save and share your collections. OutWit Images can create full-screen slideshows for image collections.

    Free


    Data and Metadata Management

    Oxygen is a XML editor that provides XML document validation and includes a SVN client for collaboration and a text editor, the Oxygen Author. It supports all XML technologies, including editors for XSLT, XPath, XQuery, and XML schema and DTDs. It provides intelligent XML editing with autocomplete features and content sensitive XML assistance. Management support for relational databases and native XML databases is provided. Oxygen can be used as a standalone application, or as a plugin for Eclipse or Java Webstart application. This is a commercial application, however it is offered for free to non-profit entities working in certain domains, including Ecology, through their 'Support Life' program.

    Cost-basis


    Exploration, Visualization, and Analysis

    Panoply is a cross-platform application which plots geo-gridded arrays from netCDF, HDF and GRIB datasets. It supports the following operations:

    • Slice and plot specific latitude-longitude, latitude-vertical, longitude-vertical, or time-latitude arrays from larger multidimensional variables.
    • Combine two arrays in one plot by differencing, summing or averaging.
    • Plot lon-lat data on a global or regional map (using any of over 75 map projections) or make a zonal average lineplot.
    • Overlay continent outlines or masks on lon-lat plots.
    • Use any ACT, CPT, GGR, or PAL color table for scale colorbar.
    • Save plots to disk GIF, JPEG, PNG or TIFF bitmap images or as PDF or PostScript graphics files.
    • Export lon-lat map plots in KMZ format.
    • Export animations as AVI or MOV video or as a collection of invididual frame images.
    • Explore remote THREDDS and OpenDAP catalogs and open datasets served there.

    To be plotted by Panoply, dataset variables must be tagged with metadata information using a convention such as CF.

    Free


    Data Deposition, Citation, Curation and Preservation

    Papers is desktop electronic library and document organizer. The tool helps users organize documents in many formats, including PDF, MS Word, spreadsheets, presentations, posters, and scanned text. Papers includes dedicated space for users' articles and conference related materials, such as travel documents and posters. It includes a citation tool and a built in search engine. The search engine allows lexical and categorical searching using both free and fee-based search engines, and the subsequent importation of documents into the user's library.

    Cost-basis


    Exploration, Visualization, and Analysis

    PAUP* (Phylogenetic Analysis Using Parsimony *and other methods) is a program for phylogenetic analysis using parsimony, maximum likelihood, and distance methods. The program has a selection of analysis options and model choices, and accommodates DNA, RNA, protein and general data types. It has options for dealing with phylogenetic trees including importing, combining, comparing, constraining, rooting and testing hypotheses.

    Cost-basis


    Scientific Workflows

    Pegasus encompases a set of technologies that help workflow-based applications execute in a number of different environments including desktops, campus clusters, grids, and now clouds. Scientific workflows allow users to easily express multi-step computations, for example retrieve data from a database, reformat the data, and run an analysis. Once an application is formalized as a workflow the Pegasus Workflow Management Service can map it onto available compute resources and execute the steps in appropriate order.

    Free


    Data and Metadata Management

    pgAdmin is a design and management interface for the PostgreSQL database (open source object-relational database system). It is an open source administration and development platform.

    The most useful features for non-programmers are:

    1. creating simple SQL queries with a syntax-highlighting SQL editor and code editor
    2. manual inserts and editing of database tables using the spreadsheet-like interface

    For database administrators, pgAdmin provides a graphical interface to all the PostgreSQL features.

    The program supports multiple versions of the PostgreSQL database.

    Free


    Exploration, Visualization, and Analysis

    Photofiltre is a simple, free image editing application that offers the standard adjustment functions (Brightness, contrast, dyed, saturation, gamma correction), layers, and also artistic filters (watercolor, pastels, Indian ink, pointillism, puzzle effect). Photofiltre can also use Photoshop plug-ins.

    Free


    Data and Metadata Management

    phpMyAdmin is an open-source tools that provides for easy management of MySQL databases through a web-based user interface. Processes that can be completed through this tool include:

    • database management (creation and management of users, permissions, etc.)
    • create and execute queries
    • create and view tables, database rows and fields
    • execute stored procedures and triggers
    • import and export data

     

    Free


    Exploration, Visualization, and Analysis

    PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). Methods that are available in the package include: parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Input data types include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters.

    Free


    Scientific Workflows

    Platform LSF is a workload manager designed for use in large, high-performance computing environments. This commercial tool can be used to schedule complex scientific workflows and manage very large (up to petaFLOP scale) compute resources. It provides application support across distributed and heterogeneous platforms.

    Cost-basis


    Data Deposition, Citation, Curation and Preservation

    PLATO is a planning and decision support tool that implements a solid preservation planning process and integrates services for content characterisation, preservation action and automatic object comparison in a service-oriented architecture to provide maximum support for preservation planning endeavours.

    Free


    Discovery Tools

    The Polar Information Commons (PIC) Rights Badging Tool allows you to use the Creative Commons tools to create a graphic badge. This badge asserts that digital content is available in the Polar Information Commons (PIC) with minimal restrictions and in adherence with community guidelines or norms of behavior for ethical data sharing. Once created, a badge may be placed on a website describing your data set or within the use constraints field of its metadata.

    Free


    Data and Metadata Management

    The combination of PostgreSQL and PostGIS provides a robust database platform that supports the integrated management of both geospatial data and attributes associated with those data in a database system that is supported by a large number of client applications, including GIS and mapping applications. PostgreSQL is an open source object-relational database server that implements the Structured Query Language (SQL) for database design, management, and use. PostGIS is an implementation of the Open Geospatial Consortium's "Simple Features Specification for SQL" standard which defines data types and functions that may be implemented in a SQL database for the storage and management of geospatial data within the database.

    Free


    Modeling

    PowerDesigner is a tool for creating business-process models, and conceptual, logical, and physical data models for database design, including relational and dimensional models. PowerDesigner can coordinate the business process model with the database design, ensuring that the process steps that create data have data representations in the logical model. PowerDesigner can create the actual database from the physical model, and create different physical implementations from a single logical model. PowerDesigner can also reverse-engineer existing databases into a model diagram. PowerDesigner works with many database management systems (DBMS). Major outputs from the tool include entity-relationship (ER) diagrams, impact analysis reports on design changes, and standard or custom reports on all objects in the design (tables, fields, relationships).

    Cost-basis


    Data Deposition, Citation, Curation and Preservation

    ProCite is a tool for creating citations in the users' preferred citation standard.

    ProCite features include:

    • Collecting and managing your references
    • Creating bibliographies
    • Formatting citations and bibliograpies for many different journal styles
    • Sharing reference with other users
    Cost-basis


    Scientific Workflows

    Project Trident is a scientific workflow workbench that allows users to author workflows visually by using a catalog of existing activities and complete workflows. The workflow workbench provides a tiered library that hides the complexity of different workflow activities and services for ease of use. Trident supports: analysis and visualization worksflows; composing, running, cataloging experiments as workflows, as well as capturing of provenance information. Workflows can be scheduled over high performance clusters or cloud computimg resources.

    Free


    Data Deposition, Citation, Curation and Preservation

    PRONOM is an online registry of technical information about file formats, maintained by The National Archives (UK). The PRONOM database contains information about the properties of over 600 file formats, and is used by repository managers to understand, document and manage file formats stored in repositories. Information in the database includes extensions associated with file types, software required to render files, version histories of file types, signature types and compression information.

    In addition to searching the registry via the web interface, PRONOM provides two important services related to file type identification and metadata extraction. The DROID tool, provides both a command line and GUI interface to the PRONOM registry allowing for easy documentation of file types. The PRONOM Unique Indentifier (PUID) tool allows unambiguous reference to data in the PRONOM database.

    Free


    Data and Metadata Management

    Protege is an open source ontology editor. An ontology is similar to a taxonomy in that it presents a controlled vocabulary for a given area of knowledge. However the relationships between the different objects can be far more complex and richly described.

    It allows users to create ontologies in both the Frames and Web Ontology Language (OWL) frameworks. Protege allows users to

    • Import, edit and save existing ontologies written in OWL or RDF (Resource Definition Framework).
    • Create new ontologies.
    • Save ontologies in several formats, including XML expressions of RDF and OWL
    • Visualize ontologies in graphical form, showing the functional relationships between classes.
    • Populate ontologies with concrete instances of classes.
    • Execute reasoners that can perform inferences on an ontology (i.e. classify instances based on their properties)

    Intended audience: Protege is designed for those in the field of ontology and knowledge modeling, since some degree of knowledge about the underlying axioms is nearly always required. Some plugins area available that shield a user from these to some degree.

    For working scientists, the most useful plugins and views will be those that present complex knowledge models graphically. In addition, it might might be helpful for those wishing to use the RDF/XML expression of Dublin Core to annotate their data with metadata. There are numerous plugins written by other projects. There are many thousands of registered users and a wiki.

    Protege can be used to edit simpler vocabulary systems such as Simple Knowledge Organization Schema (SKOS), but generally, its power is overkill for this use.

    Free


    Exploration, Visualization, and Analysis

    PSPP is a program for statistical analysis of sampled data, and is a free replacement for the proprietary program SPSS. PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its back-end is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands. Some benefits are that PSPP uses SPSS files and is compatible with OpenOffice and can support 1 billion data observations.

    Free


    Exploration, Visualization, and Analysis

    Quantum GIS (QGIS) is an open source Geographic Information System (GIS) that implements a large number of geospatial data access, visualization, processing, and analysis functions. It can access vector data stored in a wide variety of formats, including file-based (e.g. ESRI Shape Files, KML, GML), geodatabases (e.g. PostgreSQL/PostGIs, ODBC, ESRI Personal GeoDatabase, SQLlite), and network protocols (OPeNDAP, GeoJSON); raster data in one of over 40 formats supported by the underlying GDAL raster library (including NetCDF, HDF5, GeoTIFF, GRIB, and JPEG-2000); and Open Geospatial Consortium visualization and data access services (Web Map and Web Feature Services [WMS and WFS, respectively]). Depending upon the host system configuration, QGIS can also act as an alternative Graphical User Interface for the large collection of GRASS GIS geospatial processing functions. QGIS includes a "plug-in" architecture in which extensions to the core functionality of the application may be developed and used, with current plug-ins including support for GPS integration, interaction with the OpenStreetMap data servers, and data transformation tools.

    Free


    R
    Exploration, Visualization, and Analysis

    R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. R is highly extensible and has many user-submitted packages for specific functions or specific areas of study such as bioinformatics, ecological models, population dynamics, analysis of spatial data, and phylogenetis. Packages can be browsed by CRAN Task Views. These views may be of interest: Analysis of Ecological and Environmental Data, Statistical Genetics, Phylogenetics, Especially Comparative Methods, and Analysis of Spatial Data. There are several graphical user interface (GUI) packages that can simplify the use of R, including Tinn-R, R Commander, and SciViews.

    Free


    Data and Metadata Management

    RAMADDA is a web-based application framework that provides a broad suite of services for content and data management, publishing and collaboration. RAMADDA brings together a number of concepts and technologies to provide an easy to use but powerful system for publishing, organizing, discovering. and accessing data and other holdings.
    RAMADDA Is a freely available web application that runs on your own server. Java is necessary for operation.

    RAMADDA provides the following features:

    • Data file ingest, organization, meta-data creation and access control
    • Search and browse capabilities
    • Catalog and RSS feeds
    • Data services including OpenDAP, subsetting and point data access
    • Wiki facilities
    Free


    Modeling

    Rational Rose is a software development environment for using model descriptions and pattern languages to drive code development. Rational Rose is an IBM product. The emphasis of the "rational" development environment is to design major software engineering components at an abstract modeling level, initially unhindered by the challenges of implementing components and relationships as code. Common technologies driving rational development are APIs (Application Programming Interfaces) to encapsulate interfaces from implementation, UML (Unified Modeling Language) to express abstract entities and their relations, and IDEs (Integrated Development Environments) to coordinate the various modeling and coding artifacts.

    Cost-basis


    Scientific Workflows

    Refworks is an easy-to-use web-based product that allows you to import references. RefWorks is a subscription-based service available in many Higher Ed institutions. For subscribers, you can import references from your institutions' online library catalog and many electronic databases to which the institution subscribes. It is designed to help researchers easily gather, manage, store and share all types of information, as well as generate citations and bibliographies.

    More specifically, RefWorks enables you to...

  • Download citations from databases and put them into a personal RefWorks research database
  • Format bibliographies and citations automatically in over 400 styles including APA, MLA, etc.
  • Create a bibliography of citations in a Word document using a RefWorks "as-you-write-it" add-in.
  • Access your RefWorks account from any computer, anywhere
  • Create a database and share it with colleagues around the world
  • Cost-basis


    Exploration, Visualization, and Analysis

    S-PLUS is a commercial implementation of the S statistical programming language with a publication-quality graphics package and a matrix-based programming language. It provides the ability to analyze gigabyte class data sets on the desktop, and a package system for deployment of analytics.

    The R programming language is an open-source implementation of the S statistical programming language.

    Cost-basis


    SAS
    Exploration, Visualization, and Analysis

    SAS is an integrated system of software that enables everything from data access across multiple sources to complex manipulations of data files to performance of sophisticated statistical analyses and data visualizations. Three of SAS' most popular software products that are commonly used by ecologists are Base SAS, SAS/STAT, and SAS/GRAPH. SAS is available for Windows and UNIX platforms. GUIs make SAS accessible to novice users and the command-line interface facilitates development of programs for complex data processing.

    Cost-basis


    Discovery Tools

    SAS Enterprise Miner streamlines the data mining process to create predictive and descriptive models based on analysis of large amounts of data. Data can be accessed from local files or from remote database connections. SAS data mining software uses a point-and-click interactive interface to create workflows and analysis diagrams, and then execute them. SAS Miner can transform and manipulate data using filters and statistical analyses to extract desired data from large datasets.

    Cost-basis


    Scientific Workflows

    Science Pipes is a free, online resource that enables people to access publicly available biodiversity data, build scientific workflows, and share the workflows and results with other Science Pipes users. Produced by the Cornell Lab of Ornithology, the service is based on Kepler (a professional, open-source software package for building scientific workflows for data analysis and modeling), but repurposed to allow intuitive visual programming, using drag and drop components. Workflows are openly available to Science Pipes users for viewing, modifying, and commenting. Workflows can be searched by tags and titles. Users can view workflow parameters, documentation, and results. In addition, workflows can be copied, then modified to run a new analyses and/or change output parameters. Science Pipes members can rate and comment on workflows, creating the potential for a rich collaborative working environment.

    Free


    Exploration, Visualization, and Analysis

    Scientific Python (SciPy) is an interactive programming environment for mathematics, science and engineering based on the open source Python programming language. SciPy builds on NumPy, a Python library that provides convenient and fast N-dimensional array manipulation, and includes many user-friendly and efficient routines for numerical integration and optimization, data analysis, and plotting. Tutorials and recipes are provided for common data analysis scenarios, and support is provided by an active community of scientific end-users.

    Free


    Exploration, Visualization, and Analysis

    SPYDER is a free software environment for visualization, numerical calculation, and data analysis. It provides a graphical development environment for the Python programming language and leverages many scientific and engineering packages including Matplotlib, NumPy and others. It is available on Windows, Mac OS X, and GNU/Linux.

    Free


    Exploration, Visualization, and Analysis

    Scratchpads is a social networking application for biodiversity research. It can by used by researchers to manage, share and publish taxonomic data online. The software has tools for managing:

    • phylogenies
    • classifications
    • bibliographies
    • documents
    • images
    • spreadsheets
    • specimen records
    • maps

    Records that are added to a Scratchpad are classified and grouped around a taxonomy. This taxonomy can be supplied by the users or imported from the Encyclopedia of Life. The Scratchpads can link to databases including GenBank, Morphbank, GBIF, Wikipedia, etc.

    It provides social networking features, image uploading, free-form page creation, a template for species description (summary data), taxonomic hierarchy editing, importing of images from Flickr and Google Images, a grid-editor for tabular data, mapping functionality, bibliographic data management, and sharing of data with Encyclopedia of Life (http://www.eol.org).

    One module allows users to generate publishable manuscripts for taxonomic description or revision in the Zookeys journal. Hosting can be provided by the ViBRANT project or you can host your site yourself. In addition to taxon-based research groups, it has been used by scientific societies and journals to establish a web-presence.

    Scratchpads is built using the Drupal content management system.

    Free


    Exploration, Visualization, and Analysis

    SigmaPlot is a commercial software package primarily used for data analysis and publication-quality visualization. Data can be input directly into a table or imported from basic ascii or Microsoft Excel files. Data summarization (e.g., mean, sum) and analysis tools (e.g., parametric and non-parametric statistics, regression, and correlation) operate on the data at a click of the mouse. Graphs or charts in 2-D or 3-D (e.g., line, bar, and pie charts, histograms, heat maps, surfaces) are created via a wizard. Batches of files can be analyzed and graphed automatically. SigmaPlot can also create default or user-defined reports. Specialized functionality of SigmaPlot includes instrument calibration, medical test analysis, and molecular biology tools.

    Cost-basis


    Exploration, Visualization, and Analysis

    Simulink is an add-on package for MATLAB that supports simulation and model-based design using a graphical block-programming scheme. An interactive graphical editor is provided for building models and simulations based on an extensive library of customizable program blocks and custom code. A Model Explorer application supports inspection and editing of models, signals, parameters and generated code. Simulink provides full access to the MATLAB environment for analyzing and visualizing results, customizing the modeling environment, and defining signal, parameter, and test data.

    Cost-basis


    Scientific Workflows

    Skype is VOIP and instant messaging software that allows voice and chat communication between computers and phone systems via the internet. Direct computer to computer via the internet is free of charge, and computer to phone connections via the internet have a relatively small fee. Additional features include video conferencing and file transfer. Acquired by Microsoft in May 2011.

    Free


    Data and Metadata Management

    SpatiaLite is an extension to the SQLite database that enables it to support spatial data.

    SpatiaLite is conformant to OpenGIS specifications. It has the following features:

    • supports standard WKT and WKB formats
    • implements SQL spatial functions such as AsText(), GeomFromText(), Area(), PointN() and alike
    • the complete set of OpenGis functions is supported via GEOS, this comprehending sophisticated spatial analysis functions such as Overlaps(), Touches(), Union(), Buffer() ..
    • supports full Spatial metadata along the OpenGis specifications
    • supports importing and exporting from / to shapefiles
    • supports coordinate reprojection via PROJ.4 and EPSG geodetic parameters dataset
    • supports locale charsets via GNU libiconv
    • implements a true Spatial Index based on the SQLite's RTree extension
    • the VirtualShape extension enables SQLite to access shapefiles as VIRTUAL TABLEs
    Free


    Data and Metadata Management

    Specify 6 is a metadata management and collections holdings system that allows you to track specimen and tissue transactions. It manages specimen data such as descriptions of collecting locations, participants and determination histories as well as information about collections transactions such as loans, exchanges, accessions and gifts.

    Specify 6 supports georeferencing with GEOLocate, label and report printing, and importing and exporting. You can manage all institutional collections within a single database for simplified administration.

    Specify comes in two versions for each desktop operating system. The full version of Specify 6 requires the installation of MySQL database manager and the Java Runtime Environment (JRE). A lighter version of Specify 6, Specify EZDB does not require MySQL installation.

    Specify also has a off-line spreadsheet version: Specify Mobile WorkBench. Specify Mobile WorkBench allows scientists to enter data off-line while in the field and then upload to your online Specify database.

    Free


    Exploration, Visualization, and Analysis

    Spotfire is a software package for data analysis and visualization. It is an interactive visualization environment for interpreting, capturing and sharing analyses of large amounts of data from disparate and often incompatible sources. Users can merge data from both spreadsheets and databases and analyze the data using statistical operations such as data pre-processing and normalization,cluster analysis, t-tests/ANOVA, and principal component analysis.

    Cost-basis


    Exploration, Visualization, and Analysis

    Spotfire is a data analysis and visualization tool. It allows users to perform ad-hoc analysis and build custom analytic applications. It supports data imports from spreadsheets and relational databases, as well as real-time and event-driven data. Besides visualization, Spotfire also incorporates statistics functions.

    Cost-basis


    Discovery Tools

    Spotfire Miner is software for data mining of large datasets. It is sold commerically by TIBCO.

    Users can connect to remote or local datasets, apply statistical and methodological filters, clean and transform the data, and finally apply a model to produce the desired mined data. Statistical models include clustering, regression analysis, and principal components analysis. Models based on historical data can then be used to predict future results based on newly mined data.

    Cost-basis


    Exploration, Visualization, and Analysis

    SPSS is a desktop statistical software package that is centered around modeling and statistics. SPSS can access data from many different proprietary and open source data sets and has decent graphing and very good statistical modeling capabilities. One weakness (Up to version 17), is the presentation quality of graphs. Other packages do a much better job at data presentation.

    Cost-basis


    Exploration, Visualization, and Analysis

    IBM SPSS Amos is a tool used for structural equation modeling. It features drag-and-drop drawing tools and produces graphics of final models for presentation.

    Amos uses standard methods – including regression, factor analysis, correlation and analysis of variance. It can be used to create models to test hypotheses and confirm relationships amongst variables.

    Cost-basis


    Data and Metadata Management

    The SQL Server is a relational model database server produced by Microsoft that provides a high performance database platform that’s reliable, scalable, and easy to manage. Its primary query languages are T-SQL and ANSI SQL. There are several Editions of the Server available, which differ depending on the services they provide.

    Cost-basis


    Data and Metadata Management

    SQLite is a software library that implements a self-contained SQL database engine. SQLite can be used as a database underlying a website, or as a substitute for a Relational Database Management System (RDBMS).

    SQLite supports atomic, consistent, isolated, and durable (ACID) transactions and has easy setup and administration. A complete SQLite database is stored in a single disk file. SQLite supports large databases (up to 1 TB in size). It supports a relatively simple application programmer’s interface (API) and has no external dependencies. SQLite comes with a stand-alone command line interface client that can be used to administer SQLite databases.

    An additional extension for SQLite called SpatiaLite exists for adding support for spatial data to SQLite databases. SpatiaLite is conformant with OpenGIS specifications. See the SpatiaLite tool entry for additional information.

    The source code for SQLite is in the public domain and implements most of the SQL Standard. Ongoing development and maintenance of SQLite is sponsored by the SQLite Consortium, which includes Mozilla, Bloomberg, Oracle, Nokia, and Adobe. Because of its small size and overhead SQLite is suitable for use in mobile devices such as cellular phones.

    Free


    Exploration, Visualization, and Analysis

    Stata 11 is software for data management, statistics, and graphics. Stata uses point-and-click interaction and help to guide users through tasks. Logs can be created and stored as repeatable scripts, so that data management and analysis are completely documented. Users can perform statistical analyses ranging from basic statistical summaries and linear regression models to multilevel mixed-effects modeling, generalized linear modeling, resampling and simulation, and many multivariate analyses. A graph editor allows users to produce figures based on the data and statistical models. Stata also includes a custom programming language (Mata) for programming customizations. At this time, Stata 11 is the latest version.

    Stata comes in four different application "packages" which vary based on size of dataset and processing need. A "Small Stata" is available only to educational purchasers including students, with a limited number of variables and observations permitted in the dataset.

    Many local as well as national users groups for Stata exist and hold regular meetings in addition to creating online support communities.

    Cost-basis


    Exploration, Visualization, and Analysis

    STATISTICA is a proprietary analytical software package developed by StatSoft that includes data visualization, data analysis, data management, and data mining tools. It is a primarily graphical user interface (GUI) application.

    Cost-basis


    Modeling

    STELLA (Systems Thinking for Education and Research; from isee Systems) is a modeling software package that diagrams, charts, and uses animation help visual learners discover relationships between variables and helps simplify model building. Stella handles time series, sensitivity, and simulation models well and has a 'drag and drop' modeling interface. Users can download a free trial that has significant features.

    Cost-basis


    Data Deposition, Citation, Curation and Preservation

    SVN (an abbreviation for "subversion") is an open source version control package of the Apache Foundation. Version control is a process whereby: 1) versions of a document are saved for later retrieval, even if the document is later deleted; 2) versions of a document may be compared for differences; 3) multiple authors may edit and build the document version chain, with software support for avoiding, managing, and resolving collisions; 4) catastrophic failure recovery mechanisms are in place to maintain document and version integrity across a wide class of possible threats.

    Version control systems such as SVN differ significantly from the version comparison features common in word processors. SVN and similar systems (e.g., CVS, GIT) are focused on program language source code and related text-based documents; they are not optimized for binary documents.

    Version control systems' emphasis on audit trails and catastrophe recovery yield them common platforms for backups and managing data integrity. SVN is often implemented in a host/client architecture, whereby the document repository is physically distinct from the development environment. In this model, users install a separate SVN client or use a web client to interact with the system. Version control is a best practice for software development.

    Free


    Exploration, Visualization, and Analysis

    Tableau supports the analysis of tabular data from spreadsheets and relational databases. The tool provides a visual interface that allows users to import data and interactively explore the data through visualizations. These visualizations are created through a graphical user interface that allows users to build queries by dragging and dropping attribute names from tables and spreadsheets.

    Tableau also has Tableau Public, which is free visualization software that can be published to the web.

    Cost-basis


    Scientific Workflows

    Taverna is an open source family of tools for designing and executing workflows, created by the myGrid project. Written in Java, the family consists of the Taverna Engine (the workhorse), and the Taverna Workbench (desktop client) and Taverna Server (remote workflow execution server) that sit on top of the Engine.

    Taverna allows for the automation of experimental methods through the use of a number of different services (such as Web services) from a very diverse set of domains – from biology, chemistry and medicine to music, meteorology and social sciences. Effectively, Taverna allows a scientist with limited computing background and limited technical resources and support to construct highly complex analyses over public and private data and computational resources.

    Taverna Workbench 2.1.2 supports: copy/paste, shortcuts, undo/redo, drag and drop; animated workflow diagram; remembers added/removed services; secure Web services support; secure access to resources on the Web; up-to-date R support; intermediate values during workflow runs; myExperiment integration; and Excel and csv spreadsheet support.

    Free


    Data and Metadata Management

    TemaTres is an open-source, web-based thesaurus management package. Features include a simple, functional user interface for editing and browsing keywords, sophisticated search capabilities, and the ability to import or export all or part of the thesaurus in a number of standardized forms. TemaTres features a rich set of web services that provide searching and retrieval capabilities for external programs. As a result a number of tools such as "Visual Vocabulary," "ThesaurusWebPublishers" and "TemaTres View" use TemaTres as their backend. There are also capabilities for linking vocabularies in different languages to facilitate the creation of multilingual controlled vocabularies.

    (TemaTres is developed in Argentina, and consequently not all documentation is available in English.)

    Free


    Data and Metadata Management

    The Taiwan Forestry Research Institute has produced a set of web-accessible tools that use Ecological Metadata Language (EML) documents to produce maps and statistical products. They are available at: http://metacat.tfri.gov.tw/modules.

    One tool creates an "R" statistical language program from an EML document that can then be edited online . Another produces a Google Map displaying locations from the metadata document. A final module ingests an EML document and the associated data and performs quality control checks and simple analyses using a graphical user interface.

    Free


    Discovery Tools

    The THREDDS Data Server (TDS) is a web server that provides metadata and data access for scientific datasets, using OPeNDAP, OGC WMS and WCS, HTTP, and other remote data access protocols. It's features include:

    • THREDDS Dataset Inventory Catalogs are used to provide virtual directories of available data and their associated metadata. These catalogs can be generated dynamically or statically.
    • The Netcdf-Java/CDM library reads NetCDF, OpenDAP, and HDF5 datasets, as well as other binary formats such as GRIB and NEXRAD into a Common Data Model (CDM), essentially an (extended) netCDF view of the data. Datasets that can be read through the Netcdf-Java library are called CDM datasets.
    • TDS can use the NetCDF Markup Language (NcML) to modify and create virtual aggregations of CDM datasets.
    • An integrated server provides OPeNDAP access to any CDM dataset. OPeNDAP is a widely used, subsetting data access method extending the HTTP protocol.
    • An integrated server provides bulk file access through the HTTP protocol.
    • An integrated server provides data access through the OpenGIS Consortium (OGC) Web Coverage Service (WCS) protocol, for any "gridded" dataset whose coordinate system information is complete.
    • An integrated server provides data access through the OpenGIS Consortium (OGC) Web Map Service (WMS) protocol, for any "gridded" dataset whose coordinate system information is complete. This software was developed by Jon Blower (University of Reading (UK) E-Science Center) as part of the ESSC Web Map Service for environmental data (aka Godiva2).
    • The integrated ncISO server provides automated metadata analysis and ISO metadata generation.
    Free


    Data Deposition, Citation, Curation and Preservation

    Tika java class library available through the Apache group. It supports media type detection based on file type signatures, metadata extraction and text parsing and extraction.

    Supported Document Formats:

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Apple iWorks Formats
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • Mail formats
  • The DWG (AutoCAD) format
  • Font formats
  • Scientific formats
  • The Tika application can be run in either command line mode or as a graphical user interface (GUI) mode. Tika is written in Java and the class library can be used in directly in other programs where needed.

    Those with advanced programming skills can extend the Tikal to meet specific project or analysis needs not covered by the basic release. It is an open source project at the Apache Software Foundation and available under the Apache License version 2.0 (ALv2).

    Free


    Data and Metadata Management

    Tkme is a program for creating and modifying formal metadata that conforms to the Content Standards for Digital Geospatial Metadata devised by the Federal Geographic Data Committee (FGDC). Tkme is related to its progenitor, Xtme which ran exclusively on Unix systems. Tkme is closely allied with mp, a compiler for formal metadata, whose purpose is to verify that the syntactical structure of a file containing formal metadata conforms to the FGDC standard, and to re-express the metadata in various useful formats. The editor is intended to simplify the process of creating metadata that conform to the standard.

    Tkme is an improved version of Xtme, built with Tcl/Tk instead of direct calls to the X Window System. This enables Tkme to run on Microsoft Windows as well as Unix systems.

    Free


    Exploration, Visualization, and Analysis

    TMI-Orion is a manufacturer of data sensors and data loggers. They have custom software called QLEVER to configure, test, record data and perform basic statistics on the data streams resulting from the sensors. The software can manage the sensors (in some cases remotely), and evaluate sensor battery life and technical performance.

    Cost-basis


    Exploration, Visualization, and Analysis

    Triana is an open source problem solving environment that combines an intuitive visual interface with powerful data analysis tools. It can be used for a range of tasks, such as signal, text and image processing, and Triana includes a large library of pre-written analysis tools and the ability for users to integrate their own tools. Recently a custom writer was attached to the Triana GUI allowing Triana to generate Pegasus/Condor input files for the GriPhyN project.

    Free


    Exploration, Visualization, and Analysis

    UCINET is a comprehensive package for the analysis of social network data as well as other 1-mode and 2-mode data. Social network analysis methods include centrality measures, subgroup identification, role analysis, elementary graph theory, and permutation-based statistical analysis. In addition, the package has strong matrix analysis routines, such as matrix algebra and multivariate statistics.

    Integrated with UCINET is the NetDraw program for drawing diagrams of social networks.

    Cost-basis


    Exploration, Visualization, and Analysis

    uDIG (User-friendly Desktop Internet GIS) is an Open Source GIS framework and application for desktop GIS data access, editing, and viewing. uDIG is based upon the Eclipse Rich Client (RCP) Java framework and is extensible through the addition of plugins developed using the RCP framework. In its base configuration it supports a variety of data access methods, including file-based raster and vector data, geodatabases, and Open Geospatial Consortium services (Web Map and Web Feature Services [WMS and WFS respetively]). Plug-ins have already been developed for geospatial processing and analysis, enhanced cartographic capabilities, OGC Web Processing Service interaction, and data creation.

    Free


    Exploration, Visualization, and Analysis

    VisTrails is an open-source scientific workflow and provenance management system developed at the University of Utah that provides support for data exploration and visualization. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, such as simulations, data analysis and visualization, very little is repeated---change is the norm. As an engineer or scientist generates and evaluates hypotheses about data under study, a series of different, albeit related, workflows are created while a workflow is adjusted in an interactive process. VisTrails was designed to manage these rapidly-evolving workflows.

    Free


    Data and Metadata Management

    A collection of Extensible Stylesheet Language Transformations (XSLT) for transforming between various metadata standards and views, and a tool for applying those transforms to metadata records stored in Web Accessible Folders (WAF, https://geo-ide.noaa.gov/wiki/index.php?title=Web_Accessible_Folder). The focus is on translating FGDC metadata to ISO and on translating that ISO into various other standards and views.

    Free


    Exploration, Visualization, and Analysis

    Webex is a proprietary web collaboration and meeting environment. WebEx allows users to host and join web video- and tele-conferences. WebEx requires client-side Java, a browser, and a plugin, but once installed, users can host and join meetings with a browser and no additional software. WebEx web conferencing allows any user to become the "presenter" can share an application or their entire desktop over the web with other meeting participants. The presenter can give mouse control to other users, thereby allowing people to engage interactively in the conference and directly edit documents on another's computer. WebEx is often used for presentations using programs such as PowerPoint in distributed meetings.

    Cost-basis


    Exploration, Visualization, and Analysis

    WEKA is a data mining tool. It is a collection of standard machine learning algorithms organized and presented to the user as a workbench. The algorithms can be applied directly to a dataset from the workbench or called from Java code. New classifiers, filters etc can be added through the GUI.

    WEKA is written in Java and runs on platforms that support Java. It is available under the GNU Public License (GPL).

    Free


    Exploration, Visualization, and Analysis

    WinBUGS is software for running Markov Chain Monte Carlo (MCMC) simulations following Bayesian statistical theory. It is one of two software packages created for Bayesian Inference Using Gibbs Sampling, or BUGS. WinBUGS is so named because it runs on windows operating systems; the OpenBUGS software can be used on other operating systems (see OpenBUGS entry).

    Bayesian inference is built on specified probabilities of models and evaluated using MCMC simulation including error components. WinBUGS implements these simulations and "samples" them according to user-defined criteria. WinBUGS can be used as a stand-alone application but can also be integrated with R statistical software using the R2WinBUGS package in R.

    WinBUGS requires thorough knowledge of Bayesian statistics to create and evaluate models appropriately.

    Free


    Discovery Tools

    WordPress is an open source content management system (CMS). It provides a structured website that enables users to create and edit various types of web content without requiring in depth technical knowledge of web authoring or programming languages.

    Novice users can create web pages and add basic text and graphics to them with only a minimal introduction to the system; they can also take advantage of WordPress plugins, widgets, and themes. WordPress can also be used simply as a blogging tool.

    Free


    Modeling

    xCase is a data modeling and database design tool that is used to create logical and physical data models. xCase can create the actual database from the physical model, and create different physical implementations from a single logical model. xCase can also reverse-engineer existing databases into a model diagram. xCase works with many database management systems (DBMS). Major output from the tool include entity-relationship (ER) diagrams and standard or custom reports on all objects in the design (tables, fields, relationships).

    Cost-basis


    Data and Metadata Management

    XMLSpy is an advanced XML editor for modeling, editing, transforming, and debugging XML-related technologies. XMLSpy allows developers to create XML-based and Web services applications using technologies such as XML, XML Schema, XSLT, XPath, XQuery, WSDL, and SOAP. XMLSpy is also available as a plug-in for Microsoft Visual Studio and Eclipse.

    Cost-basis


    Data and Metadata Management

    Xtme is a program for creating and modifying formal metadata that conforms to the Content Standards for Digital Geospatial Metadata devised by the Federal Geographic Data Committee (FGDC). Xtme has a number of command line options and shortcuts to ease the navigation and editing among the information placeholders. Xtme is closely allied with mp, a compiler for formal metadata, whose purpose is to verify that the syntactical structure of a file containing formal metadata conforms to the FGDC standard, and to re-express the metadata in various useful formats. The editor is intended to simplify the process of creating metadata that conform to the standard.

    Free


    Data Deposition, Citation, Curation and Preservation

    Zotero is an open source reference management software which manages bibliographic data and related research materials (such as PDFs). On many websites such as library catalogs, PubMed, Google Scholar, Google Books, Amazon.com, Wikipedia, and publisher's websites, Zotero shows an icon when a book, article, or other resource is being viewed. By clicking this icon, the full reference information can be saved to the Zotero library. Zotero can also save a copy of the webpage, or, in the case of scientific articles, a copy of the full text PDF. Users can then add notes, tags, attachments, and their own metadata. Selections of the local reference library data can later be exported as formatted bibliographies. Furthermore, all entries including bibliographic information and user-created rich-text memos of the selected articles can be summarized into an HTML report. Other notable features: web browser integration, online syncing, generation of in-text citations, footnotes and bibliographies, as well as integration with the word processors Microsoft Word, LibreOffice and OpenOffice.org Writer.

    Free