Users of DataONE can now access detailed metrics on their uploaded data through our new data profiles.
January marked the launch of version 2 of the DataONE systems supporting our network of 31 Member Nodes. Designed to be responsive to community input, version 2 supports data profiles, and enables sign in through ORCID, Google or University affiliations; streamlined access for client tools such as R; and a host of new technical features making it easier for Member Node repositories to manage their data.
Data profiles let researchers know their data are being accessed. By signing in to DataONE Search, users can now access detailed metrics on datasets that have been uploaded into repositories within the DataONE network of Member Nodes. For example, the new data profile page for researcher Jennifer Balch provides a summary of her individual records and contributions, information on the total number of data and metadata downloads, temporal trends in downloads, counts of data and metadata uploads, summaries of file formats used as well as a graphical representation of the time range over which data were collected. These features are provided within the DataONE Search interface facilitating a seamless transition between reviewing personal data level metrics and discovering other data within the network. The same information can be aggregated and viewed for entire Member Node Repositories (such as the KNB; https://search.dataone.org/#profile/KNB), and for the whole DataONE network (https://search.dataone.org/#profile). An overview of these features, and of the signin process, is provided in two of our newly released DataONE Search screencast tutorials at https://www.dataone.org/dataone-search-screencasts.
DataONE supports ORCIDs! When signing into DataONE search using ORCID, users can connect their publicly available ORCID data with their DataONE profile. ORCID (Open Researcher and Contributor ID) is a platform independent, persistent digital identifier that provides consistent researcher identification for scientific and academic scholarship over time.
We’ve made signing in to DataONE much simpler for people using DataONE-enabled tools such as the DataONE R package, the R client that provides read/write access from DataONE Member Node repositories. Now, users only need to copy their authentication token from within their DataONE profile page to sign in and upload data from within R . This functionality will soon be extended to users of Matlab. From their profile page, users can also create ‘groups’ in order to manage access and permissions to their data, enabling collaboration on a private data set before publication, or share editing and publishing privileges across multiple users.
For more detailed technical information on version 2 of the DataONE service infrastructure, including information on series identifiers and metadata control, please read below.
Information about DataONE Search and other DataONE tools can be found at: https://www.dataone.org/investigator-toolkit
To search for data or access your data profile go to: https://search.dataone.org/
For screencast tutorials see: https://www.dataone.org/dataone-search-screencasts
Version 2.0 Features
Version 2.0 of the DataONE service infrastructure was released over the course of December 2015 and followed by a number of updates during January of 2016. This major upgrade to DataONE services implements a lot of changes in the background and lays the foundation for a host of new features that the DataONE team expects to release on a regular basis.
The Version 2.0 services represent an evolutionary improvement from Version 1, and are fully backwards compatible. Existing infrastructure, such as Member Nodes and Investigator Tools, do not need to be immediately upgraded, and may choose to continue operating in Version 1.
Most of the new functionality offered by Version 2 is a direct result of feedback provided by the community of DataONE users, including both Member Nodes and investigators, over the several years of production operations. Two such major changes in Version 2 include support for mutable content through the use of "Series Identifiers" and transition of authoritative system metadata control from Coordinating Nodes back to the Member Nodes.
All content in DataONE is immutable and uniquely identified by a Persistent Identifier (PID) in order to support repeatable analysis use cases.
In version 2, we have introduced the Series Identifier (SID) to support a new use case that ensures that the latest revision of a dataset can be retrieved by its identifier. In Version 1, it was necessary to query the system for available revisions to each object in order to locate the most recent revision of a data set or its components. The availability of SIDs in Version 2.0 offers increased efficiency for the user looking for the most recent version of a data set.
The availability of both SIDs and PIDs for identifying content provides more flexibility for content providers and consumers and aligns well with typical use patterns. The most important distinction is that a PID will always refer to an exact version of an object whereas a SID will always refer to the latest revision of a series.
System Metadata Control
In Version 1 of the infrastructure, system metadata (access control rules, replica information, and other details) was created at the Member Nodes, but always managed by Coordinating Nodes. This simple method of operation meant that Coordinating Nodes always held the most up-to-date information about any object in the DataONE federation. However, it also meant that any changes to system metadata, such as an update to access control rules, would always need to be done at a Coordinating Node. This would sometimes cause an undesirable latency in returning updates to a Member Node.
In Version 2, the Member Node now contains the authoritative copy of system metadata. In version 2, changes to metadata such as access control rules may be created, distributed, and reflected quickly at Member Nodes. Coordinating Nodes are notified of such edits, and prioritize updates of the system metadata to themselves and to other Member Nodes holding replicas. The overall benefit of this change is a significant reduction in latency for common operations that alter system metadata, and a generally more responsive environment for content curators.
Other Significant Changes
DataONE infrastructure has always proven reliable (with >99.999% uptime since starting production operations in July 2012). There is however, always opportunity for improvement. Version 2.0 also includes numerous internal adjustments to improve efficiency and reliability of the overall infrastructure:
- Solr 5 and Zookeeper for distributed indexing
- Full support for suggested and actual content file names and media type
- Improved usage log aggregation with COUNTER support
- New bearer token authentication mechanism that operates in parallel with the existing certificate based authentication process
- Support for authentication by ORCID in addition to CILogon and the InCommon Federation of identity providers
- Numerous bug fixes and performance improvements
Technical documentation of the services offered in Version 2.0 is available for review at: https://purl.dataone.org/architecture-dev
DataONE enables universal access to data and also facilitates researchers in fulfilling their needs for data management and in providing secure and permanent access to their data. DataONE offers the scientific community a suite of tools and training materials that cover all aspects of the data life cycle from data collection to management, analysis and publication.