I want to search

Investigator's Tool Kit (ITK) Developer Guide

ITK Developers Guide

This guide is meant to orient tool developers looking to interact with DataONE services. Successful interaction includes encoding requests and decoding responses properly, understanding DataONE's authorization system, and learning the major interactions with DataONE. DataONE also provides a few code libraries that can spare developers the mechanical details.

ITK Products

One of DataONE’s goals is to promote tool reuse by making it easier for existing tools to interact with DataONE services and content. Accordingly we sought to simplify developer effort by releasing certain code libraries as distinct packages. We currently have libraries built in java and python that are released as products in their own right.

Use of these products can greatly simplify interaction with DataONE services, and should be considered before embarking on an integration project.

d1_common

These packages contain the language specific DataONE CN and MN API interfaces and DataONE datatypes. They also contain common utilities needed by both clients and services for working with these datatypes.

packages: d1_common_java,  d1_common_python 

d1_libclient

These packages contain the DataONE CN and MN API method implementations and handle all of the HTTP/S request and response handling, including any encoding and serialization that needs to be done. Additionally, each library is set up to work with CILogon certificates. Last but not least, each also has methods for creating well-formed Resource Maps.

packages: d1_libclient_java, d1_libclient_python

Client Authorization

ITK tools will mostly be interested in searching, reading, and submitting content. For searching and reading content, it is important to note that results for both types of methods are dependent on the identity of the client. Anonymous clients will only get search results containing public content, and will only be allowed to download or get information on public content. Users of your application will not want to be limited to only public content, so your application will have to be aware of the end-user’s identity, and be able to provide the user’s identity when interacting with DataONE services.

Similarly, client applications that submit content cannot be anonymous, and will also need to provide the user’s identity.

DataONE uses SSL for secure connection, along with the use of short-lived X509 certificates for the connection handshake. ITK tools are not expected to manage the downloading of client-certificates for end-users, but are expected to know where to find the certificate, and how to interrogate it to determine the Subject string to be used for content submission. The system metadata of all submitted content must have an accurate value in the “rightsHolder” field.

CILogon

DataONE does not manage any accounts associated with submitted content. Instead we have partnered with CILogon to be the end-user certificate issuer. The key difference is that DataONE does not issue certificates or maintain username/passwords. Instead DataONE Member and Coordinating Nodes trust CILogon-issued certificates  CILogon also inserts into the certificates extra information provided by DataONE that Member and Coordinating Nodes use to navigate group memberships and equivalent Subjects (user account identities) to make authorization decisions.

CILogon certificates are short-lived (18 hours), to allow a person working a regular schedule to only have to download the certificate once per day, and usually at the first interaction with DataONE services, which sets up some predictability.

Because the CILogon certificate is downloaded to the client machine, and not owned by any one client application, the certificate can be used by all DataONE enabled tools the end-user may happen to be using. Therefore, careful consideration should be taken before manipulating the current certificate through your application, as the effects will be felt beyond application scope.

expired certificates

DataONE services are designed to accept certificateless (anonymous) connections, but will not transpose a connection using an expired certificate into an anonymous connection, so your application should be prepared to handle SSL connection errors in order to advise the end-user to download a new certificate. d1_libclient ITK libraries similarly do not quietly transpose expired certificate situations into anonymous connections, to avoid unexpected responses such as the user suddenly not having access to his or her own non-public content.

Best practice is to alert the end user that their certificate is expired and perhaps direct them to the CILogon certificate download site. It is also reasonable, but certainly not required, to provide the end user the path to the certificate so they can remove it if they want to continue as an anonymous client. Ambitious application developers can also provide end-users with a view of the certificate and its status (its expiration date), so they are aware of both who they are from the perspective of your application, and by extension, DataONE services.

DataONE Portal

Persons and Groups

While DataONE does not maintain user accounts or certificates, it does allow users to map equivalent identities to each other so that they can access private content accessible by one account, but using the CILogon certificate issued from the other. This mapping is included in the certificate CILogon provides, so your application does not need to do anything special to handle this situation. If you want your application to get involved, direct your users to the DataONE Portal to set up these mappings. (This is a stable URL.) You can find this extra information with a call to CN_Identity.listAccounts, and looking at the person record for the subject of the certificate.

The Portal also allows CILogon-authorized users to set up groups which can be used for assigning access control to submitted objects. Group membership can be edited by the group owner, but is limited to only users who have identified their CILogon account(s) through the Portal.

Note: While users don’t have to identify their CILogon account through the Portal in order to access content, it is a good practice, as it allows the mapping of equivalent identities and participation in Groups.

Non-CILogon accounts / subjects

DataONE recognizes the fact that the accounts a Member Node may have in use for assigning access rights to an object may not be CILogon compatible accounts. In those cases, users should contact the organization where the content is stored and request that their system administers work with DataONE to map this non-CILogon subject.

Searching

Data discovery - finding out what’s in DataONE - is done through the CN solr search endpoint. DataONE does provide the GUI-driven ONEMercury that may better serve the needs of your users. ONEMercury provides more of a package-level view of the data, rather than a per-object view.

For a thorough discussion of how to use the CN solr search endpoint directly from your application (by creating your own queries), see the detailed page on Content Discovery.

Retrieval

Once a user has an identifier for something of interest, retrieval seems to be the next step. However, as a developer, you have the opportunity to streamline the user experience by asking a few questions before downloading the bytes: How big is the object? Is it the latest version of the object? Asking those questions before the download can save your end user time and effort. The MN_Read API gives a few methods to help determine how to proceed.

CN_Read.describe(<identifier>): returns the object size and modification date (of the systemMetadata)  efficiently with an HTTP / HEAD request. This can be useful to know whether or not to reload the system metadata, or to find out the file size. It is lighter-weight than getSystemMetadata(...), but some may prefer to simply call getSystemMetadata(...) itself for the information.

CN_Read.getSystemMetadata(<identifier>): returns administrative details about the object, including whether it is obsoleted by a later version or not, the file size, and the format identifier.

Most users want the latest version of something, so checking this before downloading, along with the appropriate transparency to the user, is a good practice.

Note: Both of the above methods are also available under the MN_Read API, but that limits the number of objects that can be retrieved. For example, getting a Not Found doesn’t mean the object can’t be found on another Member Node.

CN_Read.resolve(<identifier>)

When it is time to download the item, use the CN_Read.resolve method, instead of either MN_Read.get or CN_Read.get. Resolve returns an ObjectLocationList along with an HTTP status code of 303 (redirect). The returned ObjectLocationList contains a prioritized list of the Member Nodes where the object can be retrieved successfully with a MN_Read.get(...). In case of Service Failures, the next Member Node on the list can be called.

Browsers and clients that can follow the redirect automatically will instead get the object from one of the Member Nodes on the list.

Package Retrieval

Content in DataONE is usually stored in the form of a package of closely related objects - the data objects and the metadata objects that describe them, as well as a resource map that is the inventory of both types and how they are related. Once you have the resource map, downloading all of the objects or latest versions of each object is straightforward. But, if you have only the identifier of a metadata object or data object, you will need some help to navigate to the resource map.

Package Navigation

If you have a metadata document you are familiar with and are able to read, you may be able to locate the data objects from information providedin the metadata document, and skip the step of locating the resource map. DataONE accommodates the storage of many metadata formats, so does not rely on the metadata to do this (it actually doesn’t know how to do this). Instead, it uses the associated resource map to do this, and extracts this information and stores it in the SOLR index, as a field available for search. Navigating from data identifier or metadata identifier to resource map is not conclusive, however, since either could be in several resource maps. In this case, you will need to present a list of choices for the end-user.

The SOLR query for finding the Resource Maps of any object is:

TODO: (fill in the query)

The metadata associated with each package can help the end-user determine which resource map.

Content Submission & Curation

From the perspective of ITK tools, content is submitted to DataONE through the Member Node Storage API. Not all Member Nodes implement the MN_Storage API, and some implement it but limit who is allowed to submit content.

To get a list of nodes that implement this API, use the CN.listNodes method call, and look for the MN_Storage API being set to true in the services listing. In general, end-users already know what Member Node they will be submitting content to, but the returned NodeList contains the Name, Description, Node ID, and BaseURL for all of the nodes. If your application plans on caching the preferred Member Node as a preference, it is best to cache the Node ID, as it is stable, while the BaseURL is subject to change over time.

The list of all current nodes:  https://cn.dataone.org/cn/v1/node

Generating Identifiers

All submitted content must be submitted using a unique (and persistent) identifiers. Some Member Node follow a naming convention, so before submission some users may need to first generate an identifier using the MN_Storage.generateIdentifier method. Generating an identifier can also be done by the corresponding CN_core method, but it may not be congruent with Member Node requirements.

Generating an identifier does not mean that it is unique and will be accepted by DataONE when the content is submitted. To do that the CN_Core.reserveIdentifier should be used. If the name of the identifier is important, it may be wise to reserve all of the identifiers for a package upfront - before submitting the package objects.

For further discussion see: http://mule1.dataone.org/ArchitectureDocs-current/design/PIDs.html

Generating Checksums

DataONE needs checksums that it can recalculate. The list of checksum algorithms DataONE can use is CN_Core.listChecksumAlgorithms(). The Checksum of the object being submitted should be calculated using one of the algorithms from the list, and included in its associated system metadata.

Controlling Access

A primary consideration when submitting new content is setting an Access Policy. Objects without an Access Policy are only visible to the rights-holder, which is usually not what is intended. So, it is a good practice to set up an Access Policy at the time of submission. And also note that it is very important to make sure the rights-holder field is properly set, so that the end user can manage it later.

Access control can also be changed after content submission as well, using the CN_Authorization.setAccessPolicy method. This method does not alter the existing Access Policy - it completely replaces the old policy with the new one. To accomplish the add, one must extract the current one, modify it accordingly, and submit the newer version.

submitter vs. rights-holder

DataONE provides for the delegation of data curation by recording the submitter field as well as rights-holder field in the system metadata. Note, however, that the submitter - always the Subject of the client - does not get any access rights automatically assigned to it. Rather, the rights-holder does. For situations involving delegation of continued management of the object(s), it is best to set up a DataONE Group with the curator’s subject in it, and include that group in the access policy of the submitted objects.

For more information on how to set access for an object, see: http://mule1.dataone.org/ArchitectureDocs-current/design/Authorization.html#object-access-control

MN_Storage.create

The create method is used to submit new content. If the content is intended to replace an existing object, use MN_Storage.update instead. Users do need to fill out a System Metadata document to submit along with the object.

MN_Storage.update

Updates are used to submit a new version of an existing object. They do need to take place on the Authoritative Member Node of the object they are succeeding and obsoleting. The process is almost the same as create, except that the identifier of the object it obsoletes is included in the call. The obsoletes field in the system Metadata should also be filled out with the same identifier.

Upon successful update, the systemMetadata of the obsoleted object will be updated, filling out the obsoletedBy field with the identifier of the new object, as well as the isArchived field to signal the CNs to remove the item from the search index. This follows DataONE’s preservation strategy of preserving past versions.

Also note that new versions of objects do not automatically inherit the Access Policies of their predecessors, so care should be taken to copy the access policy of the existing object into the system metadata of the newer version.

MN_Storage.archive

When users wish to ‘remove’ and object, the archive method is used. Following DataONE’s preservation strategy, the object is still available by identifier, but it is removed from the CN search index, to disable de novo discovery.

Package Submission

DataONE strongly encourages the submission of metadata alongside any data submissions, as well as resource maps to help others navigate to all members of the set. It is common for submitters to have one metadata document describe a handful of data objects, with a resource map containing references to the metadata and data objects. D1_libclient has tools to greatly simplify the creation of these resource maps.

To submit a package, it is best to submit the data objects first, to make sure the identifiers are unique and accepted, followed by the metadata which may reference the data identifiers, and finally the resource map itself.

Nuts and Bolts

Request transmission DataONE uses a RESTful API design to allow for service stability over time. For request transmission, it uses mime-multipart-mixed for transmitting HTTP POST and PUT requests, while API parameters of HTTP GET requests are contained in the URL, and need to be properly URL encoded according to rfc 3986. All responses are logically one object, so are returned as serialized objects. For more information, see: http://mule1.dataone.org/ArchitectureDocs-current/apis/REST_overview.html