I want to search


The 2017 DataONE Summer Internship Program is now OPEN for applications

The Data Observation Network for Earth (DataONE) is a virtual organization dedicated to providing open, persistent, robust, and secure access to biodiversity and environmental data, supported by the U.S. National Science Foundation. DataONE is pleased to announce the availability of summer research internships for undergraduates, graduate students and recent postgraduates.

Program Information

Interns undertake a 9 week program of work centered around one of the projects listed below. Each intern will be paired with one primary mentor and, in some cases, secondary and tertiary mentors. Interns need not necessarily be at the same location or institution as their mentor(s). Interns and mentors are expected to have a face-to-face meeting at the beginning of the summer, maintain frequent communication throughout the program and interns are required to work in an open notebook environment.


February 13 - Application period opens
March 17 - Deadline for receipt of applications at midnight Mountain time
Apr 3 - Notification of acceptance and scheduling of face-to-face meetings (schedules permitting)
May 22 - Program begins*
June 19 - Midterm evaluations
July 21 - Program concludes**
* Some allowance will be made for students who are unavailable during these dates due to their school calendar.
** Program may not extend beyond Aug 11 2017.


The program is open to undergraduate students, graduate students, and postgraduates who have received their degree within the past five years. Given the broad range of projects, there are no restrictions on academic backgrounds or field of study. Interns must be at least 18 years of age by the program start date, must be currently enrolled or employed at a U.S. university or other research institution and must currently reside in, and be eligible to work in, the United States. Interns are expected to be available approximately 40 hours/week during the internship period (noted above) with significant availability during the normal business hours. Interns from previous years are eligible to participate.

Financial Support

Interns will receive a stipend of $5,000 for participation, paid in two installments (one at the midterm and one at the conclusion of the program). In addition, required travel expenses will be borne by DataONE. Participation in the program after the mid-term is contingent on satisfactory performance. The University of New Mexico will administer funds. Interns will need to supply their own computing equipment and internet connection. For students who are not US citizens or permanent residents, complete visa information will be required, and it may be necessary for the funds to be paid through the student’s university or research institution. In such cases, the student will need to provide the necessary contact information for their organization.


Projects will cover a range of topic areas and vary in the extent and type of prior background required of the intern. Not all projects are guaranteed funding and the interests and expertise of the applicants will, in part, determine which projects will be selected for the program. The titles and descriptions of this year’s projects are posted below.

2016 Project Titles

Project 1: Markdown-based Semantic Annotation of Workflow Scripts
Project 2: DataONE Messaging: Creating Marketing for DataONE Stakeholder Communities
Project 3: Prospective and Retrospective Provenance Queries Using YesWorkflow, RDF, and SPARQL
Project 4: Exploration of Search Logs, Metadata Quality and Data Discovery
Project 5: Improving DataONE’s Search Capabilities Through Controlled Vocabularies
Project 6: Development of an Open Source Units of Measure Knowledge Graph

Other (non-DataONE funded) internship positions are advertised HERE.

Project Details

Project 1: Markdown-based Semantic Annotation of Workflow Scripts

Primary Mentor(s): Deborah McGuinness
Secondary Mentor(s): John Erickson (RPI)
Additional Mentor(s): Bertram Ludaescher, Tim McPhillips

Necessary Prerequisites:

  • Some experience writing data analytics scripts in best practice environments (R, MATLAB and/or Python (R preferred))
  • Demonstrated coding experience using Python, C++, Java and/or Haskell/
  • Familiarity with the Semantic Web "stack," including RDF, OWL, linked data principles
  • Summary: Successful completion of this project will require reverse engineering of the 'knitr' R package, the knitr extension process and the pandoc document converter. Some forking of the knitr and/or pandoc code base may be required. At a minimum, the successful candidate must have coding skills and be able to bootstrap in languages they encounter.

Desirable Skills / Qualifications:

  • Advanced coursework or equivalent project experience creating data analytics scripts in R
  • Experience using Markdown in R and/or Python
  • Experience using knitr and Pandoc for Markdown-based document generation
  • Experience with iPython/Jupyter Notebooks and/or R Notebooks
  • Experience with programmatic linked data generation and application
  • Experience with Haskell (due to possible Pandoc extensions)
  • Summary: Successful completion of this project will require rapid learning of multiple technologies. The student will have the opportunity to learn the component technologies on the fly, but previous experience in any of these will help accelerate the process.

Expected Outcomes:

  • Student will create an RStudio extension (via knitr and Pandoc) to produce RDF and HTML+RDFa documents containing semantic descriptions of data analytics scripts
  • Student and mentors will propose prototype best practices for semantic markdown

Project Description:

The proposed work will result in an extension to the RStudio environment enabling data analysts to directly publish RDF that richly describes the semantics of their scripts. This work will also include draft best practices that guide practitioners in proper embedding of appropriate concepts and vocabulary from established ontologies (including ProvONE and domain ontologies).

In detail, this work entails an exploration of extending markdown syntax (esp. R Markdown) in concert with knitr to directly produce workflow markup, in a human-compatible way. A specific example of what this means: When "knitting" a markdown rendition, instead of generating (e.g) PDF or HTML, the anticipated tool will generate RDF (TTL or JSON-LD) or HTML+RDFa. By "human readable," we mean markdown best practices will be developed that are reasonable for a data analyst to use; methods (possibly based on templates) must be developed that do not require the user to "know" RDF. Today we can create cumbersome R Markdown (Rmd) files that produce HTML+RDFa outputs with correct embedded workflow semantics, but the user must be an HTML and RDFa hacker to understand it. Workflow reproducibility requires tools that data analysts will actually use.

With the right skillset, the intern may develop methods for semi-automatically extracting function and package semantics and encoding these into the resulting graph. By this we mean, in addition to capturing explicit semantics expressed via markdown syntax, the intern may develop a way to further capture "meaning" based on the use of functions in scripts, without requiring users to artificially wrap standard functions; this might be done through "wrapper" functions in R or some other means.

This work will be an advancement of the semantic workflow work inspired by YesWorkflow, and leverages an approach using standard practices for R extensions, markdown and publication, creating a direct path for DataONE analysts to get their workflows represented in knowledge graphs. This approach broadens the potential DataONE user base by helping to ensure their workflows and results are easier to discover, conceptually easier to understand, and therefore increasing the likelihood they will be cited, reused and reproduced.

Project 2: DataONE Messaging: Creating Marketing for DataONE Stakeholder Communities

Primary Mentor(s): Amber Budden
Secondary Mentor(s): Trisha Cruse, Suzie Allard

Necessary Prerequisites:
A creative outlook and a desire to develop the marketing profile for an established organization. Experience with graphic design and Adobe products (e.g. InDesign, Photoshop). An interest in communicating scientific / technical content to broad audiences. A talent for creating engaging visuals and materials. Experience with science communication, advertising, PR or journalism.

Desirable Skills / Qualifications:
Would be useful for the intern to have access to a computer with Adobe Creative Suite license, but not critical. Familiarity or interest with one or more of the fields related to DataONE (e.g. environmental science, information science, computer science).

Expected Outcomes:
At the end of this project the you, the intern, will have (1) worked directly with a “client” (DataONE) to create brand-based marketing materials; (2) created a creative brief which can be used in your portfolio; (3) designed and created materials based on audience segments (what we call stakeholder communities); and (4) a set of complimentary marketing materials to add to your portfolio. These materials will become part of the DataONE marketing suite. The materials developed will include:

  • A customized powerpoint presentation theme
  • Marketing pieces for download and print, designed for four different audience segments
  • Abode templates for use in future material development
  • A suite of graphical elements to identify the DataONE “brand” for use in future material development

Project Description:
You will be joining the DataONE team to help us create materials that increase awareness and appreciation for DataONE -- the premier global environmental data partner. We are excited to provide this opportunity to someone within media / communication / journalism / marketing or a related field to work with our globally distributed science infrastructure project. DataONE has already identified our broad community, and specific stakeholder needs and interests, which are addressed by DataONE tools and services. You would be helping us create targeted marketing materials that are engaging and meaningful for each audience segment. During this project, you will have the opportunity to design a suite of materials that will be implemented across multiple formats; online, printed, presentation etc. The materials will be designed in consultation with the Director for Community Engagement and Outreach and members of the DataONE Sustainability and Governance committee who will provide all the information you need for your creative brief so the materials design can appropriately reflect the mission and vision of the organization, as well as accurately describe products and services. As with any “agency” work, the design process will require multiple ideas / proofs to be created and reviewed before final concepts are agreed upon for implementation.

Project 3: Prospective and Retrospective Provenance Queries Using YesWorkflow, RDF, and SPARQL

Primary Mentor(s): Bertram Ludäscher
Secondary Mentor(s): Timothy McPhillips
Additional Mentor(s): Paolo Missier

Necessary Prerequisites:

  • Good working knowledge of RDF, SPARQL, and commonly used tools supporting both.

Desirable Skills / Qualifications:

  • Data modeling experience, e.g. ER modeling, SQL schema design, object-relational mapping, etc; very desirable.
  • Java programming experience.
  • Datalog and/or Prolog programming experience a plus.

Expected Outcomes:

  • A general approach for representing YW annotions, YW workflow models, and YW-reconstructed retrospective provenance information in RDF.
  • A set of SPARQL queries providing the same capabilities of existing Prolog/Datalog queries operating on facts exported by YesWorkflow.
  • A specification for a new YW capability for exporting RDF.

Project Description:
YesWorkflow (YW) defines a set of annotations for declaring the expected dataflow patterns in scripts written in any text-based programming language. The YW toolkit extracts these annotations from comments in source code and builds a ProvONE-compatible workflow model of the script which can then be rendered graphically. YW also enables the user to export its representation of the workflow model as a set of Prolog or Datalog facts which can then easily be queried and used to create ad hoc visualizations of all or part of the model. Further, YW can reconstruct key runtime events and even data values that occurred during a run of the script by joining the YW model (prospective provenance) with observations made either during or after the completion of the script run, e.g. the values of metadata embedded in file names and directories created by the script. YW can export this retrospective and reconstructed provenance information as Prolog facts as well. Finally, the prospective and retrospective provenance facts can be queried together, enabling even more useful, hybrid provenance queries and visualizations that are of immediate use to the researcher reviewing the results of a script run or reporting their results to others.

The goal of the current project is to enable all of the provenance information that can be collected by YesWorkflow and exported to Prolog facts, to be exported alternatively to an RDF representation. The goal is to produce RDF documents that are both easy to read directly and also easy to query using SPARQL. We hypothesize that all of the queries that we have previously demonstrated with Prolog/Datalog can also be implemented in SPARQL 1.1. The challenge will be finding an intuitive way of representing prospective and retrospective provenance in RDF that also facilitates scientifically meaningful queries about the derivation of particular script products via the computational steps in the script and the dataflows between them. The above work, to be carried out by the intern, will not entail any modification of the YW tool itself. Rather, the intern will speculatively, author RDF documents representing YW workflow models and provenance, query these documents with SPARQL, and iteratively improve both the RDF representations and the SPARQL queries until as many as possible of the desired queries are supported. YesWorkflow will subsequently be updated to automatically generate the final version of the RDF representation designed in this project.

Project 4: Exploration of Search Logs, Metadata Quality and Data Discovery

Primary Mentor(s): Lauren Walker, Amber Budden
Secondary Mentor(s): Jesse Goldstein, Jeanette Clark, Matt Jones
Additional Mentor(s): Dave Vieglais, Rebecca Koskela

Necessary Prerequisites:
Statistical analysis, information systems, scripted analysis skills (e.g., R)

Desirable Skills / Qualifications:
Additional scripted analysis skills (e.g. python)

Expected Outcomes:
This project will provide an overview of on search patterns within search.dataone.org; a report providing an analysis of metadata quality metrics associated with downloaded data; and a draft publication detailing methodology, statistical analysis and results from a study a manipulative study of metadata quality on data discovery.

Project Description:
The ability to successfully search for and discover data held within a given repository is related to both the capabilities of the search engine and also the quality of the metadata describing the data set. Extensive variation exists in the amount and quality of metadata that an author might provide for a shared data set. At a minimum, contributors might provide details listing the data authors, the study system, and the date, time and place of data collection. A rich metadata record would provide everything that would be required to download and synthesize or reanalyze the data, including a comprehensive abstract describing the work, details of the structure and contents of the data, and full methodological information to properly interpret data values. It is on these metadata terms that the search query operates.

This project will be three-fold.

  • First, the project will examine trends in DataONE search logs from 2012 to 2017 to explore patterns. What search terms were used? How much metadata was queried (how many search terms were used)? How many results were returned for various searches? Did a search result in data download from member repositories? If so, a single repository or multiple?
  • Second, taking advantage of the Quality Report feature available in one of the Member Nodes (the Arctic Data Center), the project will explore the metadata characteristics of downloaded data in comparison with the population of data available within that repository.
  • Finally, in an attempt to directly test the importance of metadata quality a series of standardized searches will be conducted against a replica Arctic Data Center catalog. These metadata will then be manipulated before repeating the search queries to evaluate the impact on discovery and its relationship with the metadata quality reports.

The results of the project will be prepared for presentation and/or publication.

Project 5: Improving DataONE’s Search Capabilities Through Controlled Vocabularies

Primary Mentor(s): Mark Schildhauer
Secondary Mentor(s): Julien Brun, Pier Paolo Buttigieg

Necessary Prerequisites:
Basic proficiency in formal logic or capacity to quickly learn such; familiarity creating and manipulating data structures; interest in linguistics, logic, and programming.

Desirable Skills / Qualifications:
Proficiency with some programming language; experience working with and analysing ecological/environmental data; familiarity with ecological, earth science, and conservation concepts and measurements a strong plus.

Expected Outcomes:

  • Major increase in the scope and volume of well-defined terms available for researcher use
  • Valuable User testing and feedback on DataONE’s annotation and search interfaces
  • Data products of NCEAS’ Synthesis Working Group activities (see below) will be well-described and contributed to a DataONE MN

Project Description:
NCEAS is currently the host site for a number of critical and compelling environmental, ecological, and conservation/human well-being research topics, all undertaken within a framework of “Synthesis”, using existing data that must be collated, documented, and robustly archived into a DataONE compatible data repository. This project offers a prime opportunity for a DataONE intern to work with data from these synthesis activities and to augment DataONE’s own controlled vocabulary about ecosystem concepts, the “Ecosystems Ontology” (ECSO). ECSO is constructed using World Wide Web standards that enable accessing and associating terms in the vocabulary with features of the data objects found in DataONE. This will enable researchers to improve the precision of their searches, as well as enhance interpretation of the data for re-use in synthesis investigations.

The intern’s work will involve:

  • Identifying relevant “external” vocabularies containing well-constructed terms to use for describing DataONE data, and investigating the best methods for importing/referencing these terms within DataONE’s framework
  • Identifying relevant vocabularies that are not well-constructed, and incorporating these into DataONE’s framework
  • Developing new terms as needed, to augment the library of measurements in DataONE’s ECSO ontology
  • User-testing and feedback of DataONE’s annotation and search tools and features, when archiving Synthesis data products in a DataONE MN.

The outcomes of this internship are intended to be practical, supplementing the catalogue of well-defined measurements available for researchers (and machine-assisted mechanisms) to use for annotation, but also data discovery and interpretation. The numerous ongoing synthesis activities at NCEAS include: Long Term Ecological Research Synthesis Working Groups (https://www.nceas.ucsb.edu/lter-network-communications-office); Science for Nature and People Partnership Working Groups (https://www.nceas.ucsb.edu/science/snap#); Arctic Data Center Working Groups (https://arcticdata.io); and the State of Alaska Salmon and People Working Groups (https://alaskasalmonandpeople.org/). These efforts collectively provide access to a rich set of heterogeneous environmental data that will be archived in some DataONE MNs. Specific targets of data enrichment and vocabulary development activities will be prioritized and focused through discussion among the Mentors with the various project PI’s.

Project 6: Development of an Open Source Units of Measure Knowledge Graph

Primary Mentor(s): Deborah McGuinness
Secondary Mentor(s): James McCusker (RPI), Mark Schildhauer (NCEAS)

Necessary Prerequisites:
Knowledge of python web frameworks, Ontologies, RDF/Linked Data, and web service concepts

Desirable Skills / Qualifications:
Knowledge of units of measure standards and practices, knowledge graphs

Expected Outcomes:
A published knowledge graph that is building towards a comprehensive body of knowledge about units of measure, how they relate to each other, and the ability to resolve units to URIs for them and to convert between units.

Project Description:
Units of measure have clear, mathematically defined relationships that can be expressed as a knowledge graph. Users would be able to resolve units of measurement labels, symbols, and external units of measure (UoM) ontology URIs to UnitsKG URIs, and convert values from unit
to unit as appropriate. Currently, the foundational ontologies of DataONE ( ECSO and OBO-E ) ontologies provide some, but not all, units of measure that are expressed in DataONE datasets. This effort will initiate a comprehensive collection of units of measure so that they can easily be interconverted for future data alignment. It is expected that the resulting knowledge graph will be used both within DataONE but also be available for broad reuse in a wide range of efforts and will be made open source.

To Apply

Full details of the application process, and links to forms, will be available when the application period opens.
Required application materials include: 1) a resume that includes educational history, current position, any publications or honors, and full contact information (including phone number, e-mail address, and mailing address); 2) a cover letter identifying the project you are interested in, the contributions you expect to make to the project, relevant background, value of the internship program to your career objectives and your approach to meeting the project deliverables; and 3) a letter of reference.

Applications must be completed by 11:59 PM (Mountain time) on March 17th. Links to the application forms are provided below. Applicants should also provide a letter of reference. The letter of reference should be sent directly by its author to internship@dataone.org by the application deadline.

  1. The cover letter should address the following questions:
    • Which DataONE Summer Internship project(s) are you most interested in and why?
    • What contributions do you expect to be able to make to the project(s)?
    • What background do you have which is relevant to the project(s)?
    • What do you expect to learn and/or achieve by participating?
    • What are your thoughts and ideas about the project, including particular suggestions for ways of achieving the project objectives?
    • How will participation in this program help you achieve your educational and career objectives?
    • Are there any factors that would affect your ability to participate, including other summer employment, university schedules, and other commitments?
  2. The resume should include the applicant’s educational history, current position, any publications or honors, and full contact information (including phone number, e-mail address, and mailing address).
  3. The letter of reference should be sent directly to internship@dataone.org and should be from a professor, supervisor, or mentor.

Online application forms
U.S. citizens should complete the application form HERE
Non U.S. citizens with a valid visa should complete the application form HERE

Evaluation of applications

Applications will be judged by the following criteria:

  • The academic and technical qualifications of the applicant.
  • Evidence of strong written and oral communication skills.
  • The extent to which the applicant can provide substantive contributions to one or more projects, including the applicant’s ideas for project implementation.
  • The extent to which the internship would be of value to the career development of the applicant
  • The availability of the applicant during the period of the internship.

Intellectual Property

DataONE is predicated on openness and universal access. Software is developed under one of several open source licenses, and copyrightable content produced during the course of the project will made available under a Creative Commons (CC-BY 3.0) license. Where appropriate, projects may result in published articles and conference presentations, on which the intern is expected to make a substantive contribution, and receive credit for that contribution.

Funding acknowledgement

Previous Summer Internships are supported by a National Science Foundation Award (NSF Award 0830944): "DataNetONE (Observation Network for Earth)". Current Summer Internships are supported by National Science Foundation Award #1430508.

For more information

If you have questions or problems about the application process or internship program in general, please e-mail internship@dataone.org.