The 2013 DataONE Summer Internship Program is now CLOSED for applications.
We will begin to notify applicants of our decision on April 3rd.
The Data Observation Network for Earth (DataONE) is a virtual organization dedicated to providing open, persistent, robust, and secure access to biodiversity and environmental data, supported by the U.S. National Science Foundation. DataONE is pleased to announce the availability of summer research internships for undergraduates, graduate students and recent postgraduates.
Program Structure
Up to eight interns will be accepted in 2013, each paired with one primary mentor and, in some cases, secondary and tertiary mentors. Interns need not necessarily be at the same location or institution as their mentor(s). Interns and mentors are expected to have a face-to-face meeting at the beginning of the summer, and interns are encouraged to attend the DataONE All-Hands Meeting in the fall to present the results of their work. DataONE will pay all necessary travel expenses.
Provisional Schedule
- February 20 - Application period opens
March 17March 21 - Deadline for receipt of applications at midnight Mountain time- April 3 - Notification of acceptance. Scheduling of face-to-face kickoff meetings based on availability of interns and mentors
- May 27 - Program begins*
- June 24 - Midterm evaluations
- July 26 - Program concludes
- October 22-24 - DataONE All-Hands-Meeting, New Mexico (attendance encouraged)
* Allowance will be made for students who are unavailable during these date due to their school calendar.
Eligibility
The program is open to undergraduate students, graduate students, and postgraduates who have received their degree within the past five years. Given the broad range of projects, there are no restrictions on academic backgrounds or field of study. Interns must be at least 18 years of age by the program start date, must be currently enrolled or employed at a U.S. university or other research institution and must currently reside in, and be eligible to work in, the United States. Interns are expected to be available approximately 40 hours/week during the internship period (noted below) with significant availability during the normal business hours. Interns from previous years are eligible to participate.
Financial Support
Interns will receive a stipend of $5,000 for participation, paid in two installments (one at the midterm and one at the conclusion of the program). In addition, required travel expenses will be borne by DataONE. Participation in the program after the mid-term is contingent on satisfactory performance. The University of New Mexico will administer funds. Interns will need to supply their own computing equipment and Internet connection. For students who are not US citizens or permanent residents, complete visa information will be required, and it may be necessary for the funds to be paid through the student’s university or research institution. In such cases, the student will need to provide the necessary contact information for their organization.
Project Ideas
Projects cover a range of topic areas and vary in the extent and type of prior background required of the intern. The interests and expertise of the applicants will, in part, determine which projects will be selected for the program. The titles and descriptions of this year’s projects will be posted below when the application period opens.
2013 Project Titles
- Next Generation Data Environment: Semantically-Enabling the DataONE Metadata Environment
- Ontology Mappings in the Earth and Environmental Sciences
- Evaluation of Ontology Coverage for Curation
- Integrating Data Stories into DataONE Education and Community Engagement Products
- Data Policies for Public Participation in Scientific Research
- Bi-level Metadata Registry Development
- PBase: Provenance as a First-class Citizen in DataONE
- Build Fundamental Components for Provenance-aware Model Exploration, Evaluation, and Benchmarking Cyber-infrastructure Prototype
- A Visualization Tool for Provenance in DataONE
2013 Project Descriptions
1) Next Generation Data Environment: Semantically-Enabling the DataONE Metadata Environment
The summer intern will work to develop a web-based interface that immediately facilitates any user to semantically enable data and meta-data. Within the application workflow, the user will be able to link their data to selections of ontology concepts from established community ontologies, like OBO-E (https://marinemetadata.org/references/oboeontology), leveraging backend vocabulary services developed by Patrice Seyed (post-doc for DataONE semantics and interoperability working group). The interface will leverage formal reasoning to assist a user in making selections constrained by their previous selections of classes and properties, based on how these objects are defined in their respective ontologies, while at the same time assist the user in verifying the set of inferences that follow from all selections. Within the design, the user will be enabled to identify implicit domain entities (e.g., when a measurement data record refers to multiple samples or organisms as opposed to one), useful in scenarios where this is only clearly understood by the data table creator, and flexibly encode their representation within the transformed data. The project serves as an extension to previous semantic data enablement projects across Rensselaer Polytechnic Institute (RPI) and the National Center for Ecological Analysis and Synthesis (NCEAS), including the CSV2RDF4LOD from RPI, that converts tabular data into RDF statement based on user-provided configurations, and Morpho of NCEAS’s Semtools project that annotates tabular data applying the OBO-E ontology model of scientific observation. Researchers involved in these projects are mentors for this proposal and available for guidance. The resulting transformed data will include linkages back to the original data and its source using provenance-centric ontologies (PROV-O, PML3), and will be available for discovery and granular search of datasets described through DataONE’s metadata environment.
Primary Mentors: Patrice Seyed, Deborah McGuinness
Secondary Mentor: Mark Schildhauer
Additional Mentors: Ben Leinfelder, Tim Lebo, Margaret O’Brien, Matt Jones, Evan Patton
Necessary Prerequisites: Software Engineering
Desirable Skills / Qualifications: Javascript, UI Interface Design
Expected Outcomes: A data conversion/annotation environment makes it easier for domain scientists to annotate their data and to connect it into the linked data and DataOne environments. The environment will enable a user (e.g., data manager, scientist, layperson) to leverage community-based biomedical, environmental, and ecological ontologies in semantically-enabling their data using web ontology standards. The products of this environment include explicit, uniform representations of data and metadata at the level of scientific observations, and metadata at the level of encompassing scientific studies. The primary work for the student will be to develop a graphical interface that serves as a frontend to the RDF conversion software tool CSV2RDF4LOD. The outcome will be a software environment that supports annotation and facilitates the discovery and granular search of data based on information that is typically implicit and thus typically hard to find using conventional search. The other outcome is a co-authored document for submission for publication.
2) Ontology Mappings in the Earth and Environmental Sciences
Numerous Earth and Environmental Science ontologies exist in various repositories that are useful to DataONE, for data access and delivery, and for data sharing. These ontologies can be used to enhance metadata annotations of each dataset, thus improving metadata quality overall. However, Earth and Environmental Science ontologies have very different degrees of quality and curation. As DataONE is poised as the main point of access to earth and environmental data and practices and is schema agnostic, semantic descriptions of these datasets and practices are crucial to discovery across schemas. One way to ascertain this degree of quality is to locate terms with similar semantics between two or more ontologies and, based on their annotations and surrounding concepts in the ontologies, have domain users assess the comparative quality. The scope of this task includes providing backend mappings to be used by automated assistance to the users in the form of semantically similar terms from different ontologies for the same domain. The ideal candidate will have a background in computer or information science and should be familiar with ontological concepts and possibly the application of algorithms to provide mappings. Expected outcomes may include the development of software prototype, a final report, or material for publication of results at a conference on earth and environmental sciences.
Primary Mentor: Line Pouchard, Oak Ridge National Laboratory
Secondary Mentors: Natasha Noy, Stanford University; Mike Huhns, South Carolina University
Necessary Prerequisites: Background in computer sciences and/or information retrieval
Desirable Skills / Qualifications: Programming skills suitable to perform the described task
Expected Outcomes: A software prototype, or material suitable for a publication
3) Evaluation of Ontology Coverage for Curation
As the use of ontologies in the Earth and Environmental Sciences domain increases, there is a need to evaluate existing ontologies and their quality to provide an amount of curation to the collection. Many criteria have been proposed in the literature for evaluating ontologies and ontologies need to be evaluated along many dimensions. In particular the coverage of the ontologies should be evaluated for relevance to the community. This is particularly important to the DataONE federation, as ontologies and semantic descriptions of domain vocabularies enhance dataset discovery and ensure disambiguation of domain knowledge. We propose developing methods for automatic evaluation using Natural Language Processing methods. The ideal candidate will have a background in Computer Science and be familiar with ontologies or NLP techniques. Expected outcomes include a prototype on the evaluation results or material for publication.
Primary Mentor: Line Pouchard, Oak Ridge National Laboratory
Secondary Mentors: Natasha Noy, Stanford University; Mike Huhns, South Carolina University
Necessary Prerequisites: Background in computer sciences and/or information retrieval
Desirable Skills / Qualifications: programming skills suitable to perform the described task
Expected Outcomes: A software prototype, or material suitable for a publication
4) Integrating Data Stories into DataONE Education and Community Engagement Products
Tensions around sharing scientific data have received international attention in recent years - for example, in 2009’s “climategate” – and the scientific community is actively working toward creating a healthier dialogue around data management and sharing. This project aims to integrate success stories and cautionary tales from researchers related to their experiences with managing and sharing scientific research data into DataONE education and community engagement products. The Data Stories project, which is focused on collecting such stories through structured interviews and/or focus groups, is currently underway. By the beginning of Summer 2013, we expect to have a number of narratives based on these interviews posted online on the DataONE Data Stories blog (https://notebooks.dataone.org/data-stories/). The summer intern will assist with preparing and posting any stories that have not yet been posted, but will focus primarily on integrating these narratives into DataONE education products such as the Data Management Education Modules (http://www.dataone.org/education-modules). Intern will assist with publicizing the existence of these new resources to support data management and sharing, provide periodic project updates in the form of research blog posts, and assist with preparation of a manuscript summarizing key findings of the Data Stories project.
Primary Mentor: Stephanie Hampton, NCEAS
Secondary Mentor: Stacy Rebich Hespanha, NCEAS
Additional Mentors: Members of the DataONE CEE working group
Necessary Prerequisites: Bachelor’s degree (outstanding juniors and seniors will also be considered); demonstrated strong writing and communication skills; ability to work independently and meet deadlines.
Desirable Skills / Qualifications: Some knowledge of best practices for data management; experience working with data from human subjects; experience with teaching, curriculum development, learning theory, or assessment; experience writing for a public audience (e.g., newspaper or blogging).
Expected Outcomes: Data management and sharing stories (in the form of blog posts) published on the web for use by the data management education community; improvements to the DataONE education modules in the form of additional illustrative stories; a manuscript summarizing ethnographic analysis of the culture of scientific data management and sharing as revealed by the stories we collect.
5) Data Policies for Public Participation in Scientific Research
Developing sound policies for using and sharing data in projects that involve the public in scientific research is a complex undertaking. Currently no formal guidelines are available for selecting and implementing data policies that are suited to the needs of citizen science project coordinators.
The initial goal of this project is to develop a curated set of exemplar data policies for delivery through the citizen science project development toolkit on www.citizenscience.org.
A guide to data policies for practitioners will be developed for delivery along with the examples. There is also potential for extending these initial deliverables to include development of an interactive “Data Policy Planning Tool.” There may be additional opportunities for collaboration with PPSR Working Group members on ongoing related research.
The successful candidate will have opportunities to develop extensive understanding of data policies related to scientific data sharing, deep familiarity with the growing phenomenon of citizen science, and practical experience in resource selection and curation. If the candidate is able to work out of Ithaca, NY, s/he will have exceptional access to world leaders in citizen science practice and research.
Primary Mentor: Andrea Wiggins, University of New Mexico & Cornell University
Secondary Mentor: Rob Stevenson, University of Massachusetts Boston
Necessary Prerequisites: Experience with standard office software. Excellent writing skills are also required, with experience in writing for a general (non-academic) audience strongly preferred.
Desirable Skills / Qualifications: Prior experience collecting research data from the Internet. Graduate or postgraduate standing with studies in information science/studies, library science, or science & technology studies. Familiarity with any of the following: citizen science, science and research policy, law and intellectual policy related to data sharing. Experience with Wordpress for implementation of online project deliverables.
Expected Outcomes: A curated set of exemplar data policies for delivery through the citizen science project development toolkit on www.citizenscience.org.
6) Bi-level Metadata Registry Development
The goal of the proposed summer internship is to prototype a metadata registry framework in two parts: a vernacular part consisting of evolving, freely contributed terms and a lightly supervised canonical part consisting of stable terms that crowd-sourced, reputation-based methods have brought to prominence. Leveraging social technologies while benefiting from expert moderation, this bi-level mechanism can be used in any subject domain to create highly relevant metadata registries that avoid the inefficient and unresponsive maintenance pattern plaguing almost every mature registry.
The intern will work with the DataONE PAMWG (Preservation and Metadata Working Group) to begin populating a registry instance emphasizes, but is not limited to, earth and environmental sciences. As per working group goals, the instance will feature a low barrier for contributions, transparency in review processes, and support for balanced discussion and lightweight moderation by elders (experts). Stack Overflow, Hacker News, and Wikipedia, have proven, through a range of reputation-based approaches, that quality can be achieved by drawing the best from user communities. Pooling resources across sciences will reduce duplicate efforts and spending, and support greater interoperability within DataONE and among other scientific data initiatives.
Primary Mentor: Jane Greenberg (SILS Metadata Research Center), John Kunze (University of California Curation Center, California Digital Library)
Secondary Mentors: Jim Regetz (NCEAS) and members of the DataONE Preservation and Metadata Working Group.
Necessary Prerequisites: Strong ability to code in Python, Perl, or PHP, and use the basic LAMP stack. Knowledge of distributed revision control systems (Git, Mercurial). Experience with social technologies and metadata vocabularies.
Desirable Skills / Qualifications: Background in information and library science or computer science. Interest in or background in scientific topics covered in the DataONE domain. Awareness of metadata schemes used in the sciences.
Expected Outcomes:A prototype metadata registry instance sufficiently mature to release for user testing, and one or more publications on the experience.
7) PBase: Provenance as a First-class Citizen in DataONE
The goal of this project is to develop a feature-rich provenance management architecture, which we call PBase, that integrates with the core DataONE architecture. To achieve this, we will combine two strands of work that the Provenance WG has been pursuing for the past two years. The first, Golden-Trail: A Provenance Repository For Storing And Retrieving Data Lineage Information (2010) [1], focused on the realization of a common provenance model (D-PROV), a provenance repository, and an interactive user interface (Golden-Trail). The second effort (2012) has been centered on using the member nodes’ Data Packaging features in combination with provenance-aware workflow execution.
The intern will develop a prototype of PBase by building upon this prior work. The prototype will demonstrate the benefits of an architectural stack that includes advanced query and analytics capabilities over a corpus of provenance traces, which are associated with data stored in Data Packages within member nodes. It will also enable the composition of provenance fragments produced separately by workflows that are independent and yet share some of their data, a natural occurrence in e-science [2]. At the same time, we will retain the advantages of using provenance terms for data discovery, which we have demonstrated in our most recent prototype, as well as the storage of workflows, their data, and the provenance into self-contained packages.
Workflows may come from different systems. Thus, we aim to show interoperability of the provenance traces collected from those systems, by means of our unified D-PROV provenance data model.
The project will be carried out in close collaboration with the EVA WG. Their climate analysis workflows, which we use in our current demo, will form the basis for the next iteration of case studies to be used in this project.
[1] Missier, Paolo, Bertram Ludascher, Shawn Bowers, Ilkay Altintas, Saumen Dey, and Michael Agun. “Golden Trail: Retrieving the Data History That Matters from a Comprehensive Provenance Repository.” International Journal of Digital Curation 7, no. 1 (2012). http://www.dcc.ac.uk/events/idcc11.
[2] Missier, Paolo, Bertram Ludascher, Shawn Bowers, Manish Kumar Anand, Ilkay Altintas, Saumen Dey, Anandarup Sarkar, Biva Shrestha, and Carole Goble. “Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science.” In Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS), 2010.
Primary Mentors: Bertram Ludaescher (UC Davis), Paolo Missier (Newcastle University, UK)
Necessary Prerequisites: Java programming, database programming skills.
Desirable Skills / Qualifications: Python programming skills, understanding of provenance concepts.
Expected Outcomes: A prototype of PBase, a provenance repository integrated with the DataONE architecture, and associated publications.
8) Build Fundamental Components for Provenance-aware Model Exploration, Evaluation, and Benchmarking Cyber-infrastructure Prototype
Earth System Modeling is a primary approach to advance our understanding on the Earth’s biogeochemical cycles, including its interaction with human, and further to advance our understanding on climate change. There have been a variety of Earth system models developed with different approaches to address different components of the Earth’s biogeochemical cycle. Even though the findings of modeling efforts are promising, there are still many uncertainties associated with the results.
Model-data intercomparison is an important approach to diagnose and improve model processes and parameterizations by comparing differences between models and differences with model and observations. However, there are challenges, including 1) heterogeneous model output and observation data with different formats, spatial/temporal scales, etc.; 2) lack of tools that address the specific needs of data analysis and visualization for model-data intercomparison; 3) lack of mechanisms to reproduce and trace back to the origins of analyzed data and visualizations.
As an effort to tackle the above challenges, the DataONE EVA working group proposes to build a prototype of Provenance-aware Model Exploration, Evaluation, and Benchmarking Cyber-infrastructure on top of VisTrails and UV-CDAT, which are open source workflow-based scientific analysis and visualization frameworks, as described in Figure 1. This infrastructure has the capability to integrate distributed data resources from DataONE, Earth System Grid (ESG), or any user-provided model and observation data repositories through Brokers. The core component of the infrastructure contains libraries of standard modules and workflows for data analysis and visualization. Interfaces will be provided for different types of users and guide them to customize workflows for their specific model-data intercomparison needs. The infrastructure is linked together with provenance-aware tools so that VisTrails workflows can be converted to standard-based provenance representations and indexed through DataONE indexing mechanism. Provenance-based data discovery, customizations, and reproductions can then be achieved. The analyzed results, together with associated provenance information, can be packaged and contributed back to DataONE.
Figure 1. Provenance-aware Model Exploration, Evaluation, and Benchmarking Cyber-infrastructure
This summer intern project will focus on several fundamental components of the above cyber-infrastructure and the collaboration with the intern project proposed by the DataONE Provenance Working Group (ProvWG). In particular, the intern will work closely with various Earth system modeling scientists to 1) build selected core VisTrails modules for data analysis and visualization by wrapping existing Python libraries or from scratch; 2) construct scientific workflows to implement selected common model-data intercomparison scenarios; 3) work together with the intern in ProvWG to integrate provenance-aware tools into the cyber-infrastructure to enable provenance extraction, query, and tracing.
Primary Mentor: Bob Cook (Oak Ridge National Laboratory)
Secondary Mentor: Yaxing Wei (Oak Ridge National Laboratory)
Necessary Prerequisites: B.S. in computer science or statistics, strong programming background.
Desirable Skills / Qualifications: Familiar with VisTrails, UV-CDAT, and Python programming; training and interest in environmental and climate modeling sciences; interest in analyzing and visualize spatial data; strong communication skills.
Expected Outcomes: The outcome of this project will be a library of VisTrails-based core data analysis and visualization modules and workflows for climate model-data intercomparisons and a demonstration of workflow-provenance-DataONE integration. The intern is expected to receive guidance and build experiences in climate science, Earth system modeling data analysis, and Web-based application development. On the other hand, the internship will bring benefits to DataONE by providing a proof-of-concept solution for the integration of DataONE Cyber-Infrastructure (CI), scientific Exploration Visualization and Analysis (EVA) workflows from EVA working group, and components from the provenance working group.
9) A Visualization Tool for Provenance in DataONE
Provenance---information about how a result was generated---is an important component for data like that stored in DataONE. When data is the result of a computation or some collection process, it is important to be able to find exactly which tools or algorithms were used and what the various settings were. There are now tools that support capturing provenance and repositories, like DataONE, that support support storing it, but it is still difficult for users to understand and use it. Visual representations that show the data involved and processes used can help, but visualization techniques for provenance are currently limited to standard graph visualization techniques which are usually non-interactive and overload users with too much information.
In this project, we propose to create techniques that address the challenges of provenance visualization and apply them to DataONE provenance, like the D-PROV traces the provenance working group has generated. We believe that there are some important aspects to address: (1) using details-on-demand to display an overview of provenance while allowing users to drill down to specifics; (2) integrating query capabilities to restrict views so users can better analyze the traces; (3) showing the full provenance, even when it involves multiple traces; (4) linking directly to DataONE resources---nodes with should allow a user to view the metadata or download that data. Ideally, this tool would be Web-based, and would allow, for example, users searching DataONE to locate data inputs more efficiently. The techniques developed could also be used in the PBase project and ProvEx tool.
Primary Mentors: Juliana Freire, NYU-Poly, David Koop, NYU-Poly
Additional Mentors: Bertram Ludaescher, UC-Davis
Necessary Prerequisites: programming background, information visualization.
Desirable Skills / Qualifications: python and/or java experience, understanding of provenance, graph visualization, web visualization frameworks.
Expected Outcomes: A tool that displays interactive visual representations of provenance as well as associated publications.
To Apply
To apply, please complete the form here.
Applications must be completed by 11:59 PM (Mountain time) on March 17th March 21st. You will be asked to upload a cover letter and resume, both in PDF format. Applicants should also provide a letter of reference. The letter of reference should be sent directly by its author to internship@dataone.org.
- The cover letter should address the following questions:
- What DataONE Summer Internship projects are you most interested in and why?
- What contributions do you expect to be able to make to the project(s)?
- What background do you have which is relevant to the project(s)?
- What do you expect to learn and/or achieve by participating?
- What are your thoughts and ideas about the project, including particular suggestions for ways of achieving the project objectives?
- How will participation in this program help you achieve your educational and career objectives?
- Are there any factors that would affect your ability to participate, including other summer employment, university schedules, and other commitments?
- The resume should include the applicant’s educational history, current position, any publications or honors, and full contact information (including phone number, e-mail address, and mailing address).
- The letter of reference should be sent directly to internship@dataone.org and should be from a professor, supervisor, or mentor.
Evaluation of applications
Applications will be judged by the following criteria:
- The academic and technical qualifications of the applicant.
- Evidence of strong written and oral communication skills.
- The extent to which the applicant can provide substantive contributions to one or more projects, including the applicant’s ideas for project implementation.
- The extent to which the internship would be of value to the career development of the applicant
- The availability of the applicant during the period of the internship.
Intellectual Property
DataONE is predicated on openness and universal access. Software is developed under one of several open source licenses, and copyrightable content produced during the course of the project will made available under a Creative Commons (CC-BY 3.0) license. Where appropriate, projects may result in published articles and conference presentations, on which the intern is expected to make a substantive contribution, and receive credit for that contribution.
Funding acknowledgement
The Summer Internships are supported by a National Science Foundation Award (NSF Award 0830944): "DataNet Full Proposal: DataNetONE (Observation Network for Earth)".
For more information
If you have questions or problems about the application process or internship program in general, please send e-mail to internship@dataone.org.

