The 2016 DataONE Summer Internship Program is now CLOSED for applications
The Data Observation Network for Earth (DataONE) is a virtual organization dedicated to providing open, persistent, robust, and secure access to biodiversity and environmental data, supported by the U.S. National Science Foundation. DataONE is pleased to announce the availability of summer research internships for undergraduates, graduate students and recent postgraduates.
Interns undertake a 9 week program of work centered around one of the projects listed below. Each intern will be paired with one primary mentor and, in some cases, secondary and tertiary mentors. Interns need not necessarily be at the same location or institution as their mentor(s). Interns and mentors are expected to have a face-to-face meeting at the beginning of the summer, maintain frequent communication throughout the program and interns are required to work in an open notebook environment.
February 16 - Application period opens
March 14 - Deadline for receipt of applications at midnight Mountain time
Apr 4 - Notification of acceptance and scheduling of face-to-face meetings (schedules permitting)
May 23 - Program begins*
June 20 - Midterm evaluations
July 22 - Program concludes**
* Some allowance will be made for students who are unavailable during these dates due to their school calendar.
** Program may not extend beyond Aug 12 2016.
The program is open to undergraduate students, graduate students, and postgraduates who have received their degree within the past five years. Given the broad range of projects, there are no restrictions on academic backgrounds or field of study. Interns must be at least 18 years of age by the program start date, must be currently enrolled or employed at a U.S. university or other research institution and must currently reside in, and be eligible to work in, the United States. Interns are expected to be available approximately 40 hours/week during the internship period (noted above) with significant availability during the normal business hours. Interns from previous years are eligible to participate.
Interns will receive a stipend of $5,000 for participation, paid in two installments (one at the midterm and one at the conclusion of the program). In addition, required travel expenses will be borne by DataONE. Participation in the program after the mid-term is contingent on satisfactory performance. The University of New Mexico will administer funds. Interns will need to supply their own computing equipment and internet connection. For students who are not US citizens or permanent residents, complete visa information will be required, and it may be necessary for the funds to be paid through the student’s university or research institution. In such cases, the student will need to provide the necessary contact information for their organization.
Projects will cover a range of topic areas and vary in the extent and type of prior background required of the intern. Not all projects are guaranteed funding and the interests and expertise of the applicants will, in part, determine which projects will be selected for the program. The titles and descriptions of this year’s projects are posted below.
2016 Project Titles
1) Exploring the Impact of DataONE: Data Publication and Access Metrics
2) Semantic Entity Extraction and Linking for Annotation and Ontology Evolution
3) Developing a Survey Instrument for Evaluation of Teaching Materials
4) Emerging Research Communities: Fulfilling the Potential of Open Access Earth Science Data
5) Reproducibility of Script-Based Workflows: A Case Study and Demonstration
Project 1) Exploring the Impact of DataONE: Data Publication and Access Metrics
- Primary Mentor(s): Amber Budden, Heather Soyka
Secondary Mentor(s): Mark Schildhauer
Additional Mentor(s): Dave Vieglais
Necessary Prerequisites: Experience with experimental design and statistical analysis, experience with and access to statistical analysis software
Desirable Skills / Qualifications: Experience with survey design
Expected Outcomes: There are three expected deliverables. 1) summary literature review 2) analysis of data metrics across time and 3) preliminary survey instrument and listed of survey recipients. These three deliverables would form the basis of a publication which would also include data from the deployed survey. Survey deployment and analysis may occur outside the duration of this internship.
The initial DataONE infrastructure was released in 2012 with the goal of enabling new science and knowledge creation through universal access to data about life on earth and the environment that sustains it. By federating across data repositories, or Member Nodes, DataONE aims to enhance search and discovery of data. This internship will explore the extent to which DataONE has contributed to increased data sharing and reuse of data held within existing data repositories.
As part of a larger project exploring practices and perceptions of researchers around data sharing and reuse, this intern will first conduct a literature review that looks at research communities, data sharing, and the impact of the decision to make research data available for reuse. This literature review will inform the direction of the other two phases, taking into account publications about DataONE, and will be outlined and discussed in greater detail at the start of the internship.
Second, the internship will employ statistical methods and use quantitative data to explore the effects of participation in the DataONE federation. Regression analyses will be conducted on pre- and post- DataONE metrics on data uploads and downloads.
Project 2) Semantic Entity Extraction and Linking for Annotation and Ontology Evolution
- Primary Mentor: Deborah L. McGuinness
Secondary Mentor: Jim McCusker
Additional Mentors: Matt Jones, Mark Schildhauer, Margaret O’Brien (need to reconfirm)
Necessary Prerequisites: Computer Science major
Desirable Skills / Qualifications: Demonstrated experience in Semantic Web technologies. Programming skills to work independently and meet deadlines. Some experience with ontologies and entity extraction/linking is useful.
Expected Outcomes: A Web service that provides ontology-aware term extraction from unstructured (or semi-structured) text and links the extracted terms to ontology terms. The service provides suggested annotation for a given input leveraging the ontology-aware extraction. Components already exist to do portions of this service. The summer project will include putting the pieces together and improving individual portions. Additionally a document describing the service will be an output (and if the student desired, this could be evolved into a draft of a submission for publication).
A number of entity linking tools exist to take unstructured text (or sometimes semi-structured text), extract terms (often noun phrases) and then link those extracted entities to entities in a knowledge base. We have a tool (currently named Linkipedia) that addresses this challenge. Its use is described in one setting at http://nlp.cs.rpi.edu/paper/bioel.pdf In the setting of DataONE, our updated toolset leverages existing knowledge sources including DBpedia and a number of ontologies relevant to earth science to support entity linking from descriptions of data. We are using this tool in the DataONE project to take descriptions and link portions of those textual descriptions to ontology terms and then using those linking results to provide automatic annotation. While we have promising results, the annotation accuracy could be improved. Additionally our tool suite includes a number of components including a noun phrase extractor. The linking aspect can take text and propose appropriate links to knowledge base and ontology items. With the noun phrase extractor, it can take a description and identify noun phrases that do not link to any known ontologies, which is one way of identifying gaps in the ontology.
This project will package existing components into a web service for automatic annotation and ontology gap analysis. It will also attempt to improve on the suggested annotations and links.
Project 3) Developing a Survey Instrument for Evaluation of Teaching Materials
- Primary Mentor: Heather Soyka
Secondary Mentor: Viv Hutchison
Additional Mentor(s): Amber Budden
Necessary Prerequisites: This intern should have some experience in teaching or creating teaching materials and/or survey instruments, and an interest in educating users, managers, and creators of reusable research data and metadata.
Desirable Skills / Qualifications: Useful skills/qualifications include: prior experience with online professional education modules, particularly in the area of scientific research data management or related topics.
Expected Outcomes: A literature review outlining best practices for soliciting user feedback of educational resources, the development of an assessment tool for evaluating current DataONE educational resources; and the identification of target audiences for deployment of the tool.
Education and outreach focused on topics such as: data management, data sharing, and writing quality metadata for reuse are of vital importance to DataONE users, and to the broader scientific, research, and academic communities. Responsive educational design that remains sensitive to community needs is dependent on evaluation, assessment, and thoughtful development of new resources. This project would be centered around the development of a survey instrument for site-wide evaluation of DataONE teaching materials and resources (modules and screencasts). Building the evaluation instrument, constructing an workflow for deployment, and compiling considerations for putting together future educational resources are all outcomes of this project.
For this project, the summer intern will work closely with the Director of Community Outreach and the CEO Postdoc to create an evaluation tool for current DataONE resources and teaching materials. This project would involve three parts: 1) a literature review that looks at the best practices for soliciting and receiving user feedback on teaching and educational resources, and places this in context with the development of new outreach decisions; 2) the development of an assessment tool for evaluating current educational resources; and 3) identification of target audiences and recommendations for deployment of the tool. Additionally, it is anticipated that this tool will be shared with the larger data management community.
Project 4) Emerging Research Communities: Fulfilling the Potential of Open Access Earth Science Data
- Primary Mentor: Suzie Allard
Secondary Mentor: Mike Frame
Additional Mentor(s): Carol Tenopir
Necessary Prerequisites: Knowledge of social science methods and a strong interest in interdisciplinary teams.
Desirable Skills / Qualifications: Experience with conducting interviews; knowledge about climate change science.
Expected Outcomes: There are three expected deliverables. 1) white paper from a comprehensive review of data roles in emergent data communities; 2) summary of pilot interviews conducted; and 3) a brief with questions and strategies to reach emerging.
DataONE, through a federated set of repositories, offers a wide range of well described, openly available, scientific data, creating the potential for new discoveries. While researchers have worked on understanding the benefits of these repositories, and the motivations for data sharing, less is known about how collaborative research is happening in this environment. However research groups are emerging that are interdisciplinary, data-focused, highly collaborative and sometimes funded by programs that cross traditional boundaries. The questions are compelling: What roles are researchers taking in emerging data research communities? How are the roles of data curation, data access, and analysis being managed within these groups? How do these communities communicate? How do these communities enable new science? Are they changing the culture of science?
For this project, the summer intern will work closely with the usability and assessment mentor team to explore the roles and influence of emerging communities related to open access data. Three Grand Challenges of Climate Change set the context for the exploration: sea level rise, water availability and linking extreme events to climate change. Emerging communities may include (but are not limited to) data managers, data communicators, data scientists, policy makers, farmers and entrepreneurs. The project includes three parts: 1) conducting a comprehensive review of data roles in emergent data communities including reviewing a range of academic, grant information, graduate programs and informal media to see how team roles are being defined; 2) developing draft questions and conducting pilot interviews of representatives of key emerging communities; and 3) refine questions and identify strategies to reach these communities more broadly.
Project 5) Reproducibility of Script-Based Workflows: A Case Study and Demonstration
- Primary Mentor: Bertram Ludäscher
Secondary Mentor: Tim McPhillips
Additional Mentors: Bruce Wilson, Paolo Missier
Necessary Prerequisites: Good programming skills in Python, R, or Java.
Desirable Skills / Qualifications: Experience with SQL databases, scripting languages (e.g., Python, bash, tcsh, or similar), and modern software development tools and practice (e.g., version control with git or mercurial, test-driven and agile development, software deployment via docker containers) a plus.
Expected Outcomes: 1) Reports of experiments/case study with reproducibility-enhancing tools and technologies. 2) End-to-end walkthrough of a prototypical usage of a sequence of technologies that enable full reproducibility of a script-based scientific workflow. 3) Online demonstration. 4) Comparison of practices with respect to Data Carpentry.
What does it take to reproduce a script-based scientific workflow?
For example, if the Python or R scripts implementing a workflow are available through an open source repository such as github, are we all set? Not so fast! A user might fail to successfully run the scripts or replicate the results for any of a number of reasons (for starters, the installation may fail due to complex software and version dependencies; or the user may fail to properly run, adapt, or understand the scripts due to lack of documentation, etc.)
In this project we will experiment with a number of technologies and tools that can improve the reproducibility of script-based workflows: e.g., the YesWorkflow (YW) toolkit allows authors to annotate scripts to model and export prospective provenance, i.e., the workflow structure otherwise latent in the script. YW can also be used to reconstruct retrospective provenance or to query other sources of provenance information, e.g., runtime provenance logged directly by the script author or recorded by the DataONE MATLAB tool, the NCEAS recordr, or the noWorkflow system (for capturing Python execution provenance). To manage platform and software dependencies of script-based workflows, docker containers can be used. Last but not least, active elements can be embedded in PDF files to support interactive exploration of published results.
Using one or more example scripts, we will apply these different technologies and study their benefits and limitations. The overall goal is to deploy a prototypical example of a “highly reproducible” script-based workflow using a combination of the above-mentioned technologies.
Evaluation of applications
Applications will be judged by the following criteria:
- The academic and technical qualifications of the applicant.
- Evidence of strong written and oral communication skills.
- The extent to which the applicant can provide substantive contributions to one or more projects, including the applicant’s ideas for project implementation.
- The extent to which the internship would be of value to the career development of the applicant
- The availability of the applicant during the period of the internship.
DataONE is predicated on openness and universal access. Software is developed under one of several open source licenses, and copyrightable content produced during the course of the project will made available under a Creative Commons (CC-BY 3.0) license. Where appropriate, projects may result in published articles and conference presentations, on which the intern is expected to make a substantive contribution, and receive credit for that contribution.
Previous Summer Internships are supported by a National Science Foundation Award (NSF Award 0830944): "DataNetONE (Observation Network for Earth)". Current Summer Internships are supported by National Science Foundation Award #1430508.
For more information
If you have questions or problems about the application process or internship program in general, please e-mail firstname.lastname@example.org.