Nicholas DiGiuseppe is a software engineering PhD student at the University of California, Irvine. His research focuses on automating the debugging process by leveraging natural language. In his free time he enjoys going for walks with his daughter, playing board games with his wife, and Iaijutsu.
As the use of ontologies in the Earth and Environmental Sciences domain increases, there is a need to evaluate existing ontologies and their quality to provide an amount of curation to the collection. Many criteria have been proposed in the literature for evaluating ontologies and ontologies need to be evaluated along many dimensions. In particular the coverage of the ontologies should be evaluated for relevance to the community. This is particularly important to the DataONE federation, as ontologies and semantic descriptions of domain vocabularies enhance dataset discovery and ensure disambiguation of domain knowledge. We propose developing methods for automatic evaluation using Natural Language Processing methods. The ideal candidate will have a background in Computer Science and be familiar with ontologies or NLP techniques. Expected outcomes include a prototype on the evaluation results or material for publication.
Primary Mentor: Line Pouchard
Secondary Mentor: Natasha Noy
Christopher is a new masters student in computer science at the University of California, Davis. His academic interests and job experience have been broad to date, ranging from computer vision to networks. In graduate school, he'll focus on the theoretical foundations of computer networks. His current work for DataONE involves designing and implementing a crowd-sourced, online dictionary for metadata terms.
The goal of the proposed summer internship is to prototype a metadata registry framework in two parts: a vernacular part consisting of evolving, freely contributed terms and a lightly supervised canonical part consisting of stable terms that crowd-sourced, reputation-based methods have brought to prominence. Leveraging social technologies while benefiting from expert moderation, this bi-level mechanism can be used in any subject domain to create highly relevant metadata registries that avoid the inefficient and unresponsive maintenance pattern plaguing almost every mature registry.
The intern will work with the DataONE PAMWG (Preservation and Metadata Working Group) to begin populating a registry instance emphasizes, but is not limited to, earth and environmental sciences. As per working group goals, the instance will feature a low barrier for contributions, transparency in review processes, and support for balanced discussion and lightweight moderation by elders (experts). Stack Overflow, Hacker News, and Wikipedia, have proven, through a range of reputation-based approaches, that quality can be achieved by drawing the best from user communities. Pooling resources across sciences will reduce duplicate efforts and spending, and support greater interoperability within DataONE and among other scientific data initiatives.
Primary Mentor: Jane Greenberg, John Kunze
Fei Du is a PhD student in Geographic Information Science (GIS) at the University of Wisconsin-Madison. He has a MS degree in Computer Sciences from UW-Madison and a MS degree in GIS from Chinese Academy of Sciences respectively. His research interests are in geospatial predictive modeling, trajectory data analysis, spatial data mining and geovisual analytics.
Earth System Modeling is a primary approach to advance our understanding on the Earth’s biogeochemical cycles, including its interaction with human, and further to advance our understanding on climate change. There have been a variety of Earth system models developed with different approaches to address different components of the Earth’s biogeochemical cycle. Even though the findings of modeling efforts are promising, there are still many uncertainties associated with the results.
Model-data intercomparison is an important approach to diagnose and improve model processes and parameterizations by comparing differences between models and differences with model and observations. However, there are challenges, including 1) heterogeneous model output and observation data with different formats, spatial/temporal scales, etc.; 2) lack of tools that address the specific needs of data analysis and visualization for model-data intercomparison; 3) lack of mechanisms to reproduce and trace back to the origins of analyzed data and visualizations.
As an effort to tackle the above challenges, the DataONE EVA working group proposes to build a prototype of Provenance-aware Model Exploration, Evaluation, and Benchmarking Cyber-infrastructure on top of VisTrails and UV-CDAT, which are open source workflow-based scientific analysis and visualization frameworks, as described in Figure 1. This infrastructure has the capability to integrate distributed data resources from DataONE, Earth System Grid (ESG), or any user-provided model and observation data repositories through Brokers. The core component of the infrastructure contains libraries of standard modules and workflows for data analysis and visualization. Interfaces will be provided for different types of users and guide them to customize workflows for their specific model-data intercomparison needs. The infrastructure is linked together with provenance-aware tools so that VisTrails workflows can be converted to standard-based provenance representations and indexed through DataONE indexing mechanism. Provenance-based data discovery, customizations, and reproductions can then be achieved. The analyzed results, together with associated provenance information, can be packaged and contributed back to DataONE.
Primary Mentor: Bob Cook
Secondary Mentor: Yaxing Wei
Anne Bowser is a PhD student at the University of Maryland iSchool and Human-Computer Interaction Lab working with the Biotracker research team (www.biotrackers.net). Her research focus involves identifying how the motivational affordances of games can be used to engage volunteers with different citizen science campaigns. Anne is currently collaborating with Project Budburst to design Floracaching, a mobile geocaching game that will help gather plant phenology data. Her work at DataONE involves surveying the data policies of different citizen science campaigns in order to create a practitioner guide to data policy.
Developing sound policies for using and sharing data in projects that involve the public in scientific research is a complex undertaking. Currently no formal guidelines are available for selecting and implementing data policies that are suited to the needs of citizen science project coordinators.
The initial goal of this project is to develop a curated set of exemplar data policies for delivery through the citizen science project development toolkit on www.citizenscience.org.
A guide to data policies for practitioners will be developed for delivery along with the examples. There is also potential for extending these initial deliverables to include development of an interactive “Data Policy Planning Tool.” There may be additional opportunities for collaboration with PPSR Working Group members on ongoing related research.
The successful candidate will have opportunities to develop extensive understanding of data policies related to scientific data sharing, deep familiarity with the growing phenomenon of citizen science, and practical experience in resource selection and curation. If the candidate is able to work out of Ithaca, NY, s/he will have exceptional access to world leaders in citizen science practice and research.
Primary Mentor: Andrea Wiggins
Secondary Mentor: Robert Steveson
Sarah Menz is currently pursuing an MA in Sustainability at Chatham University. With a diverse background in English Literature, Creative Writing, and Business, and Environmental Studies, she is interested in communicating the wonders of our natural environment and the importance of safeguarding it. In addition to reading and writing on a variety of subjects, she enjoys experimenting in the kitchen, biking around Pittsburgh, and snapping photos of backyard wildlife.
Tensions around sharing scientific data have received international attention in recent years - for example, in 2009’s “climategate” – and the scientific community is actively working toward creating a healthier dialogue around data management and sharing. This project aims to integrate success stories and cautionary tales from researchers related to their experiences with managing and sharing scientific research data into DataONE education and community engagement products. The Data Stories project, which is focused on collecting such stories through structured interviews and/or focus groups, is currently underway. By the beginning of Summer 2013, we expect to have a number of narratives based on these interviews posted online on the DataONE Data Stories blog. The summer intern will assist with preparing and posting any stories that have not yet been posted, but will focus primarily on integrating these narratives into DataONE education products such as the Data Management Education Modules. Intern will assist with publicizing the existence of these new resources to support data management and sharing, provide periodic project updates in the form of research blog posts, and assist with preparation of a manuscript summarizing key findings of the Data Stories project.
Primary Mentor: Stephanie Hampton
Secondary Mentor: Stacy Rebich-Hespanha
Katie just finished her first year in the Information Technology and Web
Science Masters program at Rensselaer Polytechnic in upstate New York.
Her upbringing in northern Virginia, just outside of Washington, DC,
meant that she had seen snow prior to this winter, but New York was still
quite a change weather-wise from where she completed her undergraduate
work, in St. Petersburg, Florida. There, at Eckerd College, she earned a
degree in Computer Science, as well as in East Asian Studies, and Modern
Languages. This summer, she is working on additional features for the
Next Generation Data Environment, a browser-based tool for converting
comma-separated data tables into linked data format. In her free time,
she enjoys good science fiction; bad puns; watching hockey and baseball;
and embarking on cooking adventures.
The summer intern will work to develop a web-based interface that immediately facilitates any user to semantically enable data and meta-data. Within the application workflow, the user will be able to link their data to selections of ontology concepts from established community ontologies, like OBO-E (https://marinemetadata.org/references/oboeontology), leveraging backend vocabulary services developed by Patrice Seyed (post-doc for DataONE semantics and interoperability working group). The interface will leverage formal reasoning to assist a user in making selections constrained by their previous selections of classes and properties, based on how these objects are defined in their respective ontologies, while at the same time assist the user in verifying the set of inferences that follow from all selections. Within the design, the user will be enabled to identify implicit domain entities (e.g., when a measurement data record refers to multiple samples or organisms as opposed to one), useful in scenarios where this is only clearly understood by the data table creator, and flexibly encode their representation within the transformed data. The project serves as an extension to previous semantic data enablement projects across Rensselaer Polytechnic Institute (RPI) and the National Center for Ecological Analysis and Synthesis (NCEAS), including the CSV2RDF4LOD from RPI, that converts tabular data into RDF statement based on user-provided configurations, and Morpho of NCEAS’s Semtools project that annotates tabular data applying the OBO-E ontology model of scientific observation. Researchers involved in these projects are mentors for this proposal and available for guidance. The resulting transformed data will include linkages back to the original data and its source using provenance-centric ontologies (PROV-O, PML3), and will be available for discovery and granular search of datasets described through DataONE’s metadata environment.
Primary Mentor: Patrice Seyed, Deborah McGuiness
Nikhil Kapoor is presently pursuing MS in Computer Science and Engineering at University of South Carolina. His current research revolves around mapping and managing Ontologies in the Earth and Environmental Sciences. He has extensive background in use of different Virtualization tools, working with Web Services and Version Control Systems. He has also taught undergraduate courses at the university. In his spare time Nikhil likes writing poems, pursuing adventure sports or working out in the gym.
Numerous Earth and Environmental Science ontologies exist in various repositories that are useful to DataONE, for data access and delivery, and for data sharing. These ontologies can be used to enhance metadata annotations of each dataset, thus improving metadata quality overall. However, Earth and Environmental Science ontologies have very different degrees of quality and curation. As DataONE is poised as the main point of access to earth and environmental data and practices and is schema agnostic, semantic descriptions of these datasets and practices are crucial to discovery across schemas. One way to ascertain this degree of quality is to locate terms with similar semantics between two or more ontologies and, based on their annotations and surrounding concepts in the ontologies, have domain users assess the comparative quality. The scope of this task includes providing backend mappings to be used by automated assistance to the users in the form of semantically similar terms from different ontologies for the same domain. The ideal candidate will have a background in computer or information science and should be familiar with ontological concepts and possibly the application of algorithms to provide mappings. Expected outcomes may include the development of software prototype, a final report, or material for publication of results at a conference on earth and environmental sciences.
Primary Mentor: Line Pouchard
Secondary Mentor: Natasha Noy
Parisa Kianmajd is a PhD student in Computer Science at University of California, Davis. Prior to joining the PhD program, she received her Masters degree in Information Security. Her research interests include, but are not limited to, applied cryptography, database security, privacy, and privacy-aware provenance.
The goal of this project is to develop a feature-rich provenance management architecture, which we call PBase, that integrates with the core DataONE architecture. To achieve this, we will combine two strands of work that the Provenance WG has been pursuing for the past two years. The first, Golden-Trail: A Provenance Repository For Storing And Retrieving Data Lineage Information (2010) , focused on the realization of a common provenance model (D-PROV), a provenance repository, and an interactive user interface (Golden-Trail). The second effort (2012) has been centered on using the member nodes’ Data Packaging features in combination with provenance-aware workflow execution.
The intern will develop a prototype of PBase by building upon this prior work. The prototype will demonstrate the benefits of an architectural stack that includes advanced query and analytics capabilities over a corpus of provenance traces, which are associated with data stored in Data Packages within member nodes. It will also enable the composition of provenance fragments produced separately by workflows that are independent and yet share some of their data, a natural occurrence in e-science . At the same time, we will retain the advantages of using provenance terms for data discovery, which we have demonstrated in our most recent prototype, as well as the storage of workflows, their data, and the provenance into self-contained packages.
Workflows may come from different systems. Thus, we aim to show interoperability of the provenance traces collected from those systems, by means of our unified D-PROV provenance data model.
Primary Mentor: Bertram Ludaescher
Secondary Mentor: Paolo Missier