Chung-Yi (Sophie) Hou recently graduated from the Master of Science in Library and Information Science program with the specialization in Data Curation from the University of Illinois, Urbana-Champaign. Since 2014, Sophie has been working on DataONE's activities relating to semantics as well as collaborating with the Research Data Archive and the Library at the National Center for Atmospheric Research on several data management/curation projects. Sophie's previous Bachelor and Master of Science degrees and professional work experience were in the field of electrical engineering. Outside of work and research, Sophie tries to make time for good reads, walks, and cups of tea.
Education and outreach focused on topics such as: data management, data sharing, and writing quality metadata for reuse are of vital importance to DataONE users, and to the broader scientific, research, and academic communities. Responsive educational design that remains sensitive to community needs is dependent on evaluation, assessment, and thoughtful development of new resources. This project would be centered around the development of a survey instrument for site-wide evaluation of DataONE teaching materials and resources (modules and screencasts). Building the evaluation instrument, constructing an workflow for deployment, and compiling considerations for putting together future educational resources are all outcomes of this project.
For this project, the summer intern will work closely with the Director of Community Outreach and the CEO Postdoc to create an evaluation tool for current DataONE resources and teaching materials. This project would involve three parts: 1) a literature review that looks at the best practices for soliciting and receiving user feedback on teaching and educational resources, and places this in context with the development of new outreach decisions; 2) the development of an assessment tool for evaluating current educational resources; and 3) identification of target audiences and recommendations for deployment of the tool. Additionally, it is anticipated that this tool will be shared with the larger data management community.
Primary Mentor: Heather Soyka
Secondary Mentor: Viv Hutchison, Amber Budden
An Yan is a PhD student in Information Science at the University of Washington. She received her B.E. in Remote Sensing in Wuhan University, China, and received her M.S. in Ecology in Center for Center for Earth System Science, Tsinghua University, China. Her research interest revolves around data curation, Cyberinfrastructure, and reproducible research in geosciences.
DataONE, through a federated set of repositories, offers a wide range of well described, openly available, scientific data, creating the potential for new discoveries. While researchers have worked on understanding the benefits of these repositories, and the motivations for data sharing, less is known about how collaborative research is happening in this environment. However research groups are emerging that are interdisciplinary, data-focused, highly collaborative and sometimes funded by programs that cross traditional boundaries. The questions are compelling: What roles are researchers taking in emerging data research communities? How are the roles of data curation, data access, and analysis being managed within these groups? How do these communities communicate? How do these communities enable new science? Are they changing the culture of science?
For this project, the summer intern will work closely with the usability and assessment mentor team to explore the roles and influence of emerging communities related to open access data. Three Grand Challenges of Climate Change set the context for the exploration: sea level rise, water availability and linking extreme events to climate change. Emerging communities may include (but are not limited to) data managers, data communicators, data scientists, policy makers, farmers and entrepreneurs. The project includes three parts: 1) conducting a comprehensive review of data roles in emergent data communities including reviewing a range of academic, grant information, graduate programs and informal media to see how team roles are being defined; 2) developing draft questions and conducting pilot interviews of representatives of key emerging communities; and 3) refine questions and identify strategies to reach these communities more broadly.
Primary Mentor: Suzie Allard
Secondary Mentor: Mike Frame, Carol Tenopir
After graduating this spring from USC, Erika will be pursuing a Master's in Environmental Science and Management this fall at the Bren School at the UC-SB. Her interests include using big data and analytics to tackle environmental problems especially environmental justice issues. A self-described tea-lover, in her spare time, Erika enjoys exploring National Parks and listening to NPR.
The initial DataONE infrastructure was released in 2012 with the goal of enabling new science and knowledge creation through universal access to data about life on earth and the environment that sustains it. By federating across data repositories, or Member Nodes, DataONE aims to enhance search and discovery of data. This internship will explore the extent to which DataONE has contributed to increased data sharing and reuse of data held within existing data repositories.
As part of a larger project exploring practices and perceptions of researchers around data sharing and reuse, this intern will first conduct a literature review that looks at research communities, data sharing, and the impact of the decision to make research data available for reuse. This literature review will inform the direction of the other two phases, taking into account publications about DataONE, and will be outlined and discussed in greater detail at the start of the internship.
Second, the internship will employ statistical methods and use quantitative data to explore the effects of participation in the DataONE federation. Regression analyses will be conducted on pre- and post- DataONE metrics on data uploads and downloads.
Primary Mentor: Amber Budden, Heather Soyka
Secondary Mentor: Mark Schildhauer, Dave Vieglais
Duc is a first year PhD student in Electrical and Computer Engineering at the University of Illinois at Chicago. Before his PhD study, he received M.S. degree in Electrical and Computer Engineering from the University of Illinois at Chicago. His research interests include machine learning, data mining, algorithms and signal processing.In his free time, Duc enjoys cooking and collecting stamps.
What does it take to reproduce a script-based scientific workflow?
For example, if the Python or R scripts implementing a workflow are available through an open source repository such as github, are we all set? Not so fast! A user might fail to successfully run the scripts or replicate the results for any of a number of reasons (for starters, the installation may fail due to complex software and version dependencies; or the user may fail to properly run, adapt, or understand the scripts due to lack of documentation, etc.)
In this project we will experiment with a number of technologies and tools that can improve the reproducibility of script-based workflows: e.g., the YesWorkflow (YW) toolkit allows authors to annotate scripts to model and export prospective provenance, i.e., the workflow structure otherwise latent in the script. YW can also be used to reconstruct retrospective provenance or to query other sources of provenance information, e.g., runtime provenance logged directly by the script author or recorded by the DataONE MATLAB tool, the NCEAS recordr, or the noWorkflow system (for capturing Python execution provenance). To manage platform and software dependencies of script-based workflows, docker containers can be used. Last but not least, active elements can be embedded in PDF files to support interactive exploration of published results.
Using one or more example scripts, we will apply these different technologies and study their benefits and limitations. The overall goal is to deploy a prototypical example of a “highly reproducible” script-based workflow using a combination of the above-mentioned technologies.
Primary Mentor: Bertram Ludäscher
Secondary Mentor: Tim McPhillips, Bruce Wilson, Paolo Missier
Sabita is a PhD student in Computer Science at the University of Illinois at Chicago. Her research interests include Natural Language Processing, Data Mining,and Semantic Web. In her free time, she loves hiking, travelling, and reading mystery-detective novels.
A number of entity linking tools exist to take unstructured text (or sometimes semi-structured text), extract terms (often noun phrases) and then link those extracted entities to entities in a knowledge base. We have a tool (currently named Linkipedia) that addresses this challenge. Its use is described in one setting at http://nlp.cs.rpi.edu/paper/bioel.pdf In the setting of DataONE, our updated toolset leverages existing knowledge sources including DBpedia and a number of ontologies relevant to earth science to support entity linking from descriptions of data. We are using this tool in the DataONE project to take descriptions and link portions of those textual descriptions to ontology terms and then using those linking results to provide automatic annotation. While we have promising results, the annotation accuracy could be improved. Additionally our tool suite includes a number of components including a noun phrase extractor. The linking aspect can take text and propose appropriate links to knowledge base and ontology items. With the noun phrase extractor, it can take a description and identify noun phrases that do not link to any known ontologies, which is one way of identifying gaps in the ontology.
This project will package existing components into a web service for automatic annotation and ontology gap analysis. It will also attempt to improve on the suggested annotations and links.
Primary Mentor: Deborah L. McGuinness
Secondary Mentor: Jim McCusker, Matt Jones, Mark Schildhauer