I want to search

MENU

2017 Interns

Xiaoliang Jiang

Xiaoliang Jiang

Bio:

Xiaoliang Jiang is a second year master student in Library and information science at the University of Illinois at Urbana Champaign(UIUC). He received his B.E. in Information Security in University of Science and Technology Beijing. He had worked as Analitist Internship in Office of the Vice Chancellor for Institutional Advancement in UIUC, and plans to apply PhD in next year. His research interest revolves around Data Analytics ,Data Mining and Social & Information Network. In his free time, he likes reading and playing board games.


Project Description:

The proposed work will result in an extension to the RStudio environment enabling data analysts to directly publish RDF that richly describes the semantics of their scripts. This work will also include draft best practices that guide practitioners in proper embedding of appropriate concepts and vocabulary from established ontologies (including ProvONE and domain ontologies).

In detail, this work entails an exploration of extending markdown syntax (esp. R Markdown) in concert with knitr to directly produce workflow markup, in a human-compatible way. A specific example of what this means: When "knitting" a markdown rendition, instead of generating (e.g) PDF or HTML, the anticipated tool will generate RDF (TTL or JSON-LD) or HTML+RDFa. By "human readable," we mean markdown best practices will be developed that are reasonable for a data analyst to use; methods (possibly based on templates) must be developed that do not require the user to "know" RDF. Today we can create cumbersome R Markdown (Rmd) files that produce HTML+RDFa outputs with correct embedded workflow semantics, but the user must be an HTML and RDFa hacker to understand it. Workflow reproducibility requires tools that data analysts will actually use.

With the right skillset, the intern may develop methods for semi-automatically extracting function and package semantics and encoding these into the resulting graph. By this we mean, in addition to capturing explicit semantics expressed via markdown syntax, the intern may develop a way to further capture "meaning" based on the use of functions in scripts, without requiring users to artificially wrap standard functions; this might be done through "wrapper" functions in R or some other means.

This work will be an advancement of the semantic workflow work inspired by YesWorkflow, and leverages an approach using standard practices for R extensions, markdown and publication, creating a direct path for DataONE analysts to get their workflows represented in knowledge graphs. This approach broadens the potential DataONE user base by helping to ensure their workflows and results are easier to discover, conceptually easier to understand, and therefore increasing the likelihood they will be cited, reused and reproduced.


Primary Mentor: Deborah McGuinness
Secondary Mentor: John Erickson (RPI)

Read more about the project

Hui Lyu

Bio:

Hui Lyu recently obtained a master degree in Library and Information Science from School of Information Sciences of UIUC. She is going to pursue a Ph.D. degree in Computer and Information Science at University of Pennsylvania in Fall 2017. Hui received her bachelor degree in Electronic Engineering at Beihang University (BUAA) in China. Her research interest includes data provenance in databases, scientific workflows and distributed systems. In her spare time, Hui enjoys listening to emotional and soul songs, dancing and watching drama.


Project Description:

YesWorkflow (YW) defines a set of annotations for declaring the expected dataflow patterns in scripts written in any text-based programming language. The YW toolkit extracts these annotations from comments in source code and builds a ProvONE-compatible workflow model of the script which can then be rendered graphically. YW also enables the user to export its representation of the workflow model as a set of Prolog or Datalog facts which can then easily be queried and used to create ad hoc visualizations of all or part of the model. Further, YW can reconstruct key runtime events and even data values that occurred during a run of the script by joining the YW model (prospective provenance) with observations made either during or after the completion of the script run, e.g. the values of metadata embedded in file names and directories created by the script. YW can export this retrospective and reconstructed provenance information as Prolog facts as well. Finally, the prospective and retrospective provenance facts can be queried together, enabling even more useful, hybrid provenance queries and visualizations that are of immediate use to the researcher reviewing the results of a script run or reporting their results to others.

The goal of the current project is to enable all of the provenance information that can be collected by YesWorkflow and exported to Prolog facts, to be exported alternatively to an RDF representation. The goal is to produce RDF documents that are both easy to read directly and also easy to query using SPARQL. We hypothesize that all of the queries that we have previously demonstrated with Prolog/Datalog can also be implemented in SPARQL 1.1. The challenge will be finding an intuitive way of representing prospective and retrospective provenance in RDF that also facilitates scientifically meaningful queries about the derivation of particular script products via the computational steps in the script and the dataflows between them. The above work, to be carried out by the intern, will not entail any modification of the YW tool itself. Rather, the intern will speculatively, author RDF documents representing YW workflow models and provenance, query these documents with SPARQL, and iteratively improve both the RDF representations and the SPARQL queries until as many as possible of the desired queries are supported. YesWorkflow will subsequently be updated to automatically generate the final version of the RDF representation designed in this project.


Primary Mentor: Bertram Ludäscher
Secondary Mentor: Timothy McPhillips

Read more about the project

Linh Hoang

Linh Hoang

Bio:

Linh Hoang is a first year PhD student at School of Information Sciences at University of Illinois Urbana - Champaign. Her research interests include information management, knowledge discovery, and data analytics. She is motivated in building smarter information systems that can help people get insights from data and make important decisions, without the hassle of going through the laborious work of collecting and disambiguating knowledge. Outside of work and research, Linh enjoys spending time with family, cooking and baking.


Project Description:

YesWorkflow (YW) defines a set of annotations for declaring the expected dataflow patterns in scripts written in any text-based programming language. The YW toolkit extracts these annotations from comments in source code and builds a ProvONE-compatible workflow model of the script which can then be rendered graphically. YW also enables the user to export its representation of the workflow model as a set of Prolog or Datalog facts which can then easily be queried and used to create ad hoc visualizations of all or part of the model. Further, YW can reconstruct key runtime events and even data values that occurred during a run of the script by joining the YW model (prospective provenance) with observations made either during or after the completion of the script run, e.g. the values of metadata embedded in file names and directories created by the script. YW can export this retrospective and reconstructed provenance information as Prolog facts as well. Finally, the prospective and retrospective provenance facts can be queried together, enabling even more useful, hybrid provenance queries and visualizations that are of immediate use to the researcher reviewing the results of a script run or reporting their results to others.

The goal of the current project is to enable all of the provenance information that can be collected by YesWorkflow and exported to Prolog facts, to be exported alternatively to an RDF representation. The goal is to produce RDF documents that are both easy to read directly and also easy to query using SPARQL. We hypothesize that all of the queries that we have previously demonstrated with Prolog/Datalog can also be implemented in SPARQL 1.1. The challenge will be finding an intuitive way of representing prospective and retrospective provenance in RDF that also facilitates scientifically meaningful queries about the derivation of particular script products via the computational steps in the script and the dataflows between them. The above work, to be carried out by the intern, will not entail any modification of the YW tool itself. Rather, the intern will speculatively, author RDF documents representing YW workflow models and provenance, query these documents with SPARQL, and iteratively improve both the RDF representations and the SPARQL queries until as many as possible of the desired queries are supported. YesWorkflow will subsequently be updated to automatically generate the final version of the RDF representation designed in this project.

This internship is supported by the Ludäscher Lab at UIUC.


Primary Mentor: Bertram Ludäscher
Secondary Mentor: Timothy McPhillips

Read more about the project

Megan March

Megan Mach

Bio:

Megan Mach is an Postdoctoral Research Fellow at Hopkins Marine Station in Monterey, California. Her current research is focused on quantifying how intertidal and kelp forest ecosystems have changed in Monterey Bay over the last 150 years. Though marine ecology has long been her focus, she is working at DataONE because she believes in the inherent value of visuals as a tool to increase interest and buy-in in the sciences. Megan holds a bachelor's of science in biology from the University of Washington, a master's of science in marine biology from Boston University, and a doctorate degree in Resource Management and Environmental Studies from the University of British Columbia.


Project Description:

You will be joining the DataONE team to help us create materials that increase awareness and appreciation for DataONE -- the premier global environmental data partner. We are excited to provide this opportunity to someone within media / communication / journalism / marketing or a related field to work with our globally distributed science infrastructure project. DataONE has already identified our broad community, and specific stakeholder needs and interests, which are addressed by DataONE tools and services. You would be helping us create targeted marketing materials that are engaging and meaningful for each audience segment. During this project, you will have the opportunity to design a suite of materials that will be implemented across multiple formats; online, printed, presentation etc. The materials will be designed in consultation with the Director for Community Engagement and Outreach and members of the DataONE Sustainability and Governance committee who will provide all the information you need for your creative brief so the materials design can appropriately reflect the mission and vision of the organization, as well as accurately describe products and services. As with any “agency” work, the design process will require multiple ideas / proofs to be created and reviewed before final concepts are agreed upon for implementation.


Primary Mentor: Amber Budden
Secondary Mentor: Trisha Cruse and Suzie Allard

Read more about the project

Ed Flathers

Ed Flathers

Bio:

Edward is a PhD candidate in the College of Natural Resources at the University of Idaho. His research focus is on the design of ecoinformatics software systems to support collaboration and open science, data sharing and re-use, and practices of reproducible research. His background includes a master's degree in statistical sciences and many years of experience as a software developer. When he has the opportunity, Edward enjoys travel, especially when it includes fishing or hiking.


Project Description:

The ability to successfully search for and discover data held within a given repository is related to both the capabilities of the search engine and also the quality of the metadata describing the data set. Extensive variation exists in the amount and quality of metadata that an author might provide for a shared data set. At a minimum, contributors might provide details listing the data authors, the study system, and the date, time and place of data collection. A rich metadata record would provide everything that would be required to download and synthesize or reanalyze the data, including a comprehensive abstract describing the work, details of the structure and contents of the data, and full methodological information to properly interpret data values. It is on these metadata terms that the search query operates.

This project will be three-fold.

  • First, the project will examine trends in DataONE search logs from 2012 to 2017 to explore patterns. What search terms were used? How much metadata was queried (how many search terms were used)? How many results were returned for various searches? Did a search result in data download from member repositories? If so, a single repository or multiple?
  • Second, taking advantage of the Quality Report feature available in one of the Member Nodes (the Arctic Data Center), the project will explore the metadata characteristics of downloaded data in comparison with the population of data available within that repository.
  • Finally, in an attempt to directly test the importance of metadata quality a series of standardized searches will be conducted against a replica Arctic Data Center catalog. These metadata will then be manipulated before repeating the search queries to evaluate the impact on discovery and its relationship with the metadata quality reports.
  • The results of the project will be prepared for presentation and/or publication.


Primary Mentor: Lauren Walker and Amber Budden
Secondary Mentor: Jesse Goldstein, Jeanette Clark, and Matt Jones

Read more about the project

Edizabeth Olson

Elizabeth Olson

Bio:

Elizabeth Olson is a doctoral student at Northern Illinois University who specializes in geochemistry. Her research focuses on paleoclimate research that utilizes geochemical proxies. Currently she is developing a record of Holocene water availability in the Atacama Desert in northern Chile from the oxygen isotope values of tree rings. She received her master 's degree in Quaternary and Climate Studies at the University of Maine and BA in Geology and Anthropology at the University of Florida.


Project Description:

NCEAS is currently the host site for a number of critical and compelling environmental, ecological, and conservation/human well-being research topics, all undertaken within a framework of “Synthesis”, using existing data that must be collated, documented, and robustly archived into a DataONE compatible data repository. This project offers a prime opportunity for a DataONE intern to work with data from these synthesis activities and to augment DataONE’s own controlled vocabulary about ecosystem concepts, the “Ecosystems Ontology” (ECSO). ECSO is constructed using World Wide Web standards that enable accessing and associating terms in the vocabulary with features of the data objects found in DataONE. This will enable researchers to improve the precision of their searches, as well as enhance interpretation of the data for re-use in synthesis investigations.

The intern’s work will involve:

  • Identifying relevant “external” vocabularies containing well-constructed terms to use for describing DataONE data, and investigating the best methods for importing/referencing these terms within DataONE’s framework
  • Identifying relevant vocabularies that are not well-constructed, and incorporating these into DataONE’s framework
  • Developing new terms as needed, to augment the library of measurements in DataONE’s ECSO ontology
  • User-testing and feedback of DataONE’s annotation and search tools and features, when archiving Synthesis data products in a DataONE MN.

The outcomes of this internship are intended to be practical, supplementing the catalogue of well-defined measurements available for researchers (and machine-assisted mechanisms) to use for annotation, but also data discovery and interpretation. The numerous ongoing synthesis activities at NCEAS include: Long Term Ecological Research Synthesis Working Groups; Science for Nature and People Partnership Working Groups ; Arctic Data Center Working Groups; and the State of Alaska Salmon and People Working Groups. These efforts collectively provide access to a rich set of heterogeneous environmental data that will be archived in some DataONE MNs. Specific targets of data enrichment and vocabulary development activities will be prioritized and focused through discussion among the Mentors with the various project PI’s.

This project is supported by the Arctic Data Center.


Primary Mentor: Mark Schildhauer
Secondary Mentor: Julien Brun and Pier Paolo Buttigieg

Read more about the project