I want to search

2018 Interns

Pratik Shrivastava

Bio:

I am a second year master student pursuing Information Management at the University of Illinois at Urbana Champaign(UIUC). I received my B. Tech in Information and Communication Technology from DA-IICT, Gandhinagar Gujarat in 2008. Before pursuing my masters I worked as a Senior Software developer with Oracle for 4 years, and have more than 8 years of professional experience working in the software industry. My research interest revolves around Data Analytics, Data Provenance and Data mining. In my free time, I like playing soccer and cricket and I also like cooking in spare time.


Project Description:

Reliable determination of file formats is necessary to help ensure appropriate processing can be applied to the file. This is especially important when files are intended to be reused in the future since any knowledge of the producing system may be lost. There are many subtle variations in file formats that have significant implications for consumers. For example, many metadata standards are serialized as XML (text/xml or application/xml media type), but more detail is required for actual processing of the metadata. This information is usually available through a combination of the namespace(s) and schema(s) referenced by the XML. Manual interpretation of this information is relatively straightforward though is error prone due to subtle differences that may be present.

The goal of this project is to extend the capabilities of the Linux (or equivalents on OS X and Windows) file command to allow automatic identification of common science metadata and data formats. Two main activities are anticipated to achieve this goal. 1) Supporting additional file formats by extending or adding to the existing "magic" configuration files used by the file command. These magic files contain rules that enable identification of files by matching patterns within the file. 2) Provision of a simple REST service that accepts a file (or a portion thereof) and returns a JSON encoded response containing the identification of the file as provided by the file command.
- https://github.com/threatstack/libmagic
- http://jhove.openpreservation.org/
- https://linux.die.net/man/1/file


Primary Mentor: Dave Vieglais

Read more about the project

Seokki Lee

Bio:

Seokki Lee is a fourth year PhD student in the database group at Illinois Institute of Technology. His research focus is on data provenance and missing answers, specifically, unifying provenance and missing answers with providing approximate summaries for queries with negation to support in one system. His background includes a master's degree in Computer Science and Engineering as well as Engineering Management in Hanyang University and California State University at Northridge, respectively. He holds a bachelor of Computer Engineering from Dankook University. Outside of work and research, he enjoys playing tennis and spending time with his family.


Project Description:

Reproducible research is the essence of empirical science, and yet common practices fall short of producing results that are fully transparent and reproducible. This summer internship will focus on building a collaboration between working groups conducting ecological synthesis at NCEAS and the intern who wants to enable the results of these computational syntheses to be stored in a fully reproducible and transparent manner using provenance tools and standards from DataONE. The intern will work with researchers to understand and conduct computational analysis in a reproducible manner, and then use the Whole Tale system to document and archive “tales” in DataONE. In Whole Tale, a tale represents a set of scientific results, such as modeling output, figures, tables, and derived data, along with the documentation needed to understand those outputs and their linkages to the computational processes that generated them. This provenance information includes a full manifest of the data inputs, the computational code and processes used, the outputs, and the execution environment from which the results were generated. The Whole Tale system can be used to generate this package of reproducible research results. We would expect the intern to identify one or two extant working groups at NCEAS that are conducting synthesis, and to help those groups produce fully-reproducible tales describing their results, and publish these with a DOI in DataONE.


Primary Mentor: Bertram Ludaescher

Read more about the project

Rob Crystal-Ornelas

Bio:

Rob Crystal-Ornelas is a 4th year PhD candidate in Ecology and Evolution at Rutgers University. He is also a visiting graduate researcher at the UC Davis Bodega Marine Lab. Rob uses meta-analysis and systematic review to explore research biases in invasion ecology. In his free time, Rob hosts a podcast called Science in Progress (https://scienceinprogress.netlify.com/), makes art, and enjoys running.


Project Description:

Obtaining metrics on the usage of DataONE for development of published research is challenging. Products of synthesis research appropriately cite the data objects and the repository in which they are hosted, but not the method through which they discovered the data.

This project will conduct a systematic review of published Earth and environmental science material that are synthesis papers. From this set of papers, we will identify the cohort of data used, the data repository in which it is located and explore if those data are currently exposed within DataONE. This will enable us to demonstrate the percentage of synthesis papers that could have been completed using data currently found in DataONE and also identify additional data repositories that DataONE might seek to include as part of the network.


Primary Mentor: Megan Mach
Secondary Mentor: Amber Budden

Read more about the project

Giancarlo Sadoti

Bio:

Giancarlo Sadoti is a postdoctoral researcher at the University of Nevada, Reno (UNR) with a background in biogeography and avian ecology. He currently researches the ecology of birds in the southwestern U.S. and the climatology and ecology of Alaska. His interest in interning with DataONE follows nearly a decade of interest in the utility of long-term, public, museum, and citizen-science data sets to investigations of environmental change. Giancarlo holds a BA from Prescott College, an MS in Environmental Science and Wildlife Resources from the University of Idaho, and a PhD in Geography from UNR.


Project Description:

Obtaining metrics on the usage of DataONE for development of published research is challenging. Products of synthesis research appropriately cite the data objects and the repository in which they are hosted, but not the method through which they discovered the data.

This project will conduct a systematic review of published Earth and environmental science material that are synthesis papers. From this set of papers, we will identify the cohort of data used, the data repository in which it is located and explore if those data are currently exposed within DataONE. This will enable us to demonstrate the percentage of synthesis papers that could have been completed using data currently found in DataONE and also identify additional data repositories that DataONE might seek to include as part of the network.


Primary Mentor: Megan Mach
Secondary Mentor: Amber Budden

Read more about the project