Connect

2011 Interns

Shang Huan

Bio: Ms. Huang is a PhD student at Odum School of Ecology, University of Georgia. Her research uses broad-scale data that are available through public online resources to explore global patterns of biodiversity, including phylogenetically informed analyses of the factors that influence the diversity of mammals and their parasites. She is very excited about being a part of the DataONE summer project, and looks forward to creating a module to teach undergraduate students how to access and analyze environmental/ecological data.
Project Description: A graduate student intern will create an educational module for use in undergraduate classrooms – the module will be designed to teach basic principles in ecology or environmental science using data that are publicly available through the DataONE network. The student will work with mentors to choose appropriate data sets, questions and analyses, and create a simple program to access and analyze the data in R. The student will create documentation that accompanies the exercise, potentially in multimedia formats, to train instructors to use the exercise in classrooms.
Primary Mentor: Stephanie Hampton
Secondary Mentor: Carly Strasser and Amber Budden

Read more about the project

Michelle Chang

Bio: Ms. Chang graduated at the University of California Irvine with a bachelors degree in Ecology and Evolutionary Biological Sciences and subsequently worked for two years as a field technician with Dr. Katharine Suding at the University of California Berkeley. Currently, she is a Masters student at the Bren School of Environmental Sciences & Management at the University of California Santa Barbara. She has a strong interest in restoration, natural resources management, and conservation planning.
Project Description: No one is certain how much ecological data exists, or how this amount compares to the volume of data currently housed in repositories such as Knowledge Network for Biocomplexity (KNB). It would be useful to determine this for designing infrastructure, but also as a call to arms for ecologists to start sharing this “dark data”. For this project, we will develop a method for estimating the amount of ecological data being generated, with a focus on “small science” projects. Initially this project will involve brainstorming about the best way to estimate such a complex figure, and the intern will then be tasked with producing the estimate using the decided upon methods. Potential methods for estimation may include sampling publications, surveying scientists, or exploring existing databases. We foresee that results from this project will be highly cited since such an estimate is useful for discussions about data sharing, data reuse, and repository development in Ecology.
Primary Mentor: Carly Strasser
Secondary Mentor: Stephanie Hampton

Read more about the project

Aida Gandara

Bio: Aida Gandara is a PhD student in the Computer Science Department at the University of Texas at El Paso. She is a Cyber-ShARE research student working on her last year of dissertation research. Her research is focused on collaborative scientific systems where she is focused on helping scientific teams describe and discuss their research in order to share it over the Semantic Web.
Project Description: The Linked Data conventions describe four principles that allow data of any kind and from any online source to form a global interconnected web of data: i) name every "thing" that has some data or information associated with it; ii) use HTTP URIs to do so; iii) provide useful information or data in Resource Description Framework (RDF) format to someone looking up such URIs; and iv) within information provided this way, link to other common "things", such as points or axes of reference, and use common vocabularies to attach meaning to links wherever possible. The idea of this project is to develop an exploratory prototype, and practical recommendations resulting from it, for how the heterogeneous and loosely structured data held in non-specialized DataONE member nodes can be exposed to the Linked (Open) Data cloud. The approach would consist of obtaining a sufficiently representative sample of data sets from DataONE's initial 3 member nodes (Dryad, KNB, and ORNL-DAAC), and using them as instance data for which to define the RDF predicate vocabularies, domain ontologies, resource URIs, and conversion mechanisms that are necessary to create a LOD representation of these data. This representation can then be uploaded to, navigated, and queried in either one of the web-based LOD browsers (such as URIburner), or for example in a local installation of OpenLink Virtuoso.
Primary Mentor: Hilmar Lapp

Read more about the project

Melody Basham

Bio: Ms. Basham is pursuing her doctorate degree at Arizona State University in Educational Leadership and Innovation. Her current research involves assessing the impact of citizen science as an integrated ESL curriculum with Hispanic immigrant adult learners based in Phoenix Arizona. Her background has largely been in the field of anthropology, archaeology, and the life sciences with work experience in media and the educational publishing field.
Project Description: DataONE is developing online learning modules designed to educate DataONE users in various aspects of the data lifecycle. This project involves: 1) researching and acquiring software that can produce high quality online learning; 2) developing online learning modules using pre-prepared power point slides produced by the DataONE Community Engagement and Education Working Group; 3) adding content about data management 4) participating in a workshop hosted by DataONE to refine and add additional content to educational modules (July, 2011).
Primary Mentor: Vivian Hutchinson
Secondary Mentor: Carly Strasser

Read more about the project

Saumen Dey

Bio: Ph.D. student, Dept. of Computer Science; MBM Systems and Operations Research, University of Calcutta, India; B.Sc, Mathematics, Jadavpur University, India Research Interests: Privacy-Aware Provenance Publication, Data Intensive Web application, and Cloud Computing. Publications: Dey, S., Zinn, D., Ludäscher, B.: PROPUB: Towards a Declarative Approach for Publishing Customized, Policy-Aware Provenance. In: Scientific and Statistical Database Management Conference (to appear). (2011); Dey, S., Zinn, D., Ludäscher, B.: Publishing Privacy-Aware Provenance by Inventing Anonymous Nodes. Resource Discovery (RED) 2011 Workshop (part of Extended Semantic Web Conference 2011).; Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: Workflows in Support of Large-Scale Science (WORKS), 2010, IEEE 1–8
Project Description: Scientific workflow systems are increasingly used to automate scientific computations and data analysis and visualization pipelines. An important feature of scientific workflow systems is their ability to record and subsequently query and visualize provenance information. Provenance includes the processing history and lineage of data, and can be used, e.g., to validate/invalidate outputs, debug workflows, document authorship and attribution chains, etc. and thus facilitate “reproducible science”. We aim to develop (1) a provenance repository system for publishing and sharing data provenance collected from runs of a number of scientific workflow systems (Kepler, Taverna, Vistrails), together with (2) a provenance trace publication system that allows scientists to interactively and graphically select relevant fragments of a provenance trace for publishing. The selection may be driven by the need to protect private information, thus including hiding, abstracting, or anonymizing irrelevant or sensitive parts. Part (1) will be based on a DataONE-extension of the Open Provenance Model (D1-OPM) and leverage an earlier Summer of Code project. In particular, the provenance toolkit includes an API for managing workflow provenance (i.e., uploading into and retrieving from a data storage back-end). Part (2) will implement a new policy-aware approach to publishing provenance, which aims at reconciling a user’s (selective) provenance publication requests, with agreed upon provenance integrity constraints. For an existing rule-based backend, a graphical user environment needs to be developed that lets users select, abstract, hide, and anonymize provenance graph fragments prior to their publication.
Primary Mentor: Bertram Ludaescher
Secondary Mentor: Paolo Missier

Read more about the project

Michael Agun

Bio: Mr. Agun is from the Seattle, WA area and will be a senior at Gonzaga University this fall, majoring in computer science. He is also involved in Ham Radio and emergency communications.
Project Description: Scientific workflow systems are increasingly used to automate scientific computations and data analysis and visualization pipelines. An important feature of scientific workflow systems is their ability to record and subsequently query and visualize provenance information. Provenance includes the processing history and lineage of data, and can be used, e.g., to validate/invalidate outputs, debug workflows, document authorship and attribution chains, etc. and thus facilitate “reproducible science”. We aim to develop (1) a provenance repository system for publishing and sharing data provenance collected from runs of a number of scientific workflow systems (Kepler, Taverna, Vistrails), together with (2) a provenance trace publication system that allows scientists to interactively and graphically select relevant fragments of a provenance trace for publishing. The selection may be driven by the need to protect private information, thus including hiding, abstracting, or anonymizing irrelevant or sensitive parts. Part (1) will be based on a DataONE-extension of the Open Provenance Model (D1-OPM) and leverage an earlier Summer of Code project. In particular, the provenance toolkit includes an API for managing workflow provenance (i.e., uploading into and retrieving from a data storage back-end). Part (2) will implement a new policy-aware approach to publishing provenance, which aims at reconciling a user’s (selective) provenance publication requests, with agreed upon provenance integrity constraints. For an existing rule-based backend, a graphical user environment needs to be developed that lets users select, abstract, hide, and anonymize provenance graph fragments prior to their publication.
Primary Mentor: Bertram Ludaescher
Secondary Mentor: Paolo Missier, Shawn Bowers

Read more about the project

Jonathan Carlson

Bio: Mr. Carlson is currently pursing a Master of Arts in Library and Information Studies at the University of Wisconsin--Madison. He has a strong interest in the sciences, as evidenced by his Master of Science in Forest Ecology and Management from Michigan Technological University and Bachelor of Arts in Geology and Environmental Studies from Gustavus Adolphus College. On his free time Mr. Carlson enjoys reading, vegetarian cooking, and getting outside to hike and bike.
Project Description: We believe that openly archiving raw data facilitates valuable reuse. Can we measure this? What contribution does data reuse make to the published literature? Who reanalyzes data? For what? Does this vary across disciplines and repositories? These questions are the focus of an exploratory study, "Tracking data reuse: Following one thousand datasets from public repositories into the published literature." In this internship you'll work directly with Heather to collect, extract, annotate, and analyze data to explore these important questions. See http://bit.ly/cPsek0 for more info on the project.
Primary Mentor: Heather Piwowar
Secondary Mentor: Todd Vision

Read more about the project

Richard Littauer

Bio: Mr. Littauer grew up in Hartford, Connecticut. He received an MA in Linguistics at the University of Edinburgh in 2011. He plans to continue studying Linguistics and Genetics in graduate school.
Project Description: Scientists use a wide variety of tools and techniques to manage and analyze data. However, to our knowledge no one has taken a systematic look at how scientists do their work. In this project, we will examine a large number of the scientific workflows that have been constructed. We will develop a way of categorizing workflows based on their complexity, types of processing steps employed, and other factors. The goal is to develop new and significant understanding of the scientific process and how it is being enabled by science workflows.
Primary Mentor: Bill Michener
Secondary Mentor: Rebecca Koskela and Bertram Ludaescher

Read more about the project