Understanding How Scientists Analyze Data
Richard Littauer
Bio: Mr. Littauer grew up in Hartford, Connecticut. He received an MA in Linguistics at the University of Edinburgh in 2011. He plans to continue studying Linguistics and Genetics in graduate school.
Project Description: Scientists use a wide variety of tools and techniques to manage and analyze data. However, to our knowledge no one has taken a systematic look at how scientists do their work. In this project, we will examine a large number of the scientific workflows that have been constructed. We will develop a way of categorizing workflows based on their complexity, types of processing steps employed, and other factors. The goal is to develop new and significant understanding of the scientific process and how it is being enabled by science workflows.
Primary Mentor: Bill Michener, Secondary Mentors: Rebecca Koskela and Bertram Ludaescher
click here to read more about the project
How Much Ecological Data is Out There?
Michelle Chang
Bio: Ms. Change graduated at the University of California Irvine with a bachelors degree in Ecology and Evolutionary Biological Sciences and subsequently worked for two years as a field technician with Dr. Katharine Suding at the University of California Berkeley. Currently, she is a Masters student at the Bren School of Environmental Sciences & Management at the University of California Santa Barbara. She has a strong interest in restoration, natural resources management, and conservation planning.
Project Description: No one is certain how much ecological data exists, or how this amount compares to the volume of data currently housed in repositories such as Knowledge Network for Biocomplexity (KNB). It would be useful to determine this for designing infrastructure, but also as a call to arms for ecologists to start sharing this “dark data”. For this project, we will develop a method for estimating the amount of ecological data being generated, with a focus on “small science” projects. Initially this project will involve brainstorming about the best way to estimate such a complex figure, and the intern will then be tasked with producing the estimate using the decided upon methods. Potential methods for estimation may include sampling publications, surveying scientists, or exploring existing databases. We foresee that results from this project will be highly cited since such an estimate is useful for discussions about data sharing, data reuse, and repository development in Ecology.
Primary Mentor: Carly Strasser, Secondary Mentor: Stephanie Hampton
click here to read more about the project
Tracking the Reuse of 1000 Datasets
Jonathan Carlson
Bio: Mr. Carlson is currently pursing a Master of Arts in Library and Information Studies at the University of Wisconsin--Madison. He has a strong interest in the sciences, as evidenced by his Master of Science in Forest Ecology and Management from Michigan Technological University and Bachelor of Arts in Geology and Environmental Studies from Gustavus Adolphus College. On his free time Mr. Carlson enjoys reading, vegetarian cooking, and getting outside to hike and bike.
Project Description: We believe that openly archiving raw data facilitates valuable reuse. Can we measure this? What contribution does data reuse make to the published literature? Who reanalyzes data? For what? Does this vary across disciplines and repositories? These questions are the focus of an exploratory study, "Tracking data reuse: Following one thousand datasets from public repositories into the published literature." In this internship you'll work directly with Heather to collect, extract, annotate, and analyze data to explore these important questions. See http://bit.ly/cPsek0 for more info on the project.
Primary Mentor: Heather Piwowar, Secondary Mentor: Todd Vision
click here to read more about the project
Scientific Workflow Provenance Repository and Publishing Toolkit
Saumen Dey
Bio: Ph.D. student, Dept. of Computer Science; MBM Systems and Operations Research, University of Calcutta, India; B.Sc, Mathematics, Jadavpur University, India
Research Interests: Privacy-Aware Provenance Publication, Data Intensive Web application, and Cloud Computing.
Publications: Dey, S., Zinn, D., Ludäscher, B.: PROPUB: Towards a Declarative Approach for Publishing Customized, Policy-Aware Provenance. In: Scientific and Statistical Database Management Conference (to appear). (2011); Dey, S., Zinn, D., Ludäscher, B.: Publishing Privacy-Aware Provenance by Inventing Anonymous Nodes. Resource Discovery (RED) 2011 Workshop (part of Extended Semantic Web Conference 2011).; Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: Workflows in Support of Large-Scale Science (WORKS), 2010, IEEE 1–8
Project Description: Scientific workflow systems are increasingly used to automate scientific computations and data analysis and visualization pipelines. An important feature of scientific workflow systems is their ability to record and subsequently query and visualize provenance information. Provenance includes the processing history and lineage of data, and can be used, e.g., to validate/invalidate outputs, debug workflows, document authorship and attribution chains, etc. and thus facilitate “reproducible science”. We aim to develop (1) a provenance repository system for publishing and sharing data provenance collected from runs of a number of scientific workflow systems (Kepler, Taverna, Vistrails), together with (2) a provenance trace publication system that allows scientists to interactively and graphically select relevant fragments of a provenance trace for publishing. The selection may be driven by the need to protect private information, thus including hiding, abstracting, or anonymizing irrelevant or sensitive parts. Part (1) will be based on a DataONE-extension of the Open Provenance Model (D1-OPM) and leverage an earlier Summer of Code project. In particular, the provenance toolkit includes an API for managing workflow provenance (i.e., uploading into and retrieving from a data storage back-end). Part (2) will implement a new policy-aware approach to publishing provenance, which aims at reconciling a user’s (selective) provenance publication requests, with agreed upon provenance integrity constraints. For an existing rule-based backend, a graphical user environment needs to be developed that lets users select, abstract, hide, and anonymize provenance graph fragments prior to their publication.
Primary Mentor: Bertram Ludaescher Secondary Mentor: Paolo Missier
click here to read more about the project
Scientific Workflow Provenance Repository and Publishing Toolkit
Michael Agun
Bio: Mr. Agun is from the Seattle, WA area and will be a senior at Gonzaga University this fall, majoring in computer science. He is also involved in Ham Radio and emergency communications.
Project Description: Scientific workflow systems are increasingly used to automate scientific computations and data analysis and visualization pipelines. An important feature of scientific workflow systems is their ability to record and subsequently query and visualize provenance information. Provenance includes the processing history and lineage of data, and can be used, e.g., to validate/invalidate outputs, debug workflows, document authorship and attribution chains, etc. and thus facilitate “reproducible science”. We aim to develop (1) a provenance repository system for publishing and sharing data provenance collected from runs of a number of scientific workflow systems (Kepler, Taverna, Vistrails), together with (2) a provenance trace publication system that allows scientists to interactively and graphically select relevant fragments of a provenance trace for publishing. The selection may be driven by the need to protect private information, thus including hiding, abstracting, or anonymizing irrelevant or sensitive parts. Part (1) will be based on a DataONE-extension of the Open Provenance Model (D1-OPM) and leverage an earlier Summer of Code project. In particular, the provenance toolkit includes an API for managing workflow provenance (i.e., uploading into and retrieving from a data storage back-end). Part (2) will implement a new policy-aware approach to publishing provenance, which aims at reconciling a user’s (selective) provenance publication requests, with agreed upon provenance integrity constraints. For an existing rule-based backend, a graphical user environment needs to be developed that lets users select, abstract, hide, and anonymize provenance graph fragments prior to their publication.
Primary Mentor: Bertram Ludaescher, Secondary Mentors: Paolo Missier Shawn Bowers
click here to read more about the project
Integrating Loosely Structured Data Into the Linked Open Data Cloud
Aida Gandara
Bio: Aida Gandara is a PhD student in the Computer Science Department at the University of Texas at El Paso. She is a Cyber-ShARE research student working on her last year of dissertation research. Her research is focused on collaborative scientific systems where she is focused on helping scientific teams describe and discuss their research in order to share it over the Semantic Web.
Project Description: The Linked Data conventions describe four principles that allow data of any kind and from any online source to form a global interconnected web of data: i) name every "thing" that has some data or information associated with it; ii) use HTTP URIs to do so; iii) provide useful information or data in Resource Description Framework (RDF) format to someone looking up such URIs; and iv) within information provided this way, link to other common "things", such as points or axes of reference, and use common vocabularies to attach meaning to links wherever possible. The idea of this project is to develop an exploratory prototype, and practical recommendations resulting from it, for how the heterogeneous and loosely structured data held in non-specialized DataONE member nodes can be exposed to the Linked (Open) Data cloud. The approach would consist of obtaining a sufficiently representative sample of data sets from DataONE's initial 3 member nodes (Dryad, KNB, and ORNL-DAAC), and using them as instance data for which to define the RDF predicate vocabularies, domain ontologies, resource URIs, and conversion mechanisms that are necessary to create a LOD representation of these data. This representation can then be uploaded to, navigated, and queried in either one of the web-based LOD browsers (such as URIburner), or for example in a local installation of OpenLink Virtuoso.
Primary Mentor: Hilmar Lapp
click here to read more about the project



