Michael Yoo has just finished his third year as a Computer Engineering major at the University of Illinois at Urbana-Champaign. His interests include data mining and analysis, algorithms, and distributed systems. In his free time, Michael enjoys break-dancing, cooking, and competing at hackathons.
YesWorkflow (YW) is software toolkit that provides some of the benefits of using a scientific workflow system without having to rewrite scripts and other scientific software. Rather than reimplementing code so that it can be executed and managed by a workflow engine, a YW user simply adds special comments to existing scripts to declare how data is used and results are produced, step by step, by the script. YW uses these comments to create a rendering of the script as a workflow. A YW graphing module currently produces static graphical views (in Graphviz DOT format) of the resulting workflow model of the script.
The static graphs produced by YW can be large and complex. We propose to develop an interactive viewer for YW graphical output that will make these graphs easier to explore and interpret. For example, clicking on a data item in the workflow view optionally will highlight the (prospective) direct and indirect data dependencies for that data item (the data from which it will be derived when the script is run). Features for expanding and collapsing nested subworkflows also will facilitate exploration of these graphs.
The interactive graph tool will also serve as an entry point to discovering and exploring the original scripts and relating them to the workflow graphs, e.g., clicking on a node in the YW view will allow the user to inspect the source code behind that node. Similarly, users of data products can view both the YW representation of the script, and the original data manipulation code corresponding to blocks in the workflow graphs.
Last not least, the interactive graph will facilitate use of YesWorkflow as a design tool when developing new scripts (or even before a script is written) via live-update features. Given a set of script files, the live-graph feature will monitor these files for changes and update the chosen graphical view automatically. Users of this feature will continue to be able use their favorite text editor or IDE for developing their scripts.
Primary Mentor: Bertram Ludäscher
Secondary Mentor: Paolo Missier
Mark Anthony Freeman
Mark has just finished a Masters in Information Science from the University of Tennessee, Knoxville. He studied Economics and Accounting a long, long time ago in Lancaster, England and started his career in teaching, before moving into software and web technologies.
DataONE has a unique mission - it creates a foundation for supporting existing science and for enabling new science. Normal evaluation metrics provide a means for measuring this foundation. However, to our knowledge, there are no agreed upon metrics for identifying how the access to data has created new interconnections or for predicting the potential for new types of science that may only now be emerging and as of yet have potential that is unrealized. Nevertheless, embarking upon the development of both quantitative and qualitative indicators can and should lead to useful metrics that will prove of value to DataONE, other DataNets, and the broader community.
The student will conduct a comprehensive literature review of how impact is currently defined and evaluated, followed by an environmental scan to determine what metrics currently exist for measuring the interconnections leading to new science by data infrastructures. A comprehensive review of related projects, their associated metrics, and outcomes is critical in identifying these potential new metrics. Based on these scans and in consultation with the mentors, the student will then suggest potential new metrics and their associated methodologies that could be adopted by DataONE.
Primary Mentor: Suzie Allard
Secondary Mentor: Mike Frame
Booma Sowkarthiga Balasubramani
Booma is a PhD student in Computer Science at the University of Illinois at Chicago. She holds an M.S. in Software Systems from Birla Institute of Technology and Science, India. Before stepping into her PhD, she worked as a Software Engineering Senior Analyst for a leading software service provider. Her research interests include Semantic Web, Information Retrieval and Data Mining. She is an avid reader and enjoys science fiction.
The Earth Science Ontology Repository (ESOR) portal contains many vocabularies that are important and useful for sharing earth science data. It serves a similar function as BioPortal that contains a collection of ontologies that support biology, health, and life sciences research, but with a focus on the Earth Science domain. DataONE will benefit from a collection of ontologies with well defined terms that are used in earth science data so that earth science data may be integrated in a correct and consistent manner and also so that search services may be enhanced. Search over the Earth Science Ontology Repository is “smart” in that Its implementation is not based only on keyword search; semantic techniques are also involved so that the search functionality can actually “understand” the meaning of terms. ESOR can be used as the backend knowledge base for multiple applications -- for example, semi-automatic or automatic entity matching.
In order to turn the Earth Science Ontology Repository into a product, we need to create unit tests - a separate stand alone testing capability using JUnit, so that we can be confident it can handle the different use cases for different situations. This test suite will allow automatic testing of updates so that the repository can grow with minimal human effort and a level of consistency can be guaranteed.
In order for this repository to be sustainable, we also need to have simple and automatic (or at least semi-automatic) methods for enhancing the content. Right now, to deploy a new ontology to the Earth Science Ontology repository, 14 manual steps are involved (see the details of the 14 steps here). In order to speed up the process, it is necessary to explore possibilities for automation. In this project, robust automatic upload processes will be designed, tested, and deployed.
Meanwhile, for the Earth Science Ontology to be more broadly reused, it needs to conform to the principles of RESTful services, and its API needs to have easily understandable documentation for developers. The summer intern will work with the postdoctoral fellow who has created the ontology and a professor who is a leading expert in ontology environments to complete the project.
Primary Mentor: Deborah L. McGuinness
Secondary Mentor: Xixi Luo
Yue Zhang is a first year PhD student in Information Studies at Drexel University. Currently she works as research assistant with Dr.Jane Greenberg at Metadata Research Center of Drexel University (http://cci.drexel.edu/mrc/). Before her PhD study, She received M.S. degree in Scientific Computing from New York University and B.A. degree in Mathematics and Economics from DePauw University. Her research interests include data mining, social network analysis, machine learning, and linked data.
Project Description: The intern will work with the Director of Community Engagement and Outreach and the co-lead of the DataONE Sustainability and Governance Working Group to critically evaluate the current engaged DataONE community. Information from mailing lists, social media accounts, webinar participation and website activity will be used to identify the outer range of the DataONE community. Calibration of user profiles across accounts will enable a more parsimonious estimate of the community and the intern will then work with these data to construct network visualization, breaking down clusters by stakeholder group and mode of engagement with DataONE. These data will help inform the project with respect to target markets, future product development and resource allocation.
The intern will work with the Director of Community Engagement and Outreach and the co-lead of the DataONE Sustainability and Governance Working Group to critically evaluate the current engaged DataONE community. Information from mailing lists, social media accounts, webinar participation and website activity will be used to identify the outer range of the DataONE community. Calibration of user profiles across accounts will enable a more parsimonious estimate of the community and the intern will then work with these data to construct a network visualization, breaking down clusters by stakeholder group and mode of engagement with DataONE. These data will help inform the project with respect to target markets, future product development and resource allocation.
Primary Mentor: Amber Budden
Secondary Mentor: Patricia Cruse, Yiwei Wang