I want to search


Markdown-based Semantic Annotation of Workflow Scripts

Xiaoliang Jiang
Xiaoliang Jiang

Xiaoliang Jiang is a second year master student in Library and information science at the University of Illinois at Urbana Champaign(UIUC). He received his B.E. in Information Security in University of Science and Technology Beijing. He had worked as Analitist Internship in Office of the Vice Chancellor for Institutional Advancement in UIUC, and plans to apply PhD in next year. His research interest revolves around Data Analytics ,Data Mining and Social & Information Network. In his free time, he likes reading and playing board games.

Project Description: 

The proposed work will result in an extension to the RStudio environment enabling data analysts to directly publish RDF that richly describes the semantics of their scripts. This work will also include draft best practices that guide practitioners in proper embedding of appropriate concepts and vocabulary from established ontologies (including ProvONE and domain ontologies).

In detail, this work entails an exploration of extending markdown syntax (esp. R Markdown) in concert with knitr to directly produce workflow markup, in a human-compatible way. A specific example of what this means: When "knitting" a markdown rendition, instead of generating (e.g) PDF or HTML, the anticipated tool will generate RDF (TTL or JSON-LD) or HTML+RDFa. By "human readable," we mean markdown best practices will be developed that are reasonable for a data analyst to use; methods (possibly based on templates) must be developed that do not require the user to "know" RDF. Today we can create cumbersome R Markdown (Rmd) files that produce HTML+RDFa outputs with correct embedded workflow semantics, but the user must be an HTML and RDFa hacker to understand it. Workflow reproducibility requires tools that data analysts will actually use.

With the right skillset, the intern may develop methods for semi-automatically extracting function and package semantics and encoding these into the resulting graph. By this we mean, in addition to capturing explicit semantics expressed via markdown syntax, the intern may develop a way to further capture "meaning" based on the use of functions in scripts, without requiring users to artificially wrap standard functions; this might be done through "wrapper" functions in R or some other means.

This work will be an advancement of the semantic workflow work inspired by YesWorkflow, and leverages an approach using standard practices for R extensions, markdown and publication, creating a direct path for DataONE analysts to get their workflows represented in knowledge graphs. This approach broadens the potential DataONE user base by helping to ensure their workflows and results are easier to discover, conceptually easier to understand, and therefore increasing the likelihood they will be cited, reused and reproduced.

Primary Mentor: 
Deborah McGuinness
Secondary Mentor: 
John Erickson (RPI)