Scientific Workflows and Provenance Working Group

Provenance in Scientific Workflows (ProvWG)
The DataONE ProvWG is developing an open and extensible provenance management architecture for scientific data processing systems (e.g., workflows and scripting languages such as R). In particular, the working group will:

  • Develop a provenance (data) model (D-OPM) and corresponding provenance repository implementation
  • Develop a dedicated provenance query language for D-OPM and processing engine
  • Ensure interoperability with current provenance standardization efforts (notably the W3C)
  • Integrate provenance model and tools within the DataONE preservation architecture

Working Group Leaders

  • Bertram Ludaescher, University of California Davis
  • Paolo Missier, Newcastle University, United Kingdom

Current Projects: Querying Scientific Workflow Provenance
The goal of the project is to implement a special-purpose query language for provenance and workflow graphs, based on prior work done by the WG and state-of-the-art languages and techniques known from graph-based and declarative query languages. In particular, the system will allow the user to express a provenance query as a path expression or "graph pattern", which is then translated to a lower-level representation, which in turn is executed on an existing database engine. The resulting prototype will form a starting point for the DataONE cyberinfrastructure to support provenance analytics.

Previous Projects:

  • Summer Project 2011: “Golden Trail”. Developed an architectural prototype for a repository dedicated to storing provenance traces and selectively retrieving “Golden Trails” from it. These may be composed of multiple provenance traces, which must be “stitched” together and collectively provide a complete account for the discovery of a valuable dataset (the “Golden Data”).[MLB+11]
  • Summer Project 2010: Data Tree of Life (DToL); “stitching” provenance traces from different workflow systems, facilitated by a shared model of provenance [MLB+10]


  • [MLB+10] Missier, Paolo, Bertram Ludaescher, Shawn Bowers, Manish Kumar Anand, Ilkay Altintas, Saumen Dey, Anandarup Sarkar, Biva Shrestha, and Carole Goble. “Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science.” In Procs 5th Workshop on Workflows in Support of Large-Scale Science (WORKS), 2010.
  • [MLB+11] Missier, Paolo, Bertram Ludaescher, Shawn Bowers, Ilkay Altintas, Saumen Dey, and Michael Agun. “Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository.” In Procs. 7th International Digital Curation Conference. Bristol,UK, 2011. http://www.dcc.ac.uk/events/idcc11.