Meeting Agenda (long version)
Monday Jul 24th
|0830||Session 1*: Welcome to the DUG, Introduction to DataONE and Update||Amber Budden, Bill Michener||University Club|
|0910||Session 2*: DUG Introductions and Business Meeting (e-pad notes)||Felimon Gayanillo||University Club|
|0940||Session 3*: DataONE Member Node Federation||Dave Vieglais||University Club|
|1030||Session 4*: Breakout Sessions|
|Breakout A: DOIs, Services and Member Benefits (e-pad notes)||Dave Vieglais, Bill Michener, Felimon Gayanilo||Sassafras|
|Breakout B: Linked Data (e-pad notes)||Adam Shepherd||Persimmon|
|1330||Session 5*: Breakout Sessions|
|Breakout C: Education Clearinghouse (e-pad notes)||Nancy Hoebelheinrich, Sophie Hou||Persimmon|
|Breakout D: Provenance (e-pad notes)||Dave Vieglais, Bertram Ludascher||Sassafras|
|1550||Session 6: Report Back from Break Outs||Mod: Felimon Gayanilo||University Club|
|1730||Session 7: Reception and Poster Session||State Room East|
Tuesday Jul 25th
|0830||Opening Remarks, Announcements||Amber Budden||University Club|
|0850||Session 8: Community Oral Presentations||University Club|
|Bridging a data repository to the DataONE Federation||Monica Ihli|
|Configuring Generic Member Nodes (GMN) for DataONE using Terraform and Chef||Dayne Broderson|
|Reproducibility and The Open Science Framework||Natalie Meyers|
|Archonnex at ICPSR - Data Science Management||Kilsang Kim|
|1030||Session 9: Birds of a Feather**|
|Complexity of information, integration, linked data (e-pad notes)||University Club|
|Organizational structure, roles knowledge sharing, networking (e-pad notes)||Persimmon|
|Topic C (e-pad notes)||Sassafras|
|1140||Session 10: Report Back and Discussion||Mod: Amber Budden||University Club|
|1330||Session 11: Community Oral Presentations||University Club|
|Make Data Count||Dave Vieglais|
|Leveraging Github for a New Community Education Platform||Amber Budden|
|Enabling FAIR Data||Shelley Stall|
|Pursuing sustainability of seed projects: insights from the Community for Data Integration (CDI)||Leslie Hsu|
|1450||Session 12*: Moderated Discussion "Revisiting the Past and Looking forward for the DUG" (e-pad notes)||Felimon Gayanilo, Amber Budden, Incoming Chair(s)||University Club|
|1915||Optional Social Event: See below|
**Potential Topics: Data Center Certification, DUG Community, MN Onboarding, Trust/authenticity, Value provided by DataONE
Optional Social Event
Following the success and theme of the last DUG excursion, we have arranged for a private tour of Cardinal Spirits (http://cardinalspirits.com/tours/) on Tuesday Jul 25th at 7.15pm. Tours last approximately 45 minutes and include a full tasting and cocktail. The cost is $10 per person to be paid on the day provided we have sufficient numbers. If you are registered for attending the DUG meeting and wish to join this event, please indicate your interest by filling our a quick form here.
Breakout A: DOIs, Services and Member Benefits
DataONE provides infrastructure that facilitates interoperability across diverse science data repositories, providing benefits to repository operators and users. The DataONE infrastructure provides a stable foundation on which additional capabilities may be built to serve the needs of the community. Besides technical characteristics, an important aspect of any new feature is a consideration of the cost versus benefit to the DataONE community. For example, Digital Object Identifiers (DOIs) are seen as an important mechanism for identifying resources such as datasets, however creation of DOIs incurs tangible costs that must be covered to ensure a reliable, long-term service can be offered to DataONE participants. This session will provide an overview of several new capabilities to potentially be offered to participants in the DataONE federation and will encourage discussion and feedback from the participants about how these new capabilities might be best tailored to serve the community.
Breakout B: Linked Data
Linked Open Data underpins a number of the best practices set forth by the recent W3C Recommendation on Best Practices for Data on the Web (https://www.w3.org/TR/dwbp/). The recommended best practice of reusing vocabularies for encoding data spans five of the eight stated benefits of adoption. However, achieving reuse of vocabularies is a sociotechnical challenge that requires groups of data publishers to agree on which vocabularies to use. This session seeks to foster agreement for achieving vocabulary reuse by identifying the vocabularies employed by data publishers and beginning conversations for vocabulary alignment where needed.
Breakout C: Education Clearinghouse
The purpose of this breakout is to support the enhancement of the recently launched Data Management Training (DMT) Clearinghouse; a project following from an initial collaboration between ESIP, USGS and DataONE. The newly formalized Data Management Training Working Group is soliciting feedback to accomplish three goals:
(1) Discuss the targeted uses and selection criteria for identifying and including educational resources into the recently launched Data Management Training (DMT) Clearinghouse that could be used by researchers and/or by data professionals to train themselves or others on best practices related to the management of research data;
(2) Using the selection criteria, add to the compilation of educational resources in the Data Management Training (DMT) Clearinghouse;
(3) Record user feedback from participants who use the built-in form and help guides to submit educational resources, or who search and browse the existing inventory of resources.
Breakout D: Provenance
Provenance information, a form of metadata describing the lineage and processing history of data and knowledge artifacts, plays an important role in many scientific applications and use cases. For example, an ecologist might want to combine different datasets for a study, but needs to know how the candidate datasets were derived to determine their fitness for the task at hand. A climate scientist might need to document the processing history of climate model outputs to facilitate reproducibility. A natural history collection manager may want to apply automated data curation tools on specimen collection data, but needs to understand proposed “repairs” before executing them. Provenance information plays a crucial role in these and many similar cases. In this session, we will provide an overview of provenance information, how it is captured, preserved, and exposed within the DataONE federation.
Oral Presentation Abstracts
Bridging a data repository to the DataONE Federation
DataONE is a community driven project supporting enhanced search and discovery of Earth and environmental data. Existing institutional repositories have much to gain by bringing their scientific metadata into the the DataONE federation, such as exposing their repository's contents to a much wider audience through DataONE's sophisticated discovery interface. This presentation will provide a brief introduction to DataONE architecture, followed by a high-level overview of how a repository's native data is able to be bridged into the federation using an implementation solution known as Generic Member Node.
Broderson, DB; Fisher, WH; Raymond, VL
Configuring Generic Member Nodes (GMN) for DataOne using Terraform and Chef
The Geographic Information Network of Alaska’s (GINA) research objective is it automate configuration and resource allocation for standing up DataOne member nodes using Terraform and Chef tools in Amazon Web Services (AWS). Research goals allow a small Dev/Ops team managing multiple projects with large and diverse data-sets to easily create, test, destroy, and ultimately deploy member nodes for DataOne. In addition to its own vast research data assets, GINA’s project portfolio includes developing and managing data catalogs and portals for several research entities including the North Slope Science Initiative (NSSI), Alaska National Science Foundation Established Project to Stimulate Competitive Research (AK NSF EPSCoR), and others. GINA believes in the research significance of its project partners’ data. As such, the research mission is to make project data findable beyond its direct project participants, and accessible beyond the project life-cycle.This whitepaper explores the power and potential of these automation tools for testing and standing up DataOne member nodes for different data providers.
Archonnex at ICPSR - Data Science Management
ICPSR has been developing a new Digital Assets Management System (DAMS) based on upon the new architecture called Archonnex@ICPSR. Archonnex@ICPSR is an architecture meeting core and emerging business needs of the organization. It provides a digital technology platform that leverages ICPSR expertise and open source technologies that are proven and well supported by open source communities. It is not a single software product, but a collection of interconnected systems and services. Archonnex@ICPSR enables the process of the upfront integration with the researcher to allow better-managed data collection, dissemination, and management during research. It follows the data workflow process technologically through from the ingestion of data to the repository, curation, archiving, publication, and re-use of the research data including the citation and bibliography management along the way. The Archonnex architecture at ICPSR is strengthening our data services to researchers as well as data discovery, re-use, and reproducibility.
Reproducibility and The Open Science Framework
Open, transparent and reproducible science is stronger science. Sharing scientific materials – and being transparent about the research process and its contributors – is desirable but not incentivized or facilitated enough. Publishing norms incentivize novel, positive results over complete reporting. Researchers produce a variety of materials during their research process: data, code, and other materials essential to reproducibility that may never actually appear or be made fully accessible in research publications. The Center for Open Science (COS) seeks to both facilitate and incentivize better practices by fostering communities, and by conducting metascience research on the overall process as well as through building infrastructure that makes it easier to conduct open science. This talk focuses on the infrastructure of reproducibility and The Open Science Framework (OSF; http://osf.io), a free, open-source web application. The OSF connects information across all phases of the research lifecycle and enhances transparency in the process with features to support content management, collaboration, file storage, version control, and sharing within both private and public workflows, as well as interoperability with other systems.
Make Data Count
Research data are fundamental to the success of the academic enterprise. The Making Data Count (MDC) project will enable measuring the impact of research data much as is currently being done with publications, the primary vehicle for scholarly credit and accountability. The MDC team (including the California Digital Library, COUNTER, DataCite, and DataONE) are working together to publish a new COUNTER recommendation on data usage statistics; launch a DataCite-hosted MDC service for aggregated DLM based on the open-source Lagotto platform; and to build tools for data repository and discovery services to easily integrate with the new MDC service. In providing such data-level metrics (DLM), the MDC project augments existing measures of scholarly success and so offers an important incentive promoting open data principles and quality research data through adoption of research data management best practices.
Leveraging Github for a new Community Education Platform
DataONE has invested time and resources in the development of education and training resources focussed on data management for open and reproducible science. Among these are the DataONE Education Modules, designed for trainers or students to download and use for instruction or self learning. Domain agnostic in nature, these materials are intended to be customized by the user for their specific needs and as such, they are CC0 and in the public domain. Best practices in data management are both fundamental and evolving as new technologies and policies emerge. To ensure relevance, our materials need to be updated to reflect changes in the landscape. We recently undertook a period of external peer review and update of the Education Modules and the process led us to explore alternate modules for community contribution to the materials. In this talk we will showcase the development of a new platform that exposes the DataONE Education Materials and plans for supporting community angagement around the materials.
Leslie Hsu and Madison Langseth (Leslie lands 12.30 local time on Monday)
Pursuing sustainability of seed projects: insights from the Community for Data Integration (CDI)
The USGS Community for Data Integration is a voluntary community of practice with diverse members who come together to identify solutions in scientific data management and integration. Since 2010, the CDI has supported innovative and collaborative ideas through working groups and seed funding for projects. Through these efforts, the CDI has produced many data integration tools, web applications, data integration and management frameworks, and educational resources. However, project leads often wrestle with the options (or lack thereof) for sustaining these seed projects. We look at the 70+ projects funded by CDI and report on patterns, successes, and challenges in sustainability.
Poster Presentation Abstracts
Discover Keywords and Associated Data Through the GCMD Keyword Viewer
The beta version of the keyword viewer is a tool that helps users navigate through the GCMD keyword hierarchies and view the keyword definitions and related keywords. Users can also connect to the metadata in the CMR associated with those keywords. The poster will convey the purpose and goals of the keyword viewer, highlight the major functionality of the keyword viewer. We will also invite the community to tire-tick the keyword viewer and provide feedback. A live demo of the keyword viewer will also be given.
Donaldson, DR; Martin, S
Understanding Perspectives on Sharing Neutron Data
Even though the importance of sharing data is frequently discussed, data sharing appears to be limited to a few fields, and practices within those fields are not well understood. This study examines perspectives on sharing neutron data collected at Oak Ridge National Laboratory’s neutron sources. Operation at user facilities has traditionally focused on making data accessible to those who create them. The recent emphasis on open data is shifting the focus to ensure that the data produced are reusable by others. This mixed methods research study included a series of surveys and focus group interviews in which 13 data consumers, data managers, and data producers answered questions about their perspectives on sharing neutron data. Data consumers reported interest in reusing neutron data for comparison/verification of results against their own measurements and testing new theories using existing data. They also stressed the importance of establishing context for data, including how data are produced, how samples are prepared, units of measurement, and how temperatures are determined. Data managers expressed reservations about reusing others’ data because they were not always sure if they could trust whether the people responsible for interpreting data did so correctly. Data producers described concerns about their data being misused, competing with other users, and over-reliance on data producers to understand data. We present the Consumers Managers Producers (CMP) Model for understanding the interplay of each group regarding data sharing. We conclude with policy and system recommendations and discuss directions for future research.
Disciplinary, Individual, and Data Factors Affecting Scientists' Data Reuse Practices
Data sharing and reuse are increasingly seen as important to advancing science. Accordingly, a number of studies have examined factors that motivate or deter scientists' data sharing behaviors. However, data reuse is not yet well-studied so far. This research tried to understand the factors influencing scientists' data reuse behaviors. I conducted a total of 46 individual semi-structured surveys to examine scientists' current data reuse practices. Results showed that scientists' data reuse practices are influenced by their disciplinary factors (e.g., discipline norms and resources), individual motivations (e.g., usefulness, concerns, and efforts involved in data reuse), and data factors (e.g., data quality and document quality).
Eschenfelder, K; Shankar, K; Williams, R; Langam, A
Data archive business models : A historical analysis of change over time in social science data archives
Concerns for the ongoing sustainability of data archives raise the question of what archive business models can support sustainability over time. In order to examine the history of different business models, we examined social science data archives (SSDA), one of the most longstanding sets of digital data archives. Some SSDA predate the internet, and many have 30 plus years of history managing shifts in technology, government funding patterns, scientific trends, research regulations and changing user expectations. In this project we trace changes that three SSDA made in their business models over a 30-50 year period of time including ICPSR, the UK Data Archive (now part of the UK Data Service) and the LIS Cross National Data Center at NYU and Luxembourg. By identifying major business model changes made by these data archives, we hope that contemporary data projects can learn what types of changes they might expect in the future. The first part of the project clarifies what the term “business model” means in a data archive context. The second part identifies and explains major business model changes made by SSDA over time.
Ge Peng, Nancy Ritchey, Anna Milan, Sonny Zinn, and Kenneth S. Casey
Towards Consistent and Citable Data Quality Descriptive Information for End-Users
Curating quality descriptive information and metadata for individual datasets is a necessary step toward meeting the transparency requirement and helping establish the credibility and trustworthiness of data products. However, this has been a difficult challenge for the data management community because of the lack of a consistent assessment framework, process, and workflow. Furthermore, developing and implementing these require multi-domain knowledge and close cross-disciplinary collaboration.
This presentation will first introduce a data stewardship maturity matrix (DSMM) as a reference framework for assessing stewardship maturity of individual digital datasets. Using the DSMM as an example, this presentation will then demonstrate that it is possible to consistently and systematically curate and publish data quality descriptive information both as citable documents and within ISO metadata records for human and machine end-users. These consistent and citable documents and metadata records can be readily integrated into or linked by other systems and tools, for example, to be used for enhanced data discoverability and usability.
This presentation will also outline the progress made under the auspice of the NOAA OneStop project in the area of consistently curate, publish, integrate, and display data quality information.
Provenance in the Archives: Lessons for Data Repositories
Imagine a researcher in 2067 examining objects in a digital archive that were deposited in the early 21st century. How can the researcher be assured that those digital materials – whether born-digital or digital surrogates of analog materials – have not been accidentally or maliciously altered sometime during the intervening decades? Cultural heritage institutions - such as libraries and collecting archives - hold the primary source data that supports many forms of humanistic research: not only original historical research but also literary studies, sociology, human geography, archaeology, psychology, and anthropology. Corporate / institutional archives, in addition to supporting humanistic and social science research, may also be consulted in criminal and other legal proceedings as well as genealogical research. Fourteen archival workers – archivists, metadata specialists, digitization managers, and preservation technologists – participated in semi-structured interviews about how their current recordkeeping practices contribute to the creation of a comprehensive provenance (or chain-of-custody) trail that ensures the authenticity and trustworthiness of born-digital files and digital surrogates for future users of the archive. This poster presents preliminary results about how archives are adapting to emergent needs for processing digital materials at scale, ensuring that workflows capture entities, activities, and agents (provenance information), and ensuring that current systems support migration of provenance information to successor systems. Current practices across institutions vary from insufficient to exemplary, revealing opportunities for workforce development and for development of standards and ontologies to represent - and make available to the users of the archives - these provenance trails.
Christine Laney, Sarah Elmendorf, Thomas Harris, Claire Lunch, Tom Gulbransen
Recent advances in NEON data volume, quality and discovery
The National Ecological Observatory Network (NEON) is a continental-scale ecological observation facility, sponsored by the National Science Foundation and operated by Battelle, that gathers and synthesizes data on the impacts of environmental change, land use change and invasive species on natural resources and ecosystems. By the end of NEON’s construction period (early 2018), the observatory will be delivering a full suite of observational, instrumented, and airborne datasets from 81 field sites across the U.S. Initial lessons learned and community feedback on data collection, ingest, processing, quality checking, and publication efforts, has fueled numerous enhancements over the past year. Improvements include digital data collection applications for mobile devices, re-architected data processing and publication pipelines, and user-friendly documentation and code to enhance understanding and usability of NEON data. In addition, data delivery systems now include a public API, a re-architected data portal, and sharing of certain specialized data products to be hosted by partnering organizations’ data portals. As we continue to improve the quality and usability of NEON data, we are exploring data versioning and DOIs, providing full sample/specimen metadata, and enhancing our DataONE contributions. We continue to solicit advice and broader engagement from the informatics and ecological research communities.
EarthCube Council of Data Facilities Registry: A Report on Goals and Status
The EarthCube Council of Data Facilities (CDF) formed the Registry Working Group to review alignment of existing approaches to research facility description and discovery. The group is formalizing a set of repository parameters of interest to CDF members and reviewing the alignment of those parameters with re3data and COPDESS with the goal of schema mapping and extension. The plan is to leverage the re3data schema by means of community profile: developing strategies for a more structured common approach for CDF members to express/expose this information; developing a means to encode this schema in a machine readable format and bring forward a possible implementation for publishing and subscribing to this data; demonstrating the use of schema.org for publishing and accessing this data and explore gaps in this approach and recommend solutions; and collaborating with re3data as a reference implementation for collecting and exposing this data. By providing a common pattern for facilities to expose links to existing work already completed in service and resource description documents (SWAGGER, OGC, THREDDS, etc.), a broader audience can be achieved. Additionally, a lower barrier to entry, and a minimal maintenance burden are achieved through use of existing web-based architecture and best practices. These approaches are done by the working group with the reference EarthCube architecture in mind. They are viewed as contributions along the path to addressing resource discovery and assessment goals of the architecture. The poster will provide an update on the status of the working group and provide an opportunity for community input.
Joo, S Peters, C
User Needs for Data Services in Academic Libraries
This poster presents preliminary findings for the 2016/17 IMLS-funded planning project Planning Research Data Services in Academic Libraries: Designing a Conceptual Services Model Based on Patron Needs Assessment. This project comprehensively investigates the status of and user needs for research data services in academic libraries. In this project, research data services refers to a range of library services to assist researchers to collect, manage, analyze, present, visualize, and distribute data in their research activities. Multiple methods are used to collect data, i.e. case studies and surveys of potential users. For the case studies, we identified 17 specific types of data services, such as data management plan consultations, data analysis and presentation support, and assistance with data preservation. A coding book was then utilized to analyze the current services offered at academic libraries in the United States. In addition, a survey of users on the University of Kentucky campus who might potentially benefit from these services was distributed. The instrument measures the diverse data management needs of users by exploring 11 dimensions of research data services for users across disciplines. Findings from the user survey indicate with which data management activities users need support and highlight the specific types of services and resources that are most useful for users in different disciplines. Preliminary findings from both the case studies and user survey will be presented in the poster.
Cragin, M; Plale, B; Kouper, I; Minor, J
Midwest Big Data Hub; Developing effective cross-sector, data-enabled networks to solve shared problems of regional and societal interest.
The MBDH, a regional innovation hub sponsored by the NSF, is working to cultivate communities, reduce friction in data-to-decision systems, and build capacity in data science and data literacy. Creating policies around data sharing and developing templates for access arrangements are top priorities among the four innovation hubs. We wish to present the efforts of the Midwest hub in these areas as well as gain insight from the DataONE community's experiences with data sharing agreements.