Data governance workshop

From Creative Commons
Revision as of 21:42, 6 May 2012 by CCID-tvol (talk | contribs)
Jump to: navigation, search

PDF version available here.

Workshop on Data Governance: Final Report

  • Arlington, VA
  • December 14-15, 2011
  • Supported by NSF #0753138 and #0830944

Abstract

The Internet and related technologies have created new opportunities to advance scientific research, in part by sharing research data sooner and more widely. The ability to discover, access and reuse existing research data has the potential to both improve the reproducibility of research as well as enable new research that builds on prior results in novel ways. Because of this potential there is increased interest from across the research enterprise (researchers, universities, funders, societies, publishers, etc.) in data sharing and related issues. This applies to all types of research, but particularly data-intensive or “big science”, and where data is expensive to produce or is not reproducible. However, our understanding of the legal, regulatory and policy environment surrounding research data lags behind that of other research outputs like publications or conference proceedings. This lack of shared understanding is hindering our ability to develop good policy and improve data sharing and reusability, but it is not yet clear who should take the lead in this area and create the framework for data governance that we currently lack. This workshop was a first attempt to define the issues of data governance, identify short- term activities to clarify and improve the situation, and suggest a long-term research agenda that would allow the research enterprise to create the vision of a truly scalable and interoperable “Web of data" that we believe can take scientific progress to new heights.


Introduction

Data governance is the system of decision rights and responsibilities that describe who can take what actions with what data, when, under what circumstances, and using what methods. It includes laws and policies associated with data, as well as strategies for data quality control and management in the context of an organization. It includes the processes that insure important data are formally managed throughout an organization, including business processes and risk management. Organizations managing data are both traditional and well-defined (e.g. universities) as well as cultural or virtual (e.g. a scientific disciplines or large, international research collaborations). Data governance ensures that data can be trusted and that people are made accountable for actions affecting the data.

Sharing and integrating scientific research data are common requirements for international and interdisciplinary data intensive research collaborations but are often difficult for a variety of technical, cultural, policy and legal reasons. For example, the NSF’s INTEROP and DataNet programs are addressing many of the technical and cultural issues through their funded projects, including DataONE, but the legal and policy issues surrounding data are conspicuously missing from that work. The ultimate success of programs like DataNet depends on scalable data sharing that includes data governance.

Reproducing research – a core scientific principle – also depends on effective sharing of research data along with documentation on its production, processing and analysis workflow (i.e. its provenance) and its formatting and structure. Without access to the supporting data and the means to interpret and compare it, scientific research is not entirely credible and trustworthy, and this access again depends on data governance.

The research community recognizes that data governance issues, such as legal licensing and the related technical issue of attribution of Web-based resources would benefit from wider community discussion. The Data Governance Workshop was convened to discuss:

  • Legal/policy issues (e.g. copyrights, sui generis database rights, confidentiality restrictions, licensing and contracts for data);
  • Attribution and/or citation requirements (e.g. as required by legal license or desired by researchers);
  • Repositories and Preservation (e.g. persistence of data and its citability, persistence of identifiers for data and data creators);
  • Discovery and provenance metadata, including its governance (e.g. licenses for metadata);
  • Schema/ontology discovery and sharing, including governance (e.g. licenses for ontologies)

The primary goal of the workshop was to develop a better shared understanding of the topic, and a set of recommendations to research sponsors and the broader community of scientific stakeholders for useful activities to be undertaken. In particular, the workshop discussed how NSF OCI (e.g. DataNet) projects might address these data governance questions as part of a sound data management plan, as mandated by the current NSF grant proposal guidelines. Additional goals for the workshop were to define useful short-term actions and a long-term strategic and research agenda.

Workshop participants are listed in appendix II and included scientists and researchers from the life, physical and social sciences, and representatives of data archives, research universities and libraries, research funding agencies and foundations, legal and advocacy organizations, scholarly publishing companies, and scholarly societies. This cross-section of the research community brought diverse perspectives to the discussion that informed and enriched the resulting recommendations.


Legal Landscape

The workshop commenced with a review of the current legal landscape surrounding data. Copyright law, while complex and nuanced, is largely harmonized world-wide, unlike other types of intellectual property law (e.g., sui generis database rights or patent rights). The law limits copyright protection for some types of data (e.g. facts and ideas are never protected) and the legal distinction between facts or collections of facts and protected “databases” are murky. Furthermore, different legal jurisdictions distinguish various types of data (like “factual” versus creative products) with different protections. For example, a database of factual sensor readings that is automatically in the public domain in one country may fall under intellectual property control in another, making it difficult to combine data produced by researchers in both countries without complex legal negotiation or development of a customized contract to harmonize the different laws for the purposes of the research project. Another nuance of research data is the distinction in many jurisdictions between a database and its contents – the former is often copyrightable while the latter may or may not be, depending on what it is and where it came from. While some approaches are more straightforward than others (as described below), the mere existence of these legal differences can make it necessary to involve legal counsel in establishing research project data sharing norms.

Privacy and/or confidentiality law is another important part of the legal landscape for data produced by medical research, and in the social, behavioral, and health sciences. These laws and regulations impose restrictions on storage, dissemination, exchange, and use of data, and are even more fragmented and diverse than in the area of intellectual property. In addition, institutions release this data with ad hoc, custom contracts (usage agreements) which are often incompatible with restrictions from other institutions using the same regulatory framework.

The overview covered copyrights, sui generis database rights, and the public domain as they apply to various types of research data, and the current legal tools and remedies to protect and share data: contracts, public licenses, and waivers. The merits of, and problems with, each approach was discussed, along with the merits of an open, commons-based approach to data sharing.

The complexities surrounding research data make it difficult to answer questions like “who has the right to decide which legal approach to take for a given dataset” or “is it allowable to combine datasets that were released under completely different contractual ‘terms of use’ each requiring that its terms and conditions continue to apply to the data in the resulting derivative dataset”. Many researchers rely on scientific norms or conventional wisdom to resolve these questions since they lack resources to help them with any other approach, and this leads to behavior that may or may not be legally defensible and has questionable side effects for research reproducibility and data reuse.

Certainly the laws affecting data are not sufficient to insure that the norms of scientific research are followed. For example, there is an important distinction between releasing data at all (i.e. just making it accessible to other researchers) and making it effectively reusable or re-purposable for new research, with only the latter supporting research strategies that require combining multiple existing datasets. So part of data governance that exceeds the reach of law is specifying how data is to be shared so that it supports follow-on research and is not merely findable, if sought. Insuring data reusability requires additional policy to cover data quality and metadata provision, and separate mechanisms for policy enforcement such as contractual agreement (e.g. as a condition of funding) or dependence on scientific social norms of practice.