Data governance workshop
PDF version available here.
Workshop on Data Governance: Final Report
- Arlington, VA
- December 14-15, 2011
- Supported by NSF #0753138 and #0830944
The Internet and related technologies have created new opportunities to advance scientific research, in part by sharing research data sooner and more widely. The ability to discover, access and reuse existing research data has the potential to both improve the reproducibility of research as well as enable new research that builds on prior results in novel ways. Because of this potential there is increased interest from across the research enterprise (researchers, universities, funders, societies, publishers, etc.) in data sharing and related issues. This applies to all types of research, but particularly data-intensive or “big science”, and where data is expensive to produce or is not reproducible. However, our understanding of the legal, regulatory and policy environment surrounding research data lags behind that of other research outputs like publications or conference proceedings. This lack of shared understanding is hindering our ability to develop good policy and improve data sharing and reusability, but it is not yet clear who should take the lead in this area and create the framework for data governance that we currently lack. This workshop was a first attempt to define the issues of data governance, identify short- term activities to clarify and improve the situation, and suggest a long-term research agenda that would allow the research enterprise to create the vision of a truly scalable and interoperable “Web of data" that we believe can take scientific progress to new heights.
Data governance is the system of decision rights and responsibilities that describe who can take what actions with what data, when, under what circumstances, and using what methods. It includes laws and policies associated with data, as well as strategies for data quality control and management in the context of an organization. It includes the processes that insure important data are formally managed throughout an organization, including business processes and risk management. Organizations managing data are both traditional and well-defined (e.g. universities) as well as cultural or virtual (e.g. a scientific disciplines or large, international research collaborations). Data governance ensures that data can be trusted and that people are made accountable for actions affecting the data.
Sharing and integrating scientific research data are common requirements for international and interdisciplinary data intensive research collaborations but are often difficult for a variety of technical, cultural, policy and legal reasons. For example, the NSF’s INTEROP and DataNet programs are addressing many of the technical and cultural issues through their funded projects, including DataONE, but the legal and policy issues surrounding data are conspicuously missing from that work. The ultimate success of programs like DataNet depends on scalable data sharing that includes data governance.
Reproducing research – a core scientific principle – also depends on effective sharing of research data along with documentation on its production, processing and analysis workflow (i.e. its provenance) and its formatting and structure. Without access to the supporting data and the means to interpret and compare it, scientific research is not entirely credible and trustworthy, and this access again depends on data governance.
The research community recognizes that data governance issues, such as legal licensing and the related technical issue of attribution of Web-based resources would benefit from wider community discussion. The Data Governance Workshop was convened to discuss:
- Legal/policy issues (e.g. copyrights, sui generis database rights, confidentiality restrictions, licensing and contracts for data);
- Attribution and/or citation requirements (e.g. as required by legal license or desired by researchers);
- Repositories and Preservation (e.g. persistence of data and its citability, persistence of identifiers for data and data creators);
- Discovery and provenance metadata, including its governance (e.g. licenses for metadata);
- Schema/ontology discovery and sharing, including governance (e.g. licenses for ontologies)
The primary goal of the workshop was to develop a better shared understanding of the topic, and a set of recommendations to research sponsors and the broader community of scientific stakeholders for useful activities to be undertaken. In particular, the workshop discussed how NSF OCI (e.g. DataNet) projects might address these data governance questions as part of a sound data management plan, as mandated by the current NSF grant proposal guidelines. Additional goals for the workshop were to define useful short-term actions and a long-term strategic and research agenda.
Workshop participants are listed in appendix II and included scientists and researchers from the life, physical and social sciences, and representatives of data archives, research universities and libraries, research funding agencies and foundations, legal and advocacy organizations, scholarly publishing companies, and scholarly societies. This cross-section of the research community brought diverse perspectives to the discussion that informed and enriched the resulting recommendations.
The workshop commenced with a review of the current legal landscape surrounding data. Copyright law, while complex and nuanced, is largely harmonized world-wide, unlike other types of intellectual property law (e.g., sui generis database rights or patent rights). The law limits copyright protection for some types of data (e.g. facts and ideas are never protected) and the legal distinction between facts or collections of facts and protected “databases” are murky. Furthermore, different legal jurisdictions distinguish various types of data (like “factual” versus creative products) with different protections. For example, a database of factual sensor readings that is automatically in the public domain in one country may fall under intellectual property control in another, making it difficult to combine data produced by researchers in both countries without complex legal negotiation or development of a customized contract to harmonize the different laws for the purposes of the research project. Another nuance of research data is the distinction in many jurisdictions between a database and its contents – the former is often copyrightable while the latter may or may not be, depending on what it is and where it came from. While some approaches are more straightforward than others (as described below), the mere existence of these legal differences can make it necessary to involve legal counsel in establishing research project data sharing norms.
Privacy and/or confidentiality law is another important part of the legal landscape for data produced by medical research, and in the social, behavioral, and health sciences. These laws and regulations impose restrictions on storage, dissemination, exchange, and use of data, and are even more fragmented and diverse than in the area of intellectual property. In addition, institutions release this data with ad hoc, custom contracts (usage agreements) which are often incompatible with restrictions from other institutions using the same regulatory framework.
The overview covered copyrights, sui generis database rights, and the public domain as they apply to various types of research data, and the current legal tools and remedies to protect and share data: contracts, public licenses, and waivers. The merits of, and problems with, each approach was discussed, along with the merits of an open, commons-based approach to data sharing.
Certainly the laws affecting data are not sufficient to insure that the norms of scientific research are followed. For example, there is an important distinction between releasing data at all (i.e. just making it accessible to other researchers) and making it effectively reusable or re-purposable for new research, with only the latter supporting research strategies that require combining multiple existing datasets. So part of data governance that exceeds the reach of law is specifying how data is to be shared so that it supports follow-on research and is not merely findable, if sought. Insuring data reusability requires additional policy to cover data quality and metadata provision, and separate mechanisms for policy enforcement such as contractual agreement (e.g. as a condition of funding) or dependence on scientific social norms of practice.
An issue that raises considerable concern among scientists and others in the community is around the meaning of “open” and some of the subtler points like the meaning of “non-commercial use” in various copyright licenses. While advocacy organizations like the Open Knowledge Foundation have published widely recognized definitions of “open” 1, these do not allow for limitations on the reuse of data for commercial purposes, e.g., by pharmaceutical companies or software start-ups. Many scientists desire protection from that type of commercial reuse, but would agree that imposing any legal limitations on reuse of their data might also create a barrier for colleagues wanting to use the data for non-commercial purposes. Since there is no standard or legal definition of “non-commercial” use2, how such a condition will be enforced is uncertain. For example, if research results that drew from multiple reused datasets generate a patent or are included in a new textbook, is that a violation of the non-commercial terms of the license?
Following the legal overview, the group reviewed the technological landscape for data sharing as it relates to governance. To make data sharing effective at Web-scale and thereby enable international e- science or network science we need a way to support automated, machine-processable information on what can be done with data, as well as its properties and quality. The technologies involved include media types and formats, metadata (for description and provenance), identifiers, persistence strategies, and software required to use the data. What is needed is nothing less than a new layer of the Web architecture to support research and scholarship at the social and legal level, and with minimal process – “ungovernance”. This layer would then support the functional needs of data discoverability, accessibility, interpretability, reproducibility, and reusability.
The metadata required to support data governance includes both discovery and provenance metadata. Descriptive metadata supports the ability to learn about and locate the data on the Web, and basic information about its purpose, type, creator, funding source, etc. Descriptive metadata will include one or more identifiers that link to the data so that it can be accessed and cited. Provenance metadata includes information about the source of the data and the workflow and methodology to generate it and that is necessary for the interpretability, reproducibility and reusability goals of data sharing, as well as determining its quality. There are many challenges associated with metadata for data (sometimes called “paradata”) among which is its transitive nature. Datasets not only evolve over time, but can be combined, derived from or built upon. In order to properly manage some types of data, we can imagine using something like the Github software development infrastructure to allow data to transition in multiple directions, while still retaining the original core in a retrievable form. If done properly, having this metadata would be a great boon for scientists, since it would reduce the need to redo experimental and other data-production work. However producing this metadata is usually difficult and time consuming, since it is not integrated into the research process and we lack tools to make it easy and standards to store and share it. It would be useful to identify case studies of when this was done, how, and with what benefit to researchers.
Another technological aspect of data governance relates to the software used to generate, process, analyze, or visualize the data – i.e., any software needed to interpret or reproduce the data. All information in digital formats, including research data, requires software to intermediate it and is otherwise a meaningless collection of bits. So for a researcher to validate research results by recreating or re-processing data the software used for the original research is usually necessary. And software that is integral to re-creating or re-processing data should be openly available (e.g. as open source software, ideally platform-independent) and publically shared to support future uses of the data. Therefore the software must also have metadata (descriptive and provenance, including versioning) so it can be discovered and its quality assessed, and it must be preserved just as carefully as the data associated with it.
Data Management Plans
The overview of the current landscape concluded with a review of data management and archiving. Many disciplines are moving to systematize data archiving, either in large, centralized repositories (e.g. GBIF, Dryad) or in institutionally-supported repositories (e.g. DSpace or EPrints instances). In a few cases, journals mandate data archiving (e.g. the Journal Data Archiving Policy, or JDAP5, imposed by the majority of evolutionary biology journals, or Nature’s policy6 on depositing genomic data into GenBank prior to publication). Increasingly, research funding agencies are also requiring data archiving and open sharing as a condition of funding. These range from blanket policies (e.g. the Wellcome Trust7) to proposal guidelines from the NIH8and NSF9, among others10. While not all agencies mandate sharing in all cases, and could not in some cases (e.g. where privacy laws apply) their intent is to encourage that behavior.
To take a recent example, data management plans for NSF proposals require description of the types of data to be created or used, the standards in which the data will be stored and preserved, and policies for insuring access to the data and under what terms and conditions. The requirement was created to protect the agency’s investment in the research’s outputs and optimize their value. In addition to their benefits of achieving research goals and supporting research reproducibility, the NSF believes that data management plans and their evaluation by review panels will evolve over time to become a more influential part of both the broader impact and merit review criteria, so adding it as a proposal requirement was just the first step. Review panels currently take direction from program officers and review plans inconsistently, but already there are knowledgeable PIs using their plans as a competitive advantage and including references to it in the proposal narrative. But the NSF is aware that the ultimate success of their plans requires community-driven guidelines, norms, and expectations.
To improve the quality of data management plans, tools like the DMPTool11 are emerging to help researchers understand the issues and options, and create credible plans. Such tools are popular and present an opportunity to guide outcomes and behavior in good directions. But it will be important for tools and templates to be developed by the research community and not just librarians and research administrators.
A gap in both the data management plan guidelines and the emerging tools to create them is in the area of policies, and particularly for copyrights, database rights, and other applicable intellectual property policies, related to data sharing. There is a lack of expertise to guide researchers and research administrators, and uncertainty about who controls policy effecting the distribution and archiving of data (which have serious legal aspects and long-term costs). There are also uncertainties about how to create policies that are discipline-agnostic, and how to centralize policy in a time of rapid change. And this is exacerbated by a general unwillingness to tackle these issues by all parties concerned.
Section I: Current Conventions for Data Sharing and Reuse
A recent survey12 of the research community undertaken by the DataONE project showed that 80% of scientists are willing to share their data with others in the research and education community. But the question was never raised to them as to how, legally, they might include a statement in their data about that willingness to share, and under what terms, so that their expectations are clear to others who are interested in using the data. The consensus view was that most researchers have simply never thought about these issues beyond their obligations in relation to human subjects and IRB regulations.
A research discipline that has already worked through many issues of data sharing and governance is the social sciences, and particularly for census and survey data. They have the legal and technical tools to archive and share data, and well-established behavioral norms. But there are still weaknesses, especially with regard to personal privacy and subject consent agreements. And the legal contracts (“terms and conditions”) for access are often very complex and incompatible across data repositories.
The main incentives for any research activity are recognition and credit for the work, with a secondary incentive being improved efficiency, quality, and impact of the research (e.g. avoiding replication of effort, or ability to better verify results). Compliance with funder, institutional and publisher mandates are also a consideration but are insufficient by themselves to insure good behavior unless enforced. Data publishing and citation standards and practices are needed to support better credit allocation and reward mechanisms for good data sharing behavior. And in the short-term there are simple measures that would increase awareness and begin to build expectations – proposal questions such as “how has your research data been reused by others in the past” or asking researchers who download datasets to publish their own data and link it back to the source data. Even with proper credit, researchers express uncertainty about what data they should share, when, with whom, in what form, for what purposes, etc., and lack the resources or expertise (or awareness that they lack expertise) to do what is necessary or even get help to find out what options are available to them.
From the publishers’ perspective, there is renewed interest in the relationship between data and the publications that capture the research results from the data. Tighter integration of the data and publications is desirable for a variety of reasons, from making it easier to give credit to data providers to enabling “enhanced publications”13 that simplify the mechanism of locating available data on particular topics. Publishers are increasingly interested in making sure that supporting data is available, often in advance of the article, but are uncertain of their own role in making that possible. Some are developing archiving solutions, some are partnering with institutions to link data to publications, and some are setting policy but remaining silent on implementation. At a minimum, publications should cite data in a similar manner to related publications, and could include statements about the data’s availability and metadata for where to access it, and terms and conditions.
On the technology side, software used by researchers often makes subsequent data sharing and reuse difficult, since the “one tool per lab14” phenomenon is still common and there are few standards for structuring or encoding data to make it useful beyond its creators and the software they used. In other words, even if the data is successfully shared, without the software that produced, processed, analyzed or visualized it, the data is often not understandable by itself.
Whatever the current policies and intentions of funding agencies and publishers, unless researchers have access to appropriate infrastructure – repositories with long-term preservation capability, means of creating identifiers and metadata (or “paradata”) for datasets, etc., the policies and goals for improved sharing of research data will not succeed. Where the infrastructure and support services exist, recruiting data from research to comply with policies and scientific norms is easier, although the infrastructure itself is not sufficient to insure good compliance.
In order to achieve a good level of appropriate and effective data sharing, several things are needed:
- Clear and consistent statements of policy and enforcement practices by funders, publishers, institutions, societies and other research stakeholders;
- Easy-to-use and trustworthy infrastructure to accommodate the data and associated metadata;
- Credit mechanisms to reward researchers for the effort of sharing their data;
- Better clarity around the researchers’ (and other stakeholders’) rights in and responsibilities for the data, including privacy/confidentiality regulations and copyright status;
- Harmonization of Data Usage Agreements, including privacy restrictions.
A further issue is that, even with all of these pieces in place, researchers shared concerns about the possibility of misuse of their data – of it being reused without regard to the expressed conditions (e.g. citation, commercial use restrictions), of reuse becoming a support burden, and of other less tangible fears. The counterweight for these concerns is the understanding that data underpins scientific research reproducibility and can help advance scientific progress (both core values of science).
Barriers to sharing include the inverse of the main incentives – i.e., lack of recognition or credit for the work required to share data effectively. Additional barriers are uncertainty about what to share, when, with whom, in what form, etc., and lack of resources or expertise to do what is necessary. Finally, some researchers fear unwanted exposure from providing full access to their research data and tools, leaving them open to criticism that would be difficult without such access.
Reusing data has its own challenges, since researchers are often uncertain of the provenance of a given dataset and whether it can be trusted, and are also often faced with significant effort to reformat the data for integration with other data or use by a different tool than the original research used. Legal issues play into this, since researchers rarely understand what rights adhere to their data and who holds those rights, i.e., themselves, their institution (the grantee), their funder, or no one (i.e. public domain). And even if the determination is made, what contract, license or waiver to apply to the data is another source of confusion. International, interdisciplinary and cross-sector collaborations raise further questions, for some of which there may not be clear legal answers. Each stakeholder community wants a degree of control over policy affecting research data – researchers, funders, institutions/grantees, data archives, publishers, etc. There is recognition that policy needs to be cognizant of the research discipline it affects, but at the same time work across disciplines to support interdisciplinary research and achieve economies of scale. A general framework with room for discipline-specific detailed policies may be necessary to achieve everyone’s goals.
Data governance affects all stakeholders in the research enterprise: funders, institutions, government, legislatures, disciplines, publishers, data centers, standards bodies, researchers, libraries, consumers, and the public. First we should consider the activities of data governance, including creating, enabling, enforcing, promoting, educating, and managing policies over time. Thinking about roles in relation to activities, we can begin to delineate a few:
- Institutions ensure compliance with funder policies, for which they need incentives;
- Libraries can provide services for education and advising, in addition to data curation activities;
- Governments and research disciplines (e.g. societies) create policies according to the discipline’s and the public’s interests, on both national and international levels;
- Publishers create policy for data related to publications to insure research validity and reproducibility, and manage them over time.