Article Text

other Versions

Download PDFPDF

Sharing study materials in health and medical research
  1. Nicholas J DeVito1,
  2. Caroline Morton1,
  3. Aidan Gregory Cashin2,3,
  4. Georgia C Richards4,
  5. Hopin Lee5,6
  1. 1Bennett Institute for Applied Data Science, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, Oxfordshire, UK
  2. 2School of Health Sciences, University of New South Wales, Sydney, New South Wales, Australia
  3. 3Centre for Pain IMPACT, Neuroscience Research Australia, Randwick, New South Wales, Australia
  4. 4Centre for Evidence Based Medicine, Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, Oxfordshire, UK
  5. 5Centre for Statistics in Medicine & Rehabilitation Research in Oxford, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, Oxfordshire, UK
  6. 6School of Medicine and Public Health, The University of Newcastle, Callaghan, New South Wales, Australia
  1. Correspondence to Nicholas J DeVito, Primary Care Health Sciences, University of Oxford, Oxford, Oxfordshire, UK; nicholas.devito{at}phc.ox.ac.uk

Abstract

Making study materials available allows for a more comprehensive understanding of the scientific literature. Sharing can take many forms and include a wide variety of outputs including code and data. Biomedical research can benefit from increased transparency but faces unique challenges for sharing, for instance, confidentiality concerns around participants’ medical data. Both general and specialised repositories exist to aid in sharing most study materials. Sharing may also require skills and resources to ensure that it is done safely and effectively. Educating researchers on how to best share their materials, and properly rewarding these practices, requires action from a variety of stakeholders including journals, funders and research institutions.

  • ethics
  • methods
  • policy

Data availability statement

Data sharing not applicable as no datasets generated and/or analysed for this study.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Sharing study materials can have numerous advantages; however, in practice, sharing is inconsistent and understudied.

WHAT THIS STUDY ADDS

  • This paper introduces the basic arguments in favour of increased sharing, provides guidance on how to safely and effectively share, while also examining barriers that may impact sharing.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Increased awareness and education about sharing practices allows researchers to better consider how, where and when they might aim to share their research materials.

Introduction

Clear and accurate descriptions of methods are essential to the scientific endeavour. However, studies can range broadly in their complexity and the subsequent level of detail required to fully understand the methods. Relying solely on space-limited narrative methods sections, which some journals relegate to the end of the manuscript, can leave gaps for readers that make it difficult to assess what occurred in practice, and if there were any potential issues in the execution of a given study. This leads to difficulties in properly assessing a study’s merits, reproducing its findings and confidently building on its conclusions.1–3

Large-scale investigations of transparency practices in the biomedical literature, including sharing, have shown improvements over time but with room for further growth.4 Sharing of data, code and protocols lags ‘significantly behind’ other areas of transparency practice including disclosures of conflicts of interest and funding in the literature.5 Investigations of the data-sharing standards in health and medical journals have also shown clear gaps,6 7 and there might be a mismatch between the willingness to share and actual sharing rates.8 9 Examinations of the data-sharing policies of commercial and non-commercial research sponsors, as well as journals, show high variability in the extent that data sharing is encouraged, supported or required.10–15 Moreover, problems with reproducibility have been identified in clinical trials16, observational research17 and systematic reviews.18 This matches the low overall computational reproducibility seen in the broader literature.19

To ensure a sufficient level of detail is made available, many have called for the sharing of study materials to become more routine, expected and rewarded throughout science.20 Starting in 2023, plans for sharing and management of data will be a technical aspect of grant assessments at the US National Institutes of Health, the largest health funder in the world and the US Office of Science and Technology Policy recently announced broad reforms in the availability of data originating from govnment funded research;other major funders, like the UK National Institute for Health and Care Research, promote the sharing of funded research outputs.21–23 What exactly should be shared, and how to accomplish this increased transparency, will vary considerably both across and within, disciplines.24

This article addresses sharing in the context of general health and medical research. We focus on the sharing of study data and analysis code, and touch on other study-specific materials; however, the general advice will cover most materials regardless of study-level idiosyncrasies.

What does ‘sharing’ mean?

When discussing ‘sharing’ in the context of academic investigations, we mean making relevant materials, key to understanding the conduct and analysis of a study, available to interested parties to the extent ethically and legally possible (box 1). Fundamental to sharing of study materials is that they are made available as part of the timely and complete dissemination of results. While we focus on sharing as it relates to data and materials being made available either before or alongside results, biases in the availability of results will undermine any efforts for more open and transparent science. For many, sharing in the context of research means making their raw data available. Sharing data allows for external review and reuse to support future research.25 In 2018, the International Committee of Medical Journal Editors began requiring that articles reporting clinical trials in member journals must contain a data-sharing statement.26 Notably, this is not a requirement to share data, but rather an attempt to be transparent about the extent or conditions around data sharing. However, there is no guarantee these statements will be implemented, honoured or to lead to increased sharing.27 28

Box 1

Summary of sharing practices

What: Sharing data, code and other study materials like protocols, documentation and guides.

Why: Increased research transparencysupports critical appraisal, reproducibility and secondary analysis that increase confidence in results and extends their value.

When: Sharing can occur at any time throughout a study with some expected at the start (eg, protocols), some potentially during a study (eg, preliminary data, guides) and some in a timely manner after completion of the study (eg, final data and analysis code).

Who: Investigators usually bear the final responsibility for their data, but institutions, funders and journals, including both editors and peer reviewers, can aid in ensuring sharing occurs efficiently, effectively and safely.

How: Sharing happens in a number of ways specific to a given study but various platforms, technologies and repositories can handle a wide range of study materials and aid in making them discoverable and appropriately accessible.

While data sharing is important, it is only one aspect of the study methods that can be made available.29 A wide array of documentation, data collection sheets and instruments, analysis plans and code, and various other forms of information can be shared to provide insights into how a study was conducted (box 2). The Transparency and Openness Promotion Guidelines grade journal policies across eight standards, including the sharing of data, analytic code, materials and documentation (ie, protocols and analysis plans), and each area is assessed across four levels ranging from ‘not implemented’ to having an active requirement and verification of sharing.30

Box 2

Case study in sharing materials in health and medicine

Eklund and colleagues published an analysis of common software packages for functional MRI (fMRI) research. By running tests on an fMRI data set, they found issues with the implementation of these programmes and concluded that there was a need to validate “the statistical methods being used in the field of neuroimaging.”

The study used an open fMRI data set* and the group shared all their processing scripts on GitHub.* This meant that when the article’s findings generated attention within the field, others were able to use these scripts and data to contextualise, replicate, verify and extend the original claims. Furthermore, because some of the fMRI statistical software is maintained as an open-source software project, the teams were able to find and fix bugs that could impact future analyses.

Eklund and colleagues end their response to comments on their original piece by noting that “together, these examples show the importance of data sharing, open-source software, code sharing, and reproducibility.”52

*https://www.openfmri.org/faq/

**https://github.com/wanderine/ParametricMultisubjectfMRI

Why should study materials be shared?

It is common practice in manuscripts with data-sharing statements to make even public, non-sensitive data, or other study materials, available only by ‘reasonable requests’ to an author which can pose barriers between potential users and study materials.31 32 Email requests for access can go to defunct addresses, get lost in busy inboxes, or simply be ignored or rejected.27 33 Moreso, decisions to share are reliant on an individual’s subjective evaluation of what is ‘reasonable’ unless clear criteria for requests are articulated and consistently applied.34 If data can legally and safely be shared, it should be the default to allow for proactive transparency and ease of use.

The first article in this series covered pre-registration which can be considered a specific type of sharing of analysis plans and protocols prior to study commencement.35 Making these available allows for assessments of potential biases and provides details for future researchers. Making the analytic code used to clean, process and analyse study data in programs like R, STATA or Python available can help catch errors, facilitate future research and provide insights into how a study protocol was practically implemented.3 Observational health researchers working with large datasets can share the codelists (eg, ICD-10 codes) that are used to extract analysis populations and define variables of interest in order to avoid duplicating efforts and ensure consistency in analyses. Those involved in qualitative studies can make study materials, like interview guides, available to provide insights into how data were collected. Repositories, such as the National Heart, Lung and Blood Institute Biologic Specimen and Data Repository, facilitate access to biospecimens and other clinical and epidemiologic data to benefit other researchers.

These are just some common examples of how sharing may occur and aid in promoting reproducibility, secondary analyses and increased utility of publicly funded research outputs. There is even evidence that more open sharing practices leads to higher citation rates of a given work.36 For any study, we believe it is valuable to reflect on what materials were key to its conduct and whether or not it can, and should, be shared. It may also be prudent to consider sharing to the extent possible prior to publication, rather than after publication, to ensure that reviewers, editors and other interested parties can review study details for issues, errors or biases before they enter the literature.37

How can materials be shared?

Proponents of better data sharing have developed the FAIR standards for “Findability, Accessibility, Interoperability, and Reuse.”38 The details of FAIR are specific to making data available and usable, and have recently been expanded to research software,39 but provide a useful framework for thinking about what makes sharing successful more generally:

  • Findability and accessibility: Sharing in obscure or unindexed locations impacts whether others can discover and use the information. In addition, one-off sharing on personal, or other non-persistent, repositories does not guarantee interested parties will be able to access the information into the future. Where relevant, ensuring appropriate metadata is attached to your information will also allow for more efficient discovery by others.

  • Interoperability and reuse: Posting information in a repository without any documentation or instructions can mean these materials are indecipherable to others. While sharing a protocol and statistical analysis plan that explains the methods and analysis in detail is a minimum standard, further clarifications—possibly including a data dictionary, commented and documented code, or shareing instructions or guides about data extraction—can help the inspection and reuse of data. Some journals offer researchers the opportunity to publish formal descriptions of available data sets to increase their visibility and document their contents and use (eg, Nature Scientific Data). Properly presenting and documenting data or code will also make their future use more effective and efficient.40–42 It can also be useful to share information in standard formats that are easily usable across a variety of platforms. For instance, data in a .CSV file is more accessible than a program-specific data file (eg, a STATA .DTA file). Some industries or research areas will have niche standards for interoperable, open data formats (eg, Clinical Data Interchange Standards Consortium standards) that may be relevant to certain researchers. When sharing in non-standard or proprietary formats is the best option, it should be clearly documented how and what is necessary to access and use the information.

Specific methods for sharing will also depend on what is being shared. The simplest form of sharing would involve making any study materials available as supplements to the manuscript. This has the clear benefit of being immediately available to readers; however, this practice may have other limitations. Links to files can break in journal website redesigns and there may be restrictions as to how information can be shared, accessed and reused. For instance, sharing a large data set as a PDF would compromise its usability by others. This method also makes sharing contingent on journal publication.

Alternatively, information can be shared on services and repositories created to persistently house academic outputs. Preprint servers can host additional study materials, with few limitations, alongside an open-access version of the manuscript. The Open Science Framework, Figshare and Zenodo are some examples of more general-purpose repositories that offer generous free storage of nearly any type of file, structured metadata, version control and digital identifiers (eg, DOIs) for easy citation and linkage and often integrate with one another (table 1). More specialised repositories, like GitHub for code, or ClinicalTrials.gov for trial registrations, offer similar functionality and can be valuable tools for sharing. Using these repositories makes sharing as easy as putting a persistent URL or DOI into a manuscript that points to all available materials. Some types of studies may require more sophisticated sharing methods that ensure reproducible computational environments.43

Table 1

Examples of sharing platforms

Some platforms are specifically designed for securely warehousing sensitive data, particularly relevant in health and medical research. Vivli and YODA are two examples of repositories that help securely store and manage access to individual patient data from clinical research.44 There is also increased investment in Trusted Research Environments in which data providers offer approved researchers firewalled access to conduct analysis in which the raw data would never leave the environment.45 By sharing code designed for analysis within a given Trusted Research Environment, any researcher with access should be able to replicate, verify or expand on a finding. It may also be worthwhile to check the requirements and recommended resources or repositories from the researcher’s institution or funders for depositing and managing access to sensitive information.

Barriers and precautions

While increased sharing may be ideal, numerous barriers exist, especially for health and medical research.46 The biggest concern with sharing information, especially data, from health and medical studies is the potential to expose sensitive or confidential information. Legal regimes, such as the General Data Protection Regulation in Europe, must also be considered when sharing sensitive data derived from research. A key component of sharing is that it must be done thoughtfully and responsibly.47 As discussed previously, there are specialised repositories which can help manage access to sensitive data.

For some studies, it may be prudent to make data sharing part of the initial participant consent process. Using this model, study participants would understand exactly what data they are providing and the ways in which these could be used or shared. If the risks of re-identification are well described, understood and fully consented to, this may allow for more effective sharing of data.48 Even when sharing data in accepted or responsible ways, care must be taken to ensure information is not overly disclosive or the anonymity of the data is overemphasised. Pseudonymisation is a common practice in which some identifiable data are removed; however, this does not mean that the data are not disclosive.49 Furthermore, thought should be given to whether existing datasets could be combined with your open data to de-anonymise participants.50 Data leakage and vulnerabilities may even exist in unexpected places. For instance, SVG files, a common image format requested by journals, embed the data underlying the image directly in the file and could be disclosive.

Researchers should also consider whether they have ownership or permission to legally share data. For instance, one might do a study using the UK Clinical Practice Research Datalink dataset which is built from primary care electronic health record data, or with proprietary data purchased from a commercial data supplier such as those detailing insurance claims or drug sales. If sharing is not possible, be transparent about the reasons why, state who owns and manages the data, and detail how the dataset was used to the extent possible. Additional materials (eg, the protocol and analytic code) can often still be shared. When these considerations are not relevant, information can be shared under appropriately permissive licensing like Creative Commons or The MIT Licence for software.

Lastly, properly sharing data requires both human and financial resources to build data-sharing capacity, access platforms, respond to inquiries, and meet data security specifications. This creates a need for investment in skills and resources alongside systems that promote and reward openness and transparency.51

Conclusions

Increased transparency into the methods underlying published research is growing in importance. Concerns about rigour, reproducibility and research integrity cannot be addressed if access to the key ingredients that underlie published analyses are unavailable. Institutions throughout biomedical research need to invest in educating researchers about sharing while establishing infrastructure, incentives and requirements that support making study materials as accessible as possible.

Data availability statement

Data sharing not applicable as no datasets generated and/or analysed for this study.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

References

Footnotes

  • Twitter @AidanCashin, @Richards_G_C

  • Contributors NJD wrote the first draft of the piece. All authors contributed to the conceptualisation of the manuscript and critically reviewed and provided feedback on its content. NJD is guarantor.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests GCR was financially supported by the NHS, National Institute of Health Research (NIHR), School for Primary Care Research (SPCR), the Naji Foundation and the Rotary Foundation to study for a Doctor of Philosophy (DPhil) at the University of Oxford (2017–2020), but no longer has interests to declare. GCR is an Associate Editor of BMJ Evidence-Based Medicine.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.