Original Article
Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews

https://doi.org/10.1016/j.jclinepi.2020.11.003Get rights and content
Under a Creative Commons license
open access

Highlights

  • Systematic review processes need to become more efficient.

  • Machine learning is sufficiently mature for real-world use.

  • A machine learning classifier was built using data from Cochrane Crowd.

  • It was calibrated to achieve very high recall.

  • It is now live and in use in Cochrane review production systems.

Abstract

Objectives

This study developed, calibrated, and evaluated a machine learning classifier designed to reduce study identification workload in Cochrane for producing systematic reviews.

Methods

A machine learning classifier for retrieving randomized controlled trials (RCTs) was developed (the “Cochrane RCT Classifier”), with the algorithm trained using a data set of title–abstract records from Embase, manually labeled by the Cochrane Crowd. The classifier was then calibrated using a further data set of similar records manually labeled by the Clinical Hedges team, aiming for 99% recall. Finally, the recall of the calibrated classifier was evaluated using records of RCTs included in Cochrane Reviews that had abstracts of sufficient length to allow machine classification.

Results

The Cochrane RCT Classifier was trained using 280,620 records (20,454 of which reported RCTs). A classification threshold was set using 49,025 calibration records (1,587 of which reported RCTs), and our bootstrap validation found the classifier had recall of 0.99 (95% confidence interval 0.98–0.99) and precision of 0.08 (95% confidence interval 0.06–0.12) in this data set. The final, calibrated RCT classifier correctly retrieved 43,783 (99.5%) of 44,007 RCTs included in Cochrane Reviews but missed 224 (0.5%). Older records were more likely to be missed than those more recently published.

Conclusions

The Cochrane RCT Classifier can reduce manual study identification workload for Cochrane Reviews, with a very low and acceptable risk of missing eligible RCTs. This classifier now forms part of the Evidence Pipeline, an integrated workflow deployed within Cochrane to help improve the efficiency of the study identification processes that support systematic review production.

Keywords

Machine learning
Study classifiers
Searching
Information retrieval
Methods/methodology
Randomized controlled trials
Systematic reviews
Automation
Crowdsourcing
Cochrane Library

Cited by (0)

Funding: This work received funding from Cochrane via Project Transform; Australian National Health & Medical Research Council (Partnership Project grant APP1114605); U.S. National Library of Medicine (Award 2R01-LM012086-05); I.J.M. was supported by a Medical Research Council (UK) fellowship (MR/N015185/1). A portion of James Thomas's time was supported by the National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care North Thames at Barts Health NHS Trust. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

Ethics approval and consent to participate: Not applicable: study does not involve human subjects.

Conflict of interest: The authors declare that they have no competing interests.

Authors’ contributions: J.T., A.N.S., S.M., and I.J.M. designed the study. J.T. and I.J.M. built the classifiers and calibration models. A.N.S. and S.M. worked on evaluation data sets. C.M. provided overall Cochrane direction and governance. I.S. and J.E. provided methodological input throughout. All authors read and approved the final article.

Consent for publication: Not applicable: no participant data presented.

Availability of data and materials: Hyperlinks to source code repositories are supplied in the text. Cochrane's CENTRAL database is referenced and is available at https://www.cochranelibrary.com/central. All Cochrane Crowd labels are open and available at http://crowd.cochrane.org/DownloadData.php.