Article Text

Download PDFPDF

18 Methodology and reporting quality of studies using machine learning models for medical diagnosis: a methodological systematic review
  1. Mohamed Yusuf1,
  2. Ignacio Atal2,3,4,
  3. Jacques Li2,3,
  4. Phil Smith5,
  5. Philippe Ravaud2,3,6,
  6. Michael Callaghan1,
  7. James Selfe1
  1. 1Manchester Metropolitan University, Manchester, UK
  2. 2INSERM, Methods, Université Paris Descartes, Paris, France
  3. 3Cochrane France, Paris, France
  4. 4Centre for Research and Interdisciplinarity (CRI), Université Paris Descartes, Paris, France
  5. 5Manchester Metropolitan University, Manchester, France
  6. 6EQUATOR-Network, Paris, France


Objectives Over the past decade, due to access in large sums of clinical data and the development of new machine learning (ML) techniques, there has been a rise in the application of ML methods to medicine. Due to their propensity to facilitate and promote timely and objective clinical decision-making, ML methods have been applied to gain insights into clinical diagnosis. Despite their popularity, these promising techniques have pitfalls. Studies using ML for diagnosis may contain errors in both design and execution. Additionally, it appears ML researchers are not familiar with the agreed practices of the medical research community.

Our systematic review assessed the reporting quality of studies developing/validating ML models for clinical diagnosis, with a specific focus on the reporting of information concerning the participants on which the diagnostic task was evaluated on.

Method We searched Core Clinical Journals for studies published between 2015 to 2018. Two reviewers independently screened the retrieved articles, applying eligibility criteria to the title, abstract and full text accordingly. A third reviewer resolving discrepancies. To extract information regarding participants on which the ML test were evaluated on, two reviewers independently extracted the data from the eligible articles. Additional reviewers checked and verified the extracted data.

The following items were extracted; data source, method for partitioning the evaluation set from the training data, evaluation set eligibility criteria, diagnostic criteria, baseline characteristics, participant flow diagram, distribution of disease severity in diagnosed patients, distribution of alternative diagnosis for non-diagnosed patients and use of reporting guideline. Additionally, we extracted information on the time difference between the conduct of the ML test and reference standard, as well as if the evaluation group corresponded to the setting in which the ML test will be applied.

Results The search results yielded 161 papers, of which 28 conformed to the eligibility criteria. Detail of data source was reported in 86% of the papers while all of the papers reported methods for partitioning the evaluation set from a larger dataset. 82% of studies reported eligibility criteria for the evaluation set. Information on the diagnostic/non-diagnostic classification was reported well (82%). The least reported items were the use of reporting guideline (0%), distribution of disease severity (29%), patient flow diagram (34%), distribution of alternative diagnosis (36%) and baseline characteristic (64%). A large proportion of studies (82%) had a delay between the conduct of the reference standard and ML tests, while 4% did not and 14% were unclear. For 54% of the studies, it was unclear whether the evaluation group corresponded to the setting in which the ML test will be applied to.

Conclusions We found that all eligible studies in this review failed to use reporting guidelines and a large portion of the studies lacked adequate detail to replicate, assess and interpret study findings.

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.