Article Text

Download PDFPDF

General medicine
Evidence-based medicine and machine learning: a partnership with a common purpose
  1. Ian Scott1,2,
  2. David Cook3,
  3. Enrico Coiera4
  1. 1 Internal Medicine and Clinical Epidemiology, Princess Alexandra Hospital, Woolloongabba, Queensland, Australia
  2. 2 School of Clinical Medicine, The University of Queensland, Woolloongabba, Queensland, Australia
  3. 3 Intensive Care, Princess Alexandra Hospital, Woolloongabba, Queensland, Australia
  4. 4 Australian Institute of Health Innovation, Macquarie University, Sydney, New South Wales, Australia
  1. Correspondence to Professor Ian Scott, Internal Medicine and Clinical Epidemiology, Princess Alexandra Hospital, Woolloongabba, QLD 4102, Australia; ian.scott{at}health.qld.gov.au

Abstract

From its origins in epidemiology, evidence-based medicine has promulgated a rigorous approach to assessing the validity, impact and applicability of hypothesis-driven empirical research used to evaluate the utility of diagnostic tests, prognostic tools and therapeutic interventions. Machine learning, a subset of artificial intelligence, uses computer programs to discover patterns and associations within huge datasets which are then incorporated into algorithms used to assist diagnoses and predict future outcomes, including response to therapies. How do these two fields relate to one another? What are their similarities and differences, their strengths and weaknesses? Can each learn from, and complement, the other in rendering clinical decision-making more informed and effective?

  • health informatics
  • information technology
  • general medicine
  • statistics & research methods

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

In 1996, the evidence-based medicine (EBM) movement stated its mission as ‘…integrating individual clinical expertise and best external evidence’1 in making clinical decisions. EBM has promulgated rigorous assessment of the validity, impact and applicability of results of research studies of diagnostic tests, therapies and prognostic tools.2 Machine learning (ML), a subset of artificial intelligence, has developed independently and uses advanced computer programs to identify patterns and associations within large digitised datasets with minimal human instruction, which are then encoded into algorithms that assist diagnosis and predict future outcomes, including response to therapies. In this analysis, we explore the commonalities and differences between EBM and ML and note the ways in which one can complement the other in the quest for scientific truth.

Origins and aims

Both endeavours aim to inform clinical decision-making, but demonstrate differences in their epistemological methods3–5 (see table 1). EBM originated from clinical epidemiology and uses empirical research to make inferences, while ML originated from data and computer science and uses data mining methods to recognise patterns and associations. ML seeks to use observational data in developing algorithms that can more accurately diagnose disease and make prognostic decisions than more conventional cross-sectional studies or cohort regression analyses, respectively.

Table 1

Characteristics of evidence-based medicine (EBM) and machine learning (ML)

Traditional EBM methods struggle to meet the demands of making diagnoses or generating predictions within large, heterogenous populations. Many past studies have relied on analysing aggregated, study-level data derived from narrowly defined, highly selected populations in whom a limited number of predictor and outcome variables can be reliably studied. In contrast, ML can analyse huge datasets involving diverse ‘real-world’ populations from whom massive amounts of information may been gathered on a vast array of phenotypic, genotypic and environmental variables.

Historically, EBM has given preference to the randomised controlled trial (RCT) in evaluating efficacy of therapeutic interventions because of its ability to balance, by randomisation, prognostic factors between experimental and control (or standard care) groups. However, under the Grading Recommendations Assessment Development and Evaluation system,5 observational studies of interventions have been included in evidence hierarchies, although ranked lower in terms of freedom from bias. Observational data and ML are useful when prospective research studies, especially RCTs, are not feasible because of ethical concerns, logistical barriers, limited timespans, cost or inability to recruit patients and/or clinicians.6 ML offers a means for defining and analysing a personalised ‘virtual cohort’ of individuals whose collective recorded clinical destiny may be at least as predictive of intervention outcomes for a given patient as any RCT.7 EBM recognises the value of high-quality observational studies of interventions where investigators have attempted to rigorously adjust results for selection bias and confounding by indication by using methods such as propensity matching and instrumental variables.6

Several ML applications have now been approved by regulatory authorities in the USA for routine clinical use, most of which relate to diagnostic imaging (see box 1).8–10 Potentially, ML offers greater precision and personalisation of risk prediction and prognostication, and may be superior in identifying novel risk factors.11 Currently, ML can guide the care of patients with rare or highly complex maladies that are not amenable to RCTs.12 While some ML prediction models are essentially equivalent to conventional statistical EBM models, others such as deep learning approximate the latter but make no assumptions about how data are statistically distributed. As a result, these more complex ML models may detect hypothesis-generating treatment-outcome associations within specific subpopulations or clinical contexts which can inform aims and design of future trials.

Box 1

Examples of approved machine learning (ML) applications in diagnosis

  • IDx-DR analyses images of the eye taken with a retinal camera called the Topcon NW400. A doctor uploads the digital images of the patient’s retinas to a cloud server on which IDx-DR software is installed. If the images are of sufficient quality, the software provides the doctor with one of two results: (1) more than mild diabetic retinopathy detected: refer to an eye care professional or (2) negative for more than mild diabetic retinopathy; rescreen in 12 months. If a positive result is detected, patients should see an eye specialist for further diagnostic evaluation and possible treatment as soon as possible.8

  • OsteoDetect analyses two-dimensional X-ray images for signs of distal radius fracture, a common type of wrist fracture that is often missed by human observors. The algorithm identifies and highlights regions suggestive of fracture during the review of posterior–anterior and medial–lateral images of adult wrists.9

  • Viz.AI Contact analyses CT images of the brain and sends a text notification to a neurovascular specialist if a suspected large vessel blockage has been identified in patients presenting with a stroke-like syndrome.10 This task potentially involves the specialist sooner than would be the case under usual care in which the first-line provider must wait for a radiologist to review CT images before notifying the specialist. This more rapid notification of a possible stroke can be sent to a mobile device, such as a smartphone or tablet, although the specialist still needs to review the images on a clinical workstation.

In the final analysis, ML is the latest in a series of methodologies that can be applied to meet the goals of EBM. Moreover, the scientific rigour of EBM in assessing the quality and clinical utility of studies of diagnostic tests, therapies and prognostic tools applies equally to ML applications, as exemplified in a recent users’ guide for reading ML studies focused on diagnostic imaging applications.13 Existing EBM standards, such as the Prediction model Risk Of Bias ASsessment Tool for assessing risk of bias in prediction models, can aid developers of ML models in selecting appropriate training datasets and predictor variables.14

Methodological and implementation challenges

In clinical practice, lessons learnt during EBM’s evolution can help refine and improve the implementation of ML methods. Equally, ML can mitigate some of the persistent limitations and evidence-poor zones of EBM where studies are underpowered, of limited generalisability, or report short-term outcomes. In the following sections, we consider the advantages and limitations of ML across selected themes important to any new methodology that seeks to meet EBM goals of informing clinical decision-making reliably and meaningfully.

Data quality

ML facilitates rapid and reproducible analyses of large quantities of complex, routinely collected digitised data (RCDD) comprising administrative data, investigation results, imaging data, clinical observations and patient events contained within electronic health records (EHRs). Compared with clinical trial data, RCDD offers massive scale, enhanced representativeness of populations and settings, longer observation periods, larger numbers of events and continual updating.15 With more data and more complex models, ML aims to improve precision of predictions or pattern matching. However, errors can be introduced when data is not organised or standardised, or is incomplete, inaccurate, or mislabelled, unrepresentative of diseases or populations of interest, or of small volume—‘garbage in=garbage out’. More importantly, systematic biases within digital documentation of clinical encounters16—differentials in ethnic or socioeconomic mix of patient populations17 or variations in standards of care or clinical policies—will faithfully propagate into ML models—‘bias in=bias out.’ A particular risk of observational studies is selection bias from clinicians choosing to treat patients differently (confounding by indication) and which has produced falsely inflated estimates of intervention benefit.18

EBM methods mitigate such bias by mandating well-defined study protocols which prespecify: participant inclusion criteria with verifiable diagnostic criteria, intervention mode and outcome measures; randomisation (for therapeutic trials) or random or purposive sampling (for cross-sectional and cohort studies); and intention to screen/test/treat analysis of independently adjudicated outcomes. Data are gathered prospectively from well-defined primary sources by trained researchers who ensure data accuracy and completeness. Data are imported into standardised statistical packages for analysis, with results interpreted impartially and study limitations made explicit.

ML may struggle in achieving accurate, bias-free data for several reasons. Considerable effort, cost and resources are often required to first acquire, clean, curate, format and integrate high-dimensional data that, in most cases, has been collected retrospectively from many sources whose primary purposes were unrelated to research. Manipulating free text within EHRs using natural language processing (NLP), and integrating data from disparate sources using interface software, such as Fast Interoperable Healthcare Resources, are making data collection and curation easier to perform. However, NLP may fail to deliver appropriate semantic and contextual interpretation of unstructured text. Even with automated extraction of anonymised data into curated warehouses, development of clinically sensible models requires robust metadata catalogues, understanding of clinical context and synthesis of temporal sequences of clinical decision points and events. Model inputs and outputs (such as predicting a diagnosis or outcome event) must be also validated against reliable reference standards. In keeping with EBM, ML must define the primary objective, the target population, the required input data and its source, desired or expected outputs, model properties, optimisation strategies, and any assumptions, bias or limitations inherent to its models.

Data quantity

EBM stresses the need for sample size calculations in ensuring a study has sufficient power to reliably evaluate investigations or interventions, and prevent type 2 (or false negative) results. With ML, methods are emerging for determining a priori the amount of data required for any specific model,19 although these are not yet widely applied in a standardised fashion. This task is even more challenging for deep learning networks in which the number of layers, and the number of filters at each layer, will vary according to task complexity, and hence the required amount of input data is difficult to predict. In general, the more complex the ML model, the more data required. In practice, if adding more data continues to improve model performance, more data are desirable.

Complexity

The variables that need to be measured in evaluating a test, treatment or prognostic tool are usually defined a priori based on a causal or prediction model of expected relationships. Their finite scope will, however, constrain the complexity of analysis and precision of results. With ML, datasets may contain thousands of variables, some of which are non-numerical, and contain a large number of potential relationships (high dimensionality). In avoiding the risk of finding spurious relationships or models that appear to work simply by chance, good ML practice discourages hypothesis-free data mining and instead defines input features that are thought to be salient and causal to decisions at the output, and exclude irrelevant features which may degrade model performance. For models whose outputs inform optimal use of investigations and treatments, model features need to be restricted to baseline characteristics of patients before clinicians have decided what care patients subsequently receive. These processes of feature selection and engineering20 need input from clinical experts who can provide rationales for selecting and transforming features, and who can detect potential bias and errors during model development.

Precision and performance

EBM has sought to precisely define benefits and harms for individuals by employing subgroup analyses of aggregated data from large mega-trials or, even better, individual patient data meta-analyses from multitrialist collaborations. These have been constrained by the limited number of variables included in most trial databases. ML seeks to eliminate representative or ‘average patient’ outcome probabilities and guide care decisions for individuals by patterns of associations using many more variables pertaining to their specific clinical, biological and behavioural characteristics. However, due to their complexity, ML models are more prone to systematic but potentially hidden errors, especially those relating to patient subgroups under-represented in the model training datasets.

In assessing performance of predictive tools, measures of discrimination (area under the receiver operator characteristic curve (AUC)) and calibration are relevant to ML applications, and enable comparisons of the accuracy of different models directed at the same prediction task. ML models do not necessarily demonstrate better AUCs than models based on more traditional logistic regression.21 Moreover, in determining optimal clinical decision thresholds, accurate calibration is often more important, especially for predictions in low-risk populations.22

Reproducibility

Systematic reviews and meta-analyses of EBM use detailed evidence tables and metaregression techniques to identify and explain heterogeneity between studies of the same topic that arise due to differences in study design, population, intervention or analytical methods. Similarly, the increasing number of ML models and applications applied to the same clinical scenarios calls for evaluation of their reproducibility, in both form and output. This requires syntheses of findings of separate analyses of the same or similar models.23 Developers of ML models need to disclose their datasets, data structures, programming code and model outputs so that others can interrogate and understand why different models applied to similar datasets yield disparate results, usually due to differences in the type of model used or their training and testing procedures.24

Generalisability

EBM has long appreciated the tradeoff between internal and external validity, whereby trial results may not generalise to situations dissimilar to those encompassed in RCT protocols. EBM has responded by promoting multisite, pragmatic trials with broader participant inclusion criteria and more flexible, discretionary study protocols responsive to different clinical contexts and the different preferences of clinicians and patients for cointerventions. Similarly, ML models designed to perform narrowly defined tasks require validation within broader clinical contexts (different patient groups, wider spectrum of clinical care), external to the original laboratory conditions and training datasets. ML models configured too closely (or overfitted) to the training data will behave inconsistently when applied to datasets from different populations and settings. Model predictions may change according to differences in patient populations (eg, age, disease subtype) or data source (eg, different imaging instruments), or even with subtle changes in input data, such as slight, indiscernible changes in images. While many models use regularisation—a form of smoothing out outliers and data irregularities—to correct for such differences, its effects on model performance are not always well understood.

Explainability

In EBM, hypotheses are stated, variables are often assumed to have definable frequencies and statistical distributions, analytical methods are detailed, and mechanistic explanations of cause and effect which lend plausibility to observed findings are offered (when known). In ML, data are probed for structure in accordance with a hypothesised model and implicit assumptions are made about causal structure when models are being built, especially in the case of supervised learning. However, the ‘black box’ workings and outputs of deep learning models are often opaque and non-explanatory. In other models, correlations or associations, such as between inotropic drugs and deaths in critically ill patients, or between insertion of chest tubes and pneumothoraces, may be misinterpreted as causal relationships. Some ML methods such as deep learning aim to minimise predictive error, rather than clarify the relationships between features or variables, or assumptions arising from model predictions. Various software tools25 can now identify salient clinical features within these models and which may help clinicians better understand and accept an ML prediction or recommendation.

Clinical impact

The quantification of clinically important benefits and harms of interventions, compared with ‘usual care’, is an important EBM goal, as is estimation of the accuracy of new diagnostic tests compared with current reference standards, and of new prediction rules compared with existing rules or clinical intuition. Similar impact standards must apply to ML applications, and be more rigorous as the latter move from diagnostic aids involving imaging data,26 to more nuanced prognostic and therapeutic predictions27 where satisfactory performance is less guaranteed.28 To date, few deep learning applications have been tested in clinical trials in contrast to older ML methods, and efficient methods for assessing the safety and effectiveness of ML models as they iteratively adapt over time in response to new data are yet to be developed.29

EBM and ML must both continue to contend with three ongoing challenges related to clinical impact, although these do not reflect failings of EBM and ML in themselves. First, their reporting of ‘average patient’ estimates of benefit and harm, if accompanied by significance levels just above or below the conventional p value of 0.05, can be subject to misinterpretation when applied to individual patients.30 Second, rendering a tool or model more capable of more precise risk estimates is wasted effort if these estimates do not have potential to meaningfully influence care decisions.31 Third, embedding new decision support technologies into routine practice is often difficult, requiring the identification and establishment of structures, culture and incentives necessary to change behaviour of both clinicians and patients.32

Integration into routine practice

EBM has sought to close evidence-practice gaps by making high-quality evidence accessible to, and usable by, busy clinicians at the time and place of decision-making. ML shares the same aim and computerised decision support systems (CDSS) serve as the natural translation vehicle. However, in both instances, clinicians must be encouraged and able to use CDSS-integrated tools appropriately when developing and discussing care plans, and know when to critically evaluate CDSS-generated recommendations.33 CDSS can deskill clinicians who become over-reliant on, and no longer able to evaluate, such recommendations. EHR-enabled CDSS which are poorly integrated into clinical workflows can cause cognitive overload.34 A key unanswered question is how to best leverage the strengths of EBM and ML in augmenting clinician gestalt and experience.

Conclusion

ML serves as a new technology for improving diagnosis, prognostication and determining responsiveness to therapies by using computer programs to discover previously indiscernible patterns and associations within large datasets, and to build these into predictive models. However, ML applications must be evaluated with the same epistemological rigour that lies at the heart of EBM. At the same time, the traditional methods of EBM will benefit from the addition of ML in furthering the science of medicine and informing individualised decision-making based on all the available evidence.35

Ethics statements

Patient consent for publication

References

Footnotes

  • Contributors IS conceived the idea, undertook the original research and drafted the manuscript. DC critically reviewed the manuscript, submitted additional concepts and assisted in writing the final version. EC critically reviewed the manuscript, submitted additional concepts and references, and assisted in writing the final version.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.