Article Text

Download PDFPDF

Systematic reviews of diagnostic test evaluations: what’s behind the scenes?
  1. Madhukar Pai, MD,
  2. Michael McCulloch, LAc, MPH,
  3. Wayne Enanoria, MPH,
  4. John M Colford, Jr, MD, PhD
  1. Berkeley Systematic Reviews Group
 University of California, Berkeley
 Berkeley, California, USA

    Statistics from

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


    As readers of Evidence-Based Medicine, you are aware that systematic reviews are considered the best source of evidence for evidence-based clinical practice. Systematic reviews synthesise data from existing primary research and bring some order and sanity to the otherwise stressful process of sorting out a plethora of studies and staying up to date. However, since not all reviews are created equal, it is important to be able to critically assess their quality. In this editorial, we take you behind the scenes of a systematic review, using diagnostic test accuracy as an illustration. A clear understanding of the process will, hopefully, guide what you look for in a review. Furthermore, if you can’t find an existing diagnostic review and decide to do one yourself, we provide you with a “road map” (figure) for navigation.

    Se  =  sensitivity; Sp  =  specificity; LR  =  likelihood ratios; DOR  =  diagnostic odds ratios; ROC  =  receiver operating characteristic; SROC  =  summary receiver operating characteristic; TP  =  true positives; FP  =  false positives; TN  =  true negatives; FN  =  false negatives; TPR  =  true positive rate; FPR  =  false positive rate. Superscripts indicate reference numbers.


    Systematic reviews are done on a range of clinical questions, such as therapy, diagnosis, prognosis, aetiology, harm, and disease prevalence. All systematic reviews follow the same critical steps:

    1. Formulation of the review question

    2. A comprehensive, systematic search and selection of primary studies

    3. Critical appraisal of included studies for quality and data extraction

    4. Synthesis and summary of study results

    5. Interpretation of the results

    These steps resemble those of the evidence-based medicine (EBM) process, but are more thorough. In the EBM process, our objective is to quickly hunt down a valid source of evidence (eg, a high quality systematic review) on a focused clinical question and get to the bottom line (clinically meaningful results) within minutes. In contrast, the systematic review involves doing a comprehensive search for all published and unpublished primary studies on a focused question, critical appraisal of the relevant studies, and synthesis of these studies to generate evidence for clinical practice. This process typically takes months, not minutes.

    The core steps of the systematic review process (shaded boxes in the figure) can be broken down further into more discrete steps. Based on our experience in conducting reviews and developing training material (see, we have provided some helpful tricks and tips for surviving the process. For all the major steps, we have provided references to important articles and resources.


    Although not as common as systematic reviews on therapeutic questions (ie, of randomised controlled trials [RCTs]), diagnostic reviews are increasingly published in the medical literature. The main objective of a diagnostic review is to summarise the evidence on the accuracy of a test or instrument (in this case accuracy refers to such measures as sensitivity, specificity, and likelihood ratios). The other objectives are to critically evaluate the quality of primary studies, check for heterogeneity (variability) in results across studies, and determine sources of heterogeneity, where necessary.


    The first step is to formulate a clear, focused review question. It is important to specify the patient population (or the disease of interest) and setting, the index test (or tests) being evaluated, the reference standard (comparison), and the outcomes (eg, sensitivity and specificity). For example, consider a review on ultrasonography for suspected deep venous thrombosis.

    A focused question would be:

    Is ultrasonography (test) a sensitive and specific (outcomes) test compared with venography (reference standard) in the diagnosis of suspected deep venous thrombosis in adults (patients)?

    A focused question will help in searching databases and also with formulating explicit eligibility criteria for selecting studies.


    The second step is to conduct an exhaustive search for primary studies. The search might include general databases (eg, Medline and EMBASE/Excerpta Medica), subject specific databases (eg, MEDION, a database of diagnostic literature:, scanning bibliographies of included studies, contacting authors and experts to locate ongoing and unpublished studies, and contacting test manufacturers. It is important to extend the search beyond Medline and cover other databases as well. Once all sources have been searched, the accumulated citations are screened independently by 2 reviewers who select those studies that will be included in the review. This process reduces missed studies and bias in study selection.


    The third step is to critically appraise included studies. Quality assessment, again, is ideally done by 2 reviewers independently. Several quality criteria need to be considered when evaluating diagnostic studies. These include the clinical spectrum of included patients; blinded interpretation of test and reference standard results; potential for verification bias; consecutive patient sampling, prospective design; and adequate description of the index test, reference standard, and study population. Often, several of these features may not be reported in the primary studies. Reviewers might need to contact authors of the studies and seek additional information. Reviewers might choose to exclude low quality studies from the review at this stage. An alternative approach would be to stratify studies by quality at the time of analysis and examine the effect of study quality on test accuracy.

    Data extraction is done in parallel with quality assessment. The outcomes reported in diagnostic reviews are the measures of accuracy: sensitivity, specificity, likelihood ratios, diagnostic odds ratios, and receiver operating characteristic (ROC) curve data. Where possible, reviewers should extract raw data to fill the 4 cell values of a diagnostic 2 × 2 table: true positives, false positives, true negatives, and false negatives.


    Analysis begins with simple tabulation of study characteristics and results. Forest plots of accuracy measures (eg, sensitivity and specificity) show estimates from each study with their confidence intervals. These plots provide a useful visual summary of the data. Although, as with intervention studies, all measures of accuracy can be statistically pooled using random or fixed effects methods, this may not always be appropriate. Each study in the meta-analysis contributes a pair of numbers: true positive rate (sensitivity) and false positive rate (1 − specificity). Because these measures are correlated and vary with the thresholds (cutpoints for determining test positives) used, it is important to analyse them as pairs and to also explore the effect of threshold on study results. Simple pooling of accuracy measures does not address these important issues. A more meaningful approach is to summarise the joint distribution of sensitivity and specificity using the summary ROC curve. Unlike a traditional ROC plot that explores the effect of varying thresholds on sensitivity and specificity in a single study, each data point in the summary ROC space represents a separate study. The summary ROC curve is obtained by fitting a regression curve to pairs of sensitivity and specificity. The summary ROC curve and the area under it present a global summary of test performance and show the trade off between sensitivity and specificity. A symmetric, shoulder like ROC curve suggests that variability in the thresholds used could, in part, explain variability in study results.

    Heterogeneity in meta-analysis refers to a high degree of variability in study results, a fairly common finding in diagnostic meta-analyses. For example, one might find reviews with sensitivity estimates ranging from 0–100%. Such heterogeneity could be due to variability in thresholds, disease spectrum, test methods, and study quality. In the presence of significant heterogeneity, the pooled summary estimate is not meaningful. Reviewers should then focus on finding sources of heterogeneity. This can be accomplished by looking at the details of the studies (eg, selection of patients or test procedure), examining subgroups to look for homogeneous populations, and by meta-regression, to statistically assess the differences in study design that might explain variation in findings. Graphical methods can also be used to identify sources of heterogeneity.


    The final steps in the systematic review process are interpretation of the results, including discussion of such issues as applicability and writing of the report for publication. Reviewers also need to discuss the limitations of the primary studies reviewed and limitations of the review itself. The review usually concludes with a discussion on implications for clinical practice and the need for further research on the clinical question.


    Just as systematic reviews of high quality clinical trials are considered to be at the top of the hierarchy of evidence for treatment, properly conducted systematic reviews of valid diagnostic studies are at the top of the hierarchy of diagnostic evidence. A clear understanding of how systematic reviews are done will help clinicians appreciate the strengths and limitations of the reviews they read.


    View Abstract