Article Text

Download PDFPDF

Grading evidence from test accuracy studies: what makes it challenging compared with the grading of effectiveness studies?
  1. Ewelina Rogozińska1,2,
  2. Khalid Khan1,2
  1. 1 Women's Health Research Unit, Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK
  2. 2 Department of Multidisciplinary Evidence Synthesis Hub (mEsh), Barts and the London School of Medicine, Queen Mary University London, London, UK
  1. Correspondence to Ewelina Rogozińska, Women's Health Research Unit, Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK; e.a.rogozinska{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Grading of Recommendations Assessment, Development and Evaluation (GRADE) is being increasingly used to synthesise evidence for practice and policy development.1 2 The GRADE domains, that is, type of evidence and its consistency, directness, precision and risk of bias, etc,3 4 are frequently and readily applied to therapeutic effectiveness research.5 6 However, clinical practice requires direction about the accuracy of tests to make a diagnosis before contemplating over decisions about treatment.7 For assessment of evidence concerning the former, guidance on the use of GRADE principles still requires more attention.4 8 The aim of this paper is to raise awareness of grading the strength of test accuracy evidence, associated with its challenges, and contrasting them with the issues relevant for the evaluation of effectiveness research. We use grading of the quality of test accuracy evidence employed in a WHO guidelines on antenatal care for a positive pregnancy experience9 as exemplary.

The basics: accuracy versus effectiveness research

Typically in test accuracy research, the question format is as follows: clearly defined participants, an object of the evaluation (an index test) and a comparator (a reference standard test to verify the presence or absence of outcome or condition of interest) (table 1). The 2×2 contingency table created this way can be used to calculate test accuracy measures such as sensitivity and specificity.10 Accuracy research informs us about how well tests can detect given a condition. In conjunction with effectiveness research it can be used to inform an antenatal management algorithm to rationalise the use of tests and treatments. If the effectiveness of interventions is unclear or unknown, assessment of test accuracy has limited utility. Equally, if accurate tests do not exist, it is difficult to know whom to treat. Whereas the definitive study design for effectiveness research is a controlled trial with randomisation,11 study designs for evaluation of test accuracy do not require this approach. The most valid accuracy results are obtained from cross-sectional studies that concurrently apply index and reference tests and avoid features that can introduce bias.12

Table 1

Differences between grading of strength of accuracy and effectiveness evidence

Further in the text, to illustrate the application of GRADE approach to accuracy research, we used an example (table 2) derived from the assessment prepared to inform the WHO antenatal guideline.9 The guideline was prepared in line with the WHO internal standards and guided by standard operating procedures both authors took part in developing (details available on request). Undetected asymptomatic bacteriuria, if left untreated in pregnancy, might lead to serious complications,13 and the quality of accuracy evidence for urine dipstick (nitrites marker only) in detecting the infection was one of the evaluations prepared for the guideline (figure 1). Details of the full evaluation of the accuracy of on-site tests to detect asymptomatic bacteriuria are available elsewhere.14 Robustness of all GRADE features (table 1) was considered for their potential to weaken the overall strength of evidence through downgrading of individual aspects.

Figure 1

Graphic display of evidence quality of urine dipstick (nitrites marker only) accuracy to detect asymptomatic bacteriuria in pregnancy (top graphs). The graphs represent the quality of features shown on the shape corners. For each of the corners, the distance from the centre represents the level of evidence strength with the lowest close to the shape’s centre (bottom left example) and highest at its maximum (bottom right graph).

Table 2

GRADE assessment of evidence quality of urine dipstick (nitrites) accuracy (index test) to detect asymptomatic bacteriuria (reference standard: urine culture) in pregnancy8

Risk of bias

The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool is used for the assessment of the risk of bias in the accuracy evidence. The tool comprises domains that can be assessed as a low, unclear or high risk of bias for participant selection, implementation of the index test, the reference standard and study flow and timing.15 The approach is based on the same concept as the tool used to assess the effectiveness research16 with domains relevant to study design used in accuracy research.

The accuracy evidence for urine dipstick was downgraded from ‘not serious’ to ‘serious’ (table 2), as more than a half of the pooled studies was classified as the moderate or high risk of bias (see online supplementary appendix 1). Before grading, the studies were classified as low, moderate or high of a risk of bias based on the respective scoring of the domains (see online supplementary appendix 2).

Supplementary Material

Supplementary Appendix 1

Supplementary Material

Supplementary Appendix 2


The QUADAS-2 tool comprises two parts: the first focusing on the methodological quality of the study design (discussed above), and the second addressing the applicability of the study to the research question. The applicability part constitutes three domains allowing us to assess the indirectness of evidence with regards to population, reference standard and study flow and timing (see online supplementary appendix 1). For effectiveness research, the respective aspect is assessed basing on how well the populations, interventions, comparators and reported outcomes match the research question. The QUADAS-2 tool, with its applicability part, allows assessing the indirectness of accuracy studies in a more structured and transparent way than it is being done for effectiveness research. We set a grading rule for applicability of synthesised evidence (see online supplementary appendix 1) that leads us to the downgrading of the evidence strength in our example as around 50% of the studies used in the synthesis was assessed as ‘high’ or ‘unclear’ concern over their applicability (table 2).


Between-study heterogeneity is anticipated more often for accuracy than effectiveness research. Furthermore, the potential inconsistency can occur not for one but two performance measures (table 1). Grading of the accuracy evidence is two dimensional with its strength assessed separately for sensitivity and specificity. The test used to evaluate between-study heterogeneity used in the effectiveness research does not work well for accuracy in this case. We chose to assess the inconsistency between the accuracy measures through visual inspection of the overlap of CI around the performance measures between pooled studies. The domain was graded depending on the degree of lack of overlap between CIs (see online supplementary appendix 1). The evidence for the sensitivity of urine dipstick (nitrites) was downgraded to seriously decreasing the quality of evidence due to visible variability in the performance estimates between the studies14 (table 2).


The wider the CI of pooled estimates, the poorer the precision and the weaker the strength of evidence. When grading the imprecision of performance measures, the same rule applies to both types of research with a similar challenge when the occurrence of the condition (event) is rare. If the prevalence of the condition is low, CIs around the pooled performance measure are wide. Due to the dual nature of the accuracy performance measure, we observe that the CI for pooled sensitivity tends to be wider than for the pooled specificity. The consequence of this is a differential assessment of the evidence strength for test sensitivity and specificity as in our example (table 2).

Publication bias

Funnel plot asymmetry tests are used to examine the impact of the effects from small studies and are being treated as an indicator of potential risk of publication bias.17

A statistical test taking into account effective sample size and associated regression statistical test of asymmetry for detection of sample size-related bias are currently recommended when pooling accuracy studies.18 In comparison to the statistical tests that use SEs of ORs, commonly used in the effectiveness research, that are likely to be misleading if applied to a meta-analysis of the accuracy measures. However, the impact of small-study effects is not as clear in accuracy research, and the power of the currently available test is modest19 leading us to a decision to leave out this domain (table 2).


Accuracy research as an important element of any clinical management algorithm requires a thorough and unequivocal assessment of its quality for evidence syntheses. While assessment of domains such risk of bias, indirectness or impression of accuracy measures in the evidence synthesis should not pose any greater challenges than in the case effectiveness research, more insight is needed into the impact of the heterogeneity and the publication bias on the synthesis of accuracy evidence to facilitate this task.

Without a doubt, members of the GRADE Working Group are aware of the above-mentioned issues and in due course will surely see more guidance on the application of GRADE to accuracy evidence with our work contributing to its use. Hopefully, the future guidance will also cover application of GRADE to evidence derived from a single study and use of likelihood ratio as a parameter describing test performance generally better understood by the clinicians.20


The authors would like to acknowledge the assistance of the following advisors from the WHO Department of Reproductive Health and Research: A Metin Gülmezoglu, Özge Tunçalp and Professor Javier Zamora from Clinical Biostatistics Unit, Hospital Ramon y Cajal (IRYCIS) and CIBER Epidemiology and Public Health, Madrid.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.


  • Contributors KK and ER prepared the diagnostic GRADE assessment for the evidence on the accuracy of the onsite test to detect asymptomatic bacteriuria in pregnancy in GRADEpro GDT (web). ER wrote the initial draft of the manuscript and all subsequent drafts after critical review by KK. KK is guarantor for the manuscript.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.