Grading evidence from test accuracy studies: what makes it challenging compared with the grading of effectiveness studies? | BMJ Evidence-Based Medicine

Subscribe
Log In More

Log in via Institution
Log in via OpenAthens

Log in using your username and password
For personal accounts OR managers of institutional accounts

Username *

Password *

Forgot your log in details?Register a new account?
Forgot your user name or password?
Basket
Search More

Search for this keyword

Advanced search

Close More
Main menu

Latest content

Current issue

Archive

Authors

About
Subscribe
Log in More

Log in via Institution
Log in via OpenAthens

Log in using your username and password
For personal accounts OR managers of institutional accounts

Username *

Password *

Forgot your log in details?Register a new account?
Forgot your user name or password?
BMJ Journals

Article Text

Perspective

Grading evidence from test accuracy studies: what makes it challenging compared with the grading of effectiveness studies?

Free

Ewelina Rogozińska1,2,
Khalid Khan1,2

¹ Women's Health Research Unit, Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK
² Department of Multidisciplinary Evidence Synthesis Hub (mEsh), Barts and the London School of Medicine, Queen Mary University London, London, UK

Correspondence to Ewelina Rogozińska, Women's Health Research Unit, Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK; e.a.rogozinska{at}qmul.ac.uk

https://doi.org/10.1136/ebmed-2017-110717

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Grading of Recommendations Assessment, Development and Evaluation (GRADE) is being increasingly used to synthesise evidence for practice and policy development.1 2 The GRADE domains, that is, type of evidence and its consistency, directness, precision and risk of bias, etc,3 4 are frequently and readily applied to therapeutic effectiveness research.5 6 However, clinical practice requires direction about the accuracy of tests to make a diagnosis before contemplating over decisions about treatment.7 For assessment of evidence concerning the former, guidance on the use of GRADE principles still requires more attention.4 8 The aim of this paper is to raise awareness of grading the strength of test accuracy evidence, associated with its challenges, and contrasting them with the issues relevant for the evaluation of effectiveness research. We use grading of the quality of test accuracy evidence employed in a WHO guidelines on antenatal care for a positive pregnancy experience9 as exemplary.

The basics: accuracy versus effectiveness research

Typically in test accuracy research, the question format is as follows: clearly defined participants, an object of the evaluation (an index test) and a comparator (a reference standard test to verify the presence or absence of outcome or condition of interest) (table 1). The 2×2 contingency table created this way can be used to calculate test accuracy measures such as sensitivity and specificity.10 Accuracy research informs us about how well tests can detect given a condition. In conjunction with effectiveness research it can be used to inform an antenatal management algorithm to rationalise the use of tests and treatments. If the effectiveness of interventions is unclear or unknown, assessment of test accuracy has limited utility. Equally, if accurate tests do not exist, it is difficult to know whom to treat. Whereas the definitive study design for effectiveness research is a controlled trial with randomisation,11 study designs for evaluation of test accuracy do not require this approach. The most valid accuracy results are obtained from cross-sectional studies that concurrently apply index and reference tests and avoid features that can introduce bias.12

View this table:

Table 1

Differences between grading of strength of accuracy and effectiveness evidence

Further in the text, to illustrate the application of GRADE approach to accuracy research, we used an example (table 2) derived from the assessment prepared to inform the WHO antenatal guideline.9 The guideline was prepared in line with the WHO internal standards and guided by standard operating procedures both authors took part in developing (details available on request). Undetected asymptomatic bacteriuria, if left untreated in pregnancy, might lead to serious complications,13 and the quality of accuracy evidence for urine dipstick (nitrites marker only) in detecting the infection was one of the evaluations prepared for the guideline (figure 1). Details of the full evaluation of the accuracy of on-site tests to detect asymptomatic bacteriuria are available elsewhere.14 Robustness of all GRADE features (table 1) was considered for their potential to weaken the overall strength of evidence through downgrading of individual aspects.

Figure 1

Graphic display of evidence quality of urine dipstick (nitrites marker only) accuracy to detect asymptomatic bacteriuria in pregnancy (top graphs). The graphs represent the quality of features shown on the shape corners. For each of the corners, the distance from the centre represents the level of evidence strength with the lowest close to the shape’s centre (bottom left example) and highest at its maximum (bottom right graph).

View this table:

Table 2

GRADE assessment of evidence quality of urine dipstick (nitrites) accuracy (index test) to detect asymptomatic bacteriuria (reference standard: urine culture) in pregnancy8

Risk of bias

The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool is used for the assessment of the risk of bias in the accuracy evidence. The tool comprises domains that can be assessed as a low, unclear or high risk of bias for participant selection, implementation of the index test, the reference standard and study flow and timing.15 The approach is based on the same concept as the tool used to assess the effectiveness research16 with domains relevant to study design used in accuracy research.

The accuracy evidence for urine dipstick was downgraded from ‘not serious’ to ‘serious’ (table 2), as more than a half of the pooled studies was classified as the moderate or high risk of bias (see online supplementary appendix 1). Before grading, the studies were classified as low, moderate or high of a risk of bias based on the respective scoring of the domains (see online supplementary appendix 2).

Supplementary Material

Supplementary Appendix 1

[Appendix_1.docx]

Supplementary Material

Supplementary Appendix 2

[Appendix_2.docx]

Indirectness

The QUADAS-2 tool comprises two parts: the first focusing on the methodological quality of the study design (discussed above), and the second addressing the applicability of the study to the research question. The applicability part constitutes three domains allowing us to assess the indirectness of evidence with regards to population, reference standard and study flow and timing (see online supplementary appendix 1). For effectiveness research, the respective aspect is assessed basing on how well the populations, interventions, comparators and reported outcomes match the research question. The QUADAS-2 tool, with its applicability part, allows assessing the indirectness of accuracy studies in a more structured and transparent way than it is being done for effectiveness research. We set a grading rule for applicability of synthesised evidence (see online supplementary appendix 1) that leads us to the downgrading of the evidence strength in our example as around 50% of the studies used in the synthesis was assessed as ‘high’ or ‘unclear’ concern over their applicability (table 2).

Inconsistency

Between-study heterogeneity is anticipated more often for accuracy than effectiveness research. Furthermore, the potential inconsistency can occur not for one but two performance measures (table 1). Grading of the accuracy evidence is two dimensional with its strength assessed separately for sensitivity and specificity. The test used to evaluate between-study heterogeneity used in the effectiveness research does not work well for accuracy in this case. We chose to assess the inconsistency between the accuracy measures through visual inspection of the overlap of CI around the performance measures between pooled studies. The domain was graded depending on the degree of lack of overlap between CIs (see online supplementary appendix 1). The evidence for the sensitivity of urine dipstick (nitrites) was downgraded to seriously decreasing the quality of evidence due to visible variability in the performance estimates between the studies14 (table 2).

Imprecision

The wider the CI of pooled estimates, the poorer the precision and the weaker the strength of evidence. When grading the imprecision of performance measures, the same rule applies to both types of research with a similar challenge when the occurrence of the condition (event) is rare. If the prevalence of the condition is low, CIs around the pooled performance measure are wide. Due to the dual nature of the accuracy performance measure, we observe that the CI for pooled sensitivity tends to be wider than for the pooled specificity. The consequence of this is a differential assessment of the evidence strength for test sensitivity and specificity as in our example (table 2).

Publication bias

Funnel plot asymmetry tests are used to examine the impact of the effects from small studies and are being treated as an indicator of potential risk of publication bias.17

A statistical test taking into account effective sample size and associated regression statistical test of asymmetry for detection of sample size-related bias are currently recommended when pooling accuracy studies.18 In comparison to the statistical tests that use SEs of ORs, commonly used in the effectiveness research, that are likely to be misleading if applied to a meta-analysis of the accuracy measures. However, the impact of small-study effects is not as clear in accuracy research, and the power of the currently available test is modest19 leading us to a decision to leave out this domain (table 2).

Conclusion

Accuracy research as an important element of any clinical management algorithm requires a thorough and unequivocal assessment of its quality for evidence syntheses. While assessment of domains such risk of bias, indirectness or impression of accuracy measures in the evidence synthesis should not pose any greater challenges than in the case effectiveness research, more insight is needed into the impact of the heterogeneity and the publication bias on the synthesis of accuracy evidence to facilitate this task.

Without a doubt, members of the GRADE Working Group are aware of the above-mentioned issues and in due course will surely see more guidance on the application of GRADE to accuracy evidence with our work contributing to its use. Hopefully, the future guidance will also cover application of GRADE to evidence derived from a single study and use of likelihood ratio as a parameter describing test performance generally better understood by the clinicians.20

Acknowledgments

The authors would like to acknowledge the assistance of the following advisors from the WHO Department of Reproductive Health and Research: A Metin Gülmezoglu, Özge Tunçalp and Professor Javier Zamora from Clinical Biostatistics Unit, Hospital Ramon y Cajal (IRYCIS) and CIBER Epidemiology and Public Health, Madrid.

References

1.↵
Developing NICE Guidelines: The Manual. NICE process and methods guides. London: BMJ Publishing Group, 2015.
2.↵
World Health Organization. Evidence retrieval and synthesis. WHO handbook for guideline development. 2nd ed, 2014:93–108.
3.↵
2. Guyatt GH ,
3. Oxman AD ,
4. Kunz R , et al
. Going from evidence to recommendations. BMJ 2008;336:1049–51.doi:10.1136/bmj.39493.646875.AE
OpenUrl FREE Full Text
4.↵
2. Schünemann HJ ,
3. Schünemann AH ,
4. Oxman AD , et al
. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ 2008;336:1106–10.doi:10.1136/bmj.39500.677199.AE
OpenUrl FREE Full Text
5.↵
2. Khan KS BE ,
3. Roos C ,
4. Kowalska M , et al
. For the EBM-CONNECT Collaboration. Making GRADE accessible: a proposal for graphic display of evidence quality assessments. Evidence-Based Medicine 2011;16.
6.↵
2. Murad MH ,
3. Almasri J ,
4. Alsawas M , et al
. Grading the quality of evidence in complex interventions: a guide for evidence-based practitioners. Evid Based Med 2017;22:20–2.doi:10.1136/ebmed-2016-110577
OpenUrl Abstract/FREE Full Text
7.↵
2. Alonso-Coello P ,
3. Schünemann HJ ,
4. Moberg J , et al
. GRADE evidence to decision (EtD) frameworks: a systematic and transparent approach to making well informed healthcare choices. 1: introduction. BMJ 2016;353:i2016.doi:10.1136/bmj.i2016
OpenUrl FREE Full Text
8.↵
2. Hsu J ,
3. Brożek JL ,
4. Terracciano L , et al
. Application of GRADE: making evidence-based recommendations about diagnostic tests in clinical practice guidelines. Implement Sci 2011;6:62.doi:10.1186/1748-5908-6-62
OpenUrl CrossRef PubMed
9.↵
World Health Organization. WHO recommendations on antenatal care for a positive pregnancy experience. Geneva: BMJ Publishing Group, 2016.
10.↵
2. Leeflang MM ,
3. Deeks JJ ,
4. Gatsonis C , et al
. Systematic reviews of diagnostic test accuracy. Ann Intern Med 2008;149:889–97.doi:10.7326/0003-4819-149-12-200812160-00008
OpenUrl CrossRef PubMed Web of Science
11.↵
2. Khan KS ,
3. Kunz R ,
4. Kleijnen J , et al
. Five steps to conducting a systematic review. J R Soc Med 2003;96:118–21.doi:10.1258/jrsm.96.3.118
OpenUrl CrossRef PubMed Web of Science
12.↵
2. Rutjes AW ,
3. Reitsma JB ,
4. Di Nisio M , et al
. Evidence of bias and variation in diagnostic accuracy studies. CMAJ 2006;174:469–76.doi:10.1503/cmaj.050090
OpenUrl Abstract/FREE Full Text
13.↵
2. Honest H ,
3. Forbes CA ,
4. Durée KH , et al
. Screening to prevent spontaneous preterm birth: systematic reviews of accuracy and effectiveness literature with economic modelling. Health Technol Assess 2009;13:1–627.doi:10.3310/hta13430
OpenUrl CrossRef PubMed
14.↵
2. Rogozińska E ,
3. Formina S ,
4. Zamora J , et al
. Accuracy of onsite tests to detect asymptomatic bacteriuria in pregnancy: a systematic review and meta-analysis. Obstet Gynecol 2016;128:495–503.doi:10.1097/AOG.0000000000001597
OpenUrl
15.↵
2. Whiting PF ,
3. Rutjes AW ,
4. Westwood ME , et al
. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155:529–36.doi:10.7326/0003-4819-155-8-201110180-00009
OpenUrl CrossRef PubMed Web of Science
16.↵
Higgins JPT GSe. Cochrane handbook for systematic reviews of interventions: the cochrane collaboration. 2011 cochrane-handbook.org.
17.↵
2. Sterne JA ,
3. Sutton AJ ,
4. Ioannidis JP , et al
. Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials. BMJ 2011;343:d4002.doi:10.1136/bmj.d4002
OpenUrl FREE Full Text
18.↵
2. Deeks JJ ,
3. Macaskill P ,
4. Irwig L
. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J Clin Epidemiol 2005;58:882–93.doi:10.1016/j.jclinepi.2005.01.016
OpenUrl CrossRef PubMed Web of Science
19.↵
2. Macaskill PGC ,
3. Deeks J ,
4. Harbord R , et al
. Chapter 10 analysing and presenting results. Cochrane handbook for systematic reviews of diagnostic test accuracy, 2010.
20.↵
2. Whiting PF ,
3. Davenport C ,
4. Jameson C , et al
. How well do health professionals interpret diagnostic information? A systematic review. BMJ Open 2015;5:e008155.doi:10.1136/bmjopen-2015-008155

Footnotes

Contributors KK and ER prepared the diagnostic GRADE assessment for the evidence on the accuracy of the onsite test to detect asymptomatic bacteriuria in pregnancy in GRADEpro GDT (web). ER wrote the initial draft of the manuscript and all subsequent drafts after critical review by KK. KK is guarantor for the manuscript.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.