Article Text

Download PDFPDF

An emerging consensus on grading recommendations?
  1. Gordon Guyatt, MSc, MD1,
  2. Gunn Vist, PhD2,
  3. Yngve Falck-Ytter, MD3,
  4. Regina Kunz, MD, MSc, PhD4,
  5. Nicola Magrini, MD5,
  6. Holger Schunemann, MD, PhD6
  1. 1McMaster University
 Hamilton, Ontario, Canada
  2. 2Norwegian Knowledge Centre for the Health Services
 Oslo, Norway
  3. 3Case Western Reserve University
 Cleveland, Ohio, USA
  4. 4Institute for Clinical Epidemiology
 Basel, Switzerland
  5. 5Centre for Evaluation of the Effectiveness of Health Care
 Modena, Italy
  6. 6McMaster University
 Hamilton, Ontario, Canada
 Italian National Cancer Institute Regina Elena
 Rome, Italy

    Statistics from Altmetric.com

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

    Clinical practice guidelines have improved in quality over the past 10 years by adhering to a few basic principles, such as conducting thorough systematic reviews of relevant evidence and grading the recommendations and the quality of the underlying evidence. The large number of systems of measuring the quality of evidence and recommendations that have emerged are, however, confusing.1

    The mission of the Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) working group is to help resolve the confusion among the different systems of rating evidence and recommendations. The group has wide representation from many organisations including the Agency for Healthcare Research and Quality in the US, the National Institute for Clinical Excellence for England and Wales, and the World Health Organization. Developing a new uniform rating system is challenging because all systems have limitations and because many organisations have invested a great deal of time and effort to develop their rating systems and are understandably reluctant to adopt a new system.

    The GRADE working group first published the results of its work in 2004 in the BMJ.2 A simpler, clinically oriented description will soon be published.3 GRADE has taken care to ensure its suggested system is simple to use and applicable to a wide variety of clinical recommendations that span the full spectrum of medical specialties and clinical care.

    The GRADE system classifies recommendations in 1 of 2 levels—strong and weak—and quality of evidence into 1 of 4 levels—high, moderate, low, and very low. Evidence based on randomised controlled trials (RCTs) begins with a top rating on GRADE’s 4 level quality of evidence classification (table 1). GRADE takes into account, however, that not all RCTs are alike, and that limitations of individual RCTs may compromise the quality of their evidence (table 2).

    Table 1

     Quality of evidence and definitions

    Table 2

     Factors in deciding on confidence in estimates of benefits, risks, burden, and costs

    Firstly, quality decreases if most of the evidence comes from RCTs with serious methodological flaws such as lack of allocation concealment or blinding, or large loss to follow up. A second reason for downgrading is inconsistency of results; our confidence in the estimates of benefit or risk is weaker if some studies show substantial effects and other apparently similar studies show no effect at all.

    Indirectness may compromise the quality of evidence. Evidence is indirect if there are no head to head comparisons between therapeutic alternatives. For instance, drug benefit plans or formularies have to choose between funding of a number of bisphosphonates, including alendronate and risedronate, for prevention of osteoporotic fractures. Unfortunately, the decision must be made on a comparison of trials evaluating alendronate against placebo, and risedronate against placebo, rather than direct comparisons of alendronate and risedronate. Evidence may also be indirect if differences exist in populations (we are interested in valvular atrial fibrillation, but all RCTs are in non-valvular atrial fibrillation), interventions (we’d like to know about relatively low dose angiotensin converting enzyme inhibition, but all trials are in higher dose), or outcomes (we’d like to know about long term effectiveness, but all trials have only short follow up durations).

    When total sample size is small, and outcome events are few, our uncertainty about estimates of benefit and risk increase. GRADE continues to debate the appropriate thresholds for decreasing strength of inference: when are confidence intervals too wide, how few events are too few?

    While observational studies (eg, cohort studies) start with a “low quality” rating, they may be graded upwards if the magnitude of the treatment effect is very large (eg, hip replacement for severe hip osteoarthritis), if there is evidence of a dose response relation, or if all apparent confounders would decrease the magnitude of the treatment effect (table 2). For example, a systematic review showed higher mortality in for-profit than in not-for-profit hospitals.4 This result was found despite the fact that for-profit hospitals usually admit healthier patients with a higher socioeconomic status and have more resources at their disposal. These potential confounders, if anything, favour for-profit hospitals. If such confounders were taken into account, the magnitude of effect favouring not-for-profit hospitals would be even larger.

    As noted, the GRADE system offers 2 levels of recommendations: strong and weak. When the benefits of an intervention clearly outweigh its risks and burden, or clearly do not, strong recommendations are warranted. On the other hand, when the tradeoff between benefits and risks is less certain, either because of low quality evidence or because high quality evidence suggests benefits and risks are closely balanced, weak recommendations become appropriate.

    This 2 level approach is easy to put into practice. For strong recommendations in which it is clear that benefits far outweigh risks, or risks far outweigh benefits, virtually all patients will make the same choice (eg, aspirin in the setting of acute myocardial infarction). In such instances, physicians can confidently recommend treatment. For weak recommendations, different patients may choose different approaches to treatment. One example is the use of hormone replacement therapy for menopausal hot flashes. Under these circumstances, clinicians must know the evidence and communicate the evidence to their patients, or conduct a detailed inquiry to ensure their recommendations are consistent with patients’ values and preferences.5 Beyond the quality of the evidence, a number of other factors may bear on whether recommendations are strong or weak (table 3).

    Table 3

     Factors in deciding on a strong or weak recommendation

    The GRADE system is rigorous in its methodology, yet practical to use. It is neither too complex nor misleadingly simple. The Cochrane Collaboration is moving to adopt the GRADE approach for the rating of methodological quality, and the revised Quality of Reporting of Systematic Reviews (QUOROM) statement is likely to endorse the approach. The Endocrine Society is the first North American organisation to adopt GRADE for its recommendations while another important organisation, the American College of Chest Physicians (ACCP), has adopted a slightly modified version of GRADE. Other organisations, such as the American Thoracic Society and the BMJ Publishing Group, which publishes Clinical Evidence, will be exploring possible use of the GRADE approach.

    The ACCP modification, which collapses the low and very low quality categories, represents a simplification that may be attractive to groups providing recommendations primarily for clinical practice (rather than, for instance, public health interventions). In a particularly significant development, the popular electronic medical text UpToDate is moving to formal structured recommendations using this particular modification of the GRADE approach.

    The leading American and European urology associations have, for some years, been among the leaders in evidence-based guidelines. But, as in every other area, individual urology organisations have collected and summarised evidence separately, and come up with separate approaches to developing and grading recommendations. This is not only unnecessarily time consuming but also confusing for consumers of guidelines.

    American, European, and Asian urology organisations all plan to adopt GRADE for their recommendations. This alliance is also likely to see them pooling resources and using standard approaches to collect and summarise evidence relevant to urology practice, and sharing these evidence summaries across groups. Recommendations may still differ according to local circumstances and differing values and preferences, but the evidentiary basis, and the quality of evidence rating (using the GRADE approach) will be uniform. Guidelines will be framed using the simple, clinically applicable GRADE system of strong or weak recommendations.

    This development has enormous implications for efficiency, improved communication, and optimal clinical decision making. If the urology community can overcome political barriers and achieve this eminently sensible revolution in knowledge management and expert guidance, why not orthopaedic surgery, respirology, nephrology, and even cardiology?

    Most of the developments we have described are still evolving. Perhaps our vision of their eventual outcome is overly sanguine. If we can maintain momentum, however, GRADE will do more than achieve the worthy and important goal of standardising systems of grading quality of evidence and recommendations for clinical practice. GRADE may facilitate the evolution toward a world in which expert recommendations for frontline clinicians uniformly adhere to principles of evidence management and guideline development that flow from the intellectual movement we call evidence-based medicine.

    References

    View Abstract