Statistics from Altmetric.com
When busy clinicians bump into a new treatment, they ask themselves 2 questions. Firstly, is it better than (“superior to”) what they are using now? Secondly, if it’s not superior, is it as good as what they are using now (“non-inferior”) and preferable for some other reason (eg, fewer side effects or more affordable)? Moreover, they want answers to these questions right away. Evidence-Based Medicine and its related evidence-based journals do their best to answer these questions in their “more informative titles.” That’s why this issue contains titles such as: “Angioplasty at an invasive treatment centre reduced mortality compared with first contact thrombolysis”1 (http://ebm.bmjjournals.com/cgi/content/9/2/42) and “Ximelagatran was non-inferior to warfarin in preventing stroke and systemic embolism in atrial fibrillation.”2 (http://ebm.bmjjournals.com/cgi/content/9/2/43) The latter of these 2 studies prompted this editorial.
Progress toward this “more informative” goal has been slow because we have been prisoners of traditional statistical concepts that call for 2-sided tests of statistical significance and require rejection of the null hypothesis. We have further imprisoned ourselves by misinterpreting “statistically nonsignificant” results of these 2-tailed tests. Rather than recognising such results as “indeterminate” (uncertain), we conclude that they are “negative” (certain, providing proof of no difference between treatments). This editorial will address the problems created by these ways of thinking and, more importantly, their clinically relevant solutions.
At the root of our problem is the “null hypothesis,” which decrees that the difference between a new and standard treatment ought to be zero. Two-sided p values tell us the probability that the results are compatible with that null hypothesis. When that probability is small (say, <5%), we “reject” the null hypothesis and “accept” the “alternative hypothesis” that the difference we’ve observed is not zero. In doing so, however, we make no distinction between the new treatment being better, on the one hand, or worse, on the other, than the standard treatment.
There are 3 consequences of this faulty reasoning. Firstly, by performing “2-sided” tests of statistical significance, investigators turn their backs on the “1-sided” clinical questions of superiority and non-inferiority. Secondly, they often fail to recognise that the results of these 2-sided tests, especially in small trials, can be “statistically nonsignificant” even when their confidence intervals include clinically important benefit or harm. Thirdly, investigators (abetted by editors) frequently misinterpret this failure to reject the null hypothesis (based on 2-sided p values >5%, or 95% confidence intervals that include zero). Rather than recognising their results as uncertain (“indeterminate”), they report them as “negative” and conclude that there is “no difference” between the treatments. By doing so, authors and editors and readers regularly fall into the trap of concluding that the “absence of proof of a difference” between 2 treatments constitutes “proof of an absence of a difference” between them. This mistake was forcefully pointed out by Phil Alderson and Iain Chalmers: “It is never correct to claim that treatments have no effect or that there is no difference in the effects of treatments. It is impossible to prove ... that two treatments have the same effect. There will always be some uncertainty surrounding estimates of treatment effects, and a small difference can never be excluded.”3
A solution to both this incompatibility (between 1-sided clinical reasoning and 2-sided statistical testing) and confusion (about the clinical interpretation of statistically nonsignificant results) has been around for decades, but is just now gaining widespread recognition and application. I assign most of the credit to a pair of biostatisticians, Charles Dunnett and Michael Gent, and others have also contributed to its development4 (although the latter sometimes refer to “non-inferiority” as “equivalence,” a term whose common usage fails to distinguish 1-sided from 2-sided thinking). I’ll illustrate the contribution of Charles Dunnett and Michael Gent with a pair of trials in which their thinking helped clinical colleagues escape from the prison of 2-sided null hypothesis testing and, by doing so, prevented the misinterpretation of statistically nonsignificant results.5
Thirty years ago, a group of us performed a randomised controlled trial (RCT) of nurse practitioners as providers of primary care.6 We wanted to know if patients fared as well under their care as under the care of general practitioners. Guided by Mike Gent, we came to realise that a 2-sided analysis that produced an “indeterminate,” statistically nonsignificant difference in patient outcomes could confuse rather than clarify matters. We therefore abandoned our initial 2-sided null hypothesis and decided that we’d ask a non-inferiority question: Were the outcomes of patients cared for by nurse practitioners non-inferior to those of patients cared for by general practitioners? Mike then helped us recognise the need to specify our limit of acceptable “inferiority” in terms of these outcomes. With his prodding, we decided that we would tolerate no worse than 5% lower physical, social, or emotional function at the end of the trial among patients randomised to our nurse practitioners as we observed among patients randomised to our general practitioners. As it happened, our 1-sided analysis revealed that the probability that our nurse practitioners’ patients were worse off (by ⩾5%) than our general practitioners’ patients was as small as 0.008. We had established that nurse practitioners were not inferior to general practitioners as providers of primary care.
Twenty years ago, a group of us performed an RCT of superficial temporal artery–middle cerebral artery anastomosis (“EC-IC bypass”) for patients with threatened stroke.7 To the disappointment of many, we failed to show a statistically significant superiority of surgery for preventing subsequent fatal and non-fatal stroke. It became important to overcome the ambiguity of this “indeterminate” result. We therefore asked the 1-sided question: What degree of surgical benefit could we rule out? That 1-sided analysis, which calculated the upper end of a 90% (rather than 95%) confidence interval, excluded a surgical benefit as small as 3%. When news of this 1-sided result got around, performance of this operation rapidly declined.
Thanks to statisticians like Charlie Dunnett and Mike Gent, we now know how to translate rational, 1-sided clinical reasoning into sensible, 1-sided statistical analysis. Moreover, this modern strategy of asking 1-sided non-inferiority and superiority questions in RCTs is gathering momentum. The CONSORT statement on recommendations for reporting RCTs omits any requirement for 2-sided significance testing. Even some journal editors are getting the message, for 1-sided non-inferiority and superiority trials have now appeared in the New England Journal of Medicine,8Lancet,9 and JAMA,10 and this issue of Evidence-Based Medicine includes another Lancet article (http://ebm.bmjjournals.com/cgi/content/9/2/43).2
An essential prerequisite to doing 1-sided testing is the specification of the exact non-inferiority and superiority questions before the RCT begins. As with unannounced subgroup analyses, readers can and should be suspicious of authors who apply 1-sided analyses without previous planning and notice. Have they been slipped in only after a peek at the data revealed that conventional 2-sided tests generated indeterminate results? This need for prior specification of 1-sided analyses provides yet another argument for registering RCTs in their design stages, and for publishing their protocols in open access journals such as Biomed Central (http://www.biomedcentral.com).
I hope that this editorial will help free frontline clinicians, investigators, and editors from the 2-sided null-hypothesis prison. If any traditional, 2-sided biostatisticians happen upon it, they may object. If their objections are relevant to this journal’s readers, they might appear in these pages.