Article Text

Download PDFPDF

On reporting and interpreting statistical significance and p values in medical research
  1. Herman Aguinis1,
  2. Matt Vassar2,
  3. Cole Wayant2
  1. 1Management, The George Washington University, Washington, District of Columbia, USA
  2. 2Psychiatry and Behavioral Sciences, Oklahoma State University Center for Health Sciences, Tulsa, Oklahoma, USA
  1. Correspondence to Cole Wayant, Oklahoma State University Center for Health Sciences, Tulsa, OK 74107, USA; cole.wayant{at}okstate.edu

Statistics from Altmetric.com

Recent proposals to change the p value threshold from 0.05 to 0.005 or to retire statistical significance altogether have garnered much criticism and debate.1 2 As of the writing of our manuscript, the proposal to eliminate statistical significance testing, backed by over 800 signatories, achieved record-breaking status on Altmetrics, with an attention score exceeding 13 000 derived from 19 000 Twitter comments and 35 news stories. We appreciate the renewed enthusiasm for tackling important issues related to the analysis, reporting and interpretation of scientific research results. Our perspective, however, focuses on the current use and reporting of statistical significance and where we should go from here.

  1. We begin by saying that p values themselves are not flawed. Rather, the use, misuse or abuse of p values in ways antithetical to rigorous scientific pursuits is the flaw. If p values are a hammer, scientists are the hammer wielders. One would not discard the hammer if the wielder, when using the hammer, repeatedly missed the nail. Similarly, one would not discard the hammer if the wielder used the hammer in a way not suited to the hammer’s purpose, such as in an attempt to drive a screw. Rather, one would expect that the fault lies with the hammer-wielder and recommend ways to refine the hammer’s use. Thus, a focus on education and reform may be more helpful than the abandonment of statistical significance testing, which is a tool that can be used well, or misused and even abused.

  2. Similarly, in this perspective, we argue that abandoning statistical significance because scientists misuse p values does not address the underlying problems of statistical negligence. Similarly, it does not address the incorrect belief that statistical significance equates to clinical significance.3

The a priori level (ie, alpha or type I error rate) and the precisely observed probability values (ie, p) should be explicitly stated and justified in protocols and published reports of medical studies. We have examined current guidance on p value reporting in influential sources in medicine (table 1). Generally, this guidance supports reporting exact p values but fails to issue direction on specifying the a priori significance level. The ‘conventional’ a priori significance (ie, type I error) level in many scientific disciplines is 0.05—an arbitrary choice. Two issues arise when scientists arbitrarily default to an a priori significance level: results become misleading and the relative seriousness of making a type I (‘false-positive’) or type II error (‘false-negative’) is ignored.

Table 1

Guidance on p value, alpha prespecification and effect size reporting from influential sources in medicine

First, misleading results may fall on either side of the conventional 0.05 threshold, with scientists either rejecting or accepting the null hypothesis blindly—failing to consider sample size, measurement error and other factors that affect observed p values but are unrelated to the size of the effect in the population. Also, when considering the dichotomous interpretation of a truly continuous probability, Rosnow and Rosenthal4 sarcastically lamented that ‘Surely, God loves the 0.06 nearly as much as the 0.05’. Second, the choice of an a priori significance level should be made in the context of the potential for type II error. When researchers arbitrarily default to a type I error rate of 0.05, it has been calculated that the corresponding type II error is approximately 60%, because statistical power (ie, probability to correctly reject a null hypothesis) is usually insufficient given small sample sizes and the pervasive and unavoidable use of less-than-perfectly reliable measures.5 6 In other words, while authors focus on whether their results show an acceptably small type I error rate, type II error—the probability of accepting the null hypothesis erroneously and incorrectly concluding that an effect is absent—looms large. Do authors, peer reviewers, editors and readers of studies that fail to reach statistical significance consider the probability that the results are falsely ‘negative’?

A second limitation in the current guidance is the inconsistency in mandating effect size reporting that describes the strength of the relationship and/or the effect found. The only information to be gleaned from p values is whether the observed data are likely where the null hypothesis (that no effect exists) true. Therefore, a p value without an effect size is like peering into a pool of murky water: one cannot determine the depth, just say that it is likely that a pool exists. Consider interventions for improving medication adherence for patients with hypertension. A recent systematic review of medication adherence interventions found that the overall standardised mean difference for systolic blood pressure was 0.235—a 3 mm Hg difference.7 Translating mean differences to clinical differences assists in determining the practical value of the intervention. In this example, the clinician must consider whether a 3 mm Hg reduction in systolic blood pressure is clinically meaningful and weigh this reduction against the factors associated with enacting the intervention as well as whether other interventions might yield a more clinically meaningful improvement. Some of the influential guidance (or omission thereof) provided to authors in medicine (table 1) may serve to promote the poor statistical practices that readers work to mitigate. Therefore, it is our perspective that not only should all guidance emphasise reporting effect sizes, but that all guidance to interpret and report effect sizes in a meaningful way should be included as well. For example, one may report the absolute difference between groups and the number needed to treat for a medical intervention. Readers may be incapable of determining the meaningfulness of a p value but are well-equipped to interpret an absolute difference in effectiveness.

Taken together, reporting (1) precise observed p values (rather than whether it is larger or smaller than arbitrary cutoffs), (2) effect sizes and (3) the practical importance of effect sizes (ie, their interpretation for clinical practice) would improve our understanding of the meaning of study findings. Let us not throw out the baby with the bathwater.

References

View Abstract

Footnotes

  • Twitter @ColeWayant_OK

  • Contributors HA and MV conceptualised the paper. CW extracted all the data. HA, MV and CW wrote the manuscript and approve of it in its final form.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.