Article Text

Statistical approaches to uncertainty: p values and confidence intervals unpacked
1. HELEN DOLL, BSc, DIP APP STATS, MSc1,
2. STUART CARNEY, MB, ChB, MPH, MRCPsych2
1. 1Department of Public Health, University of Oxford, Oxford, UK
2. 2Department of Psychiatry, University of Oxford, Oxford, UK

## Statistics from Altmetric.com

The ARR (the difference in risk) is estimated to be 19.6% with a 95% CI of 5.7% to 33.6%. The p value of 0.006 means that an ARR of 19.6% or more would occur in only 6 in 1000 trials if streptomycin was equally as effective as bed rest. Since the p value is less than 0.05, the results are statistically significant (ie, it is unlikely that streptomycin is ineffective in preventing death). The 95% CI suggests that the likely true benefit of streptomycin could be as small as 5.7% or as large as 33.6%, but is very unlikely to be 0% or less. Our best estimate for the ARR is 19.6% and hence the NNT is 6 (95% CI 3 to 18). This means that we might have to treat as many as 18 people with streptomycin or as few as 3 to prevent 1 additional person dying of tuberculosis.

## WHY IS THE 5% LEVEL (P<0.05) USED TO INDICATE STATISTICAL SIGNIFICANCE

Conventionally, a p value of <0.05 is taken to indicate statistical significance. This 5% level is, however, an arbitrary minimum and p values should be much smaller, as in the above study (p = 0.006), before they can be considered to provide strong evidence against the null hypothesis. Hence reporting the exact p value (eg, p = 0.027) is more helpful than simply stating that the result is significant at the 5% level (or 1% level, as above).

## IF AN EFFECT IS STATISTICALLY SIGNIFICANT, DOES THIS MEAN IT IS CLINICALLY SIGNIFICANT

A statistically significant difference is not necessarily one that is of clinical significance. In the above example, the statistically significant effect (p = 0.006) is also clinically significant as even a modest improvement in survival is important. For many effects, however, the benefit needs to be somewhat greater than zero for it to be of clinical significance (ie, of sufficient benefit to be worth the effort of treatment). In figure 1, while both studies (a) and (c) show a statistically significant result, with the CIs not overlapping the “no difference” value, only (a) has a result that is consistent (in terms of the CI) with at least a minimum clinically important difference (MCID). Studies (b) and (d) are not statistically significant, as their CIs overlap the values of no difference.

Figure 1

Clinical significance and statistical significance.

## ARE P-VALUES AND CONFIDENCE INTERVALS RELATED

While the 2 approaches to dealing with the problem of uncertainty are somewhat different, p values and CIs generally provide consistent results. If the effect is statistically significant (at the 5% level), then the 95% CI will not include the value of “no difference”, and vice versa. While CIs are preferable to p values in summarising study results, both approaches are commonly used.

## WHY DOES SAMPLE SIZE HAVE TO BE CONSIDERED WHEN INTERPRETING THE SIZE OF THE P VALUE AND THE WIDTH OF THE CI

The larger the sample, the less the uncertainty, the narrower the CI, and hence the smaller the observed effect that can be declared statistically significant (p<0.05). Thus, if a sample is very large, even a very small difference (which may be of no clinical relevance) may be statistically significant (see (c) in figure 1). The width of a CI is affected by both the sample size (n) and the sample SD. The larger the sample (and the smaller its variability), the greater the accuracy of the sample estimate and thus the narrower the CI. A wide CI can thus reflect either a small sample or one with large variability (see (b) in figure 1).

## CAN THE CONCLUSIONS FROM A HYPOTHESIS TEST BE IN ERROR

Since hypothesis tests are based on estimates of probability, their conclusions can be in error. There are 2 types of error: rejecting the null hypothesis when it is true (type I error; the probability of this error is 5% if the 5% significance level is used) and failing to reject the null hypothesis when it is false (type II error; the probability of this error is 1 − Power) (figure 2). Power and sample size will be discussed in more detail in a future Statistics Note.

Figure 2

Type I and II errors.

Introduction

Table 1

. Outcomes in the RCT comparing streptomycin with bed rest alone in the treatment of tuberculosis

Comparison of the use of p values and confidence intervals in statistical inference

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.