Guidance on p value, alpha prespecification and effect size reporting from influential sources in medicine
Source | Verbatim statement on p value reporting | Verbatim statement on alpha specification | Verbatim statement on effect size reporting |
New England Journal of Medicine8 | Unless one-sided tests are required by study design, such as in non-inferiority clinical trials, all reported p values should be two-sided. In general, p values larger than 0.01 should be reported to two decimal places, and those between 0.01 and 0.001 to three decimal places; p values smaller than 0.001 should be reported as p<0.001. Notable exceptions to this policy include p values arising from tests associated with stopping rules in clinical trials or from genome-wide association studies. When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control the overall type I error—for example, Bonferroni adjustments or prespecified hierarchical procedures. P values adjusted for multiplicity should be reported when appropriate and labelled as such in the manuscript. In hierarchical testing procedures, p values should be reported only until the last comparison for which the p value was statistically significant. P values for the first non-significant comparison and for all comparisons thereafter should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling the false discovery rate described in the SAP—for example, Benjamini-Hochberg procedures. When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% CIs. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No p values should be reported for these analyses. Therefore, in most cases, no p values for interaction should be provided in the forest plots. If significance tests of safety outcomes (when not primary outcomes) are reported along with the treatment-specific estimates, no adjustment for multiplicity is necessary. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editors may request that p values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP. When appropriate, observational studies should use prespecified accepted methods for controlling family-wise error rate or false discovery rate when multiple tests are conducted. In manuscripts reporting observational studies without a prespecified method for error control, summary statistics should be limited to point estimates and 95% CIs. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn from the inferences may not be reproducible. No p values should be reported for these analyses. | When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control the overall type I error—for example, Bonferroni adjustments or prespecified hierarchical procedures. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. | Significance tests should be accompanied by CIs for estimated effect sizes, measures of association or other parameters of interest. The CIs should be adjusted to match any adjustment made to significance levels in the corresponding test. |
Journal of the American Medical Association9 | Avoid solely reporting the results of statistical hypothesis testing, such as p values, which fail to convey important quantitative information. For most studies, p values should follow the reporting of comparisons of absolute numbers or rates and measures of uncertainty (eg, 0.8%, 95% CI −0.2% to 1.8%; p=0.13). P values should never be presented alone without the data that are being compared. If p values are reported, follow standard conventions for decimal places: for p values less than 0.001, report as ‘p<0.001’; for p values between 0.001 and 0.01, report the value to the nearest thousandth; for p values greater than or equal to 0.01, report the value to the nearest hundredth; and for p values greater than 0.99, report as ‘p>0.99’. For studies with exponentially small p values (eg, genetic association studies), p values may be reported with exponents (eg, p=1×10−5). In general, there is no need to present the values of test statistics (eg, F statistics or χ2 results) and df when reporting results. | No guidance | Meta-analyses should state the major outcomes that were pooled and include ORs or effect sizes. |
The Lancet10 | P values should be given to two significant figures, unless p<0.0001. | No guidance | No guidance |
BMJ | No guidance; refers readers to SAMPL11 | No guidance | No guidance; refers readers to SAMPL |
Annals of Internal Medicine12 | For p values between 0.001 and 0.20, please report the value to the nearest thousandth. For p values greater than 0.20, please report the value to the nearest hundredth. For p values less than 0.001, report as ‘p<0.001’. | No guidance | Authors should report results for meaningful metrics rather than reporting raw results. For example, rather than reporting the log OR from a logistic regression, authors should transform coefficients into the appropriate measure of effect size, OR, relative risk or risk difference. |
ICH Harmonised Tripartite Guideline: Statistical Principles for Clinical Trials E913 | When reporting the results of significance tests, precise p values (eg, p=0.034) should be reported rather than making exclusive reference to critical values. | Conventionally, the probability of type I error is set at 5% or less or as dictated by any adjustments made necessary for multiplicity considerations; the precise choice may be influenced by the prior plausibility of the hypothesis under test and the desired impact of the results. Alternative values to the conventional levels of type I and type II errors may be acceptable or even preferable in some cases. | No guidance |
SAMPL guideline11 | Although not preferred to CIs, if desired, p values should be reported as equalities when possible and to one or two decimal places (eg, p=0.03 or 0.22 not as inequalities: eg, p<0.05). Do NOT report ‘NS’; give the actual p value. The smallest p value that needs to be reported is p<0.001, save in studies of genetic associations. | Report the alpha level (eg, 0.05) that defines statistical significance. | Likewise, p values are not sufficient for re-analysis. Needed instead are descriptive statistics for the variables being compared, including sample size of the groups involved, the estimate (or ‘effect size’) associated with the p value and a measure of precision for the estimate, usually a 95% CI. |
CONSORT statement14 | Actual p values (eg, p=0.003) are strongly preferable to imprecise threshold reports, such as p<0.05. | No guidance | n—For each outcome, study results should be reported as a summary of the outcome in each group (eg, the number of participants with or without the event and the denominators, or the mean and SD of measurements), together with the contrast between the groups, known as the effect size. |
SAP: statistical analysis plan; SAMPL: Statistical Analyses and Methods in the Published Literature; ICH: International Council for Harmonisation; CONSORT: CONsolidated Standards for Reporting Of Trials