Article Text

## Abstract

**Objectives** It has been suggested that the term ‘statistically significant’ in relation to P-values should be ‘retired’ as it is regarded as a major cause of misinterpretation of scientific data. It is also associated with less frequent than expected replication of studies, especially in the social sciences and medicine, which also may be having a negative impact on evidence-based healthcare. The object of this presentation is to use the familiar reasoning processes of medicine as a guide to reasoning with scientific data in order to avoid current pitfalls. The steps used in interpreting clinical data include checking the reliability of symptoms, signs and test results and if these facts are reliable then using them as evidence to make predictions and using the predictions and their probabilities to make decisions. The object of this presentation is the above first step: to assess the reliability of scientific study results using analogous reasoning.

**Method** The reliability of a diagnostic test or study result can be assessed by estimating the probability with which it will fall within any ‘true’ range if the study or observation were repeated until nearly an infinite number of observations were made. For example, this range might be on one side of a test’s ‘lower end of normal’ or a ‘null hypothesis’ (e.g. a true result of no difference between the effect of a treatment or placebo). The calculation is based on regarding a data set as a subset of a larger data set within which the probability of each possible true value is set to be the same. The resulting calculated ‘idealistic’ probability of long-term replication is based on the ideal situation when the observations are conducted in an impeccably consistent way. Failing this the probability of ‘non-impeccability’ is used to estimate a ‘realistic’ probability of replication.

**Results** If the data can be modelled with a ‘normal’ or other symmetrical distribution, then the probability of a true result less extreme than the null hypothesis is shown to be exactly 1-P. If the distribution is not symmetrical then it is only approximately equal to 1-P. If a Bayesian prior probability is specified, then it is shown that this can be incorporated in to the calculated probability of replication. The result of a prior pilot study and the result of subsequent studies (e.g. in a meta-analysis) can also be incorporated to provide a probability of replication within any specified range given all the evidence. The probability of getting the same P value or better after repeating a study is 50%. If the probability of impeccably consistent methodology is low (e.g. due to cherry picking, etc) then the realistic probability of replication will be lower than the idealistic probability.^{1}

**Conclusions** This approach is based on an improved understanding of the relationship between P-values and the probability of replication according to well recognized Bayesian principles. Instead of assuming uniform prior probabilities, it actually creates this condition with a new universal set that contains the study data and within which the prior probabilities of true outcomes are uniform. It thus replaces the confusing definition of the P value (the probability of the study result that was seen and others more extreme that were not seen conditional on a null hypothesis) with a probability of replication less extreme than the null hypothesis. This approach avoids the pitfalls associated with classifying P-values as being ‘significant’ or ‘not significant’ and over-estimating the probability of replication.

**References**

Llewelyn H. Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PLoS ONE, 2019, 14(2): e0212302. https://doi.org/10.1371/journal.pone.0212302