Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Background: what is data dredging bias?
Data-dredging bias encompasses a number of more specific questionable practices (eg, fishing, p-hacking) all of which involve probing data using unplanned analyses and then reporting salient results without accurately describing the processes by which the results were generated. Almost any process of data analysis involves numerous decisions necessary to complete the analysis (eg, how to handle outliers, whether to combine groups, including/excluding covariates). Where possible, it is the best practice for these decisions to be guided by a principled approach and prespecified in a publicly available protocol. When it is not possible, authors must be transparent about the open-ended nature of their analysis.
Many different sets of choices may well be methodologically defensible and reliablewhen the specifications are made prior to the analysis. However, probing the data and selectively reporting an outcome as if it were always the intended course of analysis dramatically increases the likelihood of finding a statistically significant result when there is in fact no effect (ie, a false positive).
As an intuitive example, consider a hypothesis that a given coin is unfairly biased to heads. Suppose I flip the coin twenty times each day for a week and assume that I am allowed to (1) eliminate data from any given day; (2) consider only the first 10 flips in a day, the last 10 flips in a day, or all 20 flips; and (3) restrict my consideration to only flips that were preceded by a heads or flips that were preceded by a tails. I would be conducting a fair trial if I prespecify that I will consider only the results when the prior flip was ‘tails’ and the flip was one of the last 10 in the series for the day, but I will be excluding the results from Wednesday. This is because none of these factors actually influence the probability of a flip coming up heads. However, if I am allowed to dredge through the data and freely examine various combinations of these restrictions after the results are known, I am virtually certain to find some set of specifications that makes it appear that the coin is biased. Importantly, the bias depends on analysing and reporting the data as if the analytical choices were specified ahead of time. Once the curtain is pulled back on how the results were generated, no one should be convinced by the evidence, least of all the analyst.
Though the intuitive example suggests that data dredging is obviously unreliable, many common forms of data dredging arise from a lack of knowledge rather than an intent to deceive. As1 note, ‘it can seem entirely appropriate to look at the data and construct reasonable rules for data exclusion, coding, and analysis that can lead to statistical significance’ (p. 461). Because researchers often remain unaware of the ways in which their biases shape their decisions, there is no substitute for a prespecified course of analysis. The absence of a study protocol outlining the planned methods and statistical analyses is red flag that the results may be a result of data dredging.
To be clear, unplanned analyses are appropriate for hypothesis generation. Similarly, planned interim analyses are a standard part of adaptive trials. However, such analyses must be reported accurately, and different statistical analyses are often required. In contrast, data dredging has occurred if the results are reported as if they were the result of a single preplanned analysis, but were actually generated by: collecting the data first and then systematically adjusting data analysis approaches until the researcher finds a set of choices that produces a statistically significant result (‘p-hacking’)2; assessing models with multiple combinations of variables and selectively reporting the ‘best’ model (‘fishing’)3; making decisions about whether to collect new data on the basis of interim results; or generating a hypothesis to explain results which have already been obtained but presenting it as if it were a hypothesis one had prior to collecting the data (HARKing (‘hypothesising after the results are known’)).4
The use of progestogens to prevent pregnancy loss was supported by numerous randomised control trials and meta-analyses of those trials; however, this apparent benefit disappears when the analysis is restricted to trials reporting preregistered primary outcomes.5 When meta-analyses included trials that were not preregistered, 19 of 29 meta-analyses found significant benefits for progestogens. In contrast, a meta-analysis constrained to the 22 trials which reported preregistered primary outcomes provided substantial evidence that progestogens were ineffective (RR=1.00, 95% CI 0.94 to 1.07). Since for publication bias to account for such a difference, there would have to be an implausibly large number of unpublished trials, the difference is likely caused by various forms of data dredging in the unregistered trials.5
Another well-known and extreme example of data dredging are the studies conducted at the Cornell Food and Brand Lab, under the supervision of Brian Wansink.6 Scrutiny was drawn to his work after a November 2016 blog post in which Wansink described how to succeed in academia. In the post Wansink recounts a study which was intended to replicate his earlier finding that patrons would eat more at an all-you-can-eat buffet if they paid more. Instead, the results contradicted his earlier work. Rather than publish the results of the trial as designed, Wansink passed the data from the ‘failed study’ to a visiting scholar in his lab. In his original recounting, Wansink valorises the visiting scholar who, at his direction, continually reanalysed the data looking for some result they could ‘salvage’. Indeed, she found several.
In response to subsequent criticism, Wansink wrote: ‘With field studies, hypotheses usually don’t ‘come out’ on the first data run. But instead of dropping the study, a person contributes more to science by figuring out when the hypo (sic) worked and when it didn’t. This is Plan B. Perhaps your hypo worked during lunches but not dinners, or with small groups but not large groups. You don’t change your hypothesis, but you figure out where it worked and where it didn’t. Cool data contains cool discoveries.’7 Again, had the article described all of the unplanned analyses that ultimately culminated in the published result, such an attitude may be defensible. However, the results were presented as if they were testing a hypothesis with a preplanned course of analysis, see,8 and represent an egregious form of data dredging.
While the exact frequency of data dredging has not been determined, the unusually large number of published studies that just pass the threshold p<0.05 has been offered as evidence for its prevalence.9 10 While others have suggested that p-hacking is unlikely to have a significant effect on areas of research where there are enough studies to conduct a meta-analysis,11 such a conclusion depends on those meta-analyses containing multiple studies with large sample sizes. P-curve analyses suggest that the distribution of p values indicates that most research is investigating real effects; however, some forms of data dredging could also produce a similar distribution.12 13 Despite remaining uncertainties, there is no dispute that the bias introduced by data dredging will be most severe where the effect size is small, the dependent measures are imprecise, research designs are flexible and studies are conducted on small samples.14
Because they do not understand such practices as flawed, researchers often unabashedly endorse data-dredging (eg, inspecting interim results to determine whether to continue collecting data, deciding whether to exclude outliers after assessing the effects of doing so). Accordingly, the most promising preventive step is education and change in disciplinary norms. Best practice includes designing a trial with sufficient power to detect meaningful effects, prespecifying rules for stopping data collection, and plans for data analysis, including criteria for excluding outliers, expected variable transformations, whether covariates will be controlled for, etc. Such plans should be registered before data collection commences.14 Published articles should identify the complete list of variables studied in the trial, report all planned analyses (indicating which results were prespecified as primary and which as secondary), and conduct robustness analyses for methodological choices.2
Some researchers have advocated p-curve analysis as a formal means of correcting p-hacking15 16; however, more research is needed to validate the procedure empirically. Because confounding variables can render p-curve analysis unreliable, its application beyond randomised trials is particularly uncertain.13
Yet the above solutions fail to capture the extent to which data dredging is a problem of bad barrels rather than bad apples.17 Wansink may have been wrong about the reliability of the procedures he used, but he was not wrong about their ability to increase academic output. So long as disciplinary incentives reward data dredging, researchers will face pressures to engage in it. Accordingly, the most successful interventions require changes to disciplinary standards of designing, implementing and publishing scientific research.18 For example, journals that accept articles based on their prospective designs mitigate the need and ability to dredge through data to produce ‘attractive’ results.19 Given the current standards of inquiry, such interventions would require fundamental changes to most disciplines.
Patient consent for publication
Correction notice This article has been corrected since it first published. The first affiliation has been corrected; the ORCID iD has been included for the first author, and the third author's initials have been added.
Contributors AE and BH cowrote a first draft of the article. Which was subsequently corrected and revised by all authors. All authors contributed to the drafting of this article, have approved the final draft, and agree to be accountable for all aspects of this work.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.