Article Text

Download PDFPDF
Data-dredging bias
  1. Adrian Erasmus1,2,
  2. Bennett Holman3,4,
  3. John P A Ioannidis5,6
  1. 1 Department of Philosophy, University of Alabama, Tuscaloosa, Alabama, USA
  2. 2 Institute for the Future of Knowledge, University of Johannesburg, Auckland Park, Gauteng, South Africa
  3. 3 Underwood International College, Yonsei University, Incheon, Seoul, Korea (the Republic of)
  4. 4 University of Johannesburg Faculty of Humanities, Auckland Park, South Africa
  5. 5 Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA
  6. 6 Department of Epidemiology and Population Health, Stanford University, Stanford, California, USA
  1. Correspondence to Dr Bennett Holman, Underwood International College, Yonsei University, Incheon, Seoul 21983, Korea (the Republic of); bholman{at}yonsei.ac.kr

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Background: what is data dredging bias?

Data-dredging bias encompasses a number of more specific questionable practices (eg, fishing, p-hacking) all of which involve probing data using unplanned analyses and then reporting salient results without accurately describing the processes by which the results were generated. Almost any process of data analysis involves numerous decisions necessary to complete the analysis (eg, how to handle outliers, whether to combine groups, including/excluding covariates). Where possible, it is the best practice for these decisions to be guided by a principled approach and prespecified in a publicly available protocol. When it is not possible, authors must be transparent about the open-ended nature of their analysis.

Many different sets of choices may well be methodologically defensible and reliablewhen the specifications are made prior to the analysis. However, probing the data and selectively reporting an outcome as if it were always the intended course of analysis dramatically increases the likelihood of finding a statistically significant result when there is in fact no effect (ie, a false positive).

As an intuitive example, consider a hypothesis that a given coin is unfairly biased to heads. Suppose I flip the coin twenty times each day for a week and assume that I am allowed to (1) eliminate data from any given day; (2) consider only the first 10 flips in a day, the last 10 flips in a day, or all 20 flips; and (3) restrict my consideration to only flips that were preceded by a heads or flips that were preceded by a tails. I would be conducting a fair trial if I prespecify that I will consider only the results when the prior flip was ‘tails’ and the flip was one of the last 10 in the series for the day, but I will be excluding the results from Wednesday. This is because none of …

View Full Text