Article Text

Download PDFPDF
Data-dredging bias
  1. Adrian Erasmus1,2,
  2. Bennett Holman3,4,
  3. John P A Ioannidis5,6
  1. 1Department of Philosophy, University of Alabama, Tuscaloosa, Alabama, USA
  2. 2Institute for the Future of Knowledge, University of Johannesburg, Auckland Park, Gauteng, South Africa
  3. 3Underwood International College, Yonsei University, Incheon, Seoul, Korea (the Republic of)
  4. 4University of Johannesburg Faculty of Humanities, Auckland Park, South Africa
  5. 5Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA
  6. 6Department of Epidemiology and Population Health, Stanford University, Stanford, California, USA
  1. Correspondence to Dr Bennett Holman, Underwood International College, Yonsei University, Incheon, Seoul 21983, Korea (the Republic of); bholman{at}yonsei.ac.kr

Statistics from Altmetric.com

Background: what is data dredging bias?

Data-dredging bias encompasses a number of more specific questionable practices (eg, fishing, p-hacking) all of which involve probing data using unplanned analyses and then reporting salient results without accurately describing the processes by which the results were generated. Almost any process of data analysis involves numerous decisions necessary to complete the analysis (eg, how to handle outliers, whether to combine groups, including/excluding covariates). Where possible, it is the best practice for these decisions to be guided by a principled approach and prespecified in a publicly available protocol. When it is not possible, authors must be transparent about the open-ended nature of their analysis.

Many different sets of choices may well be methodologically defensible and reliablewhen the specifications are made prior to the analysis. However, probing the data and selectively reporting an outcome as if it were always the intended course of analysis dramatically increases the likelihood of finding a statistically significant result when there is in fact no effect (ie, a false positive).

As an intuitive example, consider a hypothesis that a given coin is unfairly biased to heads. Suppose I flip the coin twenty times each day for a week and assume that I am allowed to (1) eliminate data from any given day; (2) consider only the first 10 flips in a day, the last 10 flips in a day, or all 20 flips; and (3) restrict my consideration to only flips that were preceded by a heads or flips that were preceded by a tails. I would be conducting a fair trial if I prespecify that I will consider only the results when the prior flip was ‘tails’ and the flip was one of the last 10 in the series for the day, but I will be excluding the results from Wednesday. This is because none of …

View Full Text

Footnotes

  • Correction notice This article has been corrected since it first published. The first affiliation has been corrected; the ORCID iD has been included for the first author, and the third author's initials have been added.

  • Contributors AE and BH cowrote a first draft of the article. Which was subsequently corrected and revised by all authors. All authors contributed to the drafting of this article, have approved the final draft, and agree to be accountable for all aspects of this work.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.