Abstract
Background: Clinical trials are commonly done without blinded outcome assessors despite the risk of bias. We wanted to evaluate the effect of nonblinded outcome assessment on estimated effects in randomized clinical trials with outcomes that involved subjective measurement scales.
Methods: We conducted a systematic review of randomized clinical trials with both blinded and nonblinded assessment of the same measurement scale outcome. We searched PubMed, EMBASE, PsycINFO, CINAHL, Cochrane Central Register of Controlled Trials, HighWire Press and Google Scholar for relevant studies. Two investigators agreed on the inclusion of trials and the outcome scale. For each trial, we calculated the difference in effect size (i.e., standardized mean difference between nonblinded and blinded assessments). A difference in effect size of less than 0 suggested that nonblinded assessors generated more optimistic estimates of effect. We pooled the differences in effect size using inverse variance random-effects meta-analysis and used metaregression to identify potential reasons for variation.
Results: We included 24 trials in our review. The main meta-analysis included 16 trials (involving 2854 patients) with subjective outcomes. The estimated treatment effect was more beneficial when based on nonblinded assessors (pooled difference in effect size −0.23 [95% confidence interval (CI) −0.40 to −0.06]). In relative terms, nonblinded assessors exaggerated the pooled effect size by 68% (95% CI 14% to 230%). Heterogeneity was moderate (I2 = 46%, p = 0.02) and unexplained by metaregression.
Interpretation: We provide empirical evidence for observer bias in randomized clinical trials with subjective measurement scale outcomes. A failure to blind assessors of outcomes in such trials results in a high risk of substantial bias.
A failure to blind assessors of outcomes in randomized clinical trials may result in bias. Observer bias, sometimes called “detection bias” or “ascertainment bias,” occurs when outcome assessments are systematically influenced by the assessors’ conscious or unconscious predispositions — for example, because of hope or expectations, often favouring the experimental intervention.1
Blinded outcome assessors are used in many trials to avoid such bias. However, the use of non-blinded assessors remains common,2–4 especially in nonpharmacological trials; for example, non-blinded outcome assessment was used in 90% of trials involving orthopedic traumatology3 and 74% of trials involving strength training for muscles.4
Unfortunately, the empirical evidence on observer bias in randomized clinical trials has been incomplete. Meta-epidemiological studies have compared double-blind trials with similar trials that were not double-blind.5,6 However, such studies address blinding crudely because “double-blind” is an ambiguous term.3,7 Furthermore, the risk of confounding is considerable in indirect between-trial analyses, as “double-blind” trials may have better overall methods and larger sample sizes than trials that are not reported as “double-blind.”
A more reliable approach involves analyses of trials that use both blinded and nonblinded outcome assessors, because such a within-trial design provides a direct comparison between blinded and nonblinded assessments of the same outcome in the same patients. Our previous analysis of such trials with binary outcomes found substantial observer bias.8
Although subjective measurement scales such as illness severity scores are popular, they may be susceptible to observer bias. They are frequently used as outcomes in clinical scenarios with no naturally distinct categories, and adjacent subcategories on a scale typically involve minor and vaguely defined differences.
We decided to systematically review trials with both blinded and nonblinded assessment of outcomes using the same measurement scales. Our primary objective was to evaluate the impact of nonblinded outcome assessment on estimated treatment effects in randomized clinical trials. Our secondary objective was to examine reasons for variation in observer bias.
Methods
Eligibility criteria
We included randomized clinical trials with blinded and nonblinded assessment of the same measurement scale outcome. We excluded trials for which the distinction between the experimental and control groups was unclear, because such trials would not allow us to determine the direction of any bias; trials for which only a subgroup of patients were evaluated by blinded and nonblinded assessors, unless selected at random; trials in which blinded and nonblinded assessors had access to each others’ results; and trials in which initially blinded assessors became unblinded (e.g., when radiographs showed ceramic material indicative of the experimental intervention).
Search strategy
We searched the following databases from their inception onwards without language restrictions: PubMed, EMBASE, PsycINFO, CINAHL, The Cochrane Central Register of Controlled Trials, HighWire Press and Google Scholar. Our core search string was random* AND (“blind* and unblind*” OR “masked and unmasked”) with variations according to the specific database (Appendix 1, available at www.cmaj.ca/lookup/suppl/doi:10.1503/cmaj.120744/-/DC1). We performed the last search on Jan. 26, 2010. We read the references of all of the included trials and asked the authors of all included trials whether they knew of additional trials to identify any further studies that should be included.
Data abstraction
One investigator read all abstracts from standard databases and all text fragments from full-text databases. If a study was identified as potentially eligible for inclusion, we retrieved a full study report, which was read by an investigator who excluded all clearly ineligible studies. Two investigators read all other study reports and decided on eligibility. Disagreements were resolved by discussion.
We selected a single measurement scale from each trial. If several outcomes had been assessed under both blinded and nonblinded conditions, we preferred the primary outcome of the trial and the first assessment after the end of treatment (unless the primary outcome prescribed a different time point). Two investigators selected the outcomes independently. Again, disagreements were resolved by discussion. For trials with more than 2 groups, we pooled the results in the experimental or control groups.1
From each trial we extracted the following data: posttreatment mean, standard deviation and the numbers of patients in the experimental and control groups in the blinded assessments, and the corresponding data from the nonblinded assessments. For crossover and split-body trials, we extracted the standard deviation of the paired difference between treatments. If possible, we also extracted data on the correlation between blinded and non-blinded assessment (e.g., Spearman rank correlation coefficient) and data on interobserver variation between assessors (blinded or nonblinded).
If data were incomplete, we contacted the authors of the trial by email or telephone. We also searched the US Food and Drug Administration (FDA) website for trial outcome data. If standard deviations were not reported, we used standard deviations from a comparable trial that used the same measurement scale. If interobserver data were not available, we tried to obtain them from independent scale-validation studies.
For each trial, we evaluated 5 prespecified potential confounders in the comparison between blinded and nonblinded outcome assessments: a considerable time lapse between the 2 assessments, different types of assessors (e.g., nurses v. physicians), different assessment procedures (e.g., direct visual assessment of a wound v. a photograph of a wound), a substantial risk of ineffective blinding and different patients being assessed (i.e., some patients who had been evaluated blindly had not been evaluated nonblindly and vice versa). The first 4 items were evaluated by 2 investigators masked to any information relating to the comparison between blinded and nonblinded assessors. The masking was done by manipulating PDF versions of the trial reports so that tables, graphs or text describing the results of any comparison between blinded and nonblinded assessors were blanked out. There were no cases of accidental unmasking.
In addition, for each trial, we evaluated 3 characteristics of the outcomes that could possibly explain variations in observer bias. Two masked investigators independently evaluated the following 3 factors on a scale from 1 to 5 (1 = low, 5 = high): the degree of subjectivity of the outcome (i.e., the degree to which the assessors’ judgment affected the outcome; high in global assessment of patient improvement and low in reading a laboratory sheet); the non-blinded assessor’s overall involvement in the trial (i.e., a proxy for the degree of personal preference for a result favourable to the experimental intervention); and the vulnerability of the outcome to nonblinded patients (high in outcomes based on interviews with nonblinded patients and low in outcomes involving pure observation, such as the inspection of photographs). Disagreements were resolved by discussion.
Statistical analysis
For each trial, we calculated the effect size (i.e., standardized mean difference) based on the blinded and nonblinded assessments using the pooled standard deviation of the blinded assessments as the common standardizing unit. An effect size of less than 0 suggests a beneficial effect of the experimental intervention. We subsequently summarized the impact of nonblinded outcome assessment as the difference between the 2 effect sizes. A difference in effect size of less than 0 suggests that the nonblinded assessments generate more optimistic estimates of effect than do the blinded assessments.
We pooled the differences in effect size from individual trials by meta-analysis using random-effects models and inverse variance weights.9 The standard error of the difference in effect size used for the main analysis disregarded the correlation between blinded and nonblinded assessments (Appendix 2, available at www.cmaj.ca/lookup/suppl/doi:10.1503/cmaj.120744/-/DC1).
We tested the robustness of our main analysis with secondary analyses addressing the type of analysis (e.g., incorporating the correlation between blinded and nonblinded assessments), type of data, clinical condition, trial characteristics, risk of confounding and trial size. In addition, we examined the percentage by which the nonblinded effect estimate exceeded the blinded effect estimate (effect size difference/blinded effect size), approximating the confidence interval for the percentage according to Fieller.10
Finally, we used univariable random-effects metaregression to determine whether variations in effect size differences were associated with the 3 prespecified outcome characteristics we described earlier.
Results
We identified 537 publications from 1835 hits in standard databases and 2200 hits in full-text databases. We excluded 513 studies, mostly because they were not randomized clinical trials or because they lacked blinded or nonblinded outcome assessment (Figure 1). Thus, 24 trials were included in our qualitative synthesis.11–36
Of these 24 trials, 16 (involving 2854 patients) provided outcome data for both the blinded and nonblinded assessors. The characteristics of the trials are described in Table 1. The clinical specialties represented were neurology, cosmetic surgery, cardiology, psychiatry, otolaryngology, dermatology, gynecology and infectious diseases.
The outcomes of the trials were generally subjective; 13 of the 16 trials (81%) scored 4 or 5 on our scale of subjectivity (Table 2). The median Spearman rank correlation coefficient between blinded and nonblinded assessments in the 7 trials with such data was 0.67 (Appendix 3, available at www.cmaj.ca/lookup/suppl/doi:10.1503/cmaj.120744/-/DC1.). We identified validation studies for scales used in 10 of the included trials, which generally reported good interobserver agreement (median weighted κ 0.64 [5 trials]; median intraclass correlation coefficient 0.82 [5 trials] (Appendix 3).
In 10 trials (63%), the effect size point estimate was more optimistic as determined by the nonblinded assessors (Figure 2). Among all 16 trials, the difference in effect size ranged from −1.10 to 0.14. The pooled difference in effect size was −0.23 (95% confidence interval [CI] −0.40 to −0.06), with moderate heterogeneity (I2 = 46%, p = 0.02) (Figure 3). Thus, the estimated treatment effect based on the assessments of the nonblinded assessors was exaggerated by about one-quarter of the standard deviation of the measurement scale used.
The pooled effect size based on the assessments of the blinded assessors was −0.34 (95% CI −0.55 to −0.14). Thus, the nonblinded assessors exaggerated the estimated effect size by about 68% (95% CI 14% to 230%) (i.e., −0.23/−0.34 = 0.68).
Our main result was robust, although CIs in our secondary analyses were wide (Table 3). One trial was free from any of the 5 prespecified possible confounders (effect size difference −0.22 [95% CI −0.61 to 0.16].15 The difference in effect size seemed not to be influenced by any of the suspected confounders (Table 3) or by trial size (data not shown).
Eight trials (involving 980 patients) were included in our review but not in our main meta-analysis because of incomplete or inconsistent data. Qualitative information, or results from other similar trials, suggested notable observer bias in 3 of these trials and no or little bias in 2 trials (Appendix 3).
Using univariable metaregression, we found no statistically significant associations between differences in effect size and high scores for outcome subjectivity (p = 0.29), the degree to which the nonblinded assessors were involved in the trials (p = 0.64), or the vulnerability of the outcome to nonblinded patients (p = 0.80). However, the slope of the regression line between differences in effect sizes and scores for outcome subjectivity was in the expected direction (data not shown). The 13 trials with clearly subjective outcomes had a pooled effect size difference of −0.29 (−0.50 to −0.08) (data not shown). The 3 trials with moderately subjective outcomes had a pooled effect size difference of −0.04 (−0.32 to 0.25) (data not shown).
Interpretation
Nonblinded assessors of subjective measurement scale outcomes in randomized clinical trials tended to generate substantially biased effect sizes. Standardized mean differences were exaggerated by a pooled standard deviation of 0.23 (95% CI 0.40 to 0.06) or, in relative terms, by 68% (95% CI 14% to 230%).
Observer bias can be perceived as the result of the interaction between observers’ predispositions and the subjectivity of the outcome. Predispositions are likely to differ substantially from observer to observer and from trial to trial. In some trials, conscientious nonblinded assessors may overcompensate for an expected bias in favour of the experimental intervention and paradoxically induce a bias favouring the control, whereas other trials will have fairly neutral assessors with no important bias. Thus, the degree of observer bias in trials with clearly predisposed outcome assessors is likely to be considerably higher than the mean we see here, which is based on all of the included trials. When determining the risk of bias attributable to nonblinded assessors in a randomized trial, we suggest being mindful of the range of observer bias we have found, and not only the pooled mean.
Based largely on convention, standardized mean differences of −0.2 are considered small effects, −0.5 are considered medium effects, and −0.8 are considered large effects.37 By such standards, our result constitutes a small to moderate difference. However, it seems inappropriate to interpret a degree of bias in the same way as we would interpret a treatment effect. The relevant problem is how much bias can be expected when using a nonblinded assessor, not whether that degree of bias represents a clinically worthwhile effect. In a situation with a large true treatment effect with a standardized mean difference of −0.8, the average degree of observer bias when using nonblinded observers, −0.23, would imply an exaggeration of the treatment effect estimate by 29%. This percentage increases to 115% if effects are small (i.e., if the standardized mean difference is −0.2). In the 16 trials we analyzed, the pooled estimated treatment effect was exaggerated by 68% (14% to 230%) when based on data from nonblinded assessors. Thus, we interpret our result as evidence for a substantial degree of observer bias.
In a Cochrane review of the effect of progressive resistance strength training, Liu and colleagues compared pooled standardized mean differences in a subgroup of 54 randomized trials using nonblinded assessors (−0.88 [95% CI −0.77 to −0.99]) with that of 19 trials using blinded assessors (−0.23 [95% CI −0.13 to −0.34]).4,38 The result of this indirect comparison is within the range of our findings. Meta-epidemiological studies of trials with binary outcomes have reported inconsistent estimates of the effect of a lack of double-blinding.5 However, our result is consistent with that of Savovic and colleagues,6 and with our previous study of observer bias in trials with binary outcomes.8
It may be tempting to use measures for interobserver agreement (e.g., weighted κ, intraclass correlation coefficients) as surrogate markers for risk of observer bias. Similarly, training non-blinded observers to reduce interobserver variation39 could be seen as an appealing alternative to blinding in a situation where blinding is challenging. However, good interobserver agreement does not prevent observer bias. For example, the trial with the largest degree of observer bias11 used a scale reported to have an intraclass correlation coefficient as high as 0.87.40
Some researchers consider the blinding of outcome assessors too resource-demanding, superfluous, or misconceived;41,42 however, planning and running a randomized clinical trial is already a logistically very challenging undertaking. The comparatively minor investment of using blinded outcome assessors reduces the risk of bias considerably. Blinding outcome assessors is possible in most trials.43,44
Limitations
The trials we included in our analysis are contemporary and represent a variety of clinical specialties, and their design implies a low risk of confounding. However, these trials are not representative of medical trials in general. We included no trials with clearly objective measurement scale outcomes, such as nonrepeatable automatized laboratory measures. The included trials had subjective outcomes, and our results apply only to similar trials. Furthermore, extrapolating our results to all trials with subjective measurement scale outcomes assumes that trials with both blinded and nonblinded assessors are comparable with trials with only nonblinded assessors.
Our preplanned main analysis disregarded the correlation between blinded and nonblinded assessments, and its confidence interval may thus be somewhat inflated. However, the correlation was available for 7 trials, and secondary analyses incorporating the correlation between blinded and nonblinded assessments provided results similar to those of the main analysis.
Because searching for trials with both blinded and nonblinded assessors is challenging, some such studies may not have been identified by our literature search. However, it is unclear whether such trials would report substantially different results. Publication bias is normally driven by the effect of a treatment45 and may have a limited, yet unpredictable, effect on our comparison between types of assessments.
Conclusion
We provide empirical evidence for observer bias in randomized clinical trials with subjective measurement scale outcomes. Failure to blind outcome assessors in such trials results in a high risk of substantial bias.
Acknowledgements
The authors thank the following trial authors for sharing unpublished outcome data: Peggy Vandervoort, George C. Ebers, Daniel Burkhoff, Cheryl Iglesia, Borwin Bandelow and Dina S. Reddihough, and Frances S. Weaver and the US Department of Veterans Affairs (VA) Cooperative Study Program, as well as the VA CSP study #468 “A comparison of best medical therapy and deep brain stimulation of subthalamic nucleus and globus pallidus for the treatment of Parkinson’s disease.” The authors also thank Peter C. Gøtzsche and Andreas Lundh for valuable comments on previous versions of the manuscript.
Footnotes
Competing interests: Frida Emanuelsson and Ann Sofia Skou Thomsen have received grants from the Danish Council of Independent Research. No other competing interests were declared.
This article has been peer reviewed.
Contributors: Asbjørn Hróbjartsson conceived the idea and design of the study, organized the study and wrote the first draft of the manuscript. Ann Thomsen and Asbjørn Hróbjartsson developed the search strategy. Ann Thomsen, Frida Emanuelsson, Britta Tendal, Stig Brorson and Asbjørn Hróbjartsson did the nonmasked data collection. Isabelle Boutron, Philippe Ravaud, Stig Brorson, Britta Tendal and Asbjørn Hróbjartsson did the masked data collection. Asbjørn Hróbjartsson and Jørgen Hilden did the statistical analyses. All of the authors revised the manuscript for important intellectual content and approved the final version submitted for publication.
Funding: The study was partially funded by the Danish Council for Independent Research: Medical Sciences. The funder had no influence on the study’s design, the collection, analysis, and interpretation of data, or the writing of the article and the decision to submit it for publication.