Published in the Physician Executive Journal November/December 2011

Request Reprint


Clinical peer review is in the process of evolving to a model that is better adapted to supporting ongoing improvements in healthcare quality and patient safety. This Quality Improvement or QI model has been initially described (1) and validated. (2) It presents a stark contrast to the dysfunctional Quality Assurance or QA model which has dominated for the past 30 years. Table 1 (see the QI Model) compares the distinguishing features of these 2 models for clinical peer review. Both models, however, share the concern to minimize bias.

From a scientific perspective, the process of making and interpreting measurements is subject to bias. Bias is a form of special cause variation, which can degrade the reliability and validity of measurement or measurement interpretation. The evaluations made in clinical peer review are not exempt. In the QA model, bias can affect the judgments of whether the standard of care was met. In the QI model, they can affect the measurement of clinical performance.

Sources of Bias

The chief sources of bias in peer review include the clinical outcome and the reviewer. Less commonly, there may be combined reviewed physician and reviewer bias as in cronyism and bad faith peer review.

Outcome bias presents something of a dilemma. Our mental model is that serious adverse events in medicine are more likely to be associated with substandard care. In fact, in case reviews reported by Hayward et al., substandard care was identified among 30% of patient deaths, but only 9% of deaths were judged preventable. (1) While the authors acknowledge the potential for outcome bias, they observe that patients who die tend to present with greater acuity and complexity thereby giving rise to more opportunities for errors in management, which may or may not result in death. This work basically replicates findings from the Harvard Medical Practice Study (2) in which the proportion of adverse events due to negligence increased with the level of harm. Preventability was not assessed. The RAND group came to similar conclusions in a rigorous assessment of quality of care variation over time under the Medicare prospective payment system. (3)

On the other hand, Caplan et al. found consistently harsher ratings when the outcome of a case abstract was altered to indicate a worse outcome. (4) One could argue that a case abstract is a poor proxy for the medical record, which is a universal standard for peer review. While case abstracts are used to complement the medical record among 35% of hospitals, (5) the abstract is itself subject to bias in terms of what gets included or excluded. In the Caplan study, however, the only difference was the outcome.

Berlin distinguishes hindsight bias (Monday morning quarterbacking) from outcome bias as “the tendency for people with knowledge of the actual outcome of an event to believe falsely that they would have predicted the outcome.” (6) Hindsight bias should be addressed in the process of identifying potential strategies to prevent recurrence of an error or adverse event. It may be of less concern for the process of clinical performance measurement per se. This is because clinical performance measures look at multiple aspects of performance, which may be independent of an adverse event, and not simply whether the standard of care was met overall.

Reviewer-specific bias can be problematic when it occurs, but the problem may be the exception rather than the rule. Hayward et al. reported the reviewer-specific effect for the various implicit measures studied in evaluating the care of general medical inpatients. (7) This study involved multiple independent reviews without discussion. The reviewer-specific effect was small for most measures. Ratings of resource use and appropriateness of discharge had the highest reviewer-specific effects accounting for about 15% of variance. This was not due to a single outlier reviewer. While a reviewer-specific effect contributed to 31% of variation in ratings of errors in following physician orders, these were not ratings of physician-attributable performance.

Case selection for peer review presents another potential source of bias. It has not been formally studied. While cases with serious adverse outcomes are more readily identified than those with lesser levels of harm, the effect is equal for all physicians. It’s not unreasonable that the care of higher risk conditions is more closely scrutinized for system and individual improvement opportunities. On the other hand, most programs have a pre-review screening process by which identified cases are selected or rejected for peer review. There is wide variation in how this is done and in the proportion of cases screened out, which has no demonstrable effect on the effectiveness of peer review. (5) Within a given institution, variation in the case selection process could make it difficult to compare the results of peer review across committees. Case selection would not of itself influence the results of specific clinical performance measurements.

Minimization of Bias

A common misconception is that bias can be minimized by parsing peer review findings into a small number of categories. Such thinking stems from confusion between reliability of a measure and agreement between raters. Agreement is an illusion if the reliability of a measure is low. Reliability deals with the ability of a measure to make fine differentiation among subjects. Up to the point of 7 to 10 categories, more categories will invariably produce higher reliability. (8)

The limited categories approach also confuses judgment (test result interpretation) with measurement (test result). In doing so, it forces a black and white, high-stakes process that perpetuates the dysfunctional QA model for peer review. Where reviewers are looking for agreement on whether the standard of care was met (judgment), they will succeed when the standard (cut point) is set low enough. The value of such judgments, however, is correspondingly low. What is more, they don’t lend themselves to data aggregation. Borderline performance cannot be addressed, because it’s not differentiated. In short, the limited categories approach to peer review is powerless to shift the curve of group performance. In contrast, the QI model is associated with significantly greater medical staff satisfaction than the QA model, in large measure because it is perceived as a fair, credible, consistent and non-punitive process. (9)

When properly designed, subjective measures of clinical performance made in peer review process can have good enough reliability for use in aggregate assessments, even if they are not reliable enough for unqualified judgments about individual cases. (10) They can be applied either with individual physician or case-level attribution. Thus, they work well within a quality improvement framework. In the language of industrial quality control, since performance may vary over time and clinical context, the “chart” of multiple cases will demonstrate whether the process is well-controlled. From this perspective, a single case (point measure) displaying marked deviation raises the question of special cause variation that may warrant further investigation.

If case reviews are evenly distributed to members of the review committee without regard to clinical subject matter, the aggregate view of the data will largely neutralize reviewer-specific bias. Scoring differences between reviewers can also be minimized through attention to selection and training, routine feedback of ratings (a comparative reviewer profile), and committee discussions.

Reviewers who customarily work together in a committee setting tend to develop common standards. In a study of peer review reliability, Hofer et al. found that pairs of reviewers who worked together rated cases more similarly than others, even before discussion. (11) While their independent ratings were even more similar following case discussion, differences across pairs of reviewers persisted.

Although duplicate review and discussion is not routinely done in practice because of the resource cost, it can be useful as a training exercise to help review committee members develop the habit of talking about rating appropriateness. The process is simple. Divide the committee into pairs of reviewers. Assign one case to each pair. Allow them time to review the case either independently or together and to meet for discussion. Ask each pair to generate a single final set of clinical performance ratings for their case. The committee then regroups to discuss all the cases and the lessons learned. The exercise can be done within the confines of a single meeting or the reviews can be pre-assigned a reasonable time in advance of the group session.

When clinical performance is quantified, review data become useful for program management as well as individual and group performance feedback. Ratings can then be meaningfully compared across reviewers and across committees to identify bias. Feedback of such data can promote self-corrective action.

Outcome bias can be mitigated by mindfulness of the effect, by looking at multiple cases to assess individual performance, and by uniform case selection process. If reviewers understand the risk of outcome bias, the committee can develop the habit of explicitly discussing the issue. It may also prove useful to separate the evaluation of patient harm from the evaluation of clinical performance. Certainly a reviewer would generally be aware of any significant harm which might have occurred as an inevitable consequence of a diligent review of the record. Nevertheless, asking the reviewer to specifically evaluate harm as part of the review process elevates the importance of the issue. This strategy needs to be formally tested. Most importantly, to promote a culture of safety, we must avoid a “No Harm, No Foul” philosophy when evaluating the pivotal behavioral choices that were made. Risky and reckless behaviors need to be addressed regardless of the outcome to mitigate the potential for future harm. To ignore such behaviors would favor the normalization of deviance in the organization.

In summary, although the clinical peer review process is susceptible to bias, with reasonable attention to process and governance, this issue can be managed. The data-rich structure of the QI model is well-suited to monitor for and minimize such bias, just as it better supports overall program effectiveness.


  1. Hayward RA, Bernard AM, Rosevear JS, et al. An evaluation of generic screens for poor quality of hospital care on a general medicine service. Med Care 1993;31(5):394-402.
  2. Brennan TA, Leape LL, Laird NM, Hebert L, Localio R et al. Incidence of adverse events and negligence in hospitalized patients: Results of the Harvard Medical Practice Study I. NEJM 1991;324(6):370-376.
  3. Rubenstein LV, Kahn KL, Reinisch EJ, Sherwood MJ et al. Changes in quality of care for five diseases measured by implicit review, 1981 to 1986. JAMA. 1990;264(15):1974-1979.
  4. Caplan RA, Posner KL, Cheney FW. Effect of outcome on physician judgments of appropriateness of care. JAMA1991;265(15):1957–1960.
  5. Edwards MT, Benjamin EM. The process of peer review in US hospitals and its perceived impact on quality of care. J Clin Outcomes Manage. 2009(Oct);16(10):461-467.
  6. Berlin L. Outcome bias. AJR 2004;183(3):557-560.
  7. Hayward RA, McMahon LF, Bernard AM. Evaluating the care of general medicine inpatients: How good is implicit review? Ann Intern Med. 1993;118(7):551-557.
  8. Edwards MT. Measuring clinical performance. Phys Exec. 2009;35(6):40-43.
  9. Edwards MT. Clinical peer review program self-evaluation for US hospitals. Am J Med Qual. 2010; 25(6):474-480.
  10. Goldman RL, Ciesco E. Improving peer review: alternatives to unstructured judgments by a single reviewer. Jt Comm J Qual Improv 1996;22(11):762-9.
  11. Hofer TP, Bernstein SJ, DeMonner S, Hayward RA. Discussion between reviewers does not improve reliability of peer review of hospital quality. Med Care 2000;38(2):152-161.