Annotated Bibliography

Makover HB: The quality of medical care. Am J Public Health. 1951;41(7):824-832, 1951.
The author assessed the quality of care delivered by 26 medical groups contracted with the Health Insurance Plan of Greater NY through a systematic review of a stratified random sample of 2,148 medical records focusing on routine health examinations, pediatric care, evaluation of suspected cancer, and gastrointestinal complaints. The evaluation was based on “ratings of clinical performance rather than conformance to standards.” The medical record evaluation was supplemented by interviews with the 26 medical directors. The resulting categorization of group performance was well-received. Better performance correlated with younger physicians, centralized physical facilities and laboratories
Daily EF, Moorehead MA: A method of evaluating and improving the quality of medical care. Am J Public Health 1956;46(7):848-854.
An interim report of a repeat study of medical care delivered by medical groups contracted with HIP which expanded on the program reported by Makeover using a structured implicit review methodology and ratings on a 3-level scale of good, fair, and unsatisfactory. The ratings were translated into a case score of up to 100 points allocated among the major categories of records, diagnostic evaluation, and treatment/follow-up. Allowed for the possibility that minimalist record-keeping could still be compatible with quality of care by adjusting diagnostic and treatment scores based on case discussions with the involved physician. Offers a qualitative description of the methodology, findings and impact. Noted that the greatest differentiator within and across groups was the performance of family physicians in the management of possibly serious illness. Also found a general consistency in patient management by a given physician at both a high and low level of quality.
Morehead MA. The medical audit as an operational tool. Am J Public Health. 1967;57(9):1643-1656.
This reflection on her work with the HIP and Teamster studies (reported elsewhere) includes illuminating comments on what worked and what didn’t as well as a sample evaluation form. For example, efforts to devise a method of abstracting records with explicit care items failed to correlate with reviewer judgments. Criteria for case selection were important in identifying improvement opportunities.
Sanazaro PJ, Williamson JW. A classification of physician performance in internal medicine. J Med Educ. 1968;43(3):389-397.
Used the critical incident technique to obtain 2,589 descriptions of effective and ineffective performance by internists as recalled by 2,499 physicians in private practice who were recommended by the department chairs of 20 medical schools from 14 states in 4 specialties. This was a subset of the 12,886 descriptions received and reported in Med Care. 1970 Jul-Aug;8(4):299-308. The classification scheme and the table of frequencies is a useful reference.
Fine J, Morehead MA. Study of peer review of in hospital patient care with the aid of a manual. NY State J Med. 1971(Aug 15):1963-1973.
An early effort to develop a structured review methodology with yes/no patient management questions (a “review pattern”) and an overall 3-level quality rating focusing on 6 common disorders. Created a taxonomy of deficiencies and present data on their frequency among 792 cases from 6 hospitals reviewed by 6 physicians.
Peters EN. Practical vs. impractical peer review. Med Care. 1972;10(6):516-521.
A response to a flawed estimate of the number of independent reviews required to have confidence in the result for a single case presented in Richardson FM: Peer review of medical care. Med Care 1972; 10:29-39.
B S Hulka, L L Kupper, J C Cassel. Physician management in primary care. Am J Public Health. 1976;66(12):1173-1179.
Minimal explicit consensus criteria in the management of patients with four indicator conditions were established by an ad hoc committee of primary care physicians practicing in different locations. These criteria were then applied to the practices of primary care physicians located in a single community by abstracting medical records and obtaining questionnaire data about patients with the indicator conditions. A standardized management score for each physician was used as the dependent variable in stepwise regression analysis with physician/practice and patient/disease characteristics as the candidate independent variables. For all physicians combined, the mean management scores were high, ranging from .78 to .93 for the four conditions. For two of the conditions, care of the normal infant and pregnant woman, the management scores were better for pediatricians and obstetricians respectively than for family physicians. For the other two conditions, adult onset diabetes and congestive heart failure, there were no differences between the management scores of family physicians and internists. Patient/disease characteristics did not contribute significantly to explaining the variation in the standardized management scores.
Haines S., Hospital peer review systems: An overview. Health Matrix. 1984-5; 2(4):30-32.
Sanazaro PJ, Worth RM. Measuring clinical performance of internists in office and hospital practice. Med Care. 1985;23(9):1097-1114.
An important analysis showing that small sample sizes can reliably compare clinical performance when using structured review methods. Found that a sample of 8 records would suffice to identify 92% of physicians who provide substandard care to 10% of their cases with a specific diagnosis. Found a lack of clustering of performance across the conditions that were reviewed.
No standardized methods exist for reliably measuring physicians' performance against clinically valid standards. In this study, statistically reliable peer review based on current national standards of internal medicine was used to evaluate the clinical performance of 66 internists in office and hospital practice. Evaluation was limited to the substantiation of diagnosis, prescribing indicated drug regimen, monitoring, and attaining expected patient response. Performance in substantiating diagnosis was better than in therapeutic management, and management in hospitalized patients was superior to office management. Superior performance by a physician was not consistent across diagnoses, but substandard office treatment in at least one diagnosis was associated with substandard office treatment in other conditions. Internists' performance was unrelated to their certification status but inversely related to the number of years since graduation from medical school. This method could be used to evaluate the effectiveness of continuing education in improving physicians' performance and to validate current examinations used in recertifying internists.
Park RE, Fink A, Brook RH, Chassin MR, Kahn KL, Merrick NJ, Kosecoff J, Solomon DH. Physician ratings of appropriate indications for three procedures: theoretical indications vs indications used in practice. Am J Public Health 1989;79(4): 445-447.
This and similar RAND studies served as the foundation for Value Health Management which did surgical pre-certification for the insurance industry in the 90s. The lack of clarity about how to determine agreement for surgical indications did not deter them. Unfortunately, this study is focused on absolute answers for single cases, rather than approximate answers for multiple cases for data aggregation purposes and could be misleading if applied to peer review without additional consideration. Furthermore, the inter-rater reliability of the measure (intra-class correlation coefficient) was not reported.
We previously reported substantial disagreement among expert physician panelists about the appropriateness of performing six medical and surgical procedures for a large number of theoretical indications. A recently completed community-based medical records study of about 4,500 patients who had one of three procedures--coronary angiography, upper gastrointestinal endoscopy, and carotid endarterectomy--shows that many of the theoretical indications are seldom or never used in practice. However, we find that there is also substantial disagreement (5, 25, or 32 per cent for angiography, endoscopy, or endarterectomy, respectively) about the appropriateness of indications used in actual cases if disagreement is defined by first discarding the two extreme of nine ratings, then looking for at least one rating near the bottom (1 to 3) and one near the top (7 to 9) of the 9-point scale. Patients should know that a substantial percentage of procedures are performed for indications about which expert physicians disagree.
Weinberg NS. The relation of medical problem solving and therapeutic errors to disease categories. QRB Qual Rev Bull. 1989 Sep;15(9):266-72.
This study lacked rigor, but is still a helpful reference on the epidemiology of medical error and is of use in thinking about how to mitigate human factors.
Review of 146 internal medicine cases at Emerson Hospital in Concord, Massachusetts, revealed significant variations in the patterns of physician error and medical problem solving for five diseases: recurrent congestive heart failure, transient ischemic attacks, recurrent cerebrovascular accidents, upper gastrointestinal hemorrhage, and acute bacterial pneumonia. Reviewers used general criteria to identify quality issues, which were separated into six error categories: insufficient data acquisition, inadequate hypothesis generation, inattention to or misinterpretation of cues, inappropriate or mismanaged therapy, delayed or missed diagnoses, and delayed treatment. The most common errors were inadequate hypothesis generation (38%) and inattention to or misinterpretation of cues (32%). Inappropriate or mismanaged therapy was found in 21% of cases.
Ash A, Schwartz M, Payne SM, Restuccia JD. The Self-Adapting Focused Review System. Probability sampling of medical records to monitor utilization and quality of care. Med Care. 1990 Nov;28(11):1025-39.
For those who seek a rigorous methodological approach to targeting cases for peer review, this article is a must read. If you’re still enthusiastic after distilling the content and appreciating that the authors were not able to actually implement their model (ostensibly due to information system changes at the hospital), then God bless you: you are willing to invest a level of effort that goes far beyond what is commonly found.
Medical record review is increasing in importance as the need to identify and monitor utilization and quality of care problems grow. To conserve resources, reviews are usually performed on a subset of cases. If judgment is used to identify subgroups for review, this raises the following questions: How should subgroups be determined, particularly since the locus of problems can change over time? What standard of comparison should be used in interpreting rates of problems found in subgroups? How can population problem rates be estimated from observed subgroup rates? How can the bias be avoided that arises because reviewers know that selected cases are suspected of having problems? How can changes in problem rates over time be interpreted when evaluating intervention programs? Simple random sampling, an alternative to subgroup review, overcomes the problems implied by these questions but is inefficient. The Self-Adapting Focused Review System (SAFRS), introduced and described here, provides an adaptive approach to record selection that is based upon model-weighted probability sampling. It retains the desirable inferential properties of random sampling while allowing reviews to be concentrated on cases currently thought most likely to be problematic. Model development and evaluation are illustrated using hospital data to predict inappropriate admissions.
Caplan RA, Posner KL, Cheney FW. Effect of outcome on physician judgments of appropriateness of care. JAMA. 1991;265(15):1957–1960.
One could argue that a case abstract is a poor proxy for the medical record, which is a universal standard for peer review. While case abstracts are used to complement the medical record among 35% of hospitals, the abstract is itself subject to bias in terms of what gets included or excluded. In this study, however, the only difference was the outcome. Given that outcome bias is a realistic concern in the context of clinical peer review, the process ought to be concerned with mitigating it potential for harm. See my whitepaper: Minimizing Bias in Clinical Peer Review.
Is a permanent injury more likely to elicit a rating of inappropriate care than a temporary injury? To explore this question, we asked 112 practicing anesthesiologists to judge the appropriateness of care in 21 cases involving adverse anesthetic outcomes. The original outcome in each case was classified as either temporary or permanent. The authors then generated a matching alternate case identical to the original in every respect except that a plausible outcome of opposite severity was substituted. The original and alternate cases were randomly divided into two sets and assigned to reviewers who were blind to the intent of the study. The reviewers were asked to rate independently the care in each case as appropriate, less than appropriate, or impossible to judge, based on their personal (implicit) judgment of reasonable and prudent practice. A significant inverse relationship between severity of outcome and judgments of appropriateness of care was observed in 15 (71%) of the 21 matched pairs of cases. Overall, the proportion of ratings for appropriate care decreased by 31 percentage points when the outcome was changed from temporary to permanent and increased by 28 percentage points when the outcome was changed from permanent to temporary. We conclude that knowledge of the severity of outcome can influence a reviewer's judgment of the appropriateness of care.
Sanazaro PJ, Mills DH. A critique of the use of generic screening in quality assessment. JAMA. 1991;265(15):1977-1981.
This article is a must-read for those who would persist in using generic screens to identify cases for peer review. Not only does it raise red flags, it was written by those who originally developed the methodology while exploring the potential of a no-fault medical malpractice program. Their work was the fore-runner of the Harvard Medical Practice Study.
This article summarizes available information on the efficiency and effectiveness of generic occurrence screening when used in quality assessment. Generic screening is relatively inefficient because of its multitiered review system and high rates of errors and false positives. Overall sensitivity may approach 70% to 80%, but specificity is estimated to range from about 22% to 73%. Effectiveness of generic screening in identifying problems in quality is limited by variability in peer review. Other limitations of generic screens include their lack of inherent relationship to the quality of patient care and their inability to provide direct performance measures for use in the periodic reappraisal of clinical privileges of medical staff members. We propose the monitoring of specific adverse surgical and medical clinical outcomes and related risk factors to increase efficiency in quality assessment and provide a more adequate database for the continual improvement of patient care and clinical performance.
Rubenstein LV, Kahn KL, Reinisch EJ et al. Changes in quality of care for five diseases measured by implicit review, 1981 to 1986. JAMA. 1990;264(15)1974-1979.
An application of structured implicit review methodology to the task of clinical performance evaluation. Disease-specific forms were used. They were appropriate to the objectives of the study, but may be less useful for general application to the many diseases and clinical scenarios involved in clinical peer review.
We measured quality of care before and after implementation of the prospective payment system. We developed a structured implicit review form and applied it to a sample of 1366 Medicare patients with congestive heart failure, acute myocardial infarction, pneumonia, cerebrovascular accident, or hip fracture who were hospitalized in 1981-1982 or 1985-1986. Very poor quality of care was associated with increased death rates 30 days after admission (17% with very good care died vs 30% with very poor care). The quality of medical care improved between 1981-1982 and 1985-1986 (from 25% receiving poor or very poor care to 12%), although more patients were judged to have been discharged too soon and in unstable condition (7% vs 4%). Except for discharge planning processes, the quality of hospital care has continued to improve for Medicare patients despite, or because of, the introduction of the prospective payment system with its accompanying professional review organization review.
Rubin HR, Rogers WH, Kahn KL, Rubenstein LV, Brook RH. Watching the doctor-watchers: How well do peer review organization methods detect hospital care quality problems? JAMA. 1992;267:2349-2354.
Also see: Rubin HR, Rubenstein LV, Kahn KL, Sherwood M. Guidelines for Structured Implicit Review of Diverse Medical and Surgical Conditions. Santa Monica, CA: RAND, N-30066-HCFA; 1990
A generalization of the structured implicit review methodology reported in JAMA 1990. Although the application was an independent evaluation of PRO assessments of hospital quality of care, the approach is applicable to the hospital medical staff peer review setting. The tools used, however, required extensive reviewer training and took 30+ minutes per case to complete. The publication of this study contributed to a major change Medicare administrative policy which ultimately gave rise to the QIO Model. It established structured implicit review as the gold standard for clinical performance measurement when explicit standards do not exist, which is true most of the time in the peer review setting.
Objective: To determine how well one state's peer review organization (PRO) judged the quality of hospital care compared with an independent, credible judgment of quality of care.
Design: Retrospective study comparing a PRO's review, including initial screening, physician review, and final judgments, with an independent "study judgment" based on blinded, structured, implicit reviews of hospital records.
Setting: One state's medical and surgical Medicare hospitalizations during 1985 through 1987 audited randomly by the state's PRO.
Sample: Stratified random sampling of records: 62 records that passed the PRO initial screening process and were not referred for PRO physician review; 50 records that failed PRO screen and then were confirmed by PRO physicians to be "quality problems."
Intervention: None.
Main Outcome Measure: A study judgment of below standard or standard or above based on the mean of overall ratings by five internists for records in medical diagnosis related groups (DRGs) and by five internists and five surgeons for surgical DRGs. Each step in the PRO review was evaluated for how many records passing or failing that step were judged standard or above or below standard in the study (positive and negative predictive value) and how well that step classified records that the study judged below standard or standard or above (sensitivity and specificity).
Results: An estimated 18% of records reviewed by the PRO were below standard according to the study judgment, compared with 6.3% quality problems according to the PRO's final judgment (difference, 12%; 95% confidence interval, 1 to 23). The PRO's initial screening process failed to detect and refer for PRO physician review two of three records that the study judged below standard. In addition, only one of three of the records that PRO physicians judged to be quality problems were judged below standard by the study judgment. Therefore, the PRO's final quality of care judgment and the study judgment agreed little more than expected by chance, especially about poor quality of care. Although the PRO correctly classified 95% of the records that the study judged standard or above, it detected only 11% of records judged below standard by the study.
Conclusions: Most of all, this PRO review process would be improved by additional preliminary screens to identify the 67% of records that the study judged below standard but that passed its initial screening. The screening process also must be more accurate in order to be cost-effective, as it was only slightly better than random sampling at correctly identifying below standard care. More reproducible physician review is also needed and might be accomplished through improved reviewer selection and training, a structured review method, and more physician reviewers per record.
Hershey N. Compensation and accountability: the way to improve peer review. Qual Assur Util Rev. 1992 Spring;7(1):23-9.
In reflecting on the finding the that the Joint Commission issued Type 1 recommendations to more than 50% of hospitals surveyed from 1986-88 related to medical staff standards requiring “monitoring and evaluation of medical staff/department care” and “surgical case review”, the author recommended the compensation of reviewers as a means of improving accountability and effectiveness.
Lefevre F, Feinglass J, Yarnold PR, Martin GJ, Webster J. Use of the Rand Structured Implicit Review Instrument for quality of care assessment. Am J Med Sci. 1993;305(4):222-228.
Tested the RAND instruments for specified conditions (CHF, AMI and pneumonia) at a teaching hospital, but did not invest a comparable level of training in its use (4 vs. 15 hours). The resulting inter-rater reliability was slightly lower (and similar to Hayward et al.), but still adequate for aggregate comparisons.
The Rand Structured Implicit Review Instrument is a 27-item instrument that rates process quality of care for patients with five common illnesses. This study reports on the use of this instrument for hospitalized patients with long lengths of stay. A total of 120 medical records were reviewed by multiple physician reviewers for patients discharged with congestive heart failure, acute myocardial infarction, and pneumonia. Mean inter-rater reliability was assessed for a subsample of six records by kappa score. A multiple regression analysis was used to estimate the relationship between process ratings for the quality of documentation, assessment, monitoring, and therapy and overall quality of care scores, controlled for physician judgments about patients' prognosis and selected patient characteristics. Each reviewer also evaluated the instrument. Mean kappa for trichotomized ratings of quality of care was 0.50. The majority of all quality of care ratings were in the good or very good range (77.5%). The full regression model, including process subscale quality ratings, prognostic items, and patient characteristics, accounted for 38% of the total variance in the quality of care ratings. Items measuring the quality of assessment (p < 0.0001), therapy (p < 0.02) and monitoring (p < 0.01) were significant. Physicians accepted the use of such a form moderately well. The Rand quality of care form shows consistency in rating overall quality of care and individual dimensions of quality. Achieving a high level of inter-rater reliability is difficult with implicit review. By focusing on specific areas of potentially deficient care, structured review instruments can improve clinical quality improvement efforts.
Ramsey PG, Wenrich MD, Carline JD, Inui TS, Larson EB, LoGerfo JP. Use of peer ratings to evaluate physician performance. JAMA.1993;269(13):1655-1660.
Demonstrates the feasibility of using peer ratings to evaluate physician performance in terms of both clinical skills and humanistic qualities and is highly relevant to the credentialing /re-credentialing process in hospitals where there is insufficient clinical activity at the institution. Note the relatively high number of independent ratings required to generate a reliable assessment according to the authors’ standards, which are likely more stringent than being applied in practice.
OBJECTIVE: To assess the feasibility and measurement characteristics of ratings completed by professional associates to evaluate the performance of practicing physicians.
DESIGN: The clinical performance of physicians was evaluated using written questionnaires mailed to professional associates (physicians and nurses). Physician-associates were randomly selected from lists provided by both the subjects and medical supervisors, and detailed information was collected concerning the professional and social relationships between the associate and the subject. Responses were analyzed to determine factors that affect ratings and measurement characteristics of peer ratings.
SETTING AND PARTICIPANTS: Physician-subjects were selected from among practicing internists in New York, New Jersey, and Pennsylvania who received American Board of Internal Medicine certification 5 to 15 years previously.
MAIN OUTCOME MEASURE: Physician performance as assessed by peers.
RESULTS: Peer ratings are not biased substantially by the method of selection of the peers or the relationship between the rater and the subject. Factor analyses suggest a two-dimensional conceptualization of clinical skills: one factor represents cognitive and clinical management skills and the other factor represents humanistic qualities and management of psychosocial aspects of illness. Ratings from 11 peer physicians are needed to provide a reliable assessment in these two areas.
CONCLUSIONS: These findings suggest that it is feasible to obtain assessments from professional associates of practicing physicians in areas such as clinical skills, humanistic qualities, and communication skills. Using a shorter version of the questionnaire used in this study, peer ratings provide a practical method to assess clinical performance in areas such as humanistic qualities and communication skills that are difficult to assess with other measures.
Ashton CM, Kuykendall DH, Johnson ML, Wray NP. An Empirical Assessment of the Validity of Explicit and Implicit Process-of-Care Criteria for Quality Assessment. Med Care 1999;37(8):798-808.
Not unexpectedly, found higher inter-rater reliability for explicit measures of quality. Their implicit review instrument, however, did not use a psychometrically valid rating scale. It only captured yes/no judgments. If explicit measures were available for the full range of clinical scenarios, the point would be moot. In reality, we must live with implicit (subjective) measures. The challenge is to maximize their reliability and to use them appropriately in decision-making and performance feedback.
OBJECTIVE: To evaluate the validity of three criteria-based methods of quality assessment: unit weighted explicit process-of-care criteria; differentially weighted explicit process-of-care criteria; and structured implicit process-of-care criteria.
METHODS: The three methods were applied to records of index hospitalizations in a study of unplanned readmission involving roughly 2,500 patients with one of three diagnoses treated at 12 Veterans Affairs hospitals. Convergent validity among the three methods was estimated using Spearman rank correlation. Predictive validity was evaluated by comparing process-of-care scores between patients who were or were not subsequently readmitted within 14 days.
RESULTS: The three methods displayed high convergent validity and substantial predictive validity. Index-stay mean scores, using explicit criteria, were generally lower in patients subsequently readmitted, and differences between readmitted and nonreadmitted patients achieved statistical significance as follows: mean readiness-for-discharge scores were significantly lower in patients with heart failure or with diabetes who were readmitted; and mean admission work-up scores were significantly lower in patients with lung disease who were readmitted. Scores derived from the structured implicit review were lower in patients eventually readmitted but significantly so only in diabetics.
CONCLUSIONS: These three criteria-based methods of assessing process of care appear to be measuring the same construct, presumably "quality of care." Both the explicit and implicit methods had substantial validity, but the explicit method is preferable. In this study, as in others, it had greater inter-rater reliability.
Hayward RA, Bernard AM, Rosevear JS, Anderson JE, McMahon LF Jr. An evaluation of generic screens for poor quality of hospital care on a general medicine service. Med Care. 1993 May;31(5):394-402.
Comment in: Med Care. 1994 Apr;32(4):405-6.
The generic screens for case identification assessed in this study were all based on hospital administrative data. The authors used a 28-day readmission window, but found that using 14 days gave similar results: no better than random review for identifying substandard care. The added correspondence dealt with the authors’ error in including deaths in the analysis of readmissions. Because of adjustment for the oversampling, the proportion of discharges with evidence of substandard care and without early readmission should have been stated as 11.4% instead of 11.9%.
In this study, 675 general medicine admissions at a university teaching hospital were reviewed to evaluate six potential generic quality screens: 1) in-hospital death; 2) 28-day early readmission; 3) low patient satisfaction; 4) worsening severity of illness (as determined by an increase in Laboratory Acute Physiology and Chronic Health Evaluation APACHE-L); and 5) deviations from expected hospital length of stay; and 6) expected ancillary resource use. The quality of care for a stratified random sample of admissions were evaluated using structured implicit review (inter-rate reliability, Kappa = 0.5). Patients who died in-hospital were substantially more likely than those who were discharged alive to be rated as having had substandard care (30% vs. 10%; P < 0.001). In contrast, cases who had subsequent early readmissions did not have poorer quality ratings. Similarly, lower patient satisfaction was not associated with poorer ratings of technical process of care. Cases with lower-than-expected ancillary resource use (case-mix adjusted for diagnosis-related group) were more likely to be rated as having received substandard care than those with higher-than-expected resource use (16% vs. 6%; P < 0.05), and there was a similar trend for cases with shorter than expected length of stays. Associations between worsening severity of illness, as determined by APACHE-L scores, and quality were confounded because such patients were more likely to have died in-hospital. When deaths were excluded from the analysis, we found no association between increases in APACHE-L scores during the hospital course and quality of care. It was found that in-hospital death and lower-than-expected ancillary resource use were associated with poorer implicit quality ratings, but that early readmission, lower patient satisfaction, and increases in APACHE-L scores were not. Quality screens, such as early readmissions, should be critically evaluated before being used administratively.
Hayward RA, McMahon LF, Bernard AM. Evaluating the care of general medicine inpatients: How good is implicit review? Ann Intern Med. 1993;118(7):551-557.
Tested structured implicit review methodology in the peer review setting using an adaptation of the RAND approach on randomly selected charts. Found the method adequately reliable for aggregate comparisons of quality of care, but not for assessment of resource use or readiness for discharge. Remember that this was the early 90s. I would hypothesize that, given 20 years of exposure to managed care, including Interqual criteria and Milliman & Robertson guidelines, physicians today would have less problem converging on such assessments.
Objective: Peer review often consists of implicit evaluations by physician reviewers of the quality and appropriateness of care. This study evaluated the ability of implicit review to measure reliably various aspects of care on a general medicine inpatient service.
Design: Retrospective review of patients' charts, using structured implicit review, of a stratified random sample of consecutive admissions to a general medicine ward.
Setting: A university teaching hospital.
Patients: Twelve internists were trained in structured implicit review and reviewed 675 patient admissions (with 20% duplicate reviews for a total of 846 reviews).
Results: Although inter-rater reliabilities for assessments of overall quality of care and preventable deaths (kappa = 0.5) were adequate for aggregate comparisons (for example, comparing mean ratings on two hospital wards), they were inadequate for reliable evaluations of single patients using one or two reviewers. Reviewers' agreement about most focused quality problems (for example, timeliness of diagnostic evaluation and clinical readiness at time of discharge) and about the appropriateness of hospital ancillary resource use was poor (kappa < 0.2). For most focused implicit measures, bias due to specific reviewers who were systematically more harsh or lenient (particularly for evaluation of resource-use appropriateness) accounted for much of the variation in reviewers' assessments, but this was not a substantial problem for the measure of overall quality. Reviewers rarely reported being unable to evaluate the quality of care because of deficiencies in documentation in the patient's chart.
Conclusion: For assessment of overall quality and preventable deaths of general medicine inpatients, implicit review by peers had moderate degrees of reliability, but for most other specific aspects of care, physician reviewers could not agree. Implicit review was particularly unreliable at evaluating the appropriateness of hospital resource use and the patient's readiness for discharge, two areas where this type of review is often used.
Goldman RL. The reliability of peer assessments: A meta-analysis. Eval Health Prof. 1994;17(1):3-21.
This analysis shows that unstructured methods have consistently poor inter-rater reliability compared to structured implicit review. A synopsis of the study was later published in JAMA which leaves out a lot of important detail. Use this version.
A meta-analysis of studies examining the interrater reliability of the standard practice of peer assessments of quality of care was conducted. Using the Medline, Health Planning and Administration, and SCISEARCH databases, the English-language literature from 1966 through 1991 was searched for studies of chance corrected agreement among peer reviewers. The weighted mean kappa of 21 independent findings from 13 studies was .31. Comparison of this result with widely used standards suggests that the interrater reliability of peer assessment is quite limited and needs improvement. Research needs to be directed at modifying the peer review process to improve its reliability or at identifying indexes of quality with sufficient validity and reliability that they can be employed without subsequent peer review.
Goldman RL, Ciesco E. Improving peer review: alternatives to unstructured judgments by a single reviewer. Jt Comm J Qual Improv. 1996;22(11):762-769.
Highlights findings from Goldman’s meta-analysis regarding the superiority of structured vs. unstructured judgments of quality. Notes the potential for outcome bias and the need for reviewer training. Also reports an informal survey of 16 VA hospitals with a peer review committee which showed an average of 6 members per committee, 7 reviews per meeting, 10.7 minutes per case, and good general satisfaction with the process. Contrasts this kind of process with independent reviews procedures and formal discussion to consensus methods.
BACKGROUND: Peer review, usually involving unstructured judgments by a single peer reviewing medical records, was the backbone of quality management in health care organizations until recent years. Although other approaches to quality management, such as continuous quality improvement, are now being widely adopted, peer review can still play a role in identifying dysfunctional organizational processes and individuals delivering poor care. Given the questionable reliability and validity of peer review as usually practiced, however, a more rigorous and thoughtful approach is needed-with consideration of issues such as the number of peers involved, the nature of their interaction while forming judgments, and the type of assessment instruments they use.
MULTIPLE-REVIEWER PROCEDURES: Such procedures include the use of two or more reviewers making their assessments independently, discussion to consensus among reviewers, and committee review.
STRUCTURED ASSESSMENT INSTRUMENTS: A number of studies suggest that instruments that carefully guide the reviewer have higher interrater reliability than less structured instruments.
IDENTIFICATION OF SYSTEM ISSUES: The focus of peer review can be expanded beyond individual practitioner performance to include assessment of the operational environment in which the clinician functions. For example, one might ask whether any changes in hospital policies or procedures or in administrative or computerized support for clinicians might have improved the care received by a patient.
CONCLUSION: The available data and the known limitations of unstructured judgments by a single reviewer justify serious consideration of multiple reviewer procedures and structured assessment instruments, particularly for reviews that have major consequences for patients and practitioners.
Wakefield DS, Helms CM. The role of peer review in a health care organization driven by TQM/CQI. Jt Comm J Qual Improv. 1995;21(5):227-231.
A theoretical treatment of why and how peer review should be better integrated with organizational quality improvement.
BACKGROUND: Many health care organizations have embraced the philosophy and tools of total quality management (TQM) and continuous quality improvement (CQI) without overt linkage to existing peer review processes. Achieving total quality in an organization requires that both peer review and TQM/CQI improvement processes be effectively used.
EXAMPLES: Three ways of linking peer review and TQM/CQI include: 1) coordinating TQM/CQI and peer review quality improvement initiatives whenever possible; 2) expanding the focus of peer review to include assessment of the processes and systems within which the clinician functions; and 3) linking peer review and TQM/CQI improvement processes to address behavioral and attitudinal issues having economic roots.
Smith MA, Atherly AJ, Kane RL, Pacala JT. Peer review of the quality of care: reliability and sources of variability for outcome and process assessments. JAMA. 1997;278(19):1573-1578.
Adapted the RAND methodology to evaluate care across the continuum during a 1 year period for both process and overall quality. This is a very different evaluative challenge than the typical event-driven hospital inpatient case review. Although training was comparable to the RAND specifications, found considerable variation in ratings between nurse practitioner and physician reviewers and among physicians for process measures, which in general had low reliability. Inter-rater reliability for the overall measure of quality was, however, consistent with other studies. The study raises a cautionary note for those pioneers who might seek to apply methodologies to novel situations without testing.
CONTEXT: Peer assessments have traditionally been used to judge the quality of care, but a major drawback has been poor interrater reliability.
OBJECTIVES: To compare the interrater reliability for outcome and process assessments in a population of frail older adults and to identify systematic sources of variability that contribute to poor reliability.
SETTING: Eight sites participating in a managed care program that integrates acute and long-term care for frail older adults.
PATIENTS: A total of 313 frail older adults.
DESIGN: Retrospective review of the medical record with 180 charts randomly assigned to 2 geriatricians, 2 geriatric nurse practitioners, or 1 geriatrician and 1 geriatric nurse practitioner and 133 charts randomly assigned to either a geriatrician or a geriatric nurse practitioner.
MAIN OUTCOME MEASURES: Interrater reliabilities for structured implicit judgments about process and outcomes for overall care and care for each of 8 tracer conditions (eg, arthritis).
RESULTS: Outcome measures had higher interrater reliability than process measures. Five outcome measures achieved fair to good reliability (more than 0.40), while none of the process measures achieved reliabilities more than 0.40. Three factors contributed to poorer reliabilities for process measures: (1) an inability of reviewers to differentiate among cases with respect to the quality of management, (2) systematic bias from individual reviewers, and (3) systematic bias related to the professional training of the reviewer (ie, physician or nurse practitioner).
CONCLUSIONS: Peer assessments can play an important role in characterizing the quality of care for complex patients with multiple interrelated chronic conditions, but reliability can be poor. Strategies to achieve adequate reliability for these assessments should be applied. These strategies include emphasizing outcomes measurement, providing more structured assessments to identify true differences in patient management, adjusting systematic bias resulting from the individual reviewer and their professional background, and averaging scores from multiple reviewers. Future research on the reliability of peer assessments should focus on improving the ability of process measures to differentiate among cases with respect to the quality of management and on identifying additional sources of systematic bias for both process and outcome measures. Explicit recognition of factors influencing reliability will strengthen efforts to develop sound measures for quality assurance.
Hofer TP, Bernstein SJ, DeMonner S, Hayward RA. Discussion between reviewers does not improve reliability of peer review of hospital quality. Med Care. 2000;38(2):152-161.
A detailed statistical analysis of inter-rater reliability for evaluation of causation and overall quality based on cases of severe adverse events. A methodological study, not an actual review program assessment. Found that discussion between pairs of reviewers improved the consistency of scoring only between those 2, and not across pairs. More importantly, the process of working together appeared to make the reviewer pairs more consistent with each other before discussion. Also the mean scores changed little with discussion.
OBJECTIVES: Peer review is used to make final judgments about quality of care in many quality assurance activities. To overcome the low reliability of peer review, discussion between several reviewers is often recommended to point out overlooked information or allow for reconsideration of opinions and thus improve reliability. The authors assessed the impact of discussion between 2 reviewers on the reliability of peer review.
METHODS: A group of 13 board-certified physicians completed a total of 741 structured implicit record reviews of 95 records for patients who experienced severe adverse events related to laboratory abnormalities while in the hospital (hypokalemia, hyperkalemia, renal failure, hyponatremia, and digoxin toxicity). They independently assessed the degree to which each adverse event was caused by medical care and the quality of the care leading up to the adverse event. Working in pairs, they then discussed differences of opinion, clarified factual discrepancies, and rerated the record. The authors compared the reliability of each measure before and after discussion, and between and within pairs of reviewers, using the intraclass correlation coefficient for continuous ratings and the kappa statistic for a dichotomized rating.
RESULTS: The assessment of whether the laboratory abnormality was iatrogenic had a reliability of 0.46 before discussion and 0.71 after discussion between paired reviewers, indicating considerably improved agreement between the members of a pair. However, across reviewer pairs, the reviewer reliability was 0.36 before discussion and 0.40 after discussion. Similarly, for the rating of overall quality of care, reliability of physician review went from 0.35 before discussion to 0.58 after discussion as assessed by pair. However, across pairs the reliability increased only from 0.14 to 0.17. Even for prediscussion ratings, reliability was substantially higher between 2 members of a pair than across pairs, suggesting that reviewers who work in pairs learn to be more consistent with each other even before discussion, but this consistency also did not improve overall reliability across pairs.
CONCLUSIONS: When 2 physicians discuss a record that they are reviewing, it substantially improves the agreement between those 2 physicians. However, this improvement is illusory, as discussion does not improve the overall reliability as assessed by examining the reliability between physicians who were part of different discussions. This finding may also have implications with regard to how disagreements are resolved on consensus panels, guideline committees, and reviews of literature quality for meta-analyses.
Epstein RM, Hundert EM. Defining and assessing professional competence. JAMA. 2002;287(2):226-235.
A good review and synthesis of the literature. While primarily targeting the evaluation of medical students, the findings are applicable to the ongoing assessment of all physicians.
CONTEXT: Current assessment formats for physicians and trainees reliably test core knowledge and basic skills. However, they may underemphasize some important domains of professional medical practice, including interpersonal skills, lifelong learning, professionalism, and integration of core knowledge into clinical practice.
OBJECTIVES: To propose a definition of professional competence, to review current means for assessing it, and to suggest new approaches to assessment.
DATA SOURCES: We searched the MEDLINE database from 1966 to 2001 and reference lists of relevant articles for English-language studies of reliability or validity of measures of competence of physicians, medical students, and residents.
STUDY SELECTION: We excluded articles of a purely descriptive nature, duplicate reports, reviews, and opinions and position statements, which yielded 195 relevant citations.
DATA EXTRACTION: Data were abstracted by 1 of us (R.M.E.). Quality criteria for inclusion were broad, given the heterogeneity of interventions, complexity of outcome measures, and paucity of randomized or longitudinal study designs.
DATA SYNTHESIS: We generated an inclusive definition of competence: the habitual and judicious use of communication, knowledge, technical skills, clinical reasoning, emotions, values, and reflection in daily practice for the benefit of the individual and the community being served. Aside from protecting the public and limiting access to advanced training, assessments should foster habits of learning and self-reflection and drive institutional change. Subjective, multiple-choice, and standardized patient assessments, although reliable, underemphasize important domains of professional competence: integration of knowledge and skills, context of care, information management, teamwork, health systems, and patient-physician relationships. Few assessments observe trainees in real-life situations, incorporate the perspectives of peers and patients, or use measures that predict clinical outcomes.
CONCLUSIONS: In addition to assessments of basic skills, new formats that assess clinical reasoning, expert judgment, management of ambiguity, professionalism, time management, learning strategies, and teamwork promise a multidimensional assessment while maintaining adequate reliability and validity. Institutional support, reflection, and mentoring must accompany the development of assessment programs.
Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide to their Development and Use. 3rd ed. New York, NY: Oxford University Press; 2003.
This well-written text makes measurement methodology accessible to clinicians. A great resource for clinical performance measurement development.
Lockyer J. Multisource feedback in the assessment of physician competencies. J Contin Educ Health Prof. 2003;23(1):4-12.
Multisource feedback is not yet widely used in the context of either peer review or credentialing, but one of our Showcase articles gives an example. Here is a good reference article on the subject.
Multisource feedback (MSF), or 360-degree employee evaluation, is a questionnaire-based assessment method in which rates are evaluated by peers, patients, and coworkers on key performance behaviors. Although widely used in industrial settings to assess performance, the method is gaining acceptance as a quality improvement method in health systems. This article describes MSF, identifies the key aspects of MSF program design, summarizes some of the salient empirical research in medicine, and discusses possible limitations for MSF as an assessment tool in health care. In industry and in health care, experience suggests that MSF is most likely to succeed and result in changes in performance when attention is paid to structural and psychometric aspects of program design and implementation. A carefully selected steering committee ensures that the behaviors examined are appropriate, the communication package is clear, and the threats posed to individuals are minimized. The instruments that are developed must be tested to ensure that they are reliable, achieve a generalizability coefficient of Ep2 = .70, have face and content validity, and examine variance in performance ratings to understand whether ratings are attributable to how the physician performs and not to factors beyond the physician's control (e.g., gender, age, or setting). Research shows that reliable data can be generated with a reasonable number of respondents, and physicians will use the feedback to contemplate and initiate changes in practice. Performance may be affected by familiarity between rater and ratee and sociodemographic and continuing medical education characteristics; however, little of the variance in performance is explained by factors outside the physician's control. MSF is not a replacement for audit when clinical outcomes need to be assessed. However, when interpersonal, communication, professionalism, or teamwork behaviors need to be assessed and guidance given, it is one of the better tools that may be adopted and implemented to provide feedback and guide performance.
Hofer TP, Asch SM, Hayward RA, Rubenstein LV, Hogan MM, Adams J, Kerr EA. Profiling quality of care: Is there a role for peer review? BMC Health Serv Res. 2004 May 19;4(1):9.
A well-designed study looking globally at quality of care received by patients for a 12 month period that included both inpatient and outpatient records. Offers a thorough analysis of the sources of variance in assessing longitudinal quality of care for patients with any of 4 defined conditions in the VA healthcare system using structured implicit review. The results are not directly applicable to the evaluation of physician performance through case-based reviews, but the method of analysis is highly relevant.
BACKGROUND: We sought to develop a more reliable structured implicit chart review instrument for use in assessing the quality of care for chronic disease and to examine if ratings are more reliable for conditions in which the evidence base for practice is more developed.
METHODS: We conducted a reliability study in a cohort with patient records including both outpatient and inpatient care as the objects of measurement. We developed a structured implicit review instrument to assess the quality of care over one year of treatment. 12 reviewers conducted a total of 496 reviews of 70 patient records selected from 26 VA clinical sites in two regions of the country. Each patient had between one and four conditions specified as having a highly developed evidence base (diabetes and hypertension) or a less developed evidence base (chronic obstructive pulmonary disease or a collection of acute conditions). Multilevel analysis that accounts for the nested and cross-classified structure of the data was used to estimate the signal and noise components of the measurement of quality and the reliability of implicit review.
RESULTS: For COPD and a collection of acute conditions the reliability of a single physician review was quite low (intra-class correlation = 0.16-0.26) but comparable to most previously published estimates for the use of this method in inpatient settings. However, for diabetes and hypertension the reliability is significantly higher at 0.46. The higher reliability is a result of the reviewers collectively being able to distinguish more differences in the quality of care between patients (p < 0.007) and not due to less random noise or individual reviewer bias in the measurement. For these conditions the level of true quality (i.e. the rating of quality of care that would result from the full population of physician reviewers reviewing a record) varied from poor to good across patients.
CONCLUSIONS: For conditions with a well-developed quality of care evidence base, such as hypertension and diabetes, a single structured implicit review to assess the quality of care over a period of time is moderately reliable. This method could be a reasonable complement or alternative to explicit indicator approaches for assessing and comparing quality of care. Structured implicit review, like explicit quality measures, must be used more cautiously for illnesses for which the evidence base is less well developed, such as COPD and acute, short-course illnesses.
Jamtvedt G, Young JM, Kristoffersen DT, O'Brien MA,Oxman AD. (2006) Does telling people what they have been doing change what they do? A systematic review of the effects of audit and feedback. Quality and Safety in Health Care. 15(6):433-436.
Clinical audit has remained the standard for peer review process outside of the USA. This study shows that audit and feedback can be effective, although the impact may be relatively small.
BACKGROUND: Many people advocate audit and feedback as a strategy for improving professional practice. The main results of an update of a Cochrane review on the effects of audit and feedback are reported.
DATA SOURCES: The Cochrane Effective Practice and Organisation of Care Group's register up to January 2004 was searched. Randomised trials of audit and feedback that reported objectively measured professional practice in a healthcare setting or healthcare outcomes were included.
REVIEW METHODS: Data were independently extracted and the quality of studies were assessed by two reviewers. Quantitative, visual and qualitative analyses were undertaken.
MAIN RESULTS: 118 trials are included in the review. In the primary analysis, 88 comparisons from 72 studies were included that compared any intervention in which audit and feedback was a component to no intervention. For dichotomous outcomes, the median-adjusted risk difference of compliance with desired practice was 5% (interquartile range 3-11). For continuous outcomes, the median-adjusted percentage change relative to control was 16% (interquartile range 5-37). Low baseline compliance with recommended practice and higher intensity of audit and feedback appeared to predict the effectiveness of audit and feedback.
CONCLUSIONS: Audit and feedback can be effective in improving professional practice. The effects are generally small to moderate. The absolute effects of audit and feedback are likely to be larger when baseline adherence to recommended practice is low and intensity of audit and feedback is high.
Dharmar M, Marcin JP, Kuppermann N, Andrada ER, Cole S, Harvey DJ, Romano, PS. A new implicit review instrument for measuring quality of care delivered to pediatric patients in the emergency department. BMC Emergency Medicine 2007, 7:13
While this study was designed to assess quality of care in rural EDs, the review instrument has general applicability. Readers should use caution in interpreting the intra-class correlation coefficients reported for the scale items. They are for the average of 2 independent reviews. As with other forms of peer review assessment, they are good for aggregate comparisons. They are not sufficiently reliable for judgment of a single case by one reviewer.
Background: There are few outcomes experienced by children receiving care in the Emergency Department (ED) that are amenable to measuring for the purposes of assessing of quality of care. The purpose of this study was to develop, test, and validate a new implicit review instrument that measures quality of care delivered to children in EDs.
Methods: We developed a 7-point structured implicit review instrument that encompasses four aspects of care, including the physician's initial data gathering, integration of information and development of appropriate diagnoses; initial treatment plan and orders; and plan for disposition and follow-up. Two pediatric emergency medicine physicians applied the 5-item instrument to children presenting in the highest triage category to four rural EDs, and we assessed the reliability of the average summary scores (possible range of 5–35) across the two reviewers using standard measures. We also validated the instrument by comparing this mean summary score between those with and without medication errors (ascertained independently by two pharmacists) using a two-sample t-test.
Results: We reviewed the medical records of 178 pediatric patients for the study. The mean and median summary score for this cohort of patients were 27.4 and 28.5, respectively. Internal consistency was high (Cronbach's alpha of 0.92 and 0.89). All items showed a significant (p < 0.005) positive correlation between reviewers using the Spearman rank correlation (range 0.24 to 0.39). Exact agreement on individual items between reviewers ranged from 70.2% to 85.4%. The Intra-class Correlation Coefficient for the mean of the total summary score across the two reviewers was 0.65. The validity of the instrument was supported by the finding of a higher score for children without medication errors compared to those with medication errors which trended toward significance (mean score = 28.5 vs. 26.0, p = 0.076).
Conclusion: The instrument we developed to measure quality of care provided to children in the ED has high internal consistency, fair to good inter-rater reliability and inter-rater correlation, and high content validity. The validity of the instrument is supported by the fact that the instrument's average summary score was lower in the presence of medication errors, which trended towards statistical significance.