Original articles
Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity

https://doi.org/10.1016/S0895-4356(99)00174-2Get rights and content

Abstract

Paradoxical effects of bias and prevalence on the kappa coefficient are examined using the concepts of sensitivity and specificity. Results that appear paradoxical when viewed as a 2 × 2 table of frequencies do not appear paradoxical when viewed as a pair of sensitivity and specificity measures where each observer is treated as a predictor of the other observer. An adjusted kappa value can be obtained from these sensitivity/specificity measures but simulation studies indicate that it would result in substantial overestimation of reliability when bias or prevalence effects are observed. It is suggested that investigators concentrate on obtaining populations with trait prevalence near 50% rather than searching for statistical indices to rescue or excuse inefficient experiments.

Introduction

The kappa coefficient, first described by Cohen in 1960 [1], is commonly used to assess agreement between two or more observers. In its simplest form, it is used to analyze the case where two observers classify a number of cases according to whether some finding (e.g., a tumor or other disease process) is present or absent. The results may be arranged in the four cells of a 2 × 2 table as shown below.

Cell a indicates the proportion of cases where both observers agree that the finding is present. Cell b indicates the proportion of cases where observer #2 claims that the finding is present while observer #1 claims that it is not. Cell c indicates the alternative form of disagreement (where observer #1 claims that the finding is present and observer #2 claims that it is not) and cell d indicates the proportion of cases where both observers agree that the finding is absent. When these are expressed as proportions, a + b + c + d = 1.0.

The kappa coefficient (κ) is obtained by calculating the proportion of observed agreements (po) and expected agreements (pe) as follows: po=a+dpe=a+ba+c+b+dc+dκ=pope1−pe

Although the kappa coefficient is quite simple and is widely used to test interrater agreement, a number of authors have described seeming paradoxes associated with the effects of marginal proportions 2, 3, 4, 5, 6, 7, 8. Feinstein and Cicchetti [3] and Byrt et al. [6] have presented the examples reproduced in Table 1.

The first paradox is shown in Examples 1 and 2. Both examples have an identical proportion of agreements (0.85) but the kappa coefficient is substantially lower in Example 2. This effect is attributed to the higher prevalence seen in that example. The second paradox is shown in Examples 3 and 4. Again, the proportion of agreements is the same (0.60) but the kappa coefficient calculated for Example 4 is higher, apparently because that example shows a greater discrepancy between the two observers in marginal proportions.

Examples presented by Lantz and Nebenzahl [7] are shown in Table 2. In all three examples, the two observers agree on 90% of the cases, but the kappa coefficients are quite different.

These problems have been explained in terms of bias and prevalence effects. Prevalence effects occur when the overall proportion of positive results is substantially different from 50%. Bias effects occur when the two observers differ on the proportion of positive results. Byrt et al. [6] and Lantz and Nebenzahl [7] have independently examined an adjustment of kappa for prevalence and bias. Their equation for Prevalence and Bias Adjusted Kappa (PABAK) can be presented as: PABAK=2po−1=κ1−PI2+BI2+PI2−BI2.

This equation uses a bias index (BI) and a prevalence index (PI) defined as: BI=b−cPI=a−d

Use of the bias index is equivalent to replacing cells b and c by their average ([b + c]/2) while use of the prevalence index is equivalent to replacing cells a and d by their average ([a + d]/2) and calculating kappa in the usual fashion.

Section snippets

Analyses based on sensitivity and specificity

Claims for “paradoxical” effects on kappa often note discrepancies between the kappa coefficient and the total proportion of agreements. However, some reflection on these measures tells us that the raw proportion of agreements can be a deceptive measure. Consider a study where all cases are classified as positive; the proportion of cases in cell a is 1.00 and the proportion of cases in all other cells is 0.0. We have the highest possible proportion of agreements but the study has told us

Simulation studies of bias and prevalence effects on kappa

In order to examine bias and prevalence effects on the ordinary and adjusted kappa score, a series of simulation studies was performed. For each situation, a total of 10,000 simulations of a 100-patient study were performed. For the “balanced” simulations, both observers had a 50% chance of a positive finding while for the “prevalence effect” simulations both observers had a 90% chance of a positive finding. For the “bias effect” simulations, observer #1 had a 75% chance of a positive finding

Discussion

There has been a substantial amount of investigation involving problems and “paradoxes” of the kappa coefficient. Fig. 1 suggests that these paradoxes may be more apparent than real. When investigators examine a 2 × 2 table of frequency counts they will, from long experience, tend to look at the diagonal cells. If one diagonal has a substantially greater frequency than the other, a highly significant result will be expected and a relatively low kappa coefficient may, therefore, come as a

References (11)

There are more references available in the full text version of this article.

Cited by (217)

  • Inter-reader agreement of the Prostate Imaging Quality (PI-QUAL) score: A bicentric study

    2022, European Journal of Radiology
    Citation Excerpt :

    The assessment of artifacts on DCE was the only source of disappointing agreement we observed (k = 0.21). One can suppose that the high number of “absence of artifacts” judgments provided by both readers (48–50/66 examinations), large percent agreement (70%), and no consequences in assessing the sequence as of diagnostic quality can imply a “prevalence effect” underestimating the k value [30]. However, the results of this single item did not affect the overall trend suggesting acceptable inter-reader agreement.

View all citing articles on Scopus
View full text