What is Cohen's Kappa?

Cohen's Kappa is a numerical measure of reliability designed for situations where two annotators are evaluating the same thing. It accounts for the likelihood of chance agreement between the annotators and provides a corrected assessment of their reliability.

Cohen's Kappa reliability and validity

Cohen's Kappa coefficient is a reliable measure for assessing agreement between annotators in categorical data. However, it does not guarantee the validity of the measurements, as it does not directly evaluate their accuracy or correctness. Annotators can agree on incorrect annotations. On the other hand, validity is about how well the measurements actually capture what they are supposed to measure.

For instance, let's consider a situation where two annotators, Alice and Bob, are selected to determine a person's emotions. Additionally, the person will also indicate their actual emotions. The collected results are as follows.

Annotators assessing emotions of individuals
Annotators assessing emotions of individuals

As seen from the diagram, it is apparent that Alice and Bob have 100% agreement in their assessments. However, their answers are unreliable since the assessed individual expresses opposite emotions. Thus, while Cohen's Kappa assesses reliability, validity ensures the accuracy and meaningfulness of the measurements.

Cohen's Kappa mathematically

Cohen's Kappa is calculated by comparing the observed agreement between two annotators with the agreement expected by chance. The formula for calculating Cohen's Kappa is as follows.

  • PoP_{o} represents the observed agreement between the raters. It represents how much they agree with each other in their classifications or judgments without considering what would be expected by chance.

  • PeP_{e} represents the agreement expected by chance. It is the amount of agreement that would happen by random chance alone. It accounts for what could be expected if the annotators' judgments were purely random and unrelated.

We can tabulate the results of the annotators' decisions for this case, where they choose between "Yes" or "No" on an event.

The tabulated format is as follows.

  • a: Total instances with agreement by both annotators, indicating correctness(yes).

  • b: Instances where annotator B disagreed(no) while annotator A agreed(yes).

  • c: Instances where annotator A disagreed (no) while annotator B agreed (yes).

  • d: Total instances with agreement by both annotators, indicating incorrectness(no).

Calculating PoP_{o} (observed agreement)

To calculate PoP_{o}, you divide the number of agreements by the total number of observations.

Calculating PeP_{e} (chance agreement)

To calculate PeP{_e}, you determine the expected agreement for each category based on chance. In this case, since there are two categories ("Yes" and "No"), you would calculate the expected agreement for each category assuming chance alone. Then, you average these expected agreement values to determine PeP_{e}

Significance of Cohen's Kappa value

To understand the significance of the κ\kappa (kappa) value, we can examine the results presented in the following table.

Value

Significance

< 0

No agreement

0 – 0.20

Slight

0.21 – 0.40

Fair

0.41 – 0.60

Moderate

0.61 – 0.80

Substantial

0.81 – 1.0

Perfect

The coefficient ranges from -1 to 1, with higher values indicating agreement beyond chance, values close to 0 suggesting chance-level agreement, and values near -1 indicating disagreement beyond chance.

Example

To illustrate the calculation of Cohen's Kappa, let's begin with an example. Imagine a situation where two instructors analyzed data to determine whether to label each student as "good" or "bad" based on their behavior. This assessment was performed on a group of 50 students, involving the evaluations of both instructors.

The results are tabulated as follows.

Results of two annotators annotating 50 students as bad or good
Results of two annotators annotating 50 students as bad or good

To calculate Kohen's Cappa, we need PoP_{o} and PeP_{e}.

We will first calculate PoP_{o} by using the formula:

Next, we will calculate PeP_{e}​ (expected agreement by chance). To do this, we first need to determine PGoodP_{Good}​ and PBadP_{Bad}.

Now we can finally calculate Cohen's Kappa as we have PeP_{e} and PoP_{o}.

Therefore, the just calculated κ=0.40 \kappa = 0.40 represents fair reliability or agreement.

Summary

We have acquired the ability to differentiate between reliability and validity, explain Cohen's Kappa, and evaluate its significance. It becomes evident that this statistical measure holds great utility. It proves to be an effective tool for assessing performance in scenarios involving multi-class classification and imbalanced class distributions.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved