Cohen's Kappa is a numerical measure of reliability designed for situations where two annotators are evaluating the same thing. It accounts for the likelihood of chance agreement between the annotators and provides a corrected assessment of their reliability.
Cohen's Kappa coefficient is a reliable measure for assessing agreement between annotators in categorical data. However, it does not guarantee the validity of the measurements, as it does not directly evaluate their accuracy or correctness. Annotators can agree on incorrect annotations. On the other hand, validity is about how well the measurements actually capture what they are supposed to measure.
For instance, let's consider a situation where two annotators, Alice and Bob, are selected to determine a person's emotions. Additionally, the person will also indicate their actual emotions. The collected results are as follows.
As seen from the diagram, it is apparent that Alice and Bob have 100% agreement in their assessments. However, their answers are unreliable since the assessed individual expresses opposite emotions. Thus, while Cohen's Kappa assesses reliability, validity ensures the accuracy and meaningfulness of the measurements.
Cohen's Kappa is calculated by comparing the observed agreement between two annotators with the agreement expected by chance. The formula for calculating Cohen's Kappa is as follows.
We can tabulate the results of the annotators' decisions for this case, where they choose between "Yes" or "No" on an event.
The tabulated format is as follows.
a: Total instances with agreement by both annotators, indicating correctness(yes).
b: Instances where annotator B disagreed(no) while annotator A agreed(yes).
c: Instances where annotator A disagreed (no) while annotator B agreed (yes).
d: Total instances with agreement by both annotators, indicating incorrectness(no).
To calculate
To calculate
To understand the significance of the
Value | Significance |
< 0 | No agreement |
0 – 0.20 | Slight |
0.21 – 0.40 | Fair |
0.41 – 0.60 | Moderate |
0.61 – 0.80 | Substantial |
0.81 – 1.0 | Perfect |
The coefficient ranges from -1 to 1, with higher values indicating agreement beyond chance, values close to 0 suggesting chance-level agreement, and values near -1 indicating disagreement beyond chance.
To illustrate the calculation of Cohen's Kappa, let's begin with an example. Imagine a situation where two instructors analyzed data to determine whether to label each student as "good" or "bad" based on their behavior. This assessment was performed on a group of 50 students, involving the evaluations of both instructors.
The results are tabulated as follows.
To calculate Kohen's Cappa, we need
We will first calculate
Next, we will calculate
Now we can finally calculate Cohen's Kappa as we have
Therefore, the just calculated
We have acquired the ability to differentiate between reliability and validity, explain Cohen's Kappa, and evaluate its significance. It becomes evident that this statistical measure holds great utility. It proves to be an effective tool for assessing performance in scenarios involving multi-class classification and imbalanced class distributions.
Free Resources