Comparing Kappa values

Margaret

unread,

Jan 14, 2006, 7:01:34 AM1/14/06

to MedStats

Dear all

I would very much appreciate receiving advice on best practice for
comparing two Kappa values. For example, suppose I have used a Kappa
statistic to measure agreement between the grades allocated by internal
and external examiners for 50 students, where different students have
different examiners, and that I have calculated this statistic for the
years 2000 and 2005 separately. My intention would be to identify an
appropriate hypothesis test or other procedure to assess the extent to
which agreement has changed in the sense, is the change significant?

I have a related question which I would like to include here. In the
type of study I am describing here, suppose a number of internal
examiners 'break the rules' and instead of allocating a fixed grade
suggest an 'A' or 'B' grade but don't decide between the two. If there
are sufficiently many such cases for me to be concerned about sample
size were I to omit them, what would you suggest I should do when it
comes to calculating a Kappa statistic? One important point here is
that the external examiners always choose a single grade. I had
previously thought of including 'undecided' as a category but of
course, this does not help when one is trying to measuring the extent
of disagreement between different examiners in terms of how far their
respective grades are apart from one another (by means of a weighted
Kappa statistic, say).

Thank you so much for your kind assistance.

Regards

Margaret

Martin P. Holt

unread,

Jan 15, 2006, 9:53:16 AM1/15/06

to MedS...@googlegroups.com

I know how to get to the standard error of the kappa for a 2x2 table,
Margaret. So if you were to treat your study as Internal/External examiners,
Pass/Fail there would be a way forward. Let's say you choose to do this, and
say you got the following results:

Year 2000
Internal
Pass Fail Total
Pass 30 5 35
External Fail 5 10 15
Total 35 15 50

Observed proportion of agreements = 40/50 = 0.8 = Po
Proportion expected by chance
= [(35x35/50) + (15x15/50)]/50 = 29 = 0.58 = Pe

So kappa = (Po - Pe) / (1 - Pe) = 0.52

SE(kappa) = sqrt{[Po(1-Po)] / [N(1.0 - Pe)^2]}
= sqrt{[0.8x0.2] / [50(0.42)^2]}
=0.135

This would allow the calculation of confidence intervals for the two years,
which might help......but I wouldn't be surprised if they overlapped.

In terms of a hypothesis test, it would be natural to subtract one kappa
from the other and look to see if this difference were significantly
different from zero. I'm not sure how you would use the standard errors
calculated above to do this, but someone else in the Group would know, I'm
sure.........help, anyone ?

My tuppenny worth, although I hope someone comes up with more than this !

Best Wishes,
Martin

PS Source of Standard Error formula: "Health Measurement Scales" 2nd Ed, by
Streiner,D and Norman, G

Margaret

unread,

Jan 16, 2006, 5:10:46 AM1/16/06

to MedStats

Dear Martin

Many thanks for your reply. I guess that if we were talking confidence
intervals we would really want the SE of the difference between the
Kappa values (or of the ratio of the Kappa values is this is more
appropriate), as I would wish a CI for a statistic comparing the two
Kappa values. I wonder how such an SE is obtained and am still
interested to learn of any appropriate hypothesis tests for assessing
the signifcance of the value used to compare the two Kappa values.

Regards

Margaret

unread,

Jan 19, 2006, 2:29:03 PM1/19/06

to MedStats

Dear Martin and any others who have responded to me personally
regarding my discussions on Kappa statistics

Thank you for your correspondence.

I have found something useful in the second edition of Fleiss'
publication 'Statistical Methods for Rates and Proportions' relating to
the comparing of different Kappa values to determine whether they are
significant. I can see from the example covered that the recommended
approach is designed to cover the case when the samples being
categorized are non-identical across different studies and the raters
are non-identical across different studies. I am not absolutely clear
as yet whether the same approach can accommodate cases where the
grading system for assigning categories used across each such study is
non-identical. (The example refers to "a given kind of rating". I wish
to see if changing a grading system makes a difference to agreement and
I suspect that the method can be used in this way too.) For your info,
the relevant details can be found under Problem 13.3 (b) on p. 254 of
this reference.