Fleiss kappa interpretation recommendations

140 views
Skip to first unread message

Bob Green

unread,
Jan 25, 2022, 4:28:02 PM1/25/22
to meds...@googlegroups.com
Hello


The 1977 paper by Landers and Koch, which provided guidelines for
interpting kappa values in the the scenario of two coders and
multiple categories, has been cited over 69000 times. My searching so
far, suggests a variety of approaches to this issue.

I was hopping for a recommendation about a paper that provided
general guidelines about interpretation of Fleiss' multirater kappa
results ( multiple coders coding the same items using the same codes).

Any recommendations are appreciated.

Bob

Bruce Weaver

unread,
Jan 25, 2022, 7:40:29 PM1/25/22
to MedStats
Hi Bob.  I believe that Gwet's Handbook of Inter-rater Reliability is considered a pretty good resource for all such matters these days.  I think I have a copy of the 4th edition back in the office I have not visited for so long, but I see on Gwet's website that there is now a 2-volume 5th edition:
You could also try searching is blog page

HTH.

Brian Dates

unread,
Jan 25, 2022, 8:49:31 PM1/25/22
to meds...@googlegroups.com
Bob,

Here's a cross reference to approaches to the strength of agreement in kappa-like statistics for each statistic, e.g., Cohen, Fleiss, etc. Overall references appear at the bottom. I hope this helps.

Benchmarking

Landis and Koch (1977)

 

Altman, DG (1991)

 

Textbook Fleiss et al (2003)

 

0.81 – 1.00

Almost perfect

0.81 – 1.00

Very good

0.75 – 1.00

Very good

0.61 – 0.80

Substantial

0.61 – 0.80

Good

0.41 – 0.75

Fair to good

0.41 – 0.60

Moderate

0.41 – 0.60

Moderate

<0.40

Poor

0.21 – 0.40

Fair

0.21 – 0.40

Fair

 

 

0.00 – 0.20

Slight

< 0.20

Poor

 

 

< 0.00

Poor

 

 

 

 

 

Gwet’s Benchmarking System for the AC1

Benchmark Scale

Description

Gwet Scale for AC1

Landis and Koch

Substantial

Moderate

Altman

Good

Moderate

Fleiss

Intermediate to Good

Intermediate to Good

Cohen’s Type Kappa

 

 

 

Benchmark Scale

Description

Gwet Scale for AC1

Landis and Koch

Substantial

Moderate

Altman

Good

Moderate

Fleiss

Intermediate to Good

Intermediate to Good

Fleiss’ Type Kappa

 

 

 

Benchmark Scale

Description

Gwet Scale for AC1

Landis and Koch

Almost Perfect

Substantial

Altman

Very Good

Good

Fleiss

Excellent

Intermediate to Good

Bennett’s Type S

 

 

 

Benchmark Scale

Description

Gwet Scale for AC1

Landis and Koch

Almost Perfect

Substantial

Altman

Very Good

Good

Fleiss

Excellent

Intermediate to Good

Gwet’s AC1

 

 

 

References:

Altman, D. (1991). Practical statistics for medical research. Chapman and Hall.

Landis, J., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics 33, 159–174.

Fleiss, J. L., Levin, B., & Park, M. C. (2003). Statistical methods for rates and proportions (3rd edition). John Wiley & Sons, Inc.

Gwet, K. (2014). Handbook of inter-rater reliability. Advanced Analytics Press. ISBN: 9 780970 806284.



Brian G. Dates, M.A.
Consultant in Program Evaluation, Research, and Statistics


--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/3a83daa6-054a-42a5-8ed2-cc8bc7ecf15en%40googlegroups.com.

Rich Ulrich

unread,
Jan 25, 2022, 9:29:26 PM1/25/22
to meds...@googlegroups.com
I have always been skeptical about kappa with multiple
categories, unless the author is presenting a /set/ of values
for tables that are highly parallel -- Kappa is potentially very
vulnerable to happenstances of marginal frequencies. And
if the categories are ordered, which almost always is the
case, kappa ignores the ordering. 

When there are multiple raters, a statistic across raters is
okay when the summary is, "Everything is Great!"  In practice,
I want to see for myself the comparison of pairs of raters.
If I'm reviewing a paper, I want to know that the authors looked.

What you publish might be summaries.  When the data suit it,
I like to know the information from the paired t-tests for each
pair of raters: correlation (for association) and t-test (for bias).

--
Rich Ulrich

From: meds...@googlegroups.com <meds...@googlegroups.com> on behalf of Bob Green <bgr...@dyson.brisnet.org.au>
Sent: Tuesday, January 25, 2022 4:27 PM
To: meds...@googlegroups.com <meds...@googlegroups.com>
Subject: {MEDSTATS} Fleiss kappa interpretation recommendations
 
--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

Chris Evans

unread,
Jan 26, 2022, 5:37:04 AM1/26/22
to medstats
Interesting thread and oddly synchronous with my doing both an Rblog post about kappa: https://link.psyctc.org/kappaRblog1 and 
a more lay person's blog post https://link.psyctc.org/kappaBlog1 .  The former particularly might intrigue people here.  

However, my real reason for posting here is to see if I am the only one here who is very wary of all these categorisations of 
continuous valued indices and statistics.  As a "psycho" (psychiatrist/psychotherapist) I'm intrigued by how we seem to need to
translate numbers (back) into words, or rather, in how we choose to do it.  What do these cutting points and mappings really
mean?  An inter-rater agreement that is good enough, and strengthening ratings for a low cost, large numbers, between groups 
comparison may show the rating system to be perfectly adequate for purpose and simply adjusts the planning power/precision
analysis.  By contrast, we want to see much higher agreement rates for a rating system that is going to be used to make key
decisions about individuals, observer ratings of X-rays say.  Surely we should do these mappings, "adequate", "good",
"excellent" or whatever, to purpose not as if they add much to the simple kappa, or Bandiwala, or whatever, when thinking
about the rating irrespective of the use to which it will be put.

Yet we seem to need these mappings and even their histories (thanks to Brian Dates for a lovely summary and comparison).

Does this resonate with anyone?  Does it affect the original question here?

Very best all,

Chris



Chris Evans (he/him)        <ch...@psyctc.org>     
Visiting Professor, UDLA, Quito, Ecuador & Honorary Professor, University of Roehampton, London, UK.
Work web site:                  https://www.psyctc.org/psyctc/ 
CORE site:                        http://www.coresystemtrust.org.uk/ 
Personal site:                    https://www.psyctc.org/pelerinage2016/ 

Evans_outcome measures_draft5-1_resized.jpg

Bruce Weaver

unread,
Jan 26, 2022, 10:35:43 AM1/26/22
to MedStats
Rich wrote: 
<quote>
Kappa is potentially very vulnerable to happenstances of marginal frequencies. And if the categories are ordered, which almost always is the
case, kappa ignores the ordering. 
</quote>

For 2x2 tables with extreme marginal frequencies, Yule's Q (aka., Yule's coefficient of colligation) is arguably a better choice than Cohen's kappa.  (Many years ago, a co-author of mine described this approach in an article, but unfortunately, he called the statistic phi.  I always thought that caused unnecessary confusion, because phi was already in use as the name for Pearson r on a 2x2 table.  Perhaps my colleague did not then know about Yule's Q.)

Regarding the second point, people I know have always used weighted kappa when they have ordinal variables. 

Cheers,
Bruce

Brian Dates

unread,
Jan 26, 2022, 11:28:08 AM1/26/22
to meds...@googlegroups.com
Bruce,

Considering Fleiss and Cohen's 1973 article on the equivalence of weighted kappa and ICC, is there a reason not to just use the ICC with ordinal data? Also, Aikin's alpha tries to address the difference between difficult and easy items, which has the potential for eliminating the happenstance that Rich refers to.

B

Brian G. Dates, M.A.
Consultant in Program Evaluation, Research, and Statistics

Bruce Weaver

unread,
Jan 26, 2022, 2:13:30 PM1/26/22
to MedStats
Hi Brian.  Re ICC vs weighted kappa, sure, if the ICC model you want to use is the one that matches weighted kappa.  IIRC, Norman & Streiner said in one (or more) of their books that which one you report can be a matter of convention in the particular discipline.  One could use software to compute an ICC but then report it as a weighted kappa (if that is the convention in a given discipline).  But some of that might have been written before McGraw & Wong (1996) came along. 

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological methods, 1(1), 30.

McGraw, K. O., & Wong, S. P. (1996). " Forming inferences about some intraclass correlations coefficients": Correction.

Bob Green

unread,
Jan 28, 2022, 4:40:08 PM1/28/22
to meds...@googlegroups.com

Thanks to all for their replies.

I have looked into Kilem Gwet's irrCAC package in R.


The data is from 3 coders, who coded 101 statements. Each statement
was coded into one of 10 categories, the categories are unordered.


I obtained a Fleiss coefficient value of 0.58 in irrcac and 0.53 in
irr. In irrcac it didn't really matter which option I selected, the
results was basically the same.

coeff.name pa pe coeff.val
coeff.se conf.int p.value w.name

1 Krippendorff's Alpha 0.5855308 0.1187356 0.52969 0.04175
(0.447,0.613) 0 unweighted



coeff.name pa pe coeff.val coeff.se conf.int
p.value w.name

1 Fleiss' Kappa 0.5841584 0.1187356 0.52813 0.04175
(0.445,0.611) 0 unweighted


Does the difference between irr and irrcac results, suggest use of
one over the other?


Does anyone know if coefficients can be obtained for each category in
irrCAC as can be obtained in irr?



Any assistance is appreciated,

Regards

Bob

Bob Green

unread,
Jan 31, 2022, 3:23:59 AM1/31/22
to meds...@googlegroups.com

Thanks to all for their replies.

I realised I was reading the wrong column. In this case irr and irrCAC agree.
Reply all
Reply to author
Forward
0 new messages