Fleiss kappa interpretation recommendations

Bob Green

unread,

Jan 25, 2022, 4:28:02 PM1/25/22

to meds...@googlegroups.com

Hello

The 1977 paper by Landers and Koch, which provided guidelines for
interpting kappa values in the the scenario of two coders and
multiple categories, has been cited over 69000 times. My searching so
far, suggests a variety of approaches to this issue.

I was hopping for a recommendation about a paper that provided
general guidelines about interpretation of Fleiss' multirater kappa
results ( multiple coders coding the same items using the same codes).

Any recommendations are appreciated.

Bob

Bruce Weaver

unread,

Jan 25, 2022, 7:40:29 PM1/25/22

to MedStats

Hi Bob. I believe that Gwet's Handbook of Inter-rater Reliability is considered a pretty good resource for all such matters these days. I think I have a copy of the 4th edition back in the office I have not visited for so long, but I see on Gwet's website that there is now a 2-volume 5th edition:

https://www.agreestat.com/

You could also try searching is blog page.

HTH.

Brian Dates

unread,

Jan 25, 2022, 8:49:31 PM1/25/22

to meds...@googlegroups.com

Bob,

Here's a cross reference to approaches to the strength of agreement in kappa-like statistics for each statistic, e.g., Cohen, Fleiss, etc. Overall references appear at the bottom. I hope this helps.

Benchmarking

Landis and Koch (1977)		Altman, DG (1991)		Textbook Fleiss et al (2003)
0.81 – 1.00	Almost perfect	0.81 – 1.00	Very good	0.75 – 1.00	Very good
0.61 – 0.80	Substantial	0.61 – 0.80	Good	0.41 – 0.75	Fair to good
0.41 – 0.60	Moderate	0.41 – 0.60	Moderate	<0.40	Poor
0.21 – 0.40	Fair	0.21 – 0.40	Fair
0.00 – 0.20	Slight	< 0.20	Poor
< 0.00	Poor

Gwet’s Benchmarking System for the AC1

Benchmark Scale	Description	Gwet Scale for AC1
Landis and Koch	Substantial	Moderate
Altman	Good	Moderate
Fleiss	Intermediate to Good	Intermediate to Good

Cohen’s Type Kappa

Benchmark Scale	Description	Gwet Scale for AC1
Landis and Koch	Substantial	Moderate
Altman	Good	Moderate
Fleiss	Intermediate to Good	Intermediate to Good

Fleiss’ Type Kappa

Benchmark Scale	Description	Gwet Scale for AC1
Landis and Koch	Almost Perfect	Substantial
Altman	Very Good	Good
Fleiss	Excellent	Intermediate to Good

Bennett’s Type S

Benchmark Scale	Description	Gwet Scale for AC1
Landis and Koch	Almost Perfect	Substantial
Altman	Very Good	Good
Fleiss	Excellent	Intermediate to Good

Gwet’s AC1

References:

Altman, D. (1991). Practical statistics for medical research. Chapman and Hall.

Landis, J., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics 33, 159–174.

Fleiss, J. L., Levin, B., & Park, M. C. (2003). Statistical methods for rates and proportions (3rd edition). John Wiley & Sons, Inc.

Gwet, K. (2014). Handbook of inter-rater reliability. Advanced Analytics Press. ISBN: 9 780970 806284.

Brian G. Dates, M.A.

Consultant in Program Evaluation, Research, and Statistics

248-229-2865

email:brian...@gmail.com

email: dat...@umdearborn.edu

Website: https://sites.google.com/view/briandates/home

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/3a83daa6-054a-42a5-8ed2-cc8bc7ecf15en%40googlegroups.com.

Rich Ulrich

unread,

Jan 25, 2022, 9:29:26 PM1/25/22

to meds...@googlegroups.com

I have always been skeptical about kappa with multiple

categories, unless the author is presenting a /set/ of values

for tables that are highly parallel -- Kappa is potentially very

vulnerable to happenstances of marginal frequencies. And

if the categories are ordered, which almost always is the

case, kappa ignores the ordering.

When there are multiple raters, a statistic across raters is

okay when the summary is, "Everything is Great!" In practice,

I want to see for myself the comparison of pairs of raters.

If I'm reviewing a paper, I want to know that the authors looked.

What you publish might be summaries. When the data suit it,

I like to know the information from the paired t-tests for each

pair of raters: correlation (for association) and t-test (for bias).

--

Rich Ulrich

From: meds...@googlegroups.com <meds...@googlegroups.com> on behalf of Bob Green <bgr...@dyson.brisnet.org.au>
Sent: Tuesday, January 25, 2022 4:27 PM
To: meds...@googlegroups.com <meds...@googlegroups.com>
Subject: {MEDSTATS} Fleiss kappa interpretation recommendations

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/61f06b5f.1c69fb81.59f38.1f33SMTPIN_ADDED_MISSING%40gmr-mx.google.com.

Chris Evans

unread,

Jan 26, 2022, 5:37:04 AM1/26/22

to medstats

Interesting thread and oddly synchronous with my doing both an Rblog post about kappa: https://link.psyctc.org/kappaRblog1 and

a more lay person's blog post https://link.psyctc.org/kappaBlog1 . The former particularly might intrigue people here.

However, my real reason for posting here is to see if I am the only one here who is very wary of all these categorisations of

continuous valued indices and statistics. As a "psycho" (psychiatrist/psychotherapist) I'm intrigued by how we seem to need to

translate numbers (back) into words, or rather, in how we choose to do it. What do these cutting points and mappings really

mean? An inter-rater agreement that is good enough, and strengthening ratings for a low cost, large numbers, between groups

comparison may show the rating system to be perfectly adequate for purpose and simply adjusts the planning power/precision

analysis. By contrast, we want to see much higher agreement rates for a rating system that is going to be used to make key

decisions about individuals, observer ratings of X-rays say. Surely we should do these mappings, "adequate", "good",

"excellent" or whatever, to purpose not as if they add much to the simple kappa, or Bandiwala, or whatever, when thinking

about the rating irrespective of the use to which it will be put.

Yet we seem to need these mappings and even their histories (thanks to Brian Dates for a lovely summary and comparison).

Does this resonate with anyone? Does it affect the original question here?

Very best all,

Chris

Chris Evans (he/him) <ch...@psyctc.org>

Visiting Professor, UDLA, Quito, Ecuador & Honorary Professor, University of Roehampton, London, UK.

Work web site: https://www.psyctc.org/psyctc/

CORE site: http://www.coresystemtrust.org.uk/

Personal site: https://www.psyctc.org/pelerinage2016/

Evans_outcome measures_draft5-1_resized.jpg

Bruce Weaver

unread,

Jan 26, 2022, 10:35:43 AM1/26/22

to MedStats

Rich wrote:

<quote>

Kappa is potentially very vulnerable to happenstances of marginal frequencies. And if the categories are ordered, which almost always is the

case, kappa ignores the ordering.

</quote>

For 2x2 tables with extreme marginal frequencies, Yule's Q (aka., Yule's coefficient of colligation) is arguably a better choice than Cohen's kappa. (Many years ago, a co-author of mine described this approach in an article, but unfortunately, he called the statistic phi. I always thought that caused unnecessary confusion, because phi was already in use as the name for Pearson r on a 2x2 table. Perhaps my colleague did not then know about Yule's Q.)

Regarding the second point, people I know have always used weighted kappa when they have ordinal variables.

Cheers,

Bruce

Brian Dates

unread,

Jan 26, 2022, 11:28:08 AM1/26/22

to meds...@googlegroups.com

Bruce,

Considering Fleiss and Cohen's 1973 article on the equivalence of weighted kappa and ICC, is there a reason not to just use the ICC with ordinal data? Also, Aikin's alpha tries to address the difference between difficult and easy items, which has the potential for eliminating the happenstance that Rich refers to.

B

Brian G. Dates, M.A.

Consultant in Program Evaluation, Research, and Statistics

248-229-2865

email:brian...@gmail.com

email: dat...@umdearborn.edu

Website: https://sites.google.com/view/briandates/home

To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/2165772e-b6ab-4699-9a1d-51d5aae9944dn%40googlegroups.com.

Bruce Weaver

unread,

Jan 26, 2022, 2:13:30 PM1/26/22

to MedStats

Hi Brian. Re ICC vs weighted kappa, sure, if the ICC model you want to use is the one that matches weighted kappa. IIRC, Norman & Streiner said in one (or more) of their books that which one you report can be a matter of convention in the particular discipline. One could use software to compute an ICC but then report it as a weighted kappa (if that is the convention in a given discipline). But some of that might have been written before McGraw & Wong (1996) came along.

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological methods, 1(1), 30.

McGraw, K. O., & Wong, S. P. (1996). " Forming inferences about some intraclass correlations coefficients": Correction.

Bob Green

unread,

Jan 28, 2022, 4:40:08 PM1/28/22

to meds...@googlegroups.com

Thanks to all for their replies.

I have looked into Kilem Gwet's irrCAC package in R.

The data is from 3 coders, who coded 101 statements. Each statement
was coded into one of 10 categories, the categories are unordered.

I obtained a Fleiss coefficient value of 0.58 in irrcac and 0.53 in
irr. In irrcac it didn't really matter which option I selected, the
results was basically the same.

coeff.name pa pe coeff.val
coeff.se conf.int p.value w.name

1 Krippendorff's Alpha 0.5855308 0.1187356 0.52969 0.04175
(0.447,0.613) 0 unweighted

coeff.name pa pe coeff.val coeff.se conf.int
p.value w.name

1 Fleiss' Kappa 0.5841584 0.1187356 0.52813 0.04175
(0.445,0.611) 0 unweighted

Does the difference between irr and irrcac results, suggest use of
one over the other?

Does anyone know if coefficients can be obtained for each category in
irrCAC as can be obtained in irr?

Any assistance is appreciated,

Regards

Bob

Bob Green

unread,

Jan 31, 2022, 3:23:59 AM1/31/22

to meds...@googlegroups.com

Thanks to all for their replies.

I realised I was reading the wrong column. In this case irr and irrCAC agree.

Reply all

Reply to author

Forward