Agreement between multiple raters and comparison of kappa values

45 views
Skip to first unread message

Giovanni Delli Carpini

unread,
May 10, 2021, 4:29:29 AM5/10/21
to MedStats
Good morning,

I am Giovanni Delli Carpini, researcher in Gynecology and Obstetrics at Università Politecnica delle Marche, Ancona Italy.
Thank you for accepting my request to join this interesting group. 

We are conducting a clinical study in which we are evaluating the inter-rater agreement for two scoring systems (CDC criteria and ASEPSIS score) in assessing the presence of surgical site infections after cesarean section. 
Both systems provide a categorical classification in three classes according to the presence of infection (e.g., 1. no infection, 2. mild infection, 3. severe infection). 
Three raters were asked to determine the scores and assign each patient to one of the three classes. 

My question is: which test should we use to obtain the kappa value for multiple raters? (weighted kappa? Fleiss? Krippendorff’s alpha?). Subsequently, it is possible to compare the obtained kappa values to verify if there is any difference between them? (in other words, if one of the two scoring system provide higher concordance between raters)

Thank you for the help,

Giovanni Delli Carpini

William Stanbury

unread,
May 10, 2021, 6:19:41 AM5/10/21
to meds...@googlegroups.com
Giovanni,

Good afternoon, welcome!

:-)

Thank you for your interesting email.

If I may please, I have a single question. For the categorical classification, why have only 3 classes (1. no infection 2. mild infection 3. severe infection) please and e.g. no "moderate" or other classes?

Grazie mille!

:-)

William Stanbury.

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/7b6c671a-285b-4dba-8042-5957769201aan%40googlegroups.com.

Giovanni Delli Carpini

unread,
May 10, 2021, 6:57:50 AM5/10/21
to MedStats
Dear William,

Thank you for you answer!

CDC criteria presents three outcome categories of surgical site infections (SSI):  "no SSI", "superficial incisional SSI", and "deep incisional SSI", while ASEPSIS score presents five outcome categories: "satisfactory healing (0–10 points), disturbance of healing (11–20 points), minor SSI (21–30 points), moderate SSI (31–40 points), and severe SSI (>40 points).
In order to compare the two systems and to determine the required sample size (with the R package "kappasize"), we decided to group the ASEPSIS categories as following: satisfactory healing and disturbance of healing, minor and moderate SSI, and severe SSI, to obtain three categories similar to CDC criteria. 

I have found some literature that explain several methodologies to obtain a kappa value for multiple raters, but I am unable to find the right software to perform the analysis and to chose the correct methodology. 
I have tried with SPSS (for weighted kappa) but it only shows pairwise comparison between raters and with R package "rel" Krippendorff’s alpha but I was unable to determine the confidence interval. Moreover, I don't know how to subsequently compare the kappa values (CDC vs ASEPSIS); the R package "svanbelle/multiagree" seems to be useful, but it is difficult to run. 

Giovanni

William Stanbury

unread,
May 10, 2021, 11:48:32 AM5/10/21
to meds...@googlegroups.com
Dear Giovanni,

Thank you for your reply. There are many members here with seriously impressive statistical knowledge, with luck one or more will choose to communicate with you.

My chief concern with my question was-is medical-surgical: with due respect to the CDC naturally, nevertheless is it fair to ask the question whether they are necessarily or not the best influence for designing research methodology? That they may have too much of a "Headquarters" mentality and thus be behind the frontline curve? To quote an old English language expression "A stitch in time saves nine". I'm worried that using only 3 classes may result in outcomes missing out on all kinds of medical-surgical research knowledge for the future?

Very best regards,

William Stanbury.

Chris Evans

unread,
May 11, 2021, 2:03:27 PM5/11/21
to medstats
It's a very long time since I did work on inter-rater agreements but I'll chip in pending far more expert inputs (I hope).

Working backwards from William Stanbury's point: I don't see how you can use any kappa related methodology to look at the agreement between classifications with different numbers of levels unless you do, as you are saying, a remap of the one with more categories.  You could address the issue by trying all possible (monotonically sensible) remappings I guess but you would want to be very clear that one was an a priori, planned remap and the others were exploring the point about the impact of the remapping.  Perhaps if you haven't prespecified the analytic method my liking for a distinction between a priori and post hoc exploratory/expansive analyses is not pertinent but I do believe it's always good to be crystal clear about the distinction (and able to show evidence, from a [pre]registration, of the history).

As to your methods: how many raters?  Clearly > 2 but I think it matters whether you have k raters who rated all n cases, or the more complex scenarios where you have perhaps two raters in any site so you have k raters but in k/2 pairs with no overlapping cases rated by more than two raters, or the situation in which you have subsets of the total cases rated by subsets of the raters.  I see the svanbelle/multiagree package and the paper by svanbelle it cites but don't know either.  I see she uses bootstrapping which was what I was going to suggest if there is no analytic CI for whatever index of agreement you use.

You talk about weighted kappa: my recollection is that the quadratic weighting actually gives a kappa that is the same as a Pearson which always struck me suggesting that either that weighting is a bit odd, or that there's no real need for that weighted kappa.  Others may know more about that.

If you have non-overlapping subsets of raters and cases I guess another approach is to see this as an issue about estimating rater effects (tendencies to rate higher or lower) and case mix effects (as raters rating a set of cases with more restricted categories than other raters create the obvious risk of confounding the rater effects with case mix).

Sorry, not very helpful but perhaps others will join in and add much more wisdom and experience to my thoughts!!

Cheers all,

Chris



--
Chris Evans (he/him) <ch...@psyctc.org> Visiting Professor, University of Sheffield <chris...@sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris...@roehampton.ac.uk> and other places
but <ch...@psyctc.org> remains my main Email address.  I have a work web site at:
   https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
   http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see:
   https://www.psyctc.org/pelerinage2016/semigrating-to-france/
   https://www.psyctc.org/pelerinage2016/register-to-get-updates-from-pelerinage2016/

If you want an Emeeting, I am trying to keep them to Thursdays and my diary is at:
   https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.  

Bruce Weaver

unread,
May 11, 2021, 5:40:51 PM5/11/21
to MedStats
Good comments, Chris.  I'll offer just one small correction via this excerpt from Fleiss & Cohen (1973): 

"This paper establishes the equivalence of weighted kappa with the intraclass correlation coefficient under general conditions. Krippendorff (1970) demonstrated essentially the same result."

Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3), 613-619.

A Google Scholar search on the title of that article takes me to a PDF I can open from home.  YMMV.

Cheers,
Bruce

John Whittington

unread,
May 11, 2021, 8:08:35 PM5/11/21
to meds...@googlegroups.com
At 16:48 10/05/2021, William Stanbury wrote:
....I'm worried that using only 3 classes may result in outcomes missing out on all kinds of medical-surgical research knowledge for the future?

Just one passing comment, since I think that, as a generalisation. there are two sides to the above comment, the other side being 'spurious precision'.

If there is some fairly objective way of semi-quantifying or, at least, ordering, things then what you say is clearly correct.  In fact, it then really becomes an example of the oft-crticised practice of unnecessarily 'categorising'/collapsing/reducing data, which nearly always results in an unwelcome loss of information.

However, when the assessment is largely or totally subjective, things become more difficult and less clear-cut.  One would hope to get fairly good agreement with raters with something like none/mild/moderate/severe, but as soon as one starts trying to introduce intermediate categories (e.g. 'very mild' or 'fairly severe'), agreement will become a lot less good and what might appear as 'greater precision' can actually be little more than an increase in 'noise' (inter-rate variability'), with no real gain in actually useful information.

People seem to tend to self-impose this limited range of responses when one attempts to avoid 'categorising' using tools such as Visual Analogue Scales.  Very many years (decades) ago I was involved in some attempts to look at this.  Presented with a scale labelled, say, "Negligible" at one end and "The most severe possible" at the other end, very few people would mark the scale right at one end, but the great majority made their mark either fairly close to one end or roughly in the middle of the scale!

... just some thoughts!

Kind Regards,
John


At 16:48 10/05/2021, William Stanbury wrote:
Dear Giovanni,

Thank you for your reply. There are many members here with seriously impressive statistical knowledge, with luck one or more will choose to communicate with you.

My chief concern with my question was-is medical-surgical: with due respect to the CDC naturally, nevertheless is it fair to ask the question whether they are necessarily or not the best influence for designing research methodology? That they may have too much of a "Headquarters" mentality and thus be behind the frontline curve? To quote an old English language expression "A stitch in time saves nine". I'm worried that using only 3 classes may result in outcomes missing out on all kinds of medical-surgical research knowledge for the future?

Very best regards,

William Stanbury.

On Mon, 10 May 2021 at 12:57, Giovanni Delli Carpini < giovdell...@gmail.com> wrote:
Dear William,

Thank you for you answer!

CDC criteria presents three outcome categories of surgical site infections (SSI):  "no SSI", "superficial incisional SSI", and "deep incisional SSI", while ASEPSIS score presents five outcome categories: "satisfactory healing (0–10 points), disturbance of healing (11–20 points), minornor SSI (21–30 points), moderate SSI (31–40 points), and severe SSI (>40 points).
In order to compare the two systems and to determine the required sample size (with the R package "kappasize"), we decided to group the ASEPSIS categories as following: satisfactory healing and disturbance of healing, minor and moderate SSI, and severe SSI, to obtain three categories similar to CDC criteria.Â

I have found some literature that explain several methodologies to obtain a kappa value for multiple raters, but I am unable to find the right software to perform the analysis and to chose the correct methodology.Â
I have tried with SPSS (for weighted kappa) but it only shows pairwise comparison between raters and with R package "rel" Krippendorff’s alpha but I was unable to determine the confidence interval. Moreover, I don't know how to subsequently compare the kappa values (CDC vs ASEPSIS); the R package "svanbelle/multiagree" seems to be useful, but it is difficult to run.Â

Giovanni
Â

Il giorno lunedì 10 maggio 2021 alle 12:19:41 UTC+2 william...@gmail.com ha scritto:
Giovanni,

Good afternoon, welcome!

:-)

Thank you for your interesting email.

If I may please, I have a single question. For the categorical classification, why have only 3 classes (1. no infection 2. mild infection 3. severe infection) please and e.g. no "moderate" or other classes?

Grazie mille!

:-)

William Stanbury.

On Mon, 10 May 2021 at 10:29, Giovanni Delli Carpini <giovdell...@gmail.com> wrote:
Good morning,

I am Giovanni Delli Carpini, researcher in Gynecology and Obstetrics at Università Politecnica delle Marche, Ancona Italy.
Thank you for accepting my request to join this interesting group.Â

We are conducting a clinical study in which we are evaluating the inter-rater agreement for two scoring systems (CDC criteria and ASEPSIS score) in assessing the presence of surgical site infections after cesarean section.Â
Both systems provide a categorical classification in three classes according to the presence of infection (e.g., 1. no infection, 2. mild infection, 3. severe infection).Â
Three raters were asked to determine the scores and assign each patient to one of the three classes.Â

My question is: which test should we use to obtain the kappa value for multiple raters? (weighted kappa? Fleiss? Krippendorff’s alpha?). Subsequently, it is possible to compare the obtained kappa values to verify if there is any difference between them? (in other words, if one of the two scoring system provide higher concordance between raters)


Thank you for the help,

Giovanni Delli Carpini

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/7b6c671a-285b-4dba-8042-5957769201aan%40googlegroups.com .

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.

John

----------------------------------------------------------------
Dr John Whittington,       Voice:    +44 (0) 1296 730225
Mediscience Services       Fax:      +44 (0) 1296 738893
Twyford Manor, Twyford,    E-mail:   Joh...@mediscience.co.uk
Buckingham  MK18 4EL, UK            
----------------------------------------------------------------

Brian Dates

unread,
May 11, 2021, 9:24:37 PM5/11/21
to meds...@googlegroups.com
 Bruce's comments and reference are valuable. If you think that three categories provides enough room for variance, then use the ICC. Otherwise, when it comes to either Fleiss' kappa or Krippendorff's alpha, I have some offerings. Krippendorff's alpha uses disagreement, while Fleiss' kappa uses agreement. They're really mirrors for each other. If we use 1 minus D-sub-e as the disagreement by chance and convert kappa to a disagreement-based statistic, the denominator of the expression for kappa is nr*(nr-1), while for Krippendorff's alpha it's nr-squared, where r is the number of raters and n is the number of items/subjects (Di Eugenio, B., & Glass, M., 2004. The kappa statistic: A second look. Computational Linguistics, 30 (1), 95-101). So the difference between the two disappears as the number of items/subjects gets larger. So the decision about which to use is up to you. If you think that the categories are ordinal, I'd recommend Krippendorff's alpha because it's applicable to nominal, ordinal, interval, and ratio data. I don't know what software you're using, but Andy Hayes has syntax for SPSS which allows the user to select which data type is to be considered in the analysis. If you have difficulty finding it, I can send.

Cheers!

Brian



Brian G. Dates, M.A.
Consultant in Program Evaluation, Research, and Statistics


You received this message because you are subscribed to a topic in the Google Groups "MedStats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/medstats/Es4uDbe5Axo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to medstats+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/medstats/508360946.7689093.1620664556581.JavaMail.zimbra%40psyctc.org.

Rich Ulrich

unread,
May 11, 2021, 11:02:43 PM5/11/21
to MedStats
First, I want to point out that apparently there are /three/ ratings
that might be compared, if you count the underlying 50-point rating
scale.  Or, there are many more if you count the items that make up
those 50 points.

[Google -- Points are given for the need for Additional treatment,
the presence of Serous discharge, Erythema, Purulent exudate, and
Separation of the deep tissues, the Isolation of bacteria, and the
duration of inpatient Stay (ASEPSIS). ]

Second, I ask: What are you trying to accomplish? A three-point scale
with limited description is hardly commensurable to a 50-point scale
including several dimensions. A reasonable /aim/, I suggest, is that you
seek to quantify just how inferior the three-point scale is. 

Off hand -- I figure that the items going into the 50 points by Rater 1
will probably do a better job of "predicting" the three-point scale by
Rater 2 than you get by the three-point scale from Rater 1. 

Third, whatever you do is constrained by design.  In your case, you have
a set of cases rated by three raters, each rater using each of the scales.
Which rating is done first might be randomized, since a "gestalt" rating
like a three-point summary is apt to be more accurate when it comes /after/
a thoughtful review of all those relevant items.  (But - what makes that
less than ideal is that the raters can hardly be blind to the purpose, and
they sometimes may aim at "similar" ratings.)

In your documenting the deficiency, it should be useful to "debrief" the
raters, in order that they may explain why a lower/higher rating on
the three-point scale is justified by the difference in content.  Of course,
every reliability assessment is limited by the sample it is applied to.

--
RIch Ulrich

From: meds...@googlegroups.com <meds...@googlegroups.com> on behalf of Giovanni Delli Carpini <giovdell...@gmail.com>
Sent: Monday, May 10, 2021 6:57 AM
To: MedStats <meds...@googlegroups.com>

Subject: Re: {MEDSTATS} Agreement between multiple raters and comparison of kappa values

Chris Evans

unread,
May 12, 2021, 5:07:07 AM5/12/21
to medstats
From: "Bruce Weaver" <bwe...@lakeheadu.ca>
To: "medstats" <meds...@googlegroups.com>
Sent: Tuesday, 11 May, 2021 22:40:51
Subject: Re: {MEDSTATS} Agreement between multiple raters and comparison of kappa values
Good comments, Chris.  I'll offer just one small correction via this excerpt from Fleiss & Cohen (1973): 

"This paper establishes the equivalence of weighted kappa with the intraclass correlation coefficient under general conditions. Krippendorff (1970) demonstrated essentially the same result."

Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3), 613-619.

A Google Scholar search on the title of that article takes me to a PDF I can open from home.  YMMV.
Soooo good to have this: yes, it was the F&L paper I was misremembering.   Just thought I should confirm that here but also good to see all the other thinking.  I am sure I will need this at some point in the next year or so: very helpful.

Very best all,

Chris

Martin Holt

unread,
May 21, 2021, 8:33:58 AM5/21/21
to MedStats
"Health Measurement Scales" by David L. Streiner and Geoffrey R.Norman available at a cheap price on Abebooks.

"Men can read smaller print than women can; women can hear better"!

Reply all
Reply to author
Forward
0 new messages