Interrater reliability

30 views
Skip to first unread message

Christoph Ruehlemann

unread,
May 9, 2022, 5:19:39 AM5/9/22
to statforli...@googlegroups.com
Hi All,

I have data for a large number of `Trial`s (only three shown here) and ratings by subjects `A`, `B`, `C`, `D`, and `E`(many more in the actual data). In each `Trial` subjects were asked to determine whether event `f` or event `n` occurred:

    df <- structure(list(Trial = 1:3, Trial_time = c("00:00:00.001", "00:00:00.002",
    "00:00:00.003"), A = c("f", "n", "n"), B = c("f", "n", "f"),
        C = c("f", "f", "n"), D = c("f", "f", "n"), E = c("f", "f",
        "n")), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
    "data.frame"))

How can I establish an interrater reliability score for this kind of rating? Help is much appreciated!

Best
Christoph

--
Albert-Ludwigs-Universität Freiburg
Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"
ἰχθύς

Stefan Th. Gries

unread,
May 9, 2022, 10:06:29 AM5/9/22
to StatForLing with R
Have you checked the package irr? It has functions such as
kappam.fleiss or kripp.alpha.

Best,
STG

Christoph Ruehlemann

unread,
May 9, 2022, 3:20:20 PM5/9/22
to statforli...@googlegroups.com
Yes, I have. The function kappam.light() seemed to suit my data scenario best. But I'm of course not 100% sure ...

--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CAFrBz2nO3Zk_RSbk_PYhLO85DF%3DSYGjrbSQqKpqWsG2cZXmB8Q%40mail.gmail.com.

Stefan Th. Gries

unread,
May 9, 2022, 4:51:41 PM5/9/22
to StatForLing with R
Carletta (1998, Computional Linguistics) is widely cited, she
discusses Landis & Koch (don't know the ref of the top of my head),
but one of the most comprehensive reviews might be Artstein & Poesio
(2008). I've used irr in the past and it worked well.

Christoph Ruehlemann

unread,
May 10, 2022, 2:00:24 AM5/10/22
to statforli...@googlegroups.com
As a follow up: what we are trying to do is compare two methods of determining eye gaze: the traditional method of eyeballing and the eyetracking method. So we get two irr scores, one for the eyeballing group, another for the eyetracking group. Using, for example, Fleiss Kappa, we'd then have, two values, say, 0.78 and 0.88. How could we determine which method facilitates greater agreement generally, that is, beyond the samples on hand?

On Mon, May 9, 2022 at 4:06 PM Stefan Th. Gries <stg...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/CAFrBz2nO3Zk_RSbk_PYhLO85DF%3DSYGjrbSQqKpqWsG2cZXmB8Q%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages