Inter-rater reliability of utterance segmentation

Cynthia Audisio

unread,

Apr 14, 2023, 3:54:27 PM4/14/23

to chibolts

Dear chibolts,

With some colleagues, we are trying to analyze inter-rater reliability of utterance segmentation in CHAT transcripts of spontaneous situations. We were wondering whether we can use RELY for this purpose.

Here is an example of how transcripts by two different transcribers might look:

Transcriber 1:
CHI: Mami, tenés < el vasito > [/] el vasito de Juan ?

Transcriber 2:
CHI: Mami, tenés el vasito ?
CHI: el vasito de Juan ?

Best,

Cynthia

Leonid Spektor

unread,

Apr 14, 2023, 4:13:26 PM4/14/23

to chib...@googlegroups.com

Cynthia,

For RELY to work the speaker utterance in the first file must have corresponding utterance of the same speaker in the second file. Your example will produce an error that utterance do not match.

Perhaps someone else can have a better suggestion.

Leonid.

--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/3902be92-562e-4e3f-b4cd-35328e60f465n%40googlegroups.com.

Brielle Stark

unread,

Apr 14, 2023, 4:29:11 PM4/14/23

to ChiBolts

One way we've done it is to look at number of utterances using outside tools like ICC. That's also a more robust metric for that then percent agreement, I'd argue.

Brielle C. Stark, PhD
Assistant Professor
Department of Speech, Language and Hearing Sciences
Program in Neuroscience, Cognitive Science Program
Indiana University Bloomington

sent from mobile, please excuse errors

To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/FC7F6924-A5ED-4586-89F6-2130D5EE1442%40andrew.cmu.edu.

Brian Macwhinney

unread,

Apr 14, 2023, 8:19:12 PM4/14/23

to ChiBolts, Brielle Stark

Brie,
Sounds like a good idea, but does ICC refer to: International Criminal Court? Illinois Commerce Commission? International Cricket Council? Illinois Central College? International Color Consortium? International Christian Concern? or maybe the Industrial Chimney Company? So far, googling this hasn’t helped.

— Brian MacWhinney

> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/CAEs2yTqZYz-O7U9Kw4mQ1uDxkggmRSyuJqs1Kp5wiJCAKbpqJQ%40mail.gmail.com.

Brielle Stark

unread,

Apr 14, 2023, 9:23:03 PM4/14/23

to Brian Macwhinney, ChiBolts

Intraclass Correlation Coefficient, as is usually used for reliability estimates :)

Brielle C. Stark, PhD
Assistant Professor
Department of Speech, Language and Hearing Sciences
Program in Neuroscience, Cognitive Science Program
Indiana University Bloomington

sent from mobile, please excuse errors

Brian Macwhinney

unread,

Apr 15, 2023, 9:22:28 AM4/15/23

to Brielle Stark, ChiBolts

Oh, yes, that ICC. That could work. But wouldn’t that be limited to comparing two or more raters in terms of their scores for the total number of utterances in a sample. It would not look at how they got to that number. So, one rater could break up utterance 52 into two pieces, but then join utterances 83 and 84 into one and end up with a score that is the same as the gold standard segmentation.

More generally, the standard measures of reliability, including those computed by RELY are meant to apply to coding systems, such as one would find on the %spa or %cod lines and not the main line of a transcript.

To compute word-level agreement, one could turn to something like a BLUE score that is used in judging machine translations. There is a good discussion of this in the Wikipedia page. However, BLEU is working on pairs of utterances and we already have a problem if the two coders have segmented things differently.

Grant and journal eviewers often ask whether transcription reliability has been computed. This seems like a reasonable request, but the fact is that there is no straightforward way to compute this. Perhaps the “bag of words” method available through the fourth function for RELY in section 7.21 of the CLAN manual makes the most sense.

—Brian MacWhinney

Brielle Stark

unread,

Apr 15, 2023, 10:08:04 AM4/15/23

to Brian Macwhinney, ChiBolts

Correct; it is one quantitative way of evaluating reliability, not the only way or the most comprehensive picture.

Brielle C. Stark, PhD
Assistant Professor
Department of Speech, Language and Hearing Sciences
Program in Neuroscience, Cognitive Science Program
Indiana University Bloomington

sent from mobile, please excuse errors

Reply all

Reply to author

Forward