Oh, yes, that ICC. That could work. But wouldn’t that be limited to comparing two or more raters in terms of their scores for the total number of utterances in a sample. It would not look at how they got to that number. So, one rater could break up utterance 52 into two pieces, but then join utterances 83 and 84 into one and end up with a score that is the same as the gold standard segmentation.
More generally, the standard measures of reliability, including those computed by RELY are meant to apply to coding systems, such as one would find on the %spa or %cod lines and not the main line of a transcript.
To compute word-level agreement, one could turn to something like a BLUE score that is used in judging machine translations. There is a good discussion of this in the Wikipedia page. However, BLEU is working on pairs of utterances and we already have a problem if the two coders have segmented things differently.
Grant and journal eviewers often ask whether transcription reliability has been computed. This seems like a reasonable request, but the fact is that there is no straightforward way to compute this. Perhaps the “bag of words” method available through the fourth function for RELY in section 7.21 of the CLAN manual makes the most sense.
—Brian MacWhinney