interclass vs intraclass correlation

JP

unread,

Mar 26, 2002, 4:49:08 PM3/26/02

to

I have been unable to find an adequate explanation of exactly what is the
difference between an interclass and in intraclass correlation, and the
circumstances in which you would choose either.
Ian Kestin

Rich Ulrich

unread,

Mar 26, 2002, 6:18:55 PM3/26/02

to

For the 'intraclass correlation', you *assume* that the means
have the same mu. It can be computed for a *set* of raters.
Deviations are computed around the common mean.

Thus, you only use it for 2 ratings when you expect equality,
with parallel measurements. The intra-class is what is possible,
when you can't systematically "distinguish" the raters in
a fair way.

For contrasting 2 raters, I like using a paired t-test and the
corresponding interclass correlation. That shows you both
the main pieces of information, without confusing them or
confounding them at all. You get r to measure parallelism;
you get t to measure mean-difference.

(What you don't have assured is the equal-variances,
which presumably isn't much of a problem.)

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

John W. Kulig

unread,

Mar 27, 2002, 10:54:19 AM3/27/02

to

JP wrote:

This is in answer to the second half of your question: Use the interclass
when you are correlating two different things, such as height and weight. You
can go through the entire data set of paired scores and it is obvious which
number in each pair is a height (call it A) and which is a weight (B).
Intraclass correlations are used when there you cannot do this. They are used
(as one example) to get correlations for twins. If you had a set of IQ scores
from twins, each pair of IQ scores is from a twin pair - but there is no
basis for assigning one A and the other B. Each member of a twin duo could as
easily be thrown into column A as column B.

--
------------------------------------------------------------------
John W. Kulig ku...@mail.plymouth.edu
Department of Psychology http://oz.plymouth.edu/~kulig
Plymouth State College tel: (603) 535-2468
Plymouth NH USA 03264 fax: (603) 535-2412
------------------------------------------------------------------
"Kane to kaló ke ríchto sto yaló."
(Do a good deed and cast it to the sea)
Ancient Greek saying
------------------------------------------------------------------

.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================

Dennis Roberts

unread,

Mar 27, 2002, 1:25:04 PM3/27/02

to

At 10:50 AM 3/27/02 -0500, John W. Kulig wrote:

>JP wrote:
>
> > I have been unable to find an adequate explanation of exactly what is the
> > difference between an interclass and in intraclass correlation, and the
> > circumstances in which you would choose either.
> > Ian Kestin
>
>This is in answer to the second half of your question: Use the interclass
>when you are correlating two different things, such as height and weight. You
>can go through the entire data set of paired scores and it is obvious which
>number in each pair is a height (call it A) and which is a weight (B).
>Intraclass correlations are used when there you cannot do this. They are used
>(as one example) to get correlations for twins. If you had a set of IQ scores
>from twins, each pair of IQ scores is from a twin pair - but there is no
>basis for assigning one A and the other B. Each member of a twin duo could as
>easily be thrown into column A as column B.

to expand on this a bit without going into stat funk detail, think about
some strategies one might look at when you want to examine the
"relationship" between two columns of scores ... when we have (say) twin data

to make it very simple, say we have 4 twin pairs on IQs ...

IQ A IQ B

103 109
88 84
128 119
97 92

now, we could do a regular PPM r on the data we have ... r=.928

but, since there is no necessary logical reason why the twins are ordered
across the lines as they are ... the set of data could have been (and made
just as much sense)

103 109
84 88
128 119
92 97

in this case, the r = .964

well, which is right? what if we rearranged the data in all possible IQA
and IQB configurations ... did all the rs ... and took an average? we might
say that this is the "typical" r you might get when correlating the two
columns of values ... no matter which of the pair comes first or second

another conceptual way would be the following

the notion above is that the pair of values on a line ... that is, the twin
PAIR ... should not be as different as, differences we might see DOWN ONE
COLUMN OR THE OTHER COLUMN IF there is something going on in terms of
genetics. that is ... the between column variance (by rows) should be
rather small ... compared to the WITHin column variance

in the extreme case, the two values in each of the columns might be the
same for both columns like:

103 103
84 84
128 128
97 97

here we see no BETWEEN column variability ... across the rows ... but,
clearly, there is within column variability

now, this would not be usual of course but, does set the upper limit

thus, some comparison of between column variation (across the rows) versus
within column variation ... can help to examine the issue of relationship
between the columns of data

typically, scenarios like this might find the paired columns being for
twins ... versus the paired columns being for non twin siblings ... and, in
the analysis above, if the comparison of between column (across the rows)
variances versus within column variances is different in the two cases ...
then, we say that the role of genetics (perhaps) is the reason why the
relationship between the sets of data are different

>--
>------------------------------------------------------------------
>John W. Kulig ku...@mail.plymouth.edu
>Department of Psychology http://oz.plymouth.edu/~kulig
>Plymouth State College tel: (603) 535-2468
>Plymouth NH USA 03264 fax: (603) 535-2412
>------------------------------------------------------------------
>"Kane to kaló ke ríchto sto yaló."
>(Do a good deed and cast it to the sea)
> Ancient Greek saying
>------------------------------------------------------------------
>
>
>.
>.
>=================================================================
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at:
>. http://jse.stat.ncsu.edu/ .
>=================================================================

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802
<Emailto: d...@psu.edu>
WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401

JP

unread,

Mar 27, 2002, 4:17:22 PM3/27/02

to

"John W. Kulig" <ku...@mail.plymouth.edu> wrote in message
news:3CA1EA47...@mail.plymouth.edu...

>
>
> JP wrote:
>
> > I have been unable to find an adequate explanation of exactly what is
the
> > difference between an interclass and in intraclass correlation, and the
> > circumstances in which you would choose either.
> > Ian Kestin
>
> This is in answer to the second half of your question: Use the interclass
> when you are correlating two different things, such as height and weight.
You
> can go through the entire data set of paired scores and it is obvious
which
> number in each pair is a height (call it A) and which is a weight (B).
> Intraclass correlations are used when there you cannot do this. They are
used
> (as one example) to get correlations for twins. If you had a set of IQ
scores
> from twins, each pair of IQ scores is from a twin pair - but there is no
> basis for assigning one A and the other B. Each member of a twin duo could
as
> easily be thrown into column A as column B.
>

> Thankyou, this does help, although the data I have does not fit either of
your examples. I have a single candidates answer sheet to 12 questions (each
question is scored 1, 2, 3, or 4) which has been marked by 15 different
examiners. I wish to have a single number to assess overall inter-examiner
agreement. I had thought that interclass correlation wass the correct
technique, but was told I should be using intraclass correlation instead,
and have been unable to find a convincing explanation ever since.
Ian Kestin

Rich Ulrich

unread,

Mar 27, 2002, 5:05:27 PM3/27/02

to

On Wed, 27 Mar 2002 21:17:22 +0000 (UTC), "JP"
<Janepeutrell@removethis bit.btinternet.com> wrote:
> Thankyou, this does help, although the data I have does not fit either of
> your examples. I have a single candidates answer sheet to 12 questions (each
> question is scored 1, 2, 3, or 4) which has been marked by 15 different
> examiners. I wish to have a single number to assess overall inter-examiner
> agreement. I had thought that interclass correlation wass the correct
> technique, but was told I should be using intraclass correlation instead,
> and have been unable to find a convincing explanation ever since.
> Ian Kestin

Oh! well, you are right.
You certainly do not have data for computing a correlation,
either the usual intraclass or interclass.
Without a *sample* to represent a *range* of traits,
you are limited, without a doubt, to describing
deviations rather than similarity.

You can describe how much the raters vary on a
question, say, as the Standard deviation of
responses. Or you have their range.

You could get a number across the 12 questions,
which would be computed with a correlation-formula.
It would not be a Pearson r, though, if 'r' is a
reference to something with a known statistical
distribution. - That would be something that falls
into the class of 'profile analysis'. It would be
somewhat pointless or weird to compute it, if you
didn't have a context and an a-priori reason for it.

David Treder

unread,

Mar 28, 2002, 8:17:23 AM3/28/02

to

jp wrote:.

> >
> > Thankyou, this does help, although the data I have does not fit either of
> your examples. I have a single candidates answer sheet to 12 questions (each
> question is scored 1, 2, 3, or 4) which has been marked by 15 different
> examiners. I wish to have a single number to assess overall inter-examiner
> agreement. I had thought that interclass correlation wass the correct
> technique, but was told I should be using intraclass correlation instead,
> and have been unable to find a convincing explanation ever since.

see -- Shrout, P.E., and Fleiss, J.L "Intraclass
correlations: uses in assessing rater
reliability," Psychological Bulletin, 1979,
86, 420-428.

SAS has a macro available (INTERACC) that
computes the different types of ICCs. Included
in the macro is a good, brief explanation of the
specific ICC you should use.

Dave

SR Millis

unread,

Mar 28, 2002, 9:48:21 AM3/28/02

to

Rich Ulrich wrote:
> For contrasting 2 raters, I like using a paired t-test and the
> corresponding interclass correlation. That shows you both
> the main pieces of information, without confusing them or
> confounding them at all. You get r to measure parallelism;
> you get t to measure mean-difference.

Lin has discussed the shortcomings of the t-test for assessing
concordance between raters (Biometrics, 1989, 45, 255-268). Among other
things, the paired t-test fails to detect poor agreement in pairs of
data such as (1,3)(2,3)(3,3)(4,3)(5,3).

Pearson correlation coefficient can be a good starting point for
detecting lack of agreement. But, a high r doesn't necessarily indicate
agreement. As a follow-up, Lin's concordance correlation coefficient or
the Bradley-Blackwood procedure can be useful supplements.

SR Millis

Dennis Roberts

unread,

Mar 28, 2002, 10:34:12 AM3/28/02

to

At 09:54 AM 3/28/02 -0800, SR Millis wrote:

>Pearson correlation coefficient can be a good starting point for
>detecting lack of agreement. But, a high r doesn't necessarily indicate
>agreement.

actually, the r by itself tells you nothing about agreement ... as the
following example will show ... say we have a rating scale of 1 to 10 ...
bad to good and, 2 raters say across 5 Ss (that they rate)

rater 1 rater 2

10 5
9 4
8 3
7 2
6 1

the r is +1 but, there is not one single rating of agreement ... in fact,
rather A says all are good ... and rather B says all are bad

in fact, an r close to 0 MAY in fact be evidenced by more agreement such as

rather 1 rater 2

5 2
4 4
3 5
2 3
1 1

the r here may not be exactly 0 ... but, it is not good .. if you look at
the differences in ratings however, the size of the differences are smaller
than the first example ... at least both raters are rating the 5 Ss in the
same area of the rating scale

Roy St Laurent

unread,

Mar 28, 2002, 11:20:04 AM3/28/02

to

Lin's concordance correlation coefficient which is not model based, is very
similar to model-
based procedures discussed in the Shrout and Fleiss article. Lin does
provide an excellent
discussion of the failings of t-tests and the use of the ordinary
correlation coefficient in
such settings. However in most settings I would advocate the use of
model-based approaches such as those proposed by Shrout and Fleiss, over
the method that Lin has proposed.

Another useful reference in this context, is the article

Altman, DG, and Bland, JM. (1983) "Measurement in medicine: the analysis
of
method comparison studies", The Statistician, 32: 307-317.

I have never heard of the Bradley-Blackwood procedure. Do you have
a reference for it?

Shrout, P.E., and Fleiss, J.L "Intraclass
correlations: uses in assessing rater
reliability," Psychological Bulletin, 1979,
86, 420-428.

SR Millis wrote:

--
Roy St. Laurent
Mathematics & Statistics
Northern Arizona University
http//odin.math.nau.edu/~rts

Rich Ulrich

unread,

Mar 28, 2002, 12:17:36 PM3/28/02

to

to srmillis and sci.stat.edu,

On 28 Mar 2002 06:48:21 -0800, srmi...@mindspring.com (SR Millis)
wrote:

> Rich Ulrich wrote:
> > For contrasting 2 raters, I like using a paired t-test and the
> > corresponding interclass correlation. That shows you both
> > the main pieces of information, without confusing them or
> > confounding them at all. You get r to measure parallelism;
> > you get t to measure mean-difference.
>
> Lin has discussed the shortcomings of the t-test for assessing
> concordance between raters (Biometrics, 1989, 45, 255-268). Among other
> things, the paired t-test fails to detect poor agreement in pairs of
> data such as (1,3)(2,3)(3,3)(4,3)(5,3).

Thanks for the reference. I did not have that.

Sure, the t-test fails to detect.... No one uses it *alone*, do they?
I always say, you use *both* the r and the t-test.

There are the two elements: you can look at them separately,
or look at them confounded with each other.

For publication, editors (historically) like a single number.
For research, researchers ought to see what's what.
Both errors matter, but they are very distinct:
it takes far less training to get rid of *bias* in rating scores,
than to generate *correlation* where it is absent.

A decent paired t-test procedure (in SPSS, for example) shows
you both the r and the t-test. (I don't know whether paired-t
is still missing from SAS.)

Here are lines I found in a help file, downloaded from a webpage
on the Stata module for concordance --
http://ideas.uqam.ca/ideas/data/Softwares/bocbocodeS404501.html

" Lin's coefficient increases in value as a function of the nearness
of the data's reduced major axis to the line of perfect concordance
(the accuracy of the data) and of the tightness of the data
about its reduced major axis (the precision of the data). The Pearson
correlation coefficient, r, the bias-correction factor, C_b, and the
equation of the reduced major axis are reported to show these
components. Note that the concordance correlation coefficient, rho_c,
can be expressed as the product of r, the measure of precision, and
C_b, the measure of accuracy. "

Okay, Lin's measure might be fine for the editor.
(I don't know what this C_b is, but it certainly starts out as
obscure, compared to t-tests that are in every intro-stats course.)

> Pearson correlation coefficient can be a good starting point for
> detecting lack of agreement. But, a high r doesn't necessarily indicate
> agreement. As a follow-up, Lin's concordance correlation coefficient or
> the Bradley-Blackwood procedure can be useful supplements.

Bradley-Blackwood is also new to me.
I don't find much from google, except the reference, 1991, Journal of
Quality Technology 23:12-16. It seems that there is an F-test,
which seems to be another single-number report.

ak83...@gmail.com

unread,

Jan 31, 2016, 11:30:27 AM1/31/16

to

ak83...@gmail.com

unread,

Jan 31, 2016, 11:32:06 AM1/31/16

to

On Tuesday, March 26, 2002 at 2:49:08 PM UTC-7, JP wrote:

Rich Ulrich

unread,

Jan 31, 2016, 1:20:06 PM1/31/16

to

Googling < interclass vs intraclass correlation > finds the post.

That thorough discussion in 2002 (13 posts from 8 authors,
including me) contains a number of good references for
further reading.

I assume that JP posted this fragment while failing to notice the
date, and without warning from Google that he was sending
to a Usenet group.

--
Rich Ulrich