For the 'intraclass correlation', you *assume* that the means
have the same mu. It can be computed for a *set* of raters.
Deviations are computed around the common mean.
Thus, you only use it for 2 ratings when you expect equality,
with parallel measurements. The intra-class is what is possible,
when you can't systematically "distinguish" the raters in
a fair way.
For contrasting 2 raters, I like using a paired t-test and the
corresponding interclass correlation. That shows you both
the main pieces of information, without confusing them or
confounding them at all. You get r to measure parallelism;
you get t to measure mean-difference.
(What you don't have assured is the equal-variances,
which presumably isn't much of a problem.)
--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html
JP wrote:
This is in answer to the second half of your question: Use the interclass
when you are correlating two different things, such as height and weight. You
can go through the entire data set of paired scores and it is obvious which
number in each pair is a height (call it A) and which is a weight (B).
Intraclass correlations are used when there you cannot do this. They are used
(as one example) to get correlations for twins. If you had a set of IQ scores
from twins, each pair of IQ scores is from a twin pair - but there is no
basis for assigning one A and the other B. Each member of a twin duo could as
easily be thrown into column A as column B.
--
------------------------------------------------------------------
John W. Kulig ku...@mail.plymouth.edu
Department of Psychology http://oz.plymouth.edu/~kulig
Plymouth State College tel: (603) 535-2468
Plymouth NH USA 03264 fax: (603) 535-2412
------------------------------------------------------------------
"Kane to kaló ke ríchto sto yaló."
(Do a good deed and cast it to the sea)
Ancient Greek saying
------------------------------------------------------------------
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================
>JP wrote:
>
> > I have been unable to find an adequate explanation of exactly what is the
> > difference between an interclass and in intraclass correlation, and the
> > circumstances in which you would choose either.
> > Ian Kestin
>
>This is in answer to the second half of your question: Use the interclass
>when you are correlating two different things, such as height and weight. You
>can go through the entire data set of paired scores and it is obvious which
>number in each pair is a height (call it A) and which is a weight (B).
>Intraclass correlations are used when there you cannot do this. They are used
>(as one example) to get correlations for twins. If you had a set of IQ scores
>from twins, each pair of IQ scores is from a twin pair - but there is no
>basis for assigning one A and the other B. Each member of a twin duo could as
>easily be thrown into column A as column B.
to expand on this a bit without going into stat funk detail, think about
some strategies one might look at when you want to examine the
"relationship" between two columns of scores ... when we have (say) twin data
to make it very simple, say we have 4 twin pairs on IQs ...
IQ A IQ B
103 109
88 84
128 119
97 92
now, we could do a regular PPM r on the data we have ... r=.928
but, since there is no necessary logical reason why the twins are ordered
across the lines as they are ... the set of data could have been (and made
just as much sense)
103 109
84 88
128 119
92 97
in this case, the r = .964
well, which is right? what if we rearranged the data in all possible IQA
and IQB configurations ... did all the rs ... and took an average? we might
say that this is the "typical" r you might get when correlating the two
columns of values ... no matter which of the pair comes first or second
another conceptual way would be the following
the notion above is that the pair of values on a line ... that is, the twin
PAIR ... should not be as different as, differences we might see DOWN ONE
COLUMN OR THE OTHER COLUMN IF there is something going on in terms of
genetics. that is ... the between column variance (by rows) should be
rather small ... compared to the WITHin column variance
in the extreme case, the two values in each of the columns might be the
same for both columns like:
103 103
84 84
128 128
97 97
here we see no BETWEEN column variability ... across the rows ... but,
clearly, there is within column variability
now, this would not be usual of course but, does set the upper limit
thus, some comparison of between column variation (across the rows) versus
within column variation ... can help to examine the issue of relationship
between the columns of data
typically, scenarios like this might find the paired columns being for
twins ... versus the paired columns being for non twin siblings ... and, in
the analysis above, if the comparison of between column (across the rows)
variances versus within column variances is different in the two cases ...
then, we say that the role of genetics (perhaps) is the reason why the
relationship between the sets of data are different
>--
>------------------------------------------------------------------
>John W. Kulig ku...@mail.plymouth.edu
>Department of Psychology http://oz.plymouth.edu/~kulig
>Plymouth State College tel: (603) 535-2468
>Plymouth NH USA 03264 fax: (603) 535-2412
>------------------------------------------------------------------
>"Kane to kaló ke ríchto sto yaló."
>(Do a good deed and cast it to the sea)
> Ancient Greek saying
>------------------------------------------------------------------
>
>
>.
>.
>=================================================================
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at:
>. http://jse.stat.ncsu.edu/ .
>=================================================================
Dennis Roberts, 208 Cedar Bldg., University Park PA 16802
<Emailto: d...@psu.edu>
WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401
Oh! well, you are right.
You certainly do not have data for computing a correlation,
either the usual intraclass or interclass.
Without a *sample* to represent a *range* of traits,
you are limited, without a doubt, to describing
deviations rather than similarity.
You can describe how much the raters vary on a
question, say, as the Standard deviation of
responses. Or you have their range.
You could get a number across the 12 questions,
which would be computed with a correlation-formula.
It would not be a Pearson r, though, if 'r' is a
reference to something with a known statistical
distribution. - That would be something that falls
into the class of 'profile analysis'. It would be
somewhat pointless or weird to compute it, if you
didn't have a context and an a-priori reason for it.
jp wrote:.
> >
> > Thankyou, this does help, although the data I have does not fit either of
> your examples. I have a single candidates answer sheet to 12 questions (each
> question is scored 1, 2, 3, or 4) which has been marked by 15 different
> examiners. I wish to have a single number to assess overall inter-examiner
> agreement. I had thought that interclass correlation wass the correct
> technique, but was told I should be using intraclass correlation instead,
> and have been unable to find a convincing explanation ever since.
see -- Shrout, P.E., and Fleiss, J.L "Intraclass
correlations: uses in assessing rater
reliability," Psychological Bulletin, 1979,
86, 420-428.
SAS has a macro available (INTERACC) that
computes the different types of ICCs. Included
in the macro is a good, brief explanation of the
specific ICC you should use.
Dave
Lin has discussed the shortcomings of the t-test for assessing
concordance between raters (Biometrics, 1989, 45, 255-268). Among other
things, the paired t-test fails to detect poor agreement in pairs of
data such as (1,3)(2,3)(3,3)(4,3)(5,3).
Pearson correlation coefficient can be a good starting point for
detecting lack of agreement. But, a high r doesn't necessarily indicate
agreement. As a follow-up, Lin's concordance correlation coefficient or
the Bradley-Blackwood procedure can be useful supplements.
SR Millis
>Pearson correlation coefficient can be a good starting point for
>detecting lack of agreement. But, a high r doesn't necessarily indicate
>agreement.
actually, the r by itself tells you nothing about agreement ... as the
following example will show ... say we have a rating scale of 1 to 10 ...
bad to good and, 2 raters say across 5 Ss (that they rate)
rater 1 rater 2
10 5
9 4
8 3
7 2
6 1
the r is +1 but, there is not one single rating of agreement ... in fact,
rather A says all are good ... and rather B says all are bad
in fact, an r close to 0 MAY in fact be evidenced by more agreement such as
rather 1 rater 2
5 2
4 4
3 5
2 3
1 1
the r here may not be exactly 0 ... but, it is not good .. if you look at
the differences in ratings however, the size of the differences are smaller
than the first example ... at least both raters are rating the 5 Ss in the
same area of the rating scale
Another useful reference in this context, is the article
Altman, DG, and Bland, JM. (1983) "Measurement in medicine: the analysis
of
method comparison studies", The Statistician, 32: 307-317.
I have never heard of the Bradley-Blackwood procedure. Do you have
a reference for it?
Shrout, P.E., and Fleiss, J.L "Intraclass
correlations: uses in assessing rater
reliability," Psychological Bulletin, 1979,
86, 420-428.
SR Millis wrote:
--
Roy St. Laurent
Mathematics & Statistics
Northern Arizona University
http//odin.math.nau.edu/~rts
On 28 Mar 2002 06:48:21 -0800, srmi...@mindspring.com (SR Millis)
wrote:
> Rich Ulrich wrote:
> > For contrasting 2 raters, I like using a paired t-test and the
> > corresponding interclass correlation. That shows you both
> > the main pieces of information, without confusing them or
> > confounding them at all. You get r to measure parallelism;
> > you get t to measure mean-difference.
>
> Lin has discussed the shortcomings of the t-test for assessing
> concordance between raters (Biometrics, 1989, 45, 255-268). Among other
> things, the paired t-test fails to detect poor agreement in pairs of
> data such as (1,3)(2,3)(3,3)(4,3)(5,3).
Thanks for the reference. I did not have that.
Sure, the t-test fails to detect.... No one uses it *alone*, do they?
I always say, you use *both* the r and the t-test.
There are the two elements: you can look at them separately,
or look at them confounded with each other.
For publication, editors (historically) like a single number.
For research, researchers ought to see what's what.
Both errors matter, but they are very distinct:
it takes far less training to get rid of *bias* in rating scores,
than to generate *correlation* where it is absent.
A decent paired t-test procedure (in SPSS, for example) shows
you both the r and the t-test. (I don't know whether paired-t
is still missing from SAS.)
Here are lines I found in a help file, downloaded from a webpage
on the Stata module for concordance --
http://ideas.uqam.ca/ideas/data/Softwares/bocbocodeS404501.html
" Lin's coefficient increases in value as a function of the nearness
of the data's reduced major axis to the line of perfect concordance
(the accuracy of the data) and of the tightness of the data
about its reduced major axis (the precision of the data). The Pearson
correlation coefficient, r, the bias-correction factor, C_b, and the
equation of the reduced major axis are reported to show these
components. Note that the concordance correlation coefficient, rho_c,
can be expressed as the product of r, the measure of precision, and
C_b, the measure of accuracy. "
Okay, Lin's measure might be fine for the editor.
(I don't know what this C_b is, but it certainly starts out as
obscure, compared to t-tests that are in every intro-stats course.)
> Pearson correlation coefficient can be a good starting point for
> detecting lack of agreement. But, a high r doesn't necessarily indicate
> agreement. As a follow-up, Lin's concordance correlation coefficient or
> the Bradley-Blackwood procedure can be useful supplements.
Bradley-Blackwood is also new to me.
I don't find much from google, except the reference, 1991, Journal of
Quality Technology 23:12-16. It seems that there is an F-test,
which seems to be another single-number report.