inter-observer variation

3 views
Skip to first unread message

Evie

unread,
Dec 16, 2009, 8:54:03 AM12/16/09
to MedStats
I need to compare the readings for 2 observers measured on 5 moulds,
as you can see from the data below there is a wide range of values
for the 5 moulds i.e. 1 very small and 1 large. Because of this I
can't use the intra-class correlation coefficient. Have you any
suggestions what I could use (I've calculated the % difference between
observers)?

Observer 1 Observer 2
44.49 44.96
9.29 9.03
3.65 3.51
1 1.02
0.02 0.02

Each observer measured each mould 10 times so the above values are the
means of the 10 replicates.

I appreciate your advice.

Ted Harding

unread,
Dec 16, 2009, 10:16:48 AM12/16/09
to meds...@googlegroups.com
If at all possible, get hold of the full data (the 10 separate replicates
for each number). There Is No Substitute For The Original Data!

Given that the five moulds are clearly very different in size:

45:9:3.5:1:0.02

I have quickly run a simple paired comparison on a log scale (R code):

X1 <- c(44.90, 9.29, 3.65, 1.00, 0.02)
X2 <- c(44.96, 9.03, 3.52, 1.02, 0.02)

mean(log(X1/X2))
# [1] 0.008702865
sd(log(X1/X2))
# [1] 0.02310973
mean(log(X1/X2))/( sd(log(X1/X2))/sqrt(5) )
# [1] 0.842078 ### (compare with t-test below)

t.test(log(X1),log(X2),paired=TRUE)
# Paired t-test
# data: log(X1) and log(X2)
# t = 0.8421, df = 4, p-value = 0.4471
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.01999166 0.03739739
# sample estimates:
# mean of the differences
# 0.008702865

so there is absolutely no evidence in those five pairs of numbers
that the two readers differ. Nor is one systematically greater
than the other:

X1 <- c(44.90, 9.29, 3.65, 1.00, 0.02)
X2 <- c(44.96, 9.03, 3.52, 1.02, 0.02)

X1 >|=|< X2: < > > < =

The only evidence of how much variation to expect when one person
measures a mould of given size will be in the individual replicate
results.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 16-Dec-09 Time: 15:16:44
------------------------------ XFMail ------------------------------

Bruce Weaver

unread,
Dec 17, 2009, 1:55:50 PM12/17/09
to MedStats

I forwarded the original post to Geoff Norman at McMaster University,
because I figured this is in his bailiwick. He asked me to post the
following on his behalf. I think only one or two responses had been
posted when he wrote this, by the way.


--- Start of Norman's Response ---

The example poses a number of issues.

1) What do you do with objects that are so clearly different?

Reliability is formally defined as

True variance between objects / Total variance

Foreshadowing my comment to the first respondent below, this
definition is not in dispute. It is over 100 years old. The ICC, which
directly follows from the definition , was Chapter 8 in Fisher's stats
book, Statistical Methods for REsearch workers, published in 1925. But
the definition goes back to Pearson before that. And it's in APA
guidelines and measurement books.

So, it's obvious from these data that the reliability is and will
remain close to 1.

You could do a log transform to make things more normal, but then you
would be asking abut the reliability of the logs.
So one answer is that this is a bit like looking at the reliability of
height measurement -- it just isn't worth doing.


2) However, let's suppose you had a more typical problem.

There is a very good reason to go back to the raw data. By using the
means, you have lost any way to estimate error variance within
observer, since the means have effectively reduced it by a factor of
10.

What you really have is two sources of error -- pure within rater
measurement error and error between raters. You've confounded the two.
The right approach is to explicitly treat these two sources by
conducting a two factor (Observation / 10 levels and Observer / 2
levels) repeated measures ANOVA. Then compute variance components and
use these to construct ICC to estimate the Intra-rater and inter-rater
reliability. The method is called "Generalizabilty Theory" ,
originally described by Cronbach in 1972, and described in a couple of
measurement books -- Brennan RL , Generalizability Theory, and
Streiner DL & Norman GR, Health Measurement Scales (OUP, 2007).

3) The confusion between stability and reliability

The first respondent makes the common error of confusing reliability
with stability. A difference between means of the two observers says
nothing about the ability of the measurement to discriminate the
objects. You would get the same mean difference shown by Respondent 1
if you rearranged the observations so the first pair was 44.49 and
0.02. And of course it wold be even less likely that the difference
would be significant. Or, if you like take 10 random numbers and put
them in 2 columns of 5. Mean difference is 0 (on average), so by
Respondent 1 it's good measurement. But of course reliability is 0 on
average.

Geoff Norman
McMaster University
1200 Main St. W.
Hamilton ON L8N3Z5, Canada

--- End of Norman's response ---

--
Bruce Weaver
bwe...@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/Home
"When all else fails, RTFM."

BXC (Bendix Carstensen)

unread,
Dec 17, 2009, 6:13:26 PM12/17/09
to meds...@googlegroups.com
The reliability measure is indeed a well established quantity.

It involves the total variance.
Now if you are purely speaking of the differences between the observers and merely taking the moulds as a vehicle to assess this, the variation between moulds is irrelevant. And by that token so the total variation and hence also reliability.

So reliability refers to a situation where the moulds actually represent a sample of some relevant population of moulds about which we want to make a statement in terms of the observers' ability to measure them. And so a sample size of 5 is indeed very small.

It is not really clear from the original post whether the moulds are considered a sample from a specific population or just a convenience sample to assess the agreement between the observers. My initail post implicitly assumed the latter (owing to the title of post).

Best regards,
Bendix Carstensen
_______________________________________________

Bendix Carstensen
Senior Statistician
Steno Diabetes Center
Niels Steensens Vej 2-4
DK-2820 Gentofte
Denmark
+45 44 43 87 38 (direct)
+45 30 75 87 38 (mobile)
b...@steno.dk http://www.biostat.ku.dk/~bxc
www.steno.dk

> --
> To post a new thread to MedStats, send email to
> MedS...@googlegroups.com .
> MedStats' home page is http://groups.google.com/group/MedStats .
> Rules: http://groups.google.com/group/MedStats/web/medstats-rules
>

Reply all
Reply to author
Forward
0 new messages