Re: {MEDSTATS} Re: Agreement

15 views
Skip to first unread message

Peter Flom

unread,
Aug 16, 2008, 11:40:09 AM8/16/08
to MedS...@googlegroups.com
Janet Hill <janet...@yahoo.co.uk> wrote
>Sent: Aug 16, 2008 10:14 AM
>To: MedS...@googlegroups.com
>Subject: {MEDSTATS} Re: Agreement
>
>
> I have been asked to look at two scales to assess
>"oral health" and test whether they agree. One is
>derived from physical observations and is an ordinal
>scale 1 to 4, and the other from various clinical
>measurements is from 1 to 6.
> I have considered using chi-squared or similar,
>or ordered logistic regression, but I would be
>grateful for any advice on the appropriate analysis.
>Many thanks,
>Janet
>


Well, you could just answer "no" but that's probably not what they want :-)

First, you have to figure out what they mean by 'agree'. Some possibilities:

1) That people who are higher on one scale are higher on the other

2) That people who are at the extremes on one are on the extremes on the other, and otherwise, like 1)

3) That people who are at 1 on the four point are at 1 or 2 on six, at 2 are on 2 or 3, at 3 at 4 or 5 and at 4 at 5 or 6.

and other schemes that are slight variations on 3)

4) That the means of groups of people fit a linear approximation between the scales

and who knows what else!

Peter

Peter L. Flom, PhD
Statistical Consultant
www DOT peterflom DOT com

Peter Flom

unread,
Aug 16, 2008, 11:42:32 AM8/16/08
to MedS...@googlegroups.com

Bland, M.

unread,
Aug 18, 2008, 8:03:24 AM8/18/08
to MedS...@googlegroups.com
Peter is right, the usual meaning of agreement is that the two scales
give the same answer. This is impossible here, as they have different
ranges.

I agree that you need to talk to them about what they actually want. My
guess would be that they mean "Do they measure the same or similar
things?". I would suggest a correlation approach. This is not useful
for agreement, because it does not notice things like one scale giving
bigger numbers than the other, but is useful for validity. You will
need a rank correlation, I think, because these are only ordinal
variables, and I would suggest Kendall's tau b to deal with the many
tied ranks. However, once you have done that and got your number, what
does it mean? You can test the null hypothesis that tau = 0, but a
significant correlation does not tell you much. You need to set a
reasonable value for tau to represent agreement. This is difficult to
do, because tau, like any correlation, depends on the variability of the
quantity being measured. When we have a square table, with the same
possible scores for both variables, tau b tends to be a bigger than
linearly weighted kappa. I would suggest multiplying the usual lower
limits for categories of kappa (>0.8 = very good agreement, >0.6 = good
agreement, >0.4 = moderate agreement, >0.2 = fair agreement) by 1.1 to
give plausible categories for tau b.

Martin

--
***************************************************
J. Martin Bland
Prof. of Health Statistics
Dept. of Health Sciences
Seebohm Rowntree Building Area 2
University of York
Heslington
York YO10 5DD

Email: mb...@york.ac.uk
Phone: 01904 321334 Fax: 01904 321382
Web site: http://martinbland.co.uk/
***************************************************

Peter Flom

unread,
Aug 18, 2008, 8:34:52 AM8/18/08
to MedS...@googlegroups.com
Martin wrote

>Peter is right, the usual meaning of agreement is that the two scales
>give the same answer. This is impossible here, as they have different
>ranges.
>

I thought so .... glad to have independent confirmation :-)


>I agree that you need to talk to them about what they actually want. My
>guess would be that they mean "Do they measure the same or similar
>things?". I would suggest a correlation approach. This is not useful
>for agreement, because it does not notice things like one scale giving
>bigger numbers than the other, but is useful for validity. You will
>need a rank correlation, I think, because these are only ordinal
>variables, and I would suggest Kendall's tau b to deal with the many
>tied ranks. However, once you have done that and got your number, what
>does it mean? You can test the null hypothesis that tau = 0, but a
>significant correlation does not tell you much. You need to set a
>reasonable value for tau to represent agreement. This is difficult to
>do, because tau, like any correlation, depends on the variability of the
>quantity being measured. When we have a square table, with the same
>possible scores for both variables, tau b tends to be a bigger than
>linearly weighted kappa. I would suggest multiplying the usual lower
>limits for categories of kappa (>0.8 = very good agreement, >0.6 = good
>agreement, >0.4 = moderate agreement, >0.2 = fair agreement) by 1.1 to
>give plausible categories for tau b.
>

I did not know of this relations between kappa and tau-b, so that is good to know.

Correlation is certainly one vital part of validating one scale against another. But, depending on what use they are making of these things, it may not be enough. Although the original poster didn't say why this question was being asked (always a useful thing to know!) it strikes me that it might be that some group wants to substitute one scale for the other --- perhaps it's cheaper or something, or requires less expertise. If this is the case, then I, for one, would insist on considerably more evidence of validity.

Diana Miglioretti

unread,
Aug 18, 2008, 10:58:17 AM8/18/08
to meds...@googlegroups.com
 
On a slightly related note --- Can someone explain the difference between Kendall's Tau b and Kendall's coefficient of concordance (I have seen as 'w') when there are only two raters? is Tau a measure of correlation and w a measure of agreement?
 
Thanks!
Diana

> Date: Mon, 18 Aug 2008 13:03:24 +0100
> From: mb...@york.ac.uk

See what people are saying about Windows Live. Check out featured posts. Check It Out!

Ray Koopman

unread,
Aug 18, 2008, 12:55:32 PM8/18/08
to MedStats
On Aug 18, 7:58 am, Diana Miglioretti <dimigliore...@hotmail.com>
wrote:
> On a slightly related note --- Can someone explain the difference
> between Kendall's Tau b and Kendall's coefficient of concordance
> (I have seen as 'w') when there are only two raters?
> is Tau a measure of correlation and w a measure of agreement?

W = r + (1-r)/k, where k = the number of raters and r = the average
pairwise Spearman correlation among the raters. I generally find it
more useful to think in terms of r rather than W.

Spearman's r is just the Pearson r calculated on the ranks of the
ratings (i.e., with each rater's data transformed to have a uniform
marginal distribution) and is thus a measure of the linearity of
the relation of the ranks.

Kendall's tau operates in different units. It counts the agreements
and disagreements in the pairwise orderings implied by the ratings,
and gives the difference between the two, divided by the number of
pairwise comparisons.

How the labels 'correlation' and 'agreement' should be assigned is
a matter of convention.

John Uebersax

unread,
Aug 27, 2008, 12:43:05 PM8/27/08
to MedStats
If you can accept that the intervals on both scales are equally
spaced, or at least approximately so, then you can use the Pearson
correlation coefficient to quantify agreement.

If not, a good alternative is the polychoric correlation coefficient
and its generalizations.

See:

http://ourworld.compuserve.com/homepages/jsuebersax/tetra.htm

(if the page is not working, Google 'polychoric correlation' and refer
to the cached version of the page.)

HTH.

John Uebersax PhD

Janet Hill <janethil...@yahoo.co.uk> wrote

aindrayan

unread,
Aug 27, 2008, 9:17:52 PM8/27/08
to MedStats
Please never use or suggest Pearson correlation coefficient for
assessing agreement. If one scale always notch up two points higher
than the other, the Pearson is one but they have no agreement.

~Abhaya Indrayan
Professor of Biostatistics
Delhi University College of Medical Sciences
Delhi 110095
> > grateful for any advice on the appropriate analysis.- Hide quoted text -
>
> - Show quoted text -

SadaNand Dwivedi

unread,
Aug 28, 2008, 1:45:58 AM8/28/08
to MedS...@googlegroups.com
I also reiterate the view already expressed by Prof. Indrayan. The Pearson's correlation coefficient should never be used to talk about agreement between two procedures/  methods recording quantitative values of a variable in same range. There is a set of analysis required in this regard, make observation accordingly.
 
 
 
 
S.N. Dwivedi, Ph.D., FSMS, FRSS (UK)
Additional Professor & Head
Department of Biostatistics
All India Institute of Medical Sciences
Ansari Nagar
New Delhi-110029, India
Tel: 91-11-26588441 (Direct-Residence)
91-11-26588500 Ext.3394 (Residence)
91-11-26588500 Ext.3387 (Office)
91-9810571956 (Mobile)
Other Emails: dwiv...@hotmail.com
dwi...@aiims.ac.in
dwiv...@yahoo.com


--

John Uebersax

unread,
Aug 28, 2008, 3:37:15 AM8/28/08
to MedStats
Dear Professor Indrayan,

If the two scales are on two different metrics, as in this case, so
that a fixed 'bias' of one relative to the other is meaningless, would
you then revise your caution?

Thanks,

John Uebersax
> > - Show quoted text -- Hide quoted text -

BXC (Bendix Carstensen)

unread,
Aug 28, 2008, 3:54:40 AM8/28/08
to MedS...@googlegroups.com
The utter irrelevance of a correlations for assessing agreement does not disappear because the measuremnts by one method are scaled by some factor.
The original example just have some (unknown) original data scaled to the intervals (0;4-eps) and (0;5-eps) and subsequently trincated to integers, but this exercise does not make the correlation more sensible. As in all studies of agreement the real problem is in the definition of the subject-matter question.

Bendix
______________________________________________

Bendix Carstensen
Senior Statistician
Steno Diabetes Center
Niels Steensens Vej 2-4
DK-2820 Gentofte
Denmark
+45 44 43 87 38 (direct)
+45 30 75 87 38 (mobile)
b...@steno.dk http://www.biostat.ku.dk/~bxc

ghasem yadegarfar

unread,
Aug 28, 2008, 4:13:17 AM8/28/08
to MedS...@googlegroups.com
Hi All,
 
Use of Pearson's correlation coefficients has been criticised by Bland & Altman: See below,
 
They have also suggeted a very understandable and doable method to evaluate agreement between two numerical scales.
 

STATISTICAL METHODS FOR ASSESSING AGREEMENT

BETWEEN TWO METHODS OF CLINICAL MEASUREMENT

J. Martin Bland, Douglas G. Altman

Department of Clinical Epidemiology and Social Medicine, St. George's Hospital Medical, School, London SW17 ORE; and Division of Medical Statistics, MRC Clinical Research, Centre, Northwick Park Hospital, Harrow, Middlesex

(Lancet, 1986; i: 307-310)

 
Ghasem
 
 
Dr Ghasem Yadegarfar
PhD in Epidemiology, MSc in Biostatistics, BSc in Maths
Assistant Professor, Epidemiologist
Biostat & Epidemiology Dept.,
School of Public Health Sciences
Isfahan University of Medical Sciences
Isfahan, IRAN
Tel: 0098 311 792 2771 (Office) 
Fax: 0098 311- 6682509
Email: yadeg...@yahoo.co.uk

--- On Thu, 28/8/08, BXC (Bendix Carstensen) <b...@steno.dk> wrote:


Send instant messages to your online friends http://uk.messenger.yahoo.com

John Uebersax

unread,
Aug 28, 2008, 4:31:08 AM8/28/08
to MedStats
Dear Dr. Dwivedi,

Janet, the original poster wrote:

> I have been asked to look at two scales to assess
> "oral health" and test whether they agree. One is
> derived from physical observations and is an ordinal
> scale 1 to 4, and the other from various clinical
> measurements is from 1 to 6.

Unless I misunderstand what this says, the two measures are on
entirely different scales. Hence usual concerns that correlation
coefficients are insensitive to a fixed-bias difference between two
scales would not appear to apply here. The best way to assess
"agreement", in the broad sense of the term, is correlational, per
Professor Bland's suggestion.

So to be clear, is your concern (1) that correlational methods should
not be used here at all, or (2) that what correlation methods assess
here should not be called 'agreement'.

The latter issue would seem to require a very strict definition of
'agreement'. Strict defintions are not always a good idea; to
circumvent it in this case, that is, if we were to insist on the
utmost precision, Janet's question might need to be rephrased as: "Do
these two scales have the property that lacks a precise name, but
which pertains generally to the subject of consistency,
correspondance, interchangeability, redundancy, reliability, and
validity -- and which, in a colloquial sense, is close enough to the
meaning of the
word 'agreement' that, since any other alternative word is subject to
a least as much difficulty, we may use it here?"

To illustrate that "agreement" has a number of meanings, some of which
seem appropriate here, below are some verbatim definitions found in
online dictionary sources:

* compatibility of observations; "there was no agreement between
theory and measurement";
"the results of two tests were in correspondence" (WordNet,
Princeton)

* harmony of people's opinions or actions or characters; "the two
parties were in agreement"
(WordNet, Princeton)

* to be similar : CORRESPOND; to be consistent <the story agrees
with the facts>
(Merriam-Webster dictionary)

If the objection is, "if you use the word 'agreement' here, people
will misunderstand it to refer to a correspondance of the actual
categories, e.g., a '1' for the first scale and a '1' on the second
scale," a reply might be that the context of the problem indicates
that such exact correspondance is not an issue here; hence, a broader
meaning of 'agreement' is to be taken.

Prof. Indrayan wrote:

> never use or suggest Pearson correlation coefficient for assessing agreement.

'Never' is a very strong word. What about cases where a fixed
difference between two scales is unimportant for research purposes.
For example, suppose that reported numeric results of Test A and Test
B: (1) always differ by 1.5; and (2) have a Pearson correlation of
1.0. Further, suppose a researcher wishes to, in a new clinical trial
that examines pre-treatment/post-treatment change, replace expensive
old measure A with less expensive new measure B. That is, whereas in
an old study A was used for pre- and post-treatment evaluation,
measure B will be used in the new study at both timepoints. The
researcher wishes to verify that measure B will be just as useful as
measure A.

In this case, since the pre-post differences will remove any effects
of the fixed difference between A and B (i.e., the pre-post
differences will be the same in either case), then would it not appear
that all that matters here is the Pearson correlation?

Bendix Carstensen wrote:

> The utter irrelevance of a correlations for assessing agreement

Please see comments above concerning the various definitions of
'agreement'. Statisticians do not dictate the meaning of words. If
absolute agreement is required of two statistical measure, we may call
this 'absolute agreement', leaving the word 'agreement' broader,
flexible, and with its intended meaning determined by context.

John Uebersax PhD
http://satyagraha.wordpress.com

Bland, M.

unread,
Aug 28, 2008, 4:38:30 AM8/28/08
to MedS...@googlegroups.com
As we keep repeating, it all depends on what you mean by agreement. The
clinicians involve need to explain this, but I do not think they mean
agreement in the sense of this paper. It was about measurements of
physical quantities and the examples given were peak expiratory flow
rate, arterial oxygen saturation, and mean velocity of circumferential
fibre shortening in cardiography. For these we want to record "PEFR =
600 litres/min", not "PEFR = 600 on the Wright scale". "Agreement"
means the two methods give similar numerical values, which are
interchangeable. Clearly correlation cannot address this.

I do not think this is the kind of agreement we are looking for. I
suspect we are asking "How closely do these methods measure the same
thing, even though it is measured in different units". As the two
scales are unlikely to have interval properties, I suggested rank
correlation as an approach. The limits of agreement approach proposed
by Altman and Bland should not be used for these data.

Martin

ghasem yadegarfar wrote:
> Hi All,
>
> Use of Pearson's correlation coefficients has been criticised by Bland
> & Altman: See below,
>
> They have also suggeted a very understandable and doable method to
> evaluate agreement between two numerical scales.
>

> **
>
> *STATISTICAL METHODS FOR ASSESSING AGREEMENT *
>
> *BETWEEN TWO METHODS OF CLINICAL MEASUREMENT*


>
> J. Martin Bland, Douglas G. Altman
>
> Department of Clinical Epidemiology and Social Medicine, St. George's
> Hospital Medical, School, London SW17 ORE; and Division of Medical
> Statistics, MRC Clinical Research, Centre, Northwick Park Hospital,
> Harrow, Middlesex
>

> (/Lancet/, 1986; *i: *307-310)
>
>
> Ghasem
>
>
> /Dr Ghasem Yadegarfar /
> /PhD in Epidemiology, MSc in Biostatistics, BSc in Maths
> Assistant Professor, Epidemiologist/
> /Biostat & Epidemiology Dept., /
> /School of Public Health Sciences


> Isfahan University of Medical Sciences

> Isfahan, IRAN/
> /Tel: 0098 311 792 2771 (Office)
> Fax: 0098 311- 6682509
> Email: yadeg...@yahoo.co.uk/

--

Abhaya Indrayan

unread,
Aug 28, 2008, 5:18:23 AM8/28/08
to MedS...@googlegroups.com
Yes, indeed, 'never' is too strong. I agree that agreement is not necessarily numerical agreement in the conventional sense where the values are interchangeable, although we tend to use it in this sense. Martin has explained it well.
~Abhaya Indrayan
--
Dr Abhaya Indrayan, PhD(OhioState),FAMS,FRSS
Professor and Head
Department of Biostatistics and Medical Informatics

University College of Medical Sciences
Dilshad Garden, Delhi 110 095
Phones: +91-11-2259 6637, 2259 4451
Fax: +91-11-2259 0495
Website: http://www.geocities.com/aindrayan

Peter Flom

unread,
Aug 28, 2008, 7:48:12 AM8/28/08
to MedStats
A few more thoughts on this whole thread

1) I think agreement is a weird word to use when scales are completely incompatible. But people do use it. When they do, the statistician must clarify what they mean before he or she can say anything useful

2) Even when scales are compatible, agreement can be tricky. How close do two measurements have to be in agreement? Suppose we are talking about ordinary bathroom scales. Five people try each of two scales and get:

First scale 180 150 192 198 125
Second 181 149 194 200 123

do they "agree"? How could we tell? Would correlation be good?

If the two were off by only ounces, would that be enough?


3) In the case that started all this, the scales aren't *completely* incompatible, they are somewhat incompatible. 1 through 4 and 1 through 6 aren't so far off, but it's not clear what to do to make them equivalent.

4) Whether correlation is, or is not, a good measure of agreement does not, to me, depend on the compatibility of the scales, but on the meaning of agreement.

BXC (Bendix Carstensen)

unread,
Aug 28, 2008, 10:55:56 AM8/28/08
to MedS...@googlegroups.com
The limits of agreement predicts the difference between two meausrments, i.e. answering whether the difference is 0, bar a clinically irrelevant random error.

Or otherwise phrased: If we use method 1 and predict that method 2 would have given the same result we are only off by clinically irrelevant amount.

So it all boils down to clinical judgement (which I normally find very hard to tease out of clinicians).

Likewise, if there is a linear relationship between method 1 and 2 we may predict that method 2 would have given a value of A + B x method1 bar a clinically irrelevant random error. Thus the scaling does not make any difference to the essence of the problem. And linar transformation of method2 results does not alter the correlation, so neither in this case it is relevant.

But in the case of A \neq 0 and B \neq 1 there is of course an estimation problem to get A and B.
This has partly beeen addressed by regressing differences on averages in:

@Article{Bland:Altman.1999,
author = {Bland, J.M. and Altman, D.G.},
title = {Measuring agreement in method comparison studies.},
journal = {Statistical Methods in Medical Research},
year = {1999},
volume = {8},
pages = {136--160},
}

A fuller discussion of this specific topic is available in:

@TechReport{Carstensen.2008a,
author = {B Carstensen},
title = {Limits of agreement:
How to use the regression of differences on averages.},
institution = {Department of Biostatistics, University of Copenhagen},
year = {2008},
number = {08.6},
address = {\url{http://cms.ku.dk/sund-sites/ifsv-sites/english/about/departments/biostatistics/reports/2008/researchreport08-06.pdf}},
}

Best regards,
Bendix
______________________________________________

Bendix Carstensen
Senior Statistician
Steno Diabetes Center
Niels Steensens Vej 2-4
DK-2820 Gentofte
Denmark
+45 44 43 87 38 (direct)
+45 30 75 87 38 (mobile)
b...@steno.dk http://www.biostat.ku.dk/~bxc
> -----Original Message-----
> From: MedS...@googlegroups.com

> [mailto:MedS...@googlegroups.com] On Behalf Of Bland, M.
> Sent: 28. august 2008 10:39
> To: MedS...@googlegroups.com
> Subject: {MEDSTATS} Re: Agreement
>
>

jie kuang

unread,
Aug 30, 2008, 1:09:39 AM8/30/08
to MedS...@googlegroups.com
A good BBS site for using Stata http://www.pinggu.name/bbs/X_AdvCom_Get.asp?UserID=436556


2008/8/28, BXC (Bendix Carstensen) <b...@steno.dk>:



--
  昆明医学院公卫儿少妇幼教研室    
                   况杰

SadaNand Dwivedi

unread,
Sep 1, 2008, 12:55:39 AM9/1/08
to MedS...@googlegroups.com
Dear Dr Uebersax,
 
Yes, I agree with you that these two measurements are on different scale. What I meant is if scales are quantitative within same range, we should not use correlation coefficient for assessing agreement. For this problem, where scales are different, we have to obviously see another appropriate measures of agreement. I have never dealt with such proble. One has to find out solution through careful discussion and research on the issue.
 
SN Dwivedi

--

Bruce Weaver

unread,
Sep 2, 2008, 9:53:02 AM9/2/08
to MedStats

On Sep 1, 12:55 am, "SadaNand Dwivedi" <dwive...@gmail.com> wrote:
> Dear Dr Uebersax,
>
> Yes, I agree with you that these two measurements are on different scale.
> What I meant is if scales are quantitative within same range, we should not
> use correlation coefficient for assessing agreement.

--- snip ---

I would amend that to never use *only* the Pearson correlation. If
you use it in conjunction with a paired t-test, you get answers to two
questions about possible sources of disagreement:

1) How strong is the linear association between the two sets of
measurements?

2) Is there any systematic bias (e.g., one rater a hawk and the other
a dove)?

The intra-class correlation (ICC), which is usually recommended in
this situation, mushes those two possible sources of disagreement
together into a single number. Without a scatterplot or some other
additional information, you can't tell whether a low ICC value is due
to poor linear association, bias, or both.

--
Bruce Weaver
bwe...@lakeheadu.ca
www.angelfire.com/wv/bwhomedir
"When all else fails, RTFM."
Reply all
Reply to author
Forward
0 new messages