Use of Mann-Whitney U-test with Likert-Scale data

1,751 views
Skip to first unread message

Margaret

unread,
Nov 3, 2005, 2:56:32 PM11/3/05
to MedStats
Dear all

I would very much like to receive feedback on best practice when
analysing data on a Likert scale. Typically, I find that one is
encouraged to use the Mann-Whintey U-test to get a feel for the
difference between the medians levels across two groups for which
Likert scale data is collected. Also, frequently, it is assumed that
this is legitimate because the data for both groups should be skewed.
However, some survey populations are unusual and I wonder whether we
have any reason to assume that we have prior knowledge of the
distributions of the Likert scales for the two parent populations.

If such knowledge is lacking and in turn, the samples from these
populations display markedly disimilar distributions, it seems to me
that it is a little improper to assume that the conditions for the
Mann-Whitney U-test have been satisfied. One could try to address this
problem by using the Chi-Square test of association as a test of
difference but for a two-tailed test on a five-point scale, this does
not suggest much about direction of differences. A further possibility
is to use a Chi-Square test of linear trend to compare the two groups
but I wonder what others think about this particular idea.

I would be grateful for recommendations on good practice, as (to state
the obvious) not all statisticians are agreed.

Many thanks to all

Best wishes

Margaret

Jeremy Miles

unread,
Nov 3, 2005, 3:30:27 PM11/3/05
to MedS...@googlegroups.com

To throw one more into the mix, what about doing an ordinal
logistic/probit regression?

JM
--
Jeremy Miles
mailto:jn...@york.ac.uk http://www-users.york.ac.uk/~jnvm1/
Dept of Health Sciences (Area 4), University of York, York, YO10 5DD
Phone: 01904 321375 Mobile: 07941 228018 Fax 01904 321320

Ted Harding

unread,
Nov 3, 2005, 4:04:58 PM11/3/05
to MedS...@googlegroups.com
Hi Margaret,
The danger signal in your description above is "If such knowledge
is lacking and in turn, the samples from these populations display
markedly disimilar distributions, ... "

While the Mann-Whitney test is commonly used to test "whether
two samples differ in median" or "in location" or the like,
what it really tests is whether the two samples come from the
same distribution.

What it is sensitive to is whether, when a value X is sampled
from one distribution and a value Y from the other, the chance
that X < Y is different from 1/2. If less than 1/2, then low
values of the Mann-Whitney U are to be expected, and sufficiently
small values will be significant; likewise, if this chance is
greater than 1/2, then large values are to be expected, and
sufficiently large values are significant.

There are all sorts of ways in which P(X < Y) can be different
from 1/2, of which difference in median (or location) is only one.
For example, if the sample x is

1, 2, ... , 8, 9, 10, 20.5, 21, 22, ... , 30

and the sample y is

11, 12, ... , 19, 20, 20.5, 31, 32, ... , 40

then they have the same median (20.5) but the P-value for
a Mann-Whitney (actually Wilcoxon but they come to the same)
is 0.012

And it's easy to arrange two samples x and y with the same mean
as well as the same median, but with a similarly small P-value.
Just add something large enough to the largest x:

x = 30 -> x = 30 + 21*(mean(y) - mean(x)) = 230

and now the P-value is 0.02435 -- not so extreme as before,
but still significant.

So: same mean, same median, but significant difference on
the M-W test! So what's being tested? Why, "P(X < Y) = 1/2"
of course! Conclusion: x and y were sampled from two populations
A and Y in which it is more likely than not that a value from
Y will be greater than a value from X.

But note especially that in the second case, x and y now
perhaps "display markedly disimilar distributions", in your
words. In fact x has a big outlier.

In the first case, however, you could view the distribution
of y as being effectively the same as that of x, but shifted
up by 10.0, with the exception that where on that basis you
would expect to find an x = 10.5 (analagous to the y = 20.5)
you have instaed that x = 20.5, and this is the sole point
of difference between y and the shifted x. So in this case
the two distributions might not perhaps be considered "markedly
dissimilar" except in their general location.

And you might indeed say that they were similar in the second
case, if you regard the outlier as a rare exception.

So where, in fact, is the significant difference in the test
coming from? From the fact that most X values are less than
corresponding Y values.

You could in fact obtain exactly equivalent situations, as
far as the test is concerned, by applying the following
changes (or similar) to any, some or all of the blocks of data

1, 2, ... , 9, 10 -> 5.01, 5.02, ... , 5.10
11, 12, ... , 20 -> 15.01, 15.02, ... , 15.10
21, 22, ... , 30 -> 25,01, 25.02, ... , 25.10
31, 32, ... , 40 -> 35.01, 35.02, ... , 35.10

which would introduce all sorts of differences in detail between
the x and y samples, but would not affect the outcome of the test.
All that matters is that the ranks do not get disturbed.

In the face of this, you may well ask what you learn from
a significant Mann-Whitney when applies to two samples
which "display markedly disimilar distributions".

All you learn is that they are not from the same distribution,
in that a Y is likely to be greater than an X!

So, unless you can see that the two samples are basically
similar apart from a possible shift, what you learn from
a Mann-Whitney test may not be what you had hoped to learn.

As to your question about what may be "good practice" in the
situation you refer to, I think we need to know what you are
looking for (which you do not explicitly state). What is the
difference of interest, which you want your test to be sensitive
to, and what are the differences not of interest (which you would
like your test not to be sensitive to)?

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 03-Nov-05 Time: 21:04:54
------------------------------ XFMail ------------------------------

Timot...@iop.kcl.ac.uk

unread,
Nov 4, 2005, 4:25:59 AM11/4/05
to MedS...@googlegroups.com
I raised a similar question earlier on but received no reply but already
Margaret is getting two replies to her question so I'm getting a bit
jealous!

My concern was that the normal approximation used in a U-test may not hold
given likert-type scales as there will be lots of ties, so I had preferred
t-test if the data were not overly skewed, and if they were overly skewed
I'd dichotomize the data into 0 and >0, and run a Chi-square.

(I had avoided ordinal/poisson/binomial regression because A. it's not
available in SPSS, and B. often in descriptive stats you want to keep the
stats simple.)

Tim


Bland, M.

unread,
Nov 4, 2005, 4:49:02 AM11/4/05
to MedS...@googlegroups.com
Then get a better program. Stata is good, R is free.

I agree with Ted about the Mann Whitney. This stuff about medians is a
total red herring. The Mann Whitney only tests the difference between
medians if we assume that the groups differ only in location, i.e. that
the difference between medians is equal to the difference between means
and the variances are the same. This question comes up again and again
on discussion lists.

This assumption is most unlikely for Likert scales, which are limited at
both ends and where peole use both ends. I don't think it is possible
for such a scale to have the same shape but a difference in median.

I think that Mann Whitney is OK for this, by the way, but it only does
what it does, ie test the null hypotheisis of not stochastically different
.
Martin
--
***************************************************
J. Martin Bland
Prof. of Health Statistics
Dept. of Health Sciences
Seebohm Rowntree Building Area 2
University of York
Heslington
York YO10 5DD

Email: mb...@york.ac.uk
Phone: 01904 321334
Fax: 01904 321382
Web site: http://www-users.york.ac.uk/~mb55/
***************************************************

Margaret

unread,
Nov 4, 2005, 5:37:25 AM11/4/05
to MedStats
Dear Ted

Thank you for this kind advice. You wrote:

> As to your question about what may be "good practice" in the
> situation you refer to, I think we need to know what you are
> looking for (which you do not explicitly state). What is the
> difference of interest, which you want your test to be sensitive
> to, and what are the differences not of interest (which you would
> like your test not to be sensitive to)?

My query relates to a variety of scenarios which could arise, all of
which have the following characteristics (or similar ones) in common:

* overall performance is to be assessed for two groups according to
quality ratings by observers on a Likert scale

* sample sizes are small and so histograms for both groups show
irregular and dissimilar distributions

** the main question is, which group performed the best?

Thank you in advance for your reply and to all those who would like to
join in!

Yours most gratefully,

Ted Harding

unread,
Nov 4, 2005, 6:27:08 AM11/4/05
to MedS...@googlegroups.com
On 04-Nov-05 Timot...@iop.kcl.ac.uk wrote:
>
> I raised a similar question earlier on but received no reply but
> already Margaret is getting two replies to her question so I'm
> getting a bit jealous!

Sorry about that -- no slight intended (though it did follow on
from an extended discussion under "non-parametric methods" which
had a lot about the U test, so maybe people felt there was not
much to add).

> My concern was that the normal approximation used in a U-test
> may not hold given likert-type scales as there will be lots of ties,
> so I had preferred t-test if the data were not overly skewed, and if
> they were overly skewed I'd dichotomize the data into 0 and >0, and
> run a Chi-square.
>
> (I had avoided ordinal/poisson/binomial regression because A. it's not
> available in SPSS, and B. often in descriptive stats you want to keep
> the stats simple.)
>
> Tim

The U test does not depend on a normal approximation (though this
can be used for large samples), since there is a well-established
exact distribution which any decent software should be able to
compute.

Ties are more problematic, since there is an issue about what they
really represent. If the data represent really integer-valued
entities (like number of children), a tie is a tie. However, if
they represent a discretisation of an underlying continuum
(as many "subjective scales" probably do) then it is reasonable
to suppose that a tie in the data does not represent a tie in
the underlying variable, so you can consider breaking ties at
random, for instance: since for two individuals who both score 2,
you do not know whether they were "really" 1.5 vs 1.6 or 1.6 vs 1.5,
for instance, you can consider the implications of allowing it to
be either way round.

This could either be done by simulation (add random noise of
small magnitude to each integger value on the scale, and then
run the U test on the new "data" which will not now have ties),
or in fact there is a theoretical approach to it which enables
the distribution of possibole P-values to be computed. For
practical purposes, the simulation approach would be more easliy
accessible and should be adequate.

Regarding skewness: for tests whose usual implementation depends
on noram-distribution assumptions justified by "large sample
theory", the goodness of approximation depends on the closeness
of approach to a limiting normal distribution. E.g. in a t-test
it is assumed that the numerator has a normal distribution,
and this is justified in practice for large samples because
the numerator is a sample mean and the central linit theorem
assures convergence to a normal distribution as sample size
increases.

However, in this convergence it is in fact skewness which tends
to impede convergence more than most! The mean of a sample from
a skew distribution converges more slowly than the mean of a
sample from a symmetrical one.

That being said, if none the less the main distinction between
your samples is in the proportions of "0" scores, then possibly
the approach you mention would be OK; though I would still
want to look for possible differences in how the remaining
scores were distributed.

For example, the two populations may consist (A) of people
who were very clear and definite in their opinions,
(B) people who were uncertain and confused. On a 0-5 scale,
A-people may tend to give 0 or 5, with few in between, while
B-people may be all over the place. There may well be a
difference in proportions of "0", but that's not the whole
story!

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-05 Time: 11:26:38
------------------------------ XFMail ------------------------------

John Whittington

unread,
Nov 4, 2005, 6:51:46 AM11/4/05
to MedS...@googlegroups.com
At 09:25 04/11/05 +0000, Timot...@iop.kcl.ac.uk wrote (in part):

>My concern was that the normal approximation used in a U-test may not hold
>given likert-type scales as there will be lots of ties ....

There are many issues and potential problems that could be discussed here,
but I don't think that's one of them. There are no assumptions about
normality (or approximation to normality) involved in the Mann Whitney test.

I presume what you are talking about is the conversion of U to a
'p-value'. For N up to 20 or so, the 'exact' critical values of U are
available from published Tables. For higher N, one often uses a
calculation which involves a normal approximation (TO THE DISTRIBUTION OF
U) for large samples. However, there's no reason why one has to use that
method if one doesn't want to; 'exact' critical values for U can be
calculated for any N (not just those normally tabulated in textbooks)
without making any assumptions - and, indeed, any current decent
statistical software ought to be able to do that. Be warned, though, that
if N is very large, then the process may be lengthy even with modern
computers, and the answer obtained would be virtually identical to that
obtained using the 'normal approximation'.

Large numbers of ties do, indeed, present a potential problem - but that
has nothing to do with the 'normal approximation' to which i assume you refer.

Kind Regards,


John

----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK medis...@compuserve.com
----------------------------------------------------------------

Timot...@iop.kcl.ac.uk

unread,
Nov 4, 2005, 7:56:31 AM11/4/05
to MedS...@googlegroups.com
Sorry I must have got the normal approximation and the tie thing a bit
confused. But I have been warned of ties the moment I learnt about the
Mann-Whitney test, and hence it's been a constant warning note in my head.
I think you all agree that ties can potentially be a problem. But when is
that 'potential' a real problem? Common disregard to it would suggest that
for most situations (including that of the likert-type data) it is not a
problem, but I'd like to see more evidence why that's the case.

I've got a book called Statistical Sleuth by Ramsay and Schafer. In
Chapter 4 it examines an example which I'll quote here. We're comparing
the number of O-ring incidents on 24 space shuttle flights prior to the
Challenger disaster in two Launch Temperatures: Below 65F, and Above 65F.
THe results are as follows:
Below: 1 1 1 3
Above: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2

Using a permutation test based on the t-test, they obtained a one-tailed
p-value of 0.0099. Now I plugged in the data into SPSS, and ran a U-test,
specifying even the Exact option (which I thought was equivalently to a
permutation test, but perhaps not). It gives me: exact significance:
0.002, for both one and two-tailed (asymptotic it's 0.001). Now, that, I
think is considerable discrepancy in results, and I attributed the
discrepancy to the presence of ties.

And to my mind, data from Likert-type scale is often not too different to
the above. True the above is probably a bit more sparse than the average
results you have. But at least, if the permutation results can be
considered superior, it demonstrates that the U test cannot be applied
blindly to Likert-type data.

Tim

Ted - You mentioned that the solution to this may be by some sort of
simulation, but I wouldn't imagine this is routinely done in softwares.



John Whittington <Joh...@mediscience.co.uk>
Sent by: MedS...@googlegroups.com
04/11/2005 11:51
Please respond to
MedS...@googlegroups.com


To
MedS...@googlegroups.com
cc

Subject
MEDSTATS: Re: Use of Mann-Whitney U-test with Likert-Scale data

Margaret

unread,
Nov 4, 2005, 8:59:20 AM11/4/05
to MedStats
Let me clarify in stressing my concern in using the Mann-Whitney U-test
with "markedly disimilar distributions", I was thinking in particular
of ascenario in which the shapes of the distributions for the two
samples are different and we have no reason to suppose that this is not
also so for the parent populations from which they were obtained. My
understanding was that some such distibutional assumption ought to be
in place.

I would still appreciate comments re my last message concerning what I
am testing and also on the usefulness of the Chi-Square test of linear
trend for this purpose.

Best wishes

Margaret

Margaret

unread,
Nov 4, 2005, 9:12:55 AM11/4/05
to MedStats
Just to add a little more meat to my previous message, as it is an
assumption of the Mann-Whitney U-test that the two parent populations
have the same distribution after a translation of size k, how do we
know that our assumption has been met. More precisely, how can we
assume that
P(X < x) = P(Y < x + k) for populations X and Y, some natural
number x and a fixed constant k.
Surely we need some evidence before leaping in and doing this test,
even if we have a Likert scale.

Comments on this specific point would be valued. I appreciate that
indeed the null hypothesis of the Mann-Whitney U-test is that the
populations have the same distribution. Ted's 'equal medians' example
was also interesting. It takes a little extra time to make up these
examples!

Thank you very much

Margaret

Timot...@iop.kcl.ac.uk

unread,
Nov 4, 2005, 9:16:49 AM11/4/05
to MedS...@googlegroups.com
Apologies for adding to the confusion with my last message:

I ran some permutation tests again by myself based on the dataset I quoted
last time. It turned out that the differences in p-values lie not in the
fact that one takes into account of ties and the other doesn't. It simply
is because the permutation test uses a t-test. I tried the permutation
again using a U-test. It turned out that the significance is about 0.002
using a U-test.

But still I appreciate a more rigorous proof why we can in general ignore
the problem of ties.

Tim





Timot...@iop.kcl.ac.uk
Sent by: MedS...@googlegroups.com
04/11/2005 12:56

John Whittington

unread,
Nov 4, 2005, 9:53:09 AM11/4/05
to MedS...@googlegroups.com
At 05:59 04/11/05 -0800, Margaret wrote:

>Let me clarify in stressing my concern in using the Mann-Whitney U-test
>with "markedly disimilar distributions", I was thinking in particular
>of ascenario in which the shapes of the distributions for the two
>samples are different and we have no reason to suppose that this is not
>also so for the parent populations from which they were obtained. My
>understanding was that some such distibutional assumption ought to be
>in place.

Does it not follow from what several people have said in both this and
other recent threads that if the two distributions are appreciably
different that a Mann-Whitney test is most unlikely to be of any value -
since the difference in distributions itself should result in the test
revealing a 'significant difference'?

I posed much the same question myself recently, conceding that tests like
the MW one are inappropriate in such circumstances, but asking what one CAN
do when distributions are clearly different. I suppose the real question
which then arises is that of what on earth one is trying to do by comparing
data from two 'clearly differently-shaped distributions' - on a purely
conceptual, level it's difficult to say 'what one would be looking for'
(and what it would mean) if the distributions were of very different shapes.

For me, this overlaps considerably with a different, but much more common,
situation I also asked about recently - what if one has every reason to
believe that the two parent populations have very similar (shape)
distributions but the samples one has have very different distributions
from one another? In that situation, my inclination is to ignore the shape
of the sample distributions and 'go with' the assumption of similar
population distributions (and hence justify the use of the tests).

Bland, M.

unread,
Nov 4, 2005, 10:02:17 AM11/4/05
to MedS...@googlegroups.com
Sorry, Margaret, but I think that your understanding is wrong. You need
assume only that the observations can be ranked. It is a test for
ordinal data. There need be no distribution to have a shape. However,
IF you want to use it as a test for equality of medians THEN you must
assume that distributions have the same shape. Many books do not make
this clear and there is endless confusion about it. If you want to test
the null hypothesis that a random member of one population will exceed a
random member of the other population with probablity 0.5, then you need
assume only that the observations can be ordered. I thought that Ted
explained this very clearly.

There is a real example in my book An Introduction to Medical
Statistics, 3rd edition, where the medians of both groups are zero. A
Mann Whitney test was significant and we used the 3rd quartile as the
location statistic.

Martin

Bland, M.

unread,
Nov 4, 2005, 10:11:20 AM11/4/05
to MedS...@googlegroups.com

John Whittington wrote:

>
> Does it not follow from what several people have said in both this and
> other recent threads that if the two distributions are appreciably
> different that a Mann-Whitney test is most unlikely to be of any value
> - since the difference in distributions itself should result in the
> test revealing a 'significant difference'?


There is a big difference between saying they look different and saying
that there is evidence that they are different. That is what the test
is for.

>
> I posed much the same question myself recently, conceding that tests
> like the MW one are inappropriate in such circumstances, but asking
> what one CAN do when distributions are clearly different. I suppose
> the real question which then arises is that of what on earth one is
> trying to do by comparing data from two 'clearly differently-shaped
> distributions' - on a purely conceptual, level it's difficult to say
> 'what one would be looking for' (and what it would mean) if the
> distributions were of very different shapes.
>

You can test the null hypothesis that two distributions are the same in
every respect, i.e. looking for any difference at all, down to the fifth
and sixth moments, using the Kolmogorov Smirnov two-sample test. But,
as with most tests, you should plan this before you look at the data.
As it not usually an interesting hypothesis, it is not used much. Also,
with such a broad alternative, the power is very low.

Martin

John Whittington

unread,
Nov 4, 2005, 9:39:14 AM11/4/05
to MedS...@googlegroups.com
At 12:56 04/11/05 +0000, Timot...@iop.kcl.ac.uk wrote (in part):

>Sorry I must have got the normal approximation and the tie thing a bit
>confused. But I have been warned of ties the moment I learnt about the
>Mann-Whitney test, and hence it's been a constant warning note in my head.
>I think you all agree that ties can potentially be a problem. But when is
>that 'potential' a real problem? Common disregard to it would suggest that
>for most situations (including that of the likert-type data) it is not a
>problem, but I'd like to see more evidence why that's the case.

Siegel's 'bible' on non-parametric statistics contains an interesting
discussion about ties in Mann-Whitney tests. One of the things he points
out is that ties WITHIN one the groups do not affect the U value obtained;
U is only affected if ties occur BETWEEN values in the two groups. He says
that the effect of ties is "usually negligible' but gives a method for
correcting for ties (but which is only applicable if one uses the 'normal
approximation' method for converting U to 'p'). His book contains an
interesting example. His dataset has a total of 39 values (16+23), but
only 3 of these are untied. There are only 12 unique values, with some of
them having up to 6 ties. It is therefore a 'pretty tied' set of
data. Calculations of z (normal approximation method) with and without a
correction for ties results in values of 3.45 and 3.43 respectively, so the
corresponding difference is p-values would, indeed, be 'negligible'. The
other important thing he reminds us is that the effect of correcting for
ties is to reduce the p-value - ignoring ties therefore leads to a
conservative test, not an increase in risk of Type I error. His overall
recommendation is that "one should correct for ties only if the proportion
of ties is quite large, if some of the ties involve large numbers of tied
observations, or if the p-value obtained without correction for ties is
close to one's previously set value for alpha".

>I've got a book called Statistical Sleuth by Ramsay and Schafer. In
>Chapter 4 it examines an example which I'll quote here. We're comparing
>the number of O-ring incidents on 24 space shuttle flights prior to the
>Challenger disaster in two Launch Temperatures: Below 65F, and Above 65F.
>THe results are as follows:
>Below: 1 1 1 3
>Above: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2

Are those figures given in chronological order? If so, one would be very
concerned about apparent 'changes with time', regardless of temperature!

>Using a permutation test based on the t-test, they obtained a one-tailed
>p-value of 0.0099. Now I plugged in the data into SPSS, and ran a U-test,
>specifying even the Exact option (which I thought was equivalently to a
>permutation test, but perhaps not). It gives me: exact significance:
>0.002, for both one and two-tailed

I find it a bit hard to understand how the one- and two-tailed p-values can
be identical, but maybe I'm missing something!

>Now, that, I think is considerable discrepancy in results, and I
>attributed the discrepancy to the presence of ties.

It's certainly a considerable discrepancy. In order to remove software
issues from the equation, I've just had an attempt to do a U-test on that
data by hand. I get the U-value as 6 which, using the normal approximation
method translates to a p-value (one-tailed) of 0.0043 without any
correction for ties and 0.0023 with Siegel's correction method (double
those p-values for 2-tailed); 'correction for ties' hence made quite a
difference in that case of pretty extreme ties (17/24 figures being tied
with the same value, and 5/24 with another value). Quite where they fit
into the picture of 'theirs' and 'yours' answers, I'm not sure!! - but my
hastily conducted calculations could, of course, be in error!

Ted Harding

unread,
Nov 4, 2005, 10:38:04 AM11/4/05
to MedS...@googlegroups.com
On 04-Nov-05 Margaret wrote:
>
> Just to add a little more meat to my previous message, as it is an
> assumption of the Mann-Whitney U-test that the two parent populations
> have the same distribution after a translation of size k, how do we
> know that our assumption has been met. More precisely, how can we
> assume that
> P(X < x) = P(Y < x + k) for populations X and Y, some natural
> number x and a fixed constant k.
> Surely we need some evidence before leaping in and doing this test,
> even if we have a Likert scale.

Just to clarify. The assumption you state above is not a pre-requisite
for the U-test, since as a test it refes only to the Null Hypothesis,
which is that the two distributions are identical.

It is when you come to interpret the results of the test that such
issues become relevant. In the first instance, what the U-test is
directly sensitive to is the value of P(X < Y): if < 1/2, then
significantly small values become more likely; if > 1/2, then
significantly large values become more likely.

This is true regardless of whether the two distributions "resemble"
each other or not: the main influence is the value of P(X < Y).

When you come to interpret as significant oucome, however, it is
time to ask, in the context of your inquiry, how it may reasonably
come about that say P(X < Y) > 1/2. One possibility is, of course,
that the distribution of Y is the same as that of X, but shifted up.

And, when the two distributions are the same apart from shift,
difference of median is the same as difference of mean. So, in
that case, the distinction between mean and median is irrelevant.

But "P(X < Y) > 1/2" can also come about by changing the shape,
without changing median or mean (as indeed on of my examples
showed).

Sometimes one is primarily interested in the mean. Insurance
companies tend to favour this, since their long-term average
profit essentially depends on the mean.

But, in deciding whether to under-write a particular risk of
type X rather than a particular risk of type Y, they could well
be interested in whether the loss with Y is more likely than not
to be greater than the loass with type X. The mean loss of type X
versus the mean loss of type Y may not be of immedtiate concern.

> Comments on this specific point would be valued. I appreciate that
> indeed the null hypothesis of the Mann-Whitney U-test is that the
> populations have the same distribution. Ted's 'equal medians' example
> was also interesting. It takes a little extra time to make up these
> examples!

I'll have to defer comment on your further development (in previous
mails) of your original query for today, but will come back to it.

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-05 Time: 15:37:59
------------------------------ XFMail ------------------------------

John Whittington

unread,
Nov 4, 2005, 10:46:34 AM11/4/05
to MedS...@googlegroups.com
At 15:11 04/11/05 +0000, Bland, M. wrote:

>There is a big difference between saying they look different and saying
>that there is evidence that they are different. That is what the test is for.

Indeed ... but is not the point is that, if the test tells us that there is
evidence of a difference, this may well be due 'differences in shape'
rather than the sort of differences people are usually looking for - and
that the chances of that being the case grow stronger if, in situations
like Margaret has described, the distributions do look very different in shape.

I think we keep coming back to the point you have made repeatedly - that,
unless other assumptions are satisfied, tests of this sort are NOT tests of
any of the usual measures of location (mean, median etc.), even though that
is effectively what people usually assume they are and 'use them as'. In
bottom-line terms, they are looking for a p-value to associate with a
difference in means or medians. If non-statisticians were told 'the truth'
- that "there is a difference in mean/median values between the groups, and
a significant difference between the distributions in the two groups, but
we don't know whether there is a significant difference between the
means/medians" then I strongly suspect that most would not feel that they
could take the difference 'seriously'.

>You can test the null hypothesis that two distributions are the same in
>every respect, i.e. looking for any difference at all, down to the fifth
>and sixth moments, using the Kolmogorov Smirnov two-sample test.

Sure - but again, most people would not regard that as very (if at all)
useful. Not unreasonably, they want to know the NATURE of the difference
between distributions, not simply that one has evidence that SOME
difference exists.

Worse, if they didn't understand the nature of the test, they could be
seriously misled by the result. Say one was comparing two
anti-hypertensive drugs in terms of the amount of BP reduction they
produced. If one drug resulted in much more variability of response than
the other, then, if the sample size was big enough, a K-S test would reveal
a 'significant difference' even if the 'average' (whatever!) difference
between responses in the two groups was very small (indeed, even zero). If
a p-value arising from a K-S test was put next to a statement about, say,
the difference in mean responses in the two groups, I'm sure that would
seriously mislead the great majority of non-Statisticians (and, dare I say,
probably even some Statisticians!).

Bland, M.

unread,
Nov 4, 2005, 11:10:46 AM11/4/05
to MedS...@googlegroups.com
To make possitively my last comment on this, people should know what
they are doing, in statistics as in most other things. But for us,
particularly in statistics!

Martin

Margaret

unread,
Nov 4, 2005, 11:16:09 AM11/4/05
to MedStats
Dear Ted

This is helpful, thank you. I got my M-W U-test assumption from the
Oxford Dictionary of Statistics, which I thought was infallible. Sorry!

My earlier point simply refers to the question of the useful of the
Chi-Square test of linear trend in the context of comparing Likert
scores for two groups where the intention is to show for example, that
one group scored better than the other. Clearly, we needn't expect a
perfect monotonic trend on frequencies but do you think that such a
test is useful in this context.

By the way, if you are doing a two-sided M-W U-test, how do you know
which is the better group, using the P(X<Y)>0.5 approach? Surely one
could have equally well concluded that P(Y<X)>0.5.

Best wishes

Margaret

Margaret

unread,
Nov 4, 2005, 11:25:42 AM11/4/05
to MedStats
Dear Martin

I am much clearer on this now, so thamk you very much. Can you please
clarify whether I am correct in my assumptions below, however:

If two distributions are skewed and similar in shape and a two-sided
Mann-Whitney U-test leads us to conclude that the two parent
populations which are associated with these distributions are
different, then we can use the medians to decide on the direction of
the difference. This question is obvioulsy relevant to the issue of
deciding which group performed best.

I look forward to your welcome advice.

Regards

Margaret

Doug Altman

unread,
Nov 4, 2005, 11:38:04 AM11/4/05
to MedS...@googlegroups.com, MedStats
If high values have low ranks (ie you rank from highest to lowest) then the
group with the smaller sum of ranks is the one which tends to have higher
values. (and vice versa)

No need to consider medians or anything else.

Doug
_____________________________________________________

Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
Wolfson College Annexe
Linton Road
Oxford OX2 6UD

email: doug....@cancer.org.uk
Tel: 01865 284400 (direct line 01865 284401)
Fax: 01865 284424

Web: http://www.csm-oxford.org.uk/




Jeremy Miles

unread,
Nov 4, 2005, 4:17:35 PM11/4/05
to MedS...@googlegroups.com
Timot...@iop.kcl.ac.uk wrote:

> (I had avoided ordinal/poisson/binomial regression because A. it's not
> available in SPSS, and B. often in descriptive stats you want to keep the
> stats simple.)
>

Both ordinal and Poisson regression are available in SPSS. Ordinal
regression is pretty straightforward. Poisson is a nightmare - syntax
is here:
http://www.childrens-mercy.org/stats/model/poisson/poiss_syntax.asp, but
as someone said, get R instead.

Jeremy

John Whittington

unread,
Nov 7, 2005, 7:07:59 AM11/7/05
to MedS...@googlegroups.com
At 16:10 04/11/05 +0000, Bland, M. wrote:

>To make possitively my last comment on this, people should know what they
>are doing, in statistics as in most other things. But for us,
>particularly in statistics!

I guess it's a bit rotten to avail myself of such an invitation to 'have
the last word', but I feel moved to comment that, whilst Statisticians
obviously 'need to know what they are doing in statistics', they are almost
invariably doing things (whatever they may be) for 'consumption'
predominantly by non-statisticians. It is therefore incumbent upon
Statisticians to do things, and present things, in a manner which is
helpful, understandable, and certainly not potentially misleading, to/by
that wider audience.

I would imagine (hope!) that no-one reading this will disagree!

Ted Harding

unread,
Nov 7, 2005, 8:26:29 AM11/7/05
to MedS...@googlegroups.com
On 07-Nov-05 John Whittington wrote:
>
> At 16:10 04/11/05 +0000, Bland, M. wrote:
>
>>To make possitively my last comment on this, people should
>>know what they are doing, in statistics as in most other things.
>>But for us, particularly in statistics!
>
> I guess it's a bit rotten to avail myself of such an invitation
> to 'have the last word', but I feel moved to comment that, whilst
> Statisticians obviously 'need to know what they are doing in
> statistics', they are almost invariably doing things (whatever
> they may be) for 'consumption' predominantly by non-statisticians.
> It is therefore incumbent upon Statisticians to do things, and
> present things, in a manner which is helpful, understandable, and
> certainly not potentially misleading, to/by that wider audience.
>
> I would imagine (hope!) that no-one reading this will disagree!
>
> Kind Regards,
>
> John

I certainly do agree! And would like to add a comment (and I'm not
claiming that this is the last word, nor likely to be).

This is that there is a certain somewhat ill-defined respondibility
upon the statistician.

Often non-statisticians who none the less need the "statistical input"
give an impression that they think Statistics is more intelligent than
it really is -- that for instance the implementation of the logistic
regression model (to take only one instance), when applied to say
mortality data in the context of a certain medical condition,
encapsulates the physical mechanisms leading to death and has a
scientific validity in that context..

A possible underlying logic for this (though perhaps not rigorously
examined) might be that they wouldn't have been taught that this is
what you do unless it was right.

The statistician however knows, or should know, better. Any such
procedure is simply a filter (analagous to the frequency-response
shaper you can adjust on your HiFi amplifier), transforming one
set of records into another. As such, it has certain properties
and does not have other properties. A diffferent choice of procedure
would have different properties and lack some of those of the first.

So the key questions are: will the procedure be capable of passing
through the information in the primary data which is relevant to
the physical processes giving rise to the primary data and which
is relevant to the questions being asked (the signal)? And will
it adequately suppress the information in the primary data which
is irrelevant (the noise)? And how is the output of the procedure
to be intrepreted?

For all that the non-statistical consumer may have varying degrees
of appreciation of these questions and/or of the technicalities
concerning their implementation in a particular procedure (and/or
in software), the understanding of these things is primarily tghe
responsibility of the statistician.

This particular responsibility is not at all ill-defined.

The ill-defined one is the responsibility of trying to ascertain,
in the particular case that is in hand, whether satisfactory
answers can be found to the above questions.

If the statistician is already expert in the subject domain, and
in the methods by which the primary data have been obtained, then
again this is well-defined enough.

But if not, then the issue has to be broached in dialogue with
the "consumer", it may involve any amount of all sorts of
interactions, and it may be protracted.

This is clearly ill-defined, since how it proceeds is contingent
on the circumstances of the case, but it is a responsibility!

Again analogously, think perhaps of a "juge d'instruction" in
a continental court, or your lawyer when he first sets about
ascertaining the facts and issues in a legal case that you are
on one side of. Or, indeed, of a doctor eliciting an initial
case history from a patient. In all cases, in the background
are procedures for focussed investigation which can be, and of
which some will probably be, set in train. In order to do this
responsibly, the "interrogators" must find out what is going on,
and how it relates to what might be done.

Best wishes to all,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 07-Nov-05 Time: 13:26:26
------------------------------ XFMail ------------------------------

John Whittington

unread,
Nov 8, 2005, 12:18:54 PM11/8/05
to MedS...@googlegroups.com
At 13:26 07/11/05 +0000, Ted Harding wrote (in very small part):

>Often non-statisticians who none the less need the "statistical input"
>give an impression that they think Statistics is more intelligent than
>it really is -- ,,,,,

Ted, you're point is well taken, and I'd like to add a comment about a
matter somewhat related to this .... Those non-statisticians who need
'statistical input' often seem to think as if the particularly statistician
they are dealing with is 'the statistical establishment'. As a result,
they sometimes seem to think that if they can persuade, cajole, bribe,
bully or intimidate that individual into utilising some method, or
producing a particular result, that they will have then 'won their battle
with the scientific establishment' and will not be at any risk of further
'difficulties' because of what has been done. Perhaps the two most common
situations are in relation to sample size estimations (with attempts being
made to 'talk the statistician into' producing a smaller estimate) or when
attempts are made to convince a statistician that one-tailed tests 'would
be appropriate' when tw0-tailed ones have not quite hit the ubiguitous
'p=0.05' threshold!

I have a few slides entitled 'Why not to bully your Statistician' which I
often wheel out when I'm giving statistics-related presentations to
audiences of non-Statisticians which attempt to illustrate the folly of the
above scenarios!

SR Millis

unread,
Nov 8, 2005, 5:01:31 PM11/8/05
to MedS...@googlegroups.com
In the diagnostic setting, one can estimate a test's
positive
predictive value (or posterior probability) through
the simple
application of Bayes' theorem:

P(D+|T+) = P(D+)P(T+|D+) /
[P(D+)P(T+|D+)+P(D-)(T+|D-)]

Now, if I want to estimate P(D-|T+) is it simply:

1 - P(D+|T+) ?

Thanks,
SR Millis

Ted Harding

unread,
Nov 8, 2005, 5:17:22 PM11/8/05
to MedS...@googlegroups.com
Provided your conditional probabiltiies P(D+|T+) and P(D-|T+)
have the ordinary interpretation of probabilities, and also
D+ and D- are the only possibilities and are mutually exclusive,
then the answer has to be Yes!

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Nov-05 Time: 22:17:17
------------------------------ XFMail ------------------------------

John Whittington

unread,
Nov 9, 2005, 6:06:49 AM11/9/05
to MedS...@googlegroups.com
At 22:17 08/11/05 +0000, Ted Harding wrote:

>On 08-Nov-05 SR Millis wrote:
> >[snip]
> > P(D+|T+) = P(D+)P(T+|D+) /
> > [P(D+)P(T+|D+)+P(D-)(T+|D-)]
> > Now, if I want to estimate P(D-|T+) is it simply:
> > 1 - P(D+|T+) ?
>
>Provided your conditional probabiltiies P(D+|T+) and P(D-|T+)
>have the ordinary interpretation of probabilities, and also
>D+ and D- are the only possibilities and are mutually exclusive,
>then the answer has to be Yes!

Indeed. I suppose, in addition to that 'obvious' (and certainly simplest)
approach based on probability theory, another way of looking at is that
the definition of which of the results is D+ and which is D- is essentially
arbitrary - so, on the assumption that they are mutually exclusive, one
could write an equivalent equation for P(D-|T+) simply by swapping all the
D- and D+ terms in the equation for P(D+|T+) above - and I sincerely hope
that the resulting equation could be simplified to 1 - P(D+|T+) !!

SR Millis

unread,
Nov 9, 2005, 10:08:02 AM11/9/05
to MedS...@googlegroups.com
Thanks, Ted!

As a follow-up,

P(T+|D+) is sensitivity
P(T-|D-) is specificity

P(D+|T+) is PPV

What would P(D-|T+) be called?

Thanks,
SR Millis

BXC (Bendix Carstensen)

unread,
Nov 9, 2005, 10:12:44 AM11/9/05
to MedS...@googlegroups.com
P(D-|T+) has no name, but

P(D+|T+) is called "predictive value of a positive test" or just
positive predictive value, PV+
similarly,
P(D-|T-) is called "predictive value of a negative test" or just
negative predictive value, PV-

Best,
Bendix Carstensen
----------------------
Bendix Carstensen
Senior Statistician
Steno Diabetes Center
Niels Steensens Vej 2
DK-2820 Gentofte
Denmark
tel: +45 44 43 87 38
mob: +45 30 75 87 38
fax: +45 44 43 07 06
b...@steno.dk
www.biostat.ku.dk/~bxc
----------------------

John Whittington

unread,
Nov 9, 2005, 10:26:54 AM11/9/05
to MedS...@googlegroups.com
At 07:08 09/11/05 -0800, SR Millis wrote:

>Thanks, Ted!
>As a follow-up,
>P(T+|D+) is sensitivity
>P(T-|D-) is specificity
>P(D+|T+) is PPV
>What would P(D-|T+) be called?

Whilst you await Ted's reply ....

You'll sometimes see P(D-|T+) called 'false negative rate' and P(D+|T-)
called 'false positive rate', but those terms are ambiguous and therefore
probably best avoided. To complete the 'naming', P(D-|T-) is, of course
'NPV' (Negative Predictive Value).
Reply all
Reply to author
Forward
0 new messages