pooled estimate of CV

488 views
Skip to first unread message

simon bond

unread,
Oct 4, 2007, 10:18:46 AM10/4/07
to MedS...@googlegroups.com
Problem:
 
I have several empirical means and sds and sample sizes from different groups of observations. I want to obtain from these pairs of statistics an estimate of a pooled coefficient of variation that is assumed to be constant across the groups, but can't assume the means and sds are constant accross groups.
 
Can anyone suggest methods to produce an estimate?
 
I'd be reasonably happy to assume the underlying observations follow a log-normal distribution.
 
 
Thanks
 
Simon Bond.


Yahoo! Answers - Get better answers from someone who knows. Try it now.

Bland, M.

unread,
Oct 5, 2007, 4:46:51 AM10/5/07
to MedS...@googlegroups.com
I would do a weighted average, weighting by the number of subjects. Why
do anything complicated?

Martin

simon bond wrote:
> Problem:
>
> I have several empirical means and sds and sample sizes from different
> groups of observations. I want to obtain from these pairs of
> statistics an estimate of a pooled coefficient of variation that is
> assumed to be constant across the groups, but can't assume the means
> and sds are constant accross groups.
>
> Can anyone suggest methods to produce an estimate?
>
> I'd be reasonably happy to assume the underlying observations follow a
> log-normal distribution.
>
>
> Thanks
>
> Simon Bond.
>

> ------------------------------------------------------------------------


> Yahoo! Answers - Get better answers from someone who knows. Try it now

> <http://uk.answers.yahoo.com/;_ylc=X3oDMTEydmViNG02BF9TAzIxMTQ3MTcxOTAEc2VjA21haWwEc2xrA3RhZ2xpbmU>.

--
***************************************************
J. Martin Bland
Prof. of Health Statistics
Dept. of Health Sciences
Seebohm Rowntree Building Area 2
University of York
Heslington
York YO10 5DD

Email: mb...@york.ac.uk
Phone: 01904 321334
Fax: 01904 321382
Web site: http://martinbland.co.uk/
***************************************************

John Whittington

unread,
Oct 7, 2007, 5:59:58 AM10/7/07
to MedS...@googlegroups.com
At 14:18 04/10/07 +0000, simon bond wrote:

>Problem:
>I have several empirical means and sds and sample sizes from different
>groups of observations. I want to obtain from these pairs of statistics an
>estimate of a pooled coefficient of variation that is assumed to be
>constant across the groups, but can't assume the means and sds are
>constant accross groups.
>Can anyone suggest methods to produce an estimate?
>I'd be reasonably happy to assume the underlying observations follow a
>log-normal distribution.

As is so often the case, the answer probably lies in a clearer
understanding of what you are trying to achieve. It is also rather
dependent upon your actual data. For example, if the data indicate
considerable differences between the CVs within each of the groups, then
one (at least I!) can't help but wonder what would be the meaning (or
value) of a 'pooled CV' which (seemingly incorrectly) assumed that CV was
constant across groups. In such a situation (and, indeed, more generally),
it might be more useful to quote the RANGE of within-group CVs, rather than
attempting a 'pooled' estimate.

In a mechanical sense, you obviously could calculate a mean, or N-weighted
mean, of the within-group CVs, but I'm not too sure what that would really
mean. Such an estimate would obviously have a variability associated with
it, echoing what I said about 'the range of CVs'.

Kind Regards,


John

----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: Joh...@mediscience.co.uk
Buckingham MK18 4EL, UK
----------------------------------------------------------------

Ted Harding

unread,
Oct 7, 2007, 2:17:56 PM10/7/07
to MedS...@googlegroups.com
On 07-Oct-07 09:59:58, John Whittington wrote:
> At 14:18 04/10/07 +0000, simon bond wrote:
>>Problem:
>>I have ""
. I want to obtain
>>from these pairs of statistics an estimate of a pooled
>>coefficient of variation that is assumed to be constant
>>across the groups, but can't assume the means and sds are
>>constant accross groups.
>>Can anyone suggest methods to produce an estimate?
>>I'd be reasonably happy to assume the underlying observations
>>follow a log-normal distribution.
>
> As is so often the case, the answer probably lies in a
> clearer understanding of what you are trying to achieve.
> It is also rather dependent upon your actual data. For
> example, if the data indicate considerable differences
> between the CVs within each of the groups, then one (at
> least I!) can't help but wonder what would be the meaning
> (or value) of a 'pooled CV' which (seemingly incorrectly)
> assumed that CV was constant across groups. In such a
> situation (and, indeed, more generally), it might be more
> useful to quote the RANGE of within-group CVs, rather than
> attempting a 'pooled' estimate.
>
> In a mechanical sense, you obviously could calculate a mean,
> or N-weighted mean, of the within-group CVs, but I'm not too
> sure what that would really mean. Such an estimate would
> obviously have a variability associated with it, echoing
> what I said about 'the range of CVs'.

I'm broadly with John's general comments above. In particular,
what do you want "pooled CV" to mean?

One trivial suggestion, which is probably bad for you,
is that you can get a "pooled sample mean" as

sum of (samplemean x samplesize)
divided by (sum of sample size)

and similarly get a "pooled sample SD" as the square root of

sum of (sampleSD^2 time (samplesize-1))
divided by sum of (samplesize -1)
PLUS sum of (sample size times samplmean^2)

which corresponds to throwing all the sample values into one
bucket

But this has every chance of giving something you would not
want to use.


Picking up on your remark "I'd be reasonably happy to assume
the underlying observations follow a log-normal distribution":

*IF* X has a log-normal distribution, i.e. log(X) has a normal
distribution with mean mu and variance (s^2), then:

The mean of X = exp(mu + (s^2)/2)

Variance of X = exp(2*mu + (s^2)) * (exp(s^2) - 1)

so

The CV = sqrt(exp(s^2) - 1)

and therefore depends on (s^2) but not on mu.

Therefore, ig you had the raw data, you could, for instance,
check whether you have reasonably nearly equal values of
variance for log(X) across your several empirical samples.
In that case, it becomes reasonable to adopt an assumption
that CV is the same across samples.

This would then also allow you to obtained a pooled estimate
of variance for log(X) (in the usual way for several groups
of normally distributed variables), and use this for (s^2)
in the above formula for CV. Provided the log-normal assumption
is valid, and the variances of log(X) are effectively uniform,
then this CV can be adopted uniformly across samples.

Of course, if the values of variance in the different samples
look different, then you cannot validly do this.

You say you are "happy to assume" the log-normal distribution.
I hope your happiness has an objective foundation, and is not
the bliss of someone about to do something expedient but dodgy
for the sake of getting around an obstacle!

And, while I'm at it: If you are undertaking analyses based
on log-normal data, DO NOT WORK WITH THE RAW DATA. Take logs
first, and the carry on. Otherwise, you can drop yourself in
horrible messes to do with hypothesis tests and confidence
intervals -- the "usual methods" can (and often will) give
hopelessly wrong results.

To come back to what you say: You have "several empirical means


and sds and sample sizes from different groups of observations".

This suggests that in fact you may not have the raw data, and
that the means and SDs may have been calculated from the raw data.
If so, then: Oh Dear! See the above remark. You need the
means and SDs of the log(X)'s.

However, all is not necessarily totally lost, since you can
make a shot at it using the above formulae for mean(X) and
var(X).

First of all, the CV can be estimated (THOUGH NOT EFFICIENTLY
AND NOT WITHOUT BIAS) as the ratio SD/Mean in each sample.
(See the Final Remarks at the end).

This then gives you an estimate (with even worse properties)
of sqrt(exp(s^2) - 1), hence of

(s^2) = log(1 + (CV^2))

in each sample. If these don't look too grossly different,
then you can again envidage pooling these estimates of
variance: each within-sample SS would be (samplesize-1)*(s^2),
add these up, and divide by (sum of samplesizes - k) where
k is the number of samples.

However, I will not give any guarantee that this is valid.
Log-normal distributions can be horribly skew, thus distorting
both sample mean and sample variance.

And you cannot say much, either, about mu (the mean of logX))
from knowing sample mean and sample variance of X, since

a) mean(X) is not equal to exp(mean(log(X))), i.e.

mean(log(X)) is not equal to log(mean(X))

b) (se formula for mean(x)): You're multiplying by the
factor exp((s^2)/2), which is not directly related
to your sample means and SDs either.

Finally, a comment on using the raw data (means and SDs thereof)
for estimating mu and (s^2) in the distribution of log(X)
(i.e. by solving the equations, which give mean and variance
of X in terms of mu and (s^2), for mu and (s^2) when you have
calculated the sample means and variances of the raw data).

The large-sample efficiencies of this depend on the value
of (s^2). If (s^2) = 0 (so no variation), then the efficiency
is 1 (100%) for both mu and (s^2) (no surprise there).

For (s^2) = 0.2, the efficiency for estimating (s^2) is
about 50%;
For (s^2) = 0.5, it is about 20%;
For (s^2) = 1.0 it is less than 10%;
And for (s^2) > 1.5 the efficiency is negligible (i.e. you
basically get no information at all about (s^2) from the
values of sample mean and sample variance).

Similarly (though less drastically):
For (s^2) = 0.5, the efficiency for estimating mu is about 65%;
For (s^2) = 0.6, the efficiency is about 50%;
For (s^2) = 1.0 is is about 13%;
And, again, for (s^2) > 1.5 it is negligible.

(In the above, I'm reading off graphs [p.41] in the book

The Lognormal Distribution
J. Aitchison
J.A.C. Brown
Cambridge University Press, 1957

which deals with many aspects of inference from log-normal data
very thoroughly and in great detail).

Hoping that this helps to clarify the issues, though it probably
may not help a lot with your data!

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 07-Oct-07 Time: 16:30:14
------------------------------ XFMail ------------------------------

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.h...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 07-Oct-07 Time: 19:17:26
------------------------------ XFMail ------------------------------

Reply all
Reply to author
Forward
0 new messages