Help: How to Analyze Cost Data - Should I log transform?

876 views
Skip to first unread message

jaf...@pobox.com

unread,
Feb 24, 2008, 6:35:05 PM2/24/08
to MedStats
Hi folks

I have a dataset with cost data for coronary procedures. I have the
absolute hospital cost for each patient and I wish to examine the
determinants of cost.


Question: since cost data is not normally distributed, what's the
best way of analyzing this data?


One option that came to mind was log transformation of the cost
variable and then linear regression? And if so, how does one
interpret the data ? I mean, the coeffecients derived from a linear
regression in this case will represent (please correct me if I'm
wrong) change in log of cost per unit increase in the independent
(continuous) variable OR change in log cost with category change
compared to reference value for a categorical variable.


I've read about retransformation of the data from log to the original
scale (absolute cost values) but need some help as to how to do it.


1. Do I run the linear regression analysis on the log(cost) variable
and then transform back to the orginal scale?
2. Exactly how do I transfer the data back in SPSS
3. I've read somewhere about a "smearing factor" needed when
retransforming data but really don't understand it. Can someone
explain?


Many thanks


Fahim H. Jafary, MD
Aga Khan University Hospital
Karachi, Pakistan

Bjoern

unread,
Feb 25, 2008, 2:11:50 AM2/25/08
to MedStats
On Feb 25, 12:35 am, "jaf...@pobox.com" <jaf...@pobox.com> wrote:
> Hi folks
>
> I have a dataset with cost data for coronary procedures. I have the
> absolute hospital cost for each patient and I wish to examine the
> determinants of cost.
>
> Question: since cost data is not normally distributed, what's the
> best way of analyzing this data?
>
> One option that came to mind was log transformation of the cost
> variable and then linear regression?

Whether the cost data is normally distributed is not the issue when
using standard linear regression (or ANCOVA), the issue is whether the
residuals after fitting the model are normal. That tends to be
approximately true more often than the response variable being
normally distributed.

However, even if the residuals (and the true error terms) of your
model are not exactly normal (and in reality of course essentially
every model is more or less wrong), models with normal error
assumptions (t-test, linear regression, ANCOVA) tend to be very robust
and you need quite drastic cases to get serious problems due to their
use (of course if your case is one of them...). With large sample
sizes things become even more stable.

Oh, and note that log-transformation implies multiplicative effects on
the original scale instead of additive ones, because log(a) + log(b) =
log(a*b).

Sanjoy Paul

unread,
Feb 25, 2008, 3:44:02 AM2/25/08
to meds...@googlegroups.com
Hello,

The highly positively skewed cost data do not achieve approximate normality after logarithmic transformation. Even the Box-Cox transformation does not work in most cases.

On some occasions, I used the generalized mixed effect model with Gamma distribution. The 'mixed effect' was introduced to adjust for the cluster effect. Your case may be different, and simple generalized linear model with gamma distribution might work. Bootstrapping with the model fit will be helpful in this case.

I have also used Quantile Regression in the context of analysing highly skewed semi-continuous data. You may like to see my presentation at the RSS conference:  http://www.rss.org.uk/pdf/Paul%20RSSPresentation-2005.pdf

Hope this helps.

Sanjoy



Dr. Sanjoy K. Paul
Head of Statistics & Modelling Group
DTU, OCDEM
University of Oxford
Oxford, OX3 7LJ
Tel: +44 (0)1865 857283 (Office)
+44 (0) 1865 770769 (Residence)
+44 (0) 7888712313 (Mobile)
Fax: +44 (0)1865 857260
Email: sanjo...@dtu.ox.ac.uk
sambh...@hotmail.com

> Date: Sun, 24 Feb 2008 23:11:50 -0800
> Subject: {MEDSTATS} Re: Help: How to Analyze Cost Data - Should I log transform?
> From: bjoernh...@googlemail.com
> To: MedS...@googlegroups.com

Post free auto ads on Yello Classifieds now! Try it now!

Andrea Manca (Work)

unread,
Feb 25, 2008, 4:01:55 AM2/25/08
to MedS...@googlegroups.com
Fahim,

If you really want to go down the route of log-transformation the
smearing coefficient is obtained as follows:

1. log-transform your individual patient level cost data
2. estimate E(ln(y|x)] in each treatment group, as the distribution of
the costs will differ between groups.
3. Calculate Zi=ln(Yi) - E(ln(y|x)], the departure of the
log-transformed data at individual level from its group mean
4. Estimate your treatment group specific smearing coefficient as: S =
E[exp(Zi)]
5. Back-transform the log-transformed group mean cost onto the original
scale as follows: exp(ln(y|x)] * S

However, for many reasons log transformation of cost data is not a good
idea as these data are usually characterized by right-skewness and
excess zeros, so a log-transformation might solve the former but fail to
address the latter, for instance.

A common solution is to use a GLM regression model with gamma
distributed errors and identity or log link. If you use the identity
link function, you are assuming that the determinants of costs act
additively and there is no need to back transform your results as you
will be working on the natural scale already. If you use the log family
link, the assumption here is that covariates act multiplicatively, you
can simply exponentiate your results to get them back onto the natural
scale, without the need to use a smearing coefficient to back transform
the mean of your log transformed costs. The reason for this is that in
this framework you will be working on ln[E(y|x)] and not on E[ln(y|x)].

Hope this helps,
Andrea

--
_______________________________________________________________
Andrea Manca, Ph.D.
Senior Research Fellow

Centre for Health Economics
The University of York
Alcuin College, A Block
York YO10 5DD United Kingdom
Tel: +44 (0)1904 321430
Fax: +44 (0)1904 321402

E-mail: am...@york.ac.uk
Home page: http://www.york.ac.uk/inst/che/staff/manca.htm
http://myprofile.cos.com/am126
________________________________________________________________

--
_______________________________________________________________
Andrea Manca, Ph.D.
Senior Research Fellow

Centre for Health Economics
The University of York
Alcuin College, A Block
York YO10 5DD United Kingdom
Tel: +44 (0)1904 321430
Fax: +44 (0)1904 321402

E-mail: am...@york.ac.uk
Home page: http://www.york.ac.uk/inst/che/staff/manca.htm
http://myprofile.cos.com/am126
________________________________________________________________

Doug Fuller

unread,
Mar 5, 2008, 11:16:17 AM3/5/08
to MedS...@googlegroups.com
This discussion of transformation vs. link function is
very intriguing. Is there a journal article or book
chapter to which you could direct me to better
understand why one method would be preferred over the
other?

Thanks,
Doug

____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs

Andrea Manca (Home)

unread,
Mar 5, 2008, 1:18:01 PM3/5/08
to MedS...@googlegroups.com
Sure,

here are some references to start you off:

Basu A, Arondekar BV, Rathouz PJ.   Scale of interest versus scale of estimation: comparing alternative estimators for the incremental costs of a comorbidity.   Health Econ. 2006 Oct;15(10):1091-107.
Basu A, Manning WG, Mullahy J. Comparing alternative models: Log vs Cox proportional hazard? Health Economics 2004;13(8):749-765.

Manning WG. The logged dependent variable, heteroscedasticity, and the retransformation problem. Journal of Health Economics 1998;17(3):283-295.

Manning WG, Basu A, Mullahy J. Generalized modeling approaches to risk adjustment of skewed outcomes data. Journal of Health Economics 2005;24(3):465-488.

Manning WG, Mullahy J. Estimating log models: To transform or not to transform? Journal of Health Economics 2001;20(4):461-494.

HTH
Andrea
Reply all
Reply to author
Forward
0 new messages