Outliers v. Skew in Linear Regression

44 views
Skip to first unread message

mcap

unread,
Nov 23, 2010, 3:51:59 PM11/23/10
to MedStats
We have developed a least squares regression model wtih the SF-12 as
the outcome and wor-related pain (none, minor, moderate) as the
primary predictor. We also have age, gender and occupation (only 2
groups) in the model. There is a pain/age intereaction for both
minor and mod pain.

The issue relates to the distribution of the SF-12 scores. It is
skewed - negatively - in each pain group. This can be unavoidable if
your mean score is above national norms. Looking at the residuals,
they are skewed for each pain level, but slightly less so. The skew
results from a series of outliers in each group, almost all of which
are negative. Without the outliers, the residual distributions are
somewhat normal. We have a large sample (about 1,445) and I would
say there are about 25-30 observations with standardized residuals
between -3 and -5.5. With the exception of a couple of the cases
most of these dont' seem to cause much havoc. They are interesting
and may deserve to influence the coefficients. Hard to tell - even
from a clinical prespective.

So, is this a quesiton of non normality or an outlier problem or
both.......robust regression results in coefficients that are about 1
less than standard regression (the coeffs range were about 3 and 5 in
standard regression - we are reporting several values accross a range
of ages though). Transformations of the DV don't seem to add much.
Looking at a variety of transformations only the cubic seemed to make
things even close to normal. Transformations would make this very
difficult to interpret...........

Sorry for the length....just trying to include everything you may ask
for........Thanks!!!!

Any thoughts?

stephane heritier

unread,
Nov 23, 2010, 4:19:05 PM11/23/10
to meds...@googlegroups.com
Try a robust fit and see what you get.
 
see my book Heritier, Cantoni, Copt, VIctoria-Feser (2009), Robust methodos in biostatistics,
Wiley, Chapter 3
 
http://www.unige.ch/ses/metri/cantoni/RobustBiostat/index.html


Stephane


 
> Date: Tue, 23 Nov 2010 12:51:59 -0800
> Subject: {MEDSTATS} Outliers v. Skew in Linear Regression
> From: mca...@yahoo.com
> To: meds...@googlegroups.com
> --
> To post a new thread to MedStats, send email to MedS...@googlegroups.com .
> MedStats' home page is http://groups.google.com/group/MedStats .
> Rules: http://groups.google.com/group/MedStats/web/medstats-rules

Ray Koopman

unread,
Nov 23, 2010, 4:48:56 PM11/23/10
to MedStats
Have you tried Quantile Regression?

mcap

unread,
Nov 23, 2010, 5:51:07 PM11/23/10
to MedStats
Thanks all. I did try median regression and the results were similar
to what I get with robust.....
> > Any thoughts?- Hide quoted text -
>
> - Show quoted text -

Ray Koopman

unread,
Nov 23, 2010, 8:18:00 PM11/23/10
to MedStats
Have you looked at the coefficients as functions of the quantile, as
in Koenker & Hallock's analysis of determinants of infant birthweight,
Figure 4 in Koenker R & Hallock KF (2001). Quantile regression.
Journal of Economic Perspectives, 15:143-156.
http://www.econ.uiuc.edu/~roger/research/rq/QRJEP.pdf

Peter Flom

unread,
Nov 23, 2010, 8:28:02 PM11/23/10
to meds...@googlegroups.com
I'd just like to note that I think quantile regression is an excellent idea
here, and a very underused technique.

Peter

Kornbrot, Diana

unread,
Nov 24, 2010, 2:41:12 AM11/24/10
to meds...@googlegroups.com
Me too
Sometimes called ordinal regression, as in SPSS but can be done with categorical as well as metric predictors
diana
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules




Professor Diana Kornbrot
email: 
d.e.ko...@herts.ac.uk    
web:    http://web.me.com/kornbrot/KornbrotHome.html
Work
Centre for Lifespan & Chronic Illness Research, CLiCIR
School of Psychology
University of Hertfordshire
College Lane, Hatfield, Hertfordshire AL10 9AB, UK
voice:  +44 (0) 1707 28 46 26
Home
19 Elmhurst Avenue
London N2 0LT, UK
 landline:  +44 (0) 20 8444  2081
 mobile:   +44 (0)
7403 18 16 12
fax:        +44 (0) 8707 06 49 97






Bruce Weaver

unread,
Nov 24, 2010, 7:17:10 AM11/24/10
to MedStats
Diana, are you saying that one can perform quantile regression via the
PLUM procedure in SPSS? I thought the only way to get it was via the
R-language extension, as described here:

http://faculty.chass.ncsu.edu/garson/PA765/regress.htm#quantile

Thanks for clarifying.

Cheers,
Bruce

On Nov 24, 2:41 am, "Kornbrot, Diana" <d.e.kornb...@herts.ac.uk>
wrote:
> Me too
> Sometimes called ordinal regression, as in SPSS but can be done with categorical as well as metric predictors
> diana
>
> On 24/11/2010 01:28, "Peter Flom" <peterflomconsult...@mindspring.com> wrote:
>
> I'd just like to note that I think quantile regression is an excellent idea
> here, and a very underused technique.
>
> Peter
>
> --
> To post a new thread to MedStats, send email to MedS...@googlegroups.com .
> MedStats' home page ishttp://groups.google.com/group/MedStats.
> Rules:http://groups.google.com/group/MedStats/web/medstats-rules
>
> ________________________________
> Professor Diana Kornbrot
> email:  d.e.kornb...@herts.ac.uk

mcap

unread,
Nov 24, 2010, 8:26:05 AM11/24/10
to MedStats
HI:

Thanks for the references and ideas. I will have a look at
quantile regression. I am not sure where this is in SPSS (I have the
faculty pack). In stata this is readily available.

Through the past few verisions of SPSS or should I say PASW....I am
dissapointed by the lack of new features in the core statistics
modules. They really seem to be focusing on business and large
datasets....

Thanks again everyone.....I will try and post an update once we get
the paper out the door.....any further thoughts are welcome of
course....


Marc
> > voice:  +44 (0) 1707 28 46 26begin_of_the_skype_highlighting              +44 (0) 1707 28 46 26      end_of_the_skype_highlighting
> > Home
> > 19 Elmhurst Avenue
> > London N2 0LT, UK
> >  landline:  +44 (0) 20 8444begin_of_the_skype_highlighting              +44 (0) 20 8444      end_of_the_skype_highlighting 2081
> >  mobile:  +44 (0) 7403 18 16 12begin_of_the_skype_highlighting              +44 (0) 7403 18 16 12      end_of_the_skype_highlighting
> > fax:        +44 (0) 8707 06 49 97
> > ________________________________- Hide quoted text -

Stan Alekman

unread,
Nov 24, 2010, 9:38:36 AM11/24/10
to meds...@googlegroups.com
Hello,

Quantile Regression will be proposed by the joint FDA-Industry working
group of the Product Quality Research Institute for establishing
expiration periods for drugs.

If reader's can identify commercially available programs where Quantile
Regression is available, I would be grateful.

Thank you.

Regards,
Stan Alekman


HI:


Marc

ציפי שוחט

unread,
Nov 24, 2010, 10:12:48 AM11/24/10
to meds...@googlegroups.com
SAS has proc Quantreg.
 
By the way, I don't think that Quantile regression (where the focus is on a specific percentile of the Y) is equevalent to Ordianl Regression (which may be run by SAS proc logistic).
 
Cheers.
 
Tzippy Shochat
 
2010/11/24 Stan Alekman <stanl...@aol.com>

Marc Schwartz

unread,
Nov 24, 2010, 10:26:14 AM11/24/10
to meds...@googlegroups.com
Why limit yourself to "commercial" programs?

R has the excellent 'quantreg' package by Roger Koenker:

http://cran.us.r-project.org/web/packages/quantreg/index.html

and if you have any concerns about using R for regulated clinical trials, read:

http://www.r-project.org/doc/R-FDA.pdf

The FDA is using R internally on an increasing basis. R was used internally by the FDA for the Avandia safety meta-analysis and it does not get more high profile than that.

HTH,

Marc Schwartz

Stan Alekman

unread,
Nov 24, 2010, 12:52:03 PM11/24/10
to meds...@googlegroups.com
Thank you.

Stan Alekman

http://cran.us.r-project.org/web/packages/quantreg/index.html

http://www.r-project.org/doc/R-FDA.pdf

HTH,

Marc Schwartz

--

SR Millis

unread,
Nov 24, 2010, 1:50:43 PM11/24/10
to meds...@googlegroups.com
Stata Rel. 11 can perform quantile regression via its qreg program.


~~~~~~~~~~~
Scott R Millis, PhD, ABPP, CStat, CSci
Professor
Wayne State University School of Medicine
Email: aa3...@wayne.edu
Email: srmi...@yahoo.com
Tel: 313-993-8085


--- On Wed, 11/24/10, Stan Alekman <stanl...@aol.com> wrote:

Kornbrot, Diana

unread,
Nov 25, 2010, 5:16:50 AM11/25/10
to meds...@googlegroups.com
My mistake.
quantile regression is similar to ordinal regression as implemented in SPSS plum, but not the same
Ordinal regression seems to me more useful in many situation, as the cut-points are determined by the data rather than arbitrary quantiles.
Interested in pros & cons of these two methods from experts.
But here is an example
Have data set where want to determine survival as a function of some continuous variable that has been categorized into 3 ordinal groups.  Should one choose the groups boundaries as the 33rd & 66th quantile? Os should one choose known levels of the continuous variable that approximately divide data into thirds?
If  study is replicated or compared with a similar study the tertiles will inevitably fall at different values of the predictor variable, whereas the cutpoints as values on the predictor will remain the same. Hence a more accurate comparison will be possible.
In medical appplications people often seem to use tertiels or quartiles, whereas known values of the predcitor peg a cancer marker would in my view be better.
What do you all think?
Best
Diana


Professor Diana Kornbrot
email: 
d.e.ko...@herts.ac.uk    

Peter Flom

unread,
Nov 25, 2010, 6:21:18 AM11/25/10
to meds...@googlegroups.com

Diana Kornbrot wrote

 

<<<< 

My mistake.
quantile regression is similar to ordinal regression as implemented in SPSS plum, but not the same
Ordinal regression seems to me more useful in many situation, as the cut-points are determined by the data rather than arbitrary quantiles.

Interested in pros & cons of these two methods from experts.
>>>>

I don’t know SPSS and plum, but in what I’ve seen, ordinal logistic regression uses data that is already categorized when it comes in, whereas quantile regression models the quantiles rather than the mean. In SAS, if you try ordinal regression (PROC LOGISTIC) on continuous data you will get a model with a different intercept for each level of the dependent variable.  This is not good, obviously.  Quantie regression (PROC QUANTREG) lets you model the specific quantiles that you want, and gives you parameter estimates for all IVs for each quantile.  In addition, ordinal logistic regression (in its most common form) makes the proportional odds assumption.  Quantile regression does not.  In fact, QUANTREG seems more like multinomial than ordinal logistic.

In addition, QUANTREG and LOGISTIC give output in very different forms: the former looks like OLS regression – only it’s modeling something other than the mean; the latter give parameter estimates that need to be plugged into logistic equations, and give odds ratios, which don’t really translate.

Finally, the two PROCs are modeling different things.  QUANTREG models a particular quantile; LOGISTIC models being in a particular category.

So, now I am wondering what plum in SPSS does

<<< 

But here is an example
Have data set where want to determine survival as a function of some continuous variable that has been categorized into 3 ordinal groups.  Should one choose the groups boundaries as the 33rd & 66th quantile? Os should one choose known levels of the continuous variable that approximately divide data into thirds?
If  study is replicated or compared with a similar study the tertiles will inevitably fall at different values of the predictor variable, whereas the cutpoints as values on the predictor will remain the same. Hence a more accurate comparison will be possible.

In medical appplications people often seem to use tertiels or quartiles, whereas known values of the predcitor peg a cancer marker would in my view be better.

>>>> 

I think these are BOTH bad options.  Continuous data should only be categorized based on some strong theoretical grounds.  One example I run into often is birthweight data.  Babies under 2.5 kg are “low birthweight”, those above are “normal”.  This is nonsensical but nearly universal.  It treats a baby who weighs 1 kg as identical to one who weighs 2.49 kg, and one who is 2.51 kg as identical to one who is 4 kg.  I have explained this to clients who then say “You’re right, that makes no sense, but it’s what everyone does, so we’re going to do it”.

Sometimes there are practical reasons for categorizing data, and these may be important.  For instance, if you are asking about income, the responses are nearly always in categories.  Not only are people more willing to answer when it  is formulated this way, but many do not know their precise incomes, so they will be guessing anyway.  In addition, since income data is very highly skew, it will have to be transformed anyway. 

Happy Thanksgiving!

Peter

Bruce Weaver

unread,
Nov 25, 2010, 11:00:12 AM11/25/10
to MedStats
On Nov 25, 6:21 am, "Peter Flom" <peterflomconsult...@mindspring.com>
wrote:

> So, now I am wondering what plum in SPSS does


Ordinal logistic regression. One sees different variations on what
PLUM stands for. One version is Polytomous Logit Universal Model.


> Happy Thanksgiving!
>
> Peter

Happy *American* Thanksgiving. The Canadian version was back in
October. ;-)

--
Bruce Weaver
bwe...@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/Home
"When all else fails, RTFM."

Allan Reese

unread,
Nov 26, 2010, 7:30:35 AM11/26/10
to MedStats
This discussion seems wildly eccentric to the original query. It's
confusing ordinal variable as response (ordinal regression), with
predictors that differ from the normal expected value (robust and
quantile regressions), with ordinal variables as predictors (mixed
model), and with developing a non-linear model by optimizing
categorical dummy variables (GAM?). The starting point asked how to
handle skewness (or query outliers) in SF-12 measure as response -
which might suggest a GLM. No wonder people find statistics "can
prove anything".

This seems to me a good example of getting lost in a morass of
methodology rather then looking at the provenance and handling of the
data. Is it "reasonable" to assume the scores are iid, or more likely
there is a mixture of populations represented?

Will SF-12 correlate with your happiness score?

Allan

mcap

unread,
Nov 26, 2010, 12:30:45 PM11/26/10
to MedStats
This seems to me a good example of getting lost in a morass of
> methodology rather then looking at the provenance and handling of the
> data.  Is it "reasonable" to assume the scores are iid, or more likely
> there is a mixture of populations represented?
>

I agree that we may be getting lost in the details although it
certainly is an interesting discussion and I always learn a lot from
everyone.

As for assuming the population is IID that can be tough....I imagine
most study populations aren't in some way (even though they are
assumed to be) and it can be just a matter of how much.... I think the
assumption is reasonable here (at least within the subgroups we are
controlling for).

I ran median regression and I ran robust. The coefficient for minor
pain (in terms of the SF-12) is very similar. The coef for moderate
pain drops from about 5 to 4 when I use either of the alternate
strategies (as I calculate the coeffs accross a range of ages...this
difference either reduces or increases).

I guess, theoretically, I am looking at a dataset that is skewed
because of outliers....but handling a skewed DV and a DV with outliers
can be different.....
This is a very large dataset without a ton of excessive
residuals.....but those cases do influence the results....


I am also curious to see if anyone else has had similar issues working
with the SF - 12....it is standardized to a population mean of 50 and
sd of 10. If your population is relatively less or more healthy
overall, you will end up with a ceiling or floor...

The numbers don't matter so much to me. I think that if I use Least
squares or another method and point out the limitations we could be OK
and reviews could always ask us to consdier something else. Furthe
ideas??

Mike Campbell

unread,
Dec 1, 2010, 10:56:54 AM12/1/10
to MedStats
Ordinal regression applied to quality of life data was discussed in a
review sometime ago (Lall R, Campbell MJ, Walters SJ, Morgan K, MRC
CFAS (2002) A review of ordinal regression models applied to health
related quality of life assessments. Stat Meth. Med Res, 11, 49-67.
There are a number of models to choose from which make different
assumptions about how the odds change for different categories.

However, I am surprised at the outliers having so much effect in such
a large data set. The problem with quality of life data, because they
are bounded, is usually the floor and/or ceiling effects. Nevertheless
we often find that linear regression works 'well enough'. This is
further discussed in 'Quality of life outcomes in clinical trials and
health care evaluation: a practical guide to analysis and
interpretation'. Walters SJ Wiley, 2009
Mike

mcap

unread,
Dec 2, 2010, 11:16:06 AM12/2/10
to MedStats
Thanks All!

On Dec 1, 10:56 am, Mike Campbell <m.j.campb...@sheffield.ac.uk>
wrote:
> > ideas??- Hide quoted text -
Reply all
Reply to author
Forward
0 new messages