regression of data with nonstandard outcome shape

Juliet Hannah

unread,

Sep 21, 2009, 10:45:39 PM9/21/09

to meds...@googlegroups.com

Hi Group,

I am trying to model a continuous outcome. If I make a histogram of
this outcome, it looks
bimodal (but my question is about irregular shapes in general). Even after
including all known covariates, the residuals are still shaped in this
manner. It does
not seems like one of those things that can be handled by transformation.

I could be missing a covariate that is unknown, or maybe not. I really have no
idea.

Does anyone have suggestions of topics I should look up to analyze this data.

Also, have I missed something or do none of my graduate level regression books
cover this topic? It does not seem like it should be that uncommon.

In my actual problem, there are groups in my data (correlated data),
so that I would
have used a mixed model if my residuals had looked ok.

In addition, does anyone have any software recommendations.

Thanks for your time.

Juliet

Ted Harding

unread,

Sep 22, 2009, 5:47:01 AM9/22/09

to meds...@googlegroups.com

Hi Juliet,
This is a preliminary comment (more in the spirit of solidarity
than to propose a definitive solution ... ).

Apparent regression effects which are really due to unidentified
latent groupings in the data are far more common than is usually
acknowledged ("usually acknowledged" in fact = "hardly ever").

Consider the following example of artificial data made up of two
groups. There are two variates: Y = "outcome", X = "covariate",
both Normally distributed in each group. In each group, the
outcome Y is independent of the covariate X.

X1 100 values, Normal, mean=0, SD=2
Y1 100 values, Normal, mean=0, SD=1
X1 and Y1 generated independently of each other

X2 100 values, Normal, mean=4, SD=2
Y2 100 values, Normal, mean=2, SD=1
X2 and Y2 generated independently of each other

Now pool them:
X 200 values, (X1 and X2 pooled)
Y 200 values, (Y1 and Y2 pooled)

Now plot (X.Y): a clear apparent increase in Y as X increases.
Perform a linear regression of Y on X. Typical P-value for the
slope of the regression is of the order of 1e-10 : 10^(-10).

BUT: In each group there is NO DEPENDENCY WHATEVER between Y & X.

For R users: the following code does the above (with an instance
of the regression output):

X1 <- rnorm(100,0,2); Y1 <- rnorm(100,0,1)
X2 <- rnorm(100,4,2); Y2 <- rnorm(100,2,1)
X <- c(X1,X2); Y <- c(Y1,Y2)
plot(X,Y)

summary(lm(Y~X))$coef
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.6674836 0.10161794 6.568561 4.379292e-10
# X 0.2223979 0.03125616 7.115330 1.999573e-11

Now look at the histograms of the pooled values X and Y:

hist(X)
hist(Y)

There is no obvious bimodality (despite the fact that the true
distribution generating the data is bimodal), though there is some
indication that the distributions are not Normal. A detailed
examination of the higher moments (especially kurtosis) would
be more clarly disgnostic.

Even the histogram of the residuals from the regresssion does
not look obviously non-Normal (though the kurtosis is clearly
suspect):

hist(lm(Y~X)$res)

However, you can get a slightly sharper picture from histograms
of the Principle Components of the data:

princomp(cbind(X,Y))$loadings
# Loadings:
# Comp.1 Comp.2
# X -0.965 0.262
# Y -0.262 -0.965
hist(0.965*X + 0.262*Y)
# (which is beginning to look bimodal)
hist(0.262*X - 0.965*Y)
# (which definitely looks leptokurtic)

There is a certain (and indeed theoretically correct) sense in
which there is a genuine increasing regression relationship
between Y and X, if you stick to the primary definition of
"regression" as the expected value of Y conditional on the
value of X, considered as a function of X.

In this case, imagine a population consisting of an equal mix
of the two groups. You sample an individual from that population,
and get (X,Y) for that individual. That individual is equally
likely to be from Group 1 or Group 2. If from Group 1, there is
no relationship between Y and X; the same if from Group 2.

BUT: Given the value of X, an individual with that value of X
is more (or less) likely to be from Group 1 than from Group 2,
depending on the value of X: For low values of X, Group 1 is
more likely, conditional on X. For high values of X, Group 2
is more likely, conditional on X. The Group 1 Y-mean is lower
than the Group 2 Y-mean.

Hence, conditional on a low value of X, the Y-value is more
likely to be a Group 1 Y-value and hence have a lower expected
value; for a high value of X, the Y-value is more likely to be
a Group 2 Y-value and hence have a higher expected value. So the
expected value of Y, conditional on X, increases as X increases.

Exercise for the reader: The probability of sampling an individual
from Group 2, conditional on X, is a logistic function of X, so
if you score "0" for Group 1, and "1" for Group 2, the regression
of Score on X will be a logistic regression!

Since the regression of Y on X here is

Expected value of Y given X=x

= (mean of Grp 1)*Prob(from Grp 1 given X=x) +
(mean of Grp 2)*Prob(from Grp 2 given X=x)

= (mean of Grp 1) +
((mean of Grp 2) - (mean of Grp 1))*Prob(from Grp 2 given X=x)

the regression of Y on X follows the logistic fit

Prob(from Grp 2 given X=x)

So there is a genuine regression lurking here -- it just ain't linear!

Over the years, I have come across many datasets where the implicit
existence of two (or more) groups has provided an excellent -- and
realistic -- model for apparent dependency of Y on X. I may post
later on some good examples of this which people can study, if
iterested.

I agree that this sort of question is not discussed in standard
texts, which tend to concentrate on development and explanation
(often, of course, very well done) of standard types of analysis
presented in a kind of ritualistic way ("When the value of Y
increases as X increases, we perform the ceremony of Linear
Regression. If your Study Design is Worthy, you will be blessed
with a Significant P-value, which the Key of Entry into the
Kingdom of Publication Heaven." No, that is not a quotation
from anything ... ).

A better domain for exploring his kind of thing is the literature
on Exploratory Data Analysis. One type of analysis which can be
useful in helping to identify "hidden sub-groups" is based on
mixture models, for which routines are available in many statistical
software packages. There is an enormous variety of approaches to
mixture models.

Coming back to some specific points in your query.
The fact that you have observed clear bimodality in your data
very strongly suggests that you should be looking along the above
lines. The above example was deliberately chosen so that the
effect (in apparent dependency of Y on X) would be very marked
indeed, yet the basic diagnostics (histograms of X, Y etc.)
would not give a particularly definitive impression.

You also say that you "there are groups in my data". If these
are explicit, have you allowed for them before looking at the
bimodalities?

Hoping this may help a bit!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 22-Sep-09 Time: 10:46:58
------------------------------ XFMail ------------------------------

Juliet Hannah

unread,

Sep 22, 2009, 9:30:12 AM9/22/09

to meds...@googlegroups.com

Here is another way I was thinking about it, perhaps incorrectly.

Let's say an individual's sex had the biggest influence on the outcome. But
let's say it was unmeasured. This kind of situation could create
bimodality, right. Even if the data given sex were approximately
normal, I still
cannot proceed as usual no matter what.

Would quantile regression help out in this situation. It is not just
that distributional assumptions are not met. A huge effect is unmeasured,
and the remaining variation may only be explained by accounting for this
unmeasured effect.

Based on my initial search, I was about to look into "mixture models",
but before
I invested the time, I wanted to check if this is a reasonable way to
proceed. It
seems this is what Ted is suggesting I do.

Is this a book I should look into:

Peter Schlattmann
Medical Applications of Finite Mixture Models

As far as the grouping I mentioned earlier, it is the same as
a treatment given to animals in a litter. There is a litter effect that
could be treated as a random intercept (in a mixed model setting).

In sum, I have this bimodal outcome on clustered data.

As I mentioned, I am surprised that in all my courses this problem was not
discussed. I get out in the field and within a few months, I'm stumped! :)

Thanks!

Peter Flom

unread,

Sep 22, 2009, 9:39:17 AM9/22/09

to meds...@googlegroups.com

Juliet Hannah <juliet...@gmail.com> wrote

>
>Here is another way I was thinking about it, perhaps incorrectly.
>
>Let's say an individual's sex had the biggest influence on the outcome. But
>let's say it was unmeasured. This kind of situation could create
>bimodality, right. Even if the data given sex were approximately
>normal, I still
>cannot proceed as usual no matter what.
>
>Would quantile regression help out in this situation. It is not just
>that distributional assumptions are not met. A huge effect is unmeasured,
>and the remaining variation may only be explained by accounting for this
>unmeasured effect.
>
>Based on my initial search, I was about to look into "mixture models",
>but before
>I invested the time, I wanted to check if this is a reasonable way to
>proceed. It
>seems this is what Ted is suggesting I do.
>

If there really is a mixture, then that can certainly be a valuable approach.
I am not sure what would happen if you tried to apply a mixture model
where there was a single population. Perhaps nothing bad.

In my experience, mixture models are tricky to apply and to interpret, but that may be me.

But it's hard to know if there really is a mixture, since you haven't given us any context.

What's your DV and what are your IVs?
What is it you are trying to do?
How strong is the bimodality?

>As I mentioned, I am surprised that in all my courses this problem was not
>discussed. I get out in the field and within a few months, I'm stumped! :)

It took that long????? :-)

Seriously, real world problems often don't match up with textbook solutions.

Peter

Peter L. Flom, PhD
Statistical Consultant
Website: www DOT peterflomconsulting DOT com
Writing; http://www.associatedcontent.com/user/582880/peter_flom.html
Twitter: @peterflom

Juliet Hannah

unread,

Sep 22, 2009, 10:02:55 AM9/22/09

to meds...@googlegroups.com

Hi Group,

One more thing.

The setup is a drug is given to individuals and a continuous response
is measured. I am testing if a biomarker is related to the response. Several
indivduals come from families so I must take the correlation between
these individuals into account.

I realized I may have sounded if I am set on the idea that there
is a missing covariate. I'm not, but I'm trying to
learn what to do if that is the case.

I'm going to read up on quantile regression and mixture models.

Thanks!

Juliet

Chris Everyman

unread,

Sep 27, 2009, 8:44:21 AM9/27/09

to MedStats

On 22 Sep, 10:47, (Ted Harding) <Ted.Hard...@manchester.ac.uk> wrote:
> A better domain for exploring his kind of thing is the literature
> on Exploratory Data Analysis. One type of analysis which can be
> useful in helping to identify "hidden sub-groups" is based on
> mixture models, for which routines are available in many statistical
> software packages. There is an enormous variety of approaches to
> mixture models.

Ted, thanks for a thoughtful and fascinating post. Are there any
particular articles and/or books on this that you would recommend,
especially with regards to the use of exploratory data analysis in
this context?

As Ted, and others, may already be aware, it is possible to find
"hidden sub-groups" in what is really a single population. The
following quotes are from
Tarpey, T., Yun, D., & Petkova, E. (2008). Model misspecification:
finite mixture or homogeneous? Statistical Modelling, Vol. 8, Iss. 2,
pp. 199-218.
http://smj.sagepub.com/cgi/content/abstract/8/2/199
"The problem, highlighted in this paper, is that in many cases
mixture distributions and homogeneous non-normal distributions will be
virtually identical to one another. Discerning a finite mixture from
some other homogeneous non-normal distribution is an old problem.
Pearson (1895) states, ‘The question may be raised, how are we to
discriminate between a true curve of skew type and a compound curve,’
where by compound he means mixture.... More recently Bauer and Curran
(2003) demonstrate that a growth mixture model may appear optimal even
in cases where the true distribution is not a mixture." (Tarpey et
al., 2008, p. 201).
"Closely related to finite mixture models is clustering (discussed
in Section 5). The k-means algorithm (for example, Forgy, 1965;
Hartigan and Wong, 1979; MacQueen 1967) is frequently used to discover
distinct clusters in a data set. If the data is from a homogeneous
distribution, the k-means algorithm will nonetheless converge to a set
of well-defined cluster means which are called self-consistent points
(Flury, 1993) of the empirical distribution and are estimators of the
principal points of the underlying distribution (Flury,
1990)." (Tarpey et al., 2008, p. 201).
"It is well-known that any given continuous distribution can be
approximated by a mixture model. We have demonstrated through the
population-based EM algorithm that mixture models with as few as two
or three mixture components can provide a very good approximation to
some well-known non-normal homogeneous distributions." (Tarpey et al.,
2008, pp. 215-216).
"Bimodality in large samples is often (but not always, see Tarpey
and Petkova, 2007) evidence of at least two distinct sub-populations.
Of course, this will only occur if the mixture component means are
well separated and/or the mixture component variances are relatively
small. Powerful statistical techniques are essential in cases when the
mixtures are not well-separated, but unfortunately, in these cases, we
will not always be able to distinguish a mixture from some other
homogeneous non-normal distribution. The problem is compounded because
the mixture model and the homogeneous non-normal probability model
present two very different models for reality." (Tarpey et al., 2008,
pp. 216).

Cheers,

Chris

Reply all

Reply to author

Forward