I am trying to model a continuous outcome. If I make a histogram of
this outcome, it looks
bimodal (but my question is about irregular shapes in general). Even after
including all known covariates, the residuals are still shaped in this
manner. It does
not seems like one of those things that can be handled by transformation.
I could be missing a covariate that is unknown, or maybe not. I really have no
idea.
Does anyone have suggestions of topics I should look up to analyze this data.
Also, have I missed something or do none of my graduate level regression books
cover this topic? It does not seem like it should be that uncommon.
In my actual problem, there are groups in my data (correlated data),
so that I would
have used a mixed model if my residuals had looked ok.
In addition, does anyone have any software recommendations.
Thanks for your time.
Juliet
Apparent regression effects which are really due to unidentified
latent groupings in the data are far more common than is usually
acknowledged ("usually acknowledged" in fact = "hardly ever").
Consider the following example of artificial data made up of two
groups. There are two variates: Y = "outcome", X = "covariate",
both Normally distributed in each group. In each group, the
outcome Y is independent of the covariate X.
X1 100 values, Normal, mean=0, SD=2
Y1 100 values, Normal, mean=0, SD=1
X1 and Y1 generated independently of each other
X2 100 values, Normal, mean=4, SD=2
Y2 100 values, Normal, mean=2, SD=1
X2 and Y2 generated independently of each other
Now pool them:
X 200 values, (X1 and X2 pooled)
Y 200 values, (Y1 and Y2 pooled)
Now plot (X.Y): a clear apparent increase in Y as X increases.
Perform a linear regression of Y on X. Typical P-value for the
slope of the regression is of the order of 1e-10 : 10^(-10).
BUT: In each group there is NO DEPENDENCY WHATEVER between Y & X.
For R users: the following code does the above (with an instance
of the regression output):
X1 <- rnorm(100,0,2); Y1 <- rnorm(100,0,1)
X2 <- rnorm(100,4,2); Y2 <- rnorm(100,2,1)
X <- c(X1,X2); Y <- c(Y1,Y2)
plot(X,Y)
summary(lm(Y~X))$coef
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.6674836 0.10161794 6.568561 4.379292e-10
# X 0.2223979 0.03125616 7.115330 1.999573e-11
Now look at the histograms of the pooled values X and Y:
hist(X)
hist(Y)
There is no obvious bimodality (despite the fact that the true
distribution generating the data is bimodal), though there is some
indication that the distributions are not Normal. A detailed
examination of the higher moments (especially kurtosis) would
be more clarly disgnostic.
Even the histogram of the residuals from the regresssion does
not look obviously non-Normal (though the kurtosis is clearly
suspect):
hist(lm(Y~X)$res)
However, you can get a slightly sharper picture from histograms
of the Principle Components of the data:
princomp(cbind(X,Y))$loadings
# Loadings:
# Comp.1 Comp.2
# X -0.965 0.262
# Y -0.262 -0.965
hist(0.965*X + 0.262*Y)
# (which is beginning to look bimodal)
hist(0.262*X - 0.965*Y)
# (which definitely looks leptokurtic)
There is a certain (and indeed theoretically correct) sense in
which there is a genuine increasing regression relationship
between Y and X, if you stick to the primary definition of
"regression" as the expected value of Y conditional on the
value of X, considered as a function of X.
In this case, imagine a population consisting of an equal mix
of the two groups. You sample an individual from that population,
and get (X,Y) for that individual. That individual is equally
likely to be from Group 1 or Group 2. If from Group 1, there is
no relationship between Y and X; the same if from Group 2.
BUT: Given the value of X, an individual with that value of X
is more (or less) likely to be from Group 1 than from Group 2,
depending on the value of X: For low values of X, Group 1 is
more likely, conditional on X. For high values of X, Group 2
is more likely, conditional on X. The Group 1 Y-mean is lower
than the Group 2 Y-mean.
Hence, conditional on a low value of X, the Y-value is more
likely to be a Group 1 Y-value and hence have a lower expected
value; for a high value of X, the Y-value is more likely to be
a Group 2 Y-value and hence have a higher expected value. So the
expected value of Y, conditional on X, increases as X increases.
Exercise for the reader: The probability of sampling an individual
from Group 2, conditional on X, is a logistic function of X, so
if you score "0" for Group 1, and "1" for Group 2, the regression
of Score on X will be a logistic regression!
Since the regression of Y on X here is
Expected value of Y given X=x
= (mean of Grp 1)*Prob(from Grp 1 given X=x) +
(mean of Grp 2)*Prob(from Grp 2 given X=x)
= (mean of Grp 1) +
((mean of Grp 2) - (mean of Grp 1))*Prob(from Grp 2 given X=x)
the regression of Y on X follows the logistic fit
Prob(from Grp 2 given X=x)
So there is a genuine regression lurking here -- it just ain't linear!
Over the years, I have come across many datasets where the implicit
existence of two (or more) groups has provided an excellent -- and
realistic -- model for apparent dependency of Y on X. I may post
later on some good examples of this which people can study, if
iterested.
I agree that this sort of question is not discussed in standard
texts, which tend to concentrate on development and explanation
(often, of course, very well done) of standard types of analysis
presented in a kind of ritualistic way ("When the value of Y
increases as X increases, we perform the ceremony of Linear
Regression. If your Study Design is Worthy, you will be blessed
with a Significant P-value, which the Key of Entry into the
Kingdom of Publication Heaven." No, that is not a quotation
from anything ... ).
A better domain for exploring his kind of thing is the literature
on Exploratory Data Analysis. One type of analysis which can be
useful in helping to identify "hidden sub-groups" is based on
mixture models, for which routines are available in many statistical
software packages. There is an enormous variety of approaches to
mixture models.
Coming back to some specific points in your query.
The fact that you have observed clear bimodality in your data
very strongly suggests that you should be looking along the above
lines. The above example was deliberately chosen so that the
effect (in apparent dependency of Y on X) would be very marked
indeed, yet the basic diagnostics (histograms of X, Y etc.)
would not give a particularly definitive impression.
You also say that you "there are groups in my data". If these
are explicit, have you allowed for them before looking at the
bimodalities?
Hoping this may help a bit!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 22-Sep-09 Time: 10:46:58
------------------------------ XFMail ------------------------------
Let's say an individual's sex had the biggest influence on the outcome. But
let's say it was unmeasured. This kind of situation could create
bimodality, right. Even if the data given sex were approximately
normal, I still
cannot proceed as usual no matter what.
Would quantile regression help out in this situation. It is not just
that distributional assumptions are not met. A huge effect is unmeasured,
and the remaining variation may only be explained by accounting for this
unmeasured effect.
Based on my initial search, I was about to look into "mixture models",
but before
I invested the time, I wanted to check if this is a reasonable way to
proceed. It
seems this is what Ted is suggesting I do.
Is this a book I should look into:
Peter Schlattmann
Medical Applications of Finite Mixture Models
As far as the grouping I mentioned earlier, it is the same as
a treatment given to animals in a litter. There is a litter effect that
could be treated as a random intercept (in a mixed model setting).
In sum, I have this bimodal outcome on clustered data.
As I mentioned, I am surprised that in all my courses this problem was not
discussed. I get out in the field and within a few months, I'm stumped! :)
Thanks!
>
>Here is another way I was thinking about it, perhaps incorrectly.
>
>Let's say an individual's sex had the biggest influence on the outcome. But
>let's say it was unmeasured. This kind of situation could create
>bimodality, right. Even if the data given sex were approximately
>normal, I still
>cannot proceed as usual no matter what.
>
>Would quantile regression help out in this situation. It is not just
>that distributional assumptions are not met. A huge effect is unmeasured,
>and the remaining variation may only be explained by accounting for this
>unmeasured effect.
>
>Based on my initial search, I was about to look into "mixture models",
>but before
>I invested the time, I wanted to check if this is a reasonable way to
>proceed. It
>seems this is what Ted is suggesting I do.
>
If there really is a mixture, then that can certainly be a valuable approach.
I am not sure what would happen if you tried to apply a mixture model
where there was a single population. Perhaps nothing bad.
In my experience, mixture models are tricky to apply and to interpret, but that may be me.
But it's hard to know if there really is a mixture, since you haven't given us any context.
What's your DV and what are your IVs?
What is it you are trying to do?
How strong is the bimodality?
>As I mentioned, I am surprised that in all my courses this problem was not
>discussed. I get out in the field and within a few months, I'm stumped! :)
It took that long????? :-)
Seriously, real world problems often don't match up with textbook solutions.
Peter
Peter L. Flom, PhD
Statistical Consultant
Website: www DOT peterflomconsulting DOT com
Writing; http://www.associatedcontent.com/user/582880/peter_flom.html
Twitter: @peterflom
One more thing.
The setup is a drug is given to individuals and a continuous response
is measured. I am testing if a biomarker is related to the response. Several
indivduals come from families so I must take the correlation between
these individuals into account.
I realized I may have sounded if I am set on the idea that there
is a missing covariate. I'm not, but I'm trying to
learn what to do if that is the case.
I'm going to read up on quantile regression and mixture models.
Thanks!
Juliet