On 8/15/2012 3:19 PM, Vlad wrote:
> Does regression methods prove causal relationship between the
> independent variable and the dependent one ?
To expand on the already good comments, establishing causation typically
requires making untestable assumptions about your data. Are these
untestable assumptions reasonable? They are very reasonable in the case
of randomization. Other approaches like instrumental variables and
propensity score matching are good, but even so, the assumptions that
they require are difficult at times to believe.
It sounds just awful to make untestable assumptions, but we do it all
the time. We assume, for example, that the laws of physics apply equally
well at all points in time. But can we prove that the effect of gravity
is the same today as it was 4.5 billion years ago when our solar system
was just being formed?
We statisticians blithely assume independence, and even if you do a runs
test or other test of autocorrelation, it does not totally remove the
possibility of lack of independence. You often have to take the
independence assumption on faith. It is, for the most part, an
untestable assumption. Most of the time, if you are not dealing with an
infectious disease and you don't recruit patients at that festival of
twins that occurs every year in Twinsburg, Ohio and that one person is
not copying off another person during the exam, etc., etc., this is a
very reasonable assumption.
There's nothing wrong with untestable assumptions as long as you are
honest with yourself about them.
I like the Hill criteria that others have mentioned, but there are
plenty of counterexamples out there. For example, X and Y could have a
dose response relationship but it is totally an artefact. Birth order
has a dose response relationship with Down's syndrome. First born
children have less risk than second born, who have less risk than third
born, etc. But the real cause is mother's age and it just happens that
birth order and mother's age are closely related. It's pretty hard to be
the seventh child of a twenty year old mother and a lot easier to be the
seventh child of a forty year old mother.
The Hill criteria have a cumulative impact in that if more of them are
satisfied, it is harder to envision how this statistical relationship
could have be produced artefactually. But the Hill criteria are never
sufficient. They get you closer to establishing causality, but there
will always be some residual doubt.
All in all, if you want to establish causation, especially in a
non-randomized study, you have to make qualitative arguments that are
independent of your data itself. There is no formal statistical test
that can establish causation.
If you want to understand this better, think of it as a missing data
problem. The people in the treatment group are missing data on what
their outcome would have been if they had been given the placebo.
Likewise the people in the placebo group are missing data on what their
outcome would have been if they had been given the treatment. In a
randomized study, you have what is equivalent to the missing completely
at random (MCAR) case. It's pretty easy to impute values in the MCAR
model. Missing data in an observational study is, at best, missing at
random (MAR). You can impute values in the MAR case, but it takes a lot
more work, and you are never quite sure if you have MAR or the dreaded
missing not at random (MNAR) case. To distinguish between MAR and MNAR
requires that you make untestable assumptions about your data that are
not too much different than the untestable assumptions that you have to
make about causation.
If you can stand a philosophical approach to causation, read up on
counterfactual statements. To imply causation is to make a
counterfactual statement. I've never been comfortable with a
philosophical discourse on causation, but that's more my limitation than
a statement of the validity of the philosophical approach.
Steve Simon,
n...@pmean.com, Standard Disclaimer.
Sign up for the Monthly Mean, the newsletter that
dares to call itself average at
www.pmean.com/news