Correlation between categorical variable and a continuous variable

RAMS

unread,

Aug 4, 2005, 6:50:34 AM8/4/05

to MedStats

Hai all,

I wish to do a logistic regression for the smear conversion of
TB which has a dichotomous outcome and i have four independet variables
namely group regimn which has 3 random treatments, sex, weight a
continuous variable and pre sensitivity status which is also a
dichotomous with values yea or no.

Before doing this i want to check the multicollinearity between
the independents. For a dichotomous and continuous variaables i did a
Point Biserial correlation, and to compare the two dichotomous
variables i did kappa. I found that there is no multicollinearity
between the independts by the above mentioned tests.

For a categorical variable (3 categories) and a continuous
variable may i calculate wilcoxon signed rank correlation or cannonical
correlation?

Please help me

Thanks in advance.

Ted Harding

unread,

Aug 4, 2005, 7:44:45 AM8/4/05

to MedS...@googlegroups.com

The essential issue underlying "multicollinearity" in regression
is whether you have redundant (or nearly redundant) independent
variables, i.e. one (or more) are linearly predictable from the
others.

To investigate that issue, I would analyse the "design matrix"
of the regression. This consists of columns of 0s and 1s for
the categorical variables (your 3-level factor "regimen" would
need 2 columns, your binary variables only 1 each), and a column
for each continuous variable (in your case only 1 column); and
as many rows as there are cases.

You can then look at all columns together, or at a subset of
columns to investigate collinearity between a subset of your
variables.

The mathematical criterion for collinearity is that the rank
of the matrix (using the set of columns you are interested in)
is less than the number of columns of the matrix. This would
be reflected in the occurrence of "zero" values in the "singular
value" component of a Singular Value Decomposition of the matrix.

However, in practice exact collinearity may not occur, while
the variables may be sufficiently nearly "collinear" to cause
trouble in the regression. This can be assessed by looking
at the relative magnitudes of the singular values: If one or
more is small in magnitude compared with the others, then
you are probably facing that situation.

I would know how to set this up in a statistical package like R
(See http://www.r-project.org ) or equivalently in S-Plus, and
also in matrix-oriented numerical software like MatLab or octave
(See http://www.octave.org ), but am not familiar enough with
other statistical software to know how to do it (in say STATA
or SAS). However, if any package is halfway decent, it should
be possible.

I see that you have apparently carried out statistical significance
tests to check for collinearity. I would suggest that this is
not the right approach at all. The existence (or near-existence)
of collinearity is simply a numerical fact, and its importance
in regression lies in the fact that when it is present it makes
some of the the estimated coefficients in the regression to be
aliases of others -- i.e. the variation in the dependent variable
which can be explained by some of the variables can be equally
explained by others, leaving out the first, and the data will
offer you no information to choose between these equally well
supported possibilities.

By the same token, the effect of collinearity will be manifested
in the fact that the matrix of variances and covariances for
the estimated coefficients (and all decent regression software
will provide this output) is singular or nearly singular, which
can be examined in the same way as above (or, since it is now
a square matrix, by finding the eigenvalues of this matrix and
applying the same criterion to the eigenvalues). Many statistical
packages will in fact warn that there is singularity.

But, as for a "statistical test for collinearity", what can this
means? Does it mean that you have set up a Null Hypothesis of
collinearity and rejected it at say 5% P-value? And what happens
if the data give a non-significant result? Do you then infer that
you can't "reject collinearity"? This is not relevant to the
underlying issue!

Since you are doing a logistic regression, a related phenomenon
that you need to watch out for is what is called "perfect separation",
in which the set of covariate values associated with the cases
where Response=1, and the set of covariate values assiociated with
cases where Response=0, can be linearly separated in covariate
space, i.e. you can find coefficients a1, a2, ... such that the
value of

a1*X1 + a2*X2 + ...

for every "0" case is less than its value for every "1" case.
In statistical multivariate analysis terms, this amounts to
saying that a linear discriminant analysis of the independent
variables, grouped according to the 0/1 value of the dependent
variable, gives complete discrimination (i.e. there is a linear
discriminant which perfectly predicts which group a case belongs
to). The inmportance of this in logistic regression it that such
a linear combination allows perfect predicition of outcome
(according to the data): the maximum likelihood fit will have
a scale parameter with value 0, so that negative values of the
linear function will predict P=0 for those cases with Response=0,
and positive values will predict P=1 for those cases with
Response=1. Since real life doesn't behave like that, if it
happens you know that your results are unrealistic! Again,
decent software will warn when this is happening.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Aug-05 Time: 12:44:37
------------------------------ XFMail ------------------------------

Martin P. Holt

unread,

Aug 4, 2005, 10:17:13 AM8/4/05

to MedS...@googlegroups.com

Hello RAMS,

Once you have the design matrix X as described by Ted, one way of detecting
collinearity without fitting the regression model is to compute the matrix's
"Condition Number". This is defined as the square root of the ratio of the
largest to smallest eigenvalue of X_transposed_X (the Information Matrix).

It is a measure of the ratio of the variation of the linear combination of
the columns of X with the greatest variance to that with the smallest
variance. Collinearity makes the latter of these eigenvalues close to zero,
so large values of this ratio, say exceeding 30, indicate serious problems
with collinearity.

SPSS gives the following in its help menu;
"Collinearity diagnostics. Collinearity (or multicollinearity) is the
undesirable situation when one independent variable is a linear function of
other independent variables. Eigenvalues of the scaled and uncentered
cross-products matrix, condition indices, and variance-decomposition
proportions are displayed along with variance inflation factors (VIF) and
tolerances for individual variables."

So I would hope that your software is that helpful too.

Best Wishes,
Martin

Reply all

Reply to author

Forward