Logistic Regression and Sample Size

White, Robin HSURC

unread,

Dec 18, 1997, 3:00:00 AM12/18/97

to

Hi Everyone,

I'm new to logistic regression and would welcome an applied perspective
on the following:

We are predicting Y after controlling for (entering) 39 control
variables. Our X variable of interest (the 40th variable to enter the
model) is dichotomous. The 0 value has an n of 648; and the 1 value has
an n of 10.

Can we still get meaningful results given the total number of variables
and the sample sizes (n) for X?

Thanks in advance!

Sincerely,

Robin White
whi...@sdh.sk.ca

Terry Taerum

unread,

Dec 19, 1997, 3:00:00 AM12/19/97

to

White, Robin HSURC wrote in message
<4C681945EB2DD11189A...@master.sdh.sk.ca>...

>Hi Everyone,
>
>I'm new to logistic regression and would welcome an applied perspective
>on the following:
>
>We are predicting Y after controlling for (entering) 39 control
>variables. Our X variable of interest (the 40th variable to enter the
>model) is dichotomous. The 0 value has an n of 648; and the 1 value has
>an n of 10.
>
>Can we still get meaningful results given the total number of variables
>and the sample sizes (n) for X?

Given the power of desk top computers these days, it is possible for you
to run a set of simulations in order to answer your own question. At the
bottom is some SPSS code that you could run a number of times in order
to answer what I think is your question.

The first thing you will notice is the fact you usually get 1 or 2 variables
that are significant at the .05 level - even though it is completely random
data. Another thing you will notice is the similarity in results
between linear regression and logistic regression. In
the situation you are dealing with, you likely should use logistic
regression
(I can't think of a situation when you would use linear regression with a
dichotomous dependent variable) but the point is, some of what you
have already learned about linear regression can be applied,
to logistic regression.

Along that line of reasoning then, you need to come to a better
understanding of the 39 variables you are entering into the logistic
regression. From the strictly statistical point of view, you might want
to at least consider factor analysis of the predictor variables.

You also need to come to a better understanding of the dependent
variable. You need to ask yourself, why are there only 10 1's on the
dependent side of the equation.

You need to come to a better understanding of what the
relationship might look like between the 39 predictors and the 1
dichotomous variable - if it was meaningful. Would, for instance,
only extreme values of the predictor variables be associated with
the 10 1's on the dependent side of the equation.

Finally, you need to ask yourself whether there is a better way to
test whatever hypothesis it is you are examining.

new file .
input program .
numeric v1 to v40 .
vector vx=v1 to v39 .
loop #i=1 to 648 .
if (#i le 10) v40=1 .
if (#i gt 10) v40=0 .
loop #j=1 to 39 .
compute vx(#j)=trunc(5*rv.uniform(0,1)) .
end loop .
end case .
end loop .
end file.
end input program .
execute .
regression vars=v1 to v40
/dependent=v40 /enter v1 to v39 .
LOGISTIC REGRESSION VAR=v40
/METHOD=ENTER v1 to v39 .

Terry....@ualberta.ca

Richard F Ulrich

unread,

Dec 19, 1997, 3:00:00 AM12/19/97

to

White, Robin HSURC (whi...@SDH.SK.CA) wrote:

: I'm new to logistic regression and would welcome an applied perspective
: on the following:

: We are predicting Y after controlling for (entering) 39 control
: variables. Our X variable of interest (the 40th variable to enter the
: model) is dichotomous. The 0 value has an n of 648; and the 1 value has
: an n of 10.

: Can we still get meaningful results given the total number of variables
: and the sample sizes (n) for X?

- That depends on what you have in mind as "meaningful results".

One lesson that they teach in classes in experimental design is
that you maximize your power by having equal Ns in the groups. If
Ns are unequal, the "equivalent sample size" can be estimated by
getting the "reciprocal average" of the Ns - for 10 vs 648, you have
the same design-power as if you had 20 vs 20. (It would be no more
than 20 vs 20 if it were 10 vs 6000, since the small N will dominate.)

With two samples of 20, you cannot get any meaningful results at all
unless you are expecting a rather large effect size. So, 648 vs 10
is really a *small* sample, for some important purposes, since the
lesser number really matters.

With 10 cases to pick out of the total, you have a rather large
capacity for capitalizing on chance with 40 variables - the amount
depends, to some extent, on whether other variables are split in the
same proportions. Will you get a screwed up solution? - maximum
likelihood logistic regression is worse that way than least squares
regression - that is, ML-logistic will blow up earlier (lose all
hope of any asymptotic correctness in the tests, or utility in the
coefficients), and give less evidence to the user of what is going
on.

If you are controlling for 39 variables because they *do* contribute
to the distinction between groups, and you allow their random variances
to play a role, too, then I strongly suspect that you will use up
the available, predictable part of the Outcome variance; and thus
your power for detecting anything more will be nil.

One possibility of how to proceed - You may combine the 39 into one
predictor, and use just the one covariate. For references to a general
approach to that, follow my homepage to my stats FAQ and see the
contribution that someone posted some months ago on Propensity scoring.

Rich Ulrich, biostatistician wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html Univ. of Pittsburgh

Michael Lacy

unread,

Dec 19, 1997, 3:00:00 AM12/19/97

to

In article <4C681945EB2DD11189A...@master.sdh.sk.ca> "White, Robin HSURC" <whi...@SDH.SK.CA> writes:
>Hi Everyone,

>
>I'm new to logistic regression and would welcome an applied perspective
>on the following:
>
>We are predicting Y after controlling for (entering) 39 control
>variables. Our X variable of interest (the 40th variable to enter the
>model) is dichotomous. The 0 value has an n of 648; and the 1 value has
>an n of 10.
>
>Can we still get meaningful results given the total number of variables
>and the sample sizes (n) for X?

I would flatly say no. In considering sample size in logistic
regression, the relevant N is the minimum of the frequency of
"0" vs. "1", which in your case is only 10. One rule of thumb I've
heard (from an article by Harrell et al., I think) is that the
number of events per independent variable be > 10. In your
situation, the number of events, which is defined as min(f(0), f(1)),
is 10, so this would suggest that using even 1 independent
variable is pushing the limits.

The problem is that, with so many independent variables, you
you will get a good fit simply by chance--- and, to my knowledge,
this sort of "chance" is not in anyway tapped by LL ratio
tests and the like. Your situation is analogous to doing
ordinary multiple linear regression with 40 independent variables,
and an N of 10.

Sorry to be the bearer of bad news. Perhaps someone else would
have a more optimistic take on your situation.

Regards,
--
=-=-=-=-=-=-=-=-=-==-=-=-=
Mike Lacy, Sociology Dept., Colo. State Univ. FT COLLINS CO 80523
voice (970) 491-6721 fax (970) 491-2191