logistic model with quite a small sample size

19 views

Skip to first unread message

Daniel Smith

unread,

Oct 16, 2015, 11:51:24 PM10/16/15

to regmod

Dear Frank,

Apologies for all these questions!

Lastly, for my modelling problem, I have data from 52 patients with oral cancer and want to build a logistic model of survival (yes/no) from various predictors (smoking, age, alcohol consumption etc…). If I subset the data to only those patients in stage 4 then this reduces it down to 38 patients. I guess I can already predict that you are going to say the data set is too small?! I have read that you recommend the 15 outcomes per EV (by the way is that if the variable is continuous? What if it is categorical with a number of factor levels hence df>1 or does it not matter since the variables will be dummy coded?). In that case perhaps the model can only really have 2 or 3 EVs? Will the penalised regression method in rms account for this and not select any more variables than it should do or do we have to somehow specify / constrain this? I have also read that you provide a case that for a logistic model 96 outcomes are needed just to estimate the intercept to 90% confidence and that should also be somehow taken into the rule of thumb. So in an ideal world, the number of outcomes should be 96 + 15 for each EV? Does the 15 refer to total outcome (i.e., it could be 10 dead and 5 alive or 2 dead and 13 alive? – does the 15:1 rule still apply if the binary outcome is not roughly 50:50?). So given my small sample size is it even possible to produce a logistic model? Or can it be done but we just have less confidence in its predictive ability and have to be cautious?

Many thanks indeed!

Frank Harrell

unread,

Oct 17, 2015, 8:49:40 AM10/17/15

to regmod

We need a more comprehensive rule of thumb that is not really linear in the number of parameters. But what we know at present is that it is safe to use a rule like the following. Suppose you have binary Y and you observe e events and m non-events. Then for a model to perform as well in the future as we think it does now when penalization (shrinkage) is not used,

min(e, m) = 96 + 15p

where p is the number of parameters in the model. 96 is the number of observations needed just to estimate the intercept with 0.95 confidence with a margin of error of +/- 0.1 on the absolute risk scale. Note that you do not get much of a benefit from having more subjects added to the more frequent outcome category, although they never hurt and sometimes allows you to estimate absolute risk and not just relative odds.

p includes one for each category of each categorical predictor other than the reference category. It also includes all nonlinear terms. It is important to note that p is pre-specified in ordinary estimation and equals the number of candidate predictors if doing variable selection. If using shrinkage, p is the number of effective degrees of freedom.

Consider all the biomarker and genomics work that is trying to create predictions and classifiers with small min(e,m) where the result is hopeless for even estimating crude marginal probabilities that ignore the biomarkers.

When n < 96 and Y is binary (i.e., is a minimum information outcome variable), as in your case, the exercise is nearly futile because if you ignored all the predictors and just wanted to estimate overall Prob(Y=1) you can't really do it. When n = 52 if 26 events were observed the 0.95 Wilson confidence interval for the probability that Y=0 is [0.37, 0.63] which means you don't know very much. This is why multi-center clinical studies are usually needed as opposed to getting data from only one center.

Reply all

Reply to author

Forward

0 new messages