Multivariate regression, cell with few counts, how should it be handled?

1 view
Skip to first unread message

Thomas Fröjd

unread,
Nov 24, 2009, 8:45:32 AM11/24/09
to MedStats
Hi this is a pretty basic question but I haven´t encountered it before
so I would like to ask for the proper way to handle this.

X and Z are both categorical variables and I am interested on how they
together predict Y. X has three possible values(ABC), Y has 2(AB). The
model I am interested in is Y =a + X + Z + X*Z.

Tabulating X on Z the freqency table looks like this

Z=A Z=B
X=A 590 100
X=B 490 46
X=C 279 1

The trouble is that there is only one observation for Z=B and X=C. How
should I handle this?

Do I recode X and Z to a new variable containing five categories? This
looks as the easiest way but will make it harder to interpret the
effect of Z on Y.

Can i nest the variables in some way?

Best regards.
/Thomas

Frank Harrell

unread,
Nov 24, 2009, 9:01:28 AM11/24/09
to MedStats
Please tell us more about Y and about what was known apriori about the
likely interaction between X and Z.

I assume you meant to say 'multivariable' instead of 'multivariate'
because your problem is univariate with only one dependent variable.

Frank

Ted Harding

unread,
Nov 24, 2009, 10:05:35 AM11/24/09
to meds...@googlegroups.com
On 24-Nov-09 14:01:28, Frank Harrell wrote:
> Please tell us more about Y and about what was known apriori
> about the likely interaction between X and Z.
>
> I assume you meant to say 'multivariable' instead of 'multivariate'
> because your problem is univariate with only one dependent variable.
> Frank

Well said, Frank!
Regarding the OP's question, I think the simplest approach (and the
most transparent) would be to simply apply a standard regression
analysis to the data as they stand, using the desired model
Y = a + X + Z + X*Z

In the result, there will be a reported estimate of the effect
of Z at level C of X (i.e. in effect the interaction term at
that level of X); but you will simply have to accept that you
have almost no information in the data as to what it should be,
so the estimate of it will be very poor (and probably might as
well be ignored).

Ted. [fellow member of the "multivariable" lobby]

> On Nov 24, 7:45_am, Thomas Fröjd <tfr...@gmail.com> wrote:
>> Hi this is a pretty basic question but I haven´t encountered it
>> before so I would like to ask for the proper way to handle this.
>>
>> X and Z are both categorical variables and I am interested on how
>> they together predict Y. X has three possible values(ABC), Y has
>> 2(AB). The model I am interested in is Y =a + X + Z + X*Z.
>>
>> Tabulating X on Z the freqency table looks like this
>>
>> _ _ Z=A Z=B
>> X=A 590 100
>> X=B 490 46
>> X=C 279 1
>>
>> The trouble is that there is only one observation for Z=B and X=C. How
>> should I handle this?
>>
>> Do I recode X and Z to a new variable containing five categories? This
>> looks as the easiest way but will make it harder to interpret the
>> effect of Z on Y.
>>
>> Can i nest the variables in some way?
>>
>> Best regards.
>> /Thomas
> >

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 24-Nov-09 Time: 15:05:21
------------------------------ XFMail ------------------------------

Thomas Fröjd

unread,
Nov 25, 2009, 6:32:21 AM11/25/09
to MedStats
Hi,
I should have learnt to provide context when asking question by
now....

Groups are naturally occuring subgroups. The first variable is
exposure to an event(was there and physically affected, was there,
weren't there). Second variable is lost a relative or friend in the
event (yes or no).

Ted's suggestion feels pretty reasonable to me. Any other takes?




On Nov 24, 4:05 pm, (Ted Harding) <Ted.Hard...@manchester.ac.uk>
wrote:
> E-Mail: (Ted Harding) <Ted.Hard...@manchester.ac.uk>

Frank Harrell

unread,
Nov 25, 2009, 9:48:58 AM11/25/09
to MedStats
Ted's response is a good one. It would still be nice to know more
about Y. Is is continuous? Ordinal? If ordinal what is the
distribution of frequencies?

Frank

Ted Harding

unread,
Nov 25, 2009, 9:52:45 AM11/25/09
to meds...@googlegroups.com
On 25-Nov-09 11:32:21, Thomas Fröjd wrote:
> Hi,
> I should have learnt to provide context when asking question by
> now....
>
> Groups are naturally occuring subgroups. The first variable is
> exposure to an event(was there and physically affected, was there,
> weren't there). Second variable is lost a relative or friend in the
> event (yes or no).
>
> Ted's suggestion feels pretty reasonable to me. Any other takes?

Just to follow up on my suggestion (the rest snipped).
I have run an R simulation of a model (with non-zero interaction terms)
to check on what I suggested previously. R code with comments follows.

With the system of contrasts used in the regression, the true
effect and interaction values are:

Intercept = 0.0000
XB = 2.0000 XC = 5.0000
ZB = 2.0000
XB:ZB = 1.0000 XC:ZB = 1.0000

You can see that the SE for the XC:ZB interaction is much larger
that the SEs for the other effects, coming out at 1.055 throughout,
i.e. basically the same as the SD 1.0 of the N(mu,1) data points,
while the others range from 0.2 down to 0.05 depending on the
associated N.

Following the first run, the single data point corresponding
to X=C, Z=B is changed (sampling from the same N(8,1) distribution
as in the first run), leaving the other 1505 data unchanged, and
this is done three times (four runs in all).

The estimates of all effects and interactions are unchanged (to 4 d.p.)
in the repeated runs, except for the XC:ZB interaction which (in the
four repeats) gets estimates

-0.32126, 1.72541, 0.27920, 0.15458

(to be compared with its true value 1.0000). Now read on:

# Z=A Z=B
# X=A 590 100
# X=B 490 46
# X=C 279 1

X <- factor(c(rep("A",590),rep("B",490),rep("C",279),
rep("A",100),rep("B", 46),rep("C", 1)))
Z <- factor(c(rep("A",590+490+279),rep("B",100+46+1)))
# Total N = 1506

m.AA <- 0 ; m.BA <- 2 ; m.CA <- 5
m.AB <- 2 ; m.BB <- 5 ; m.CB <- 8
mu <- c(rep(m.AA,590),rep(m.BA,490),rep(m.CA,279),
rep(m.AB,100),rep(m.BB, 46),rep(m.CB, 1))

set.seed(54321)
Y <- rnorm(1506,mu,1)
print(summary(lm(Y ~ X*Z))$coef,4)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.03639 0.04149 -0.8771 3.806e-01
# XB 1.95764 0.06160 31.7801 6.717e-170
# XC 4.98101 0.07323 68.0215 0.000e+00
# ZB 2.12411 0.10899 19.4890 1.383e-75
# XB:ZB 0.92999 0.18982 4.8992 1.066e-06
# XC:ZB -0.32126 1.01550 -0.3164 7.518e-01

Y.N <- Y[1506] # Save it for future reference in case needed
Y[1506] <- m.CB + rnorm(1) #change it
print(summary(lm(Y ~ X*Z))$coef,4)
# Intercept, XB, XC, ZB and XB:ZB unchanged from above.
# XC:ZB 1.72541 1.01550 1.699 8.951e-02

Y[1506] <- m.CB + rnorm(1) #change it again
print(summary(lm(Y ~ X*Z))$coef,4)
# Intercept, XB, XC, ZB and XB:ZB unchanged from above.
# XC:ZB 0.27920 1.01550 0.2749 7.834e-01

Y[1506] <- m.CB + rnorm(1) #change it again
print(summary(lm(Y ~ X*Z))$coef,4)
# Intercept, XB, XC, ZB and XB:ZB unchanged from above.
# XC:ZB 0.15458 1.01550 0.1522 8.790e-01

#########################################################

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>

Fax-to-email: +44 (0)870 094 0861

Date: 25-Nov-09 Time: 14:52:43
------------------------------ XFMail ------------------------------

Thomas Fröjd

unread,
Nov 26, 2009, 4:23:25 AM11/26/09
to MedStats
Frank: Actually it is a count distribution but I use a negbin
distribution function in the model so for this exercise I belive we
can treat it as continious.

Ted: Thanks, that settles it.

As an experiment I tried to run your code but switching the N for

Z=A Z=B
X=A 590 1
X=B 490 46
X=C 279 100

It gives estimates around 0 for both XB:ZB and XC:ZB and std. errors
around 1. Not very suprising maybe. This parametrization is actually
more like I would prefer using X=A and Z=A as my reference groups. The
reason for this is that they are the non-exposed and would make
interpretation a bit easier.

Do I actually lose anything recoding the two factors to one factor
with five outcomes? Since there is only one interaction that can be
estimated shouldn't the estimate of it equal the estimates (XC:ZB-
XC:ZA)-(XB:ZB-XB:ZA)?

Also another question on the same subject.

Lets say the interaction is not there and I have the frequency above
that is

Z=A Z=B
X=A 590 1
X=B 490 46
X=C 279 100

The model is now Y=a+X+Z.

What is the Z parameter an estimate of now? The average diffrence
overall between Z on all three groups of X or will it be heavily
weighted to be the average diffrence of X:B and X:C?

Thanks for all the help so far.

/Thomas
> E-Mail: (Ted Harding) <Ted.Hard...@manchester.ac.uk>
Reply all
Reply to author
Forward
0 new messages