Cells empty in lower and higher values of an age factor, should I remove observations or recode?

0 views
Skip to first unread message

Thomas

unread,
May 6, 2009, 4:11:01 AM5/6/09
to MedStats
Hi.

I am doing a regression with three factors as covariates, country (two
levels), sex and age (five levels). In the age factor the lowest and
the highest level (< 29 years and 51-59 years) there are empty cells
and a few cells with very few observations when looking at a frequency
table of sex*age or age*country.

If I want the age estimates to be meaningful I therefore need to
adress this in some way. As I see it I have two options.

1) Changing the study population and removing the <29 years level and
the 60-69 years level. I lose some observations but if it is right it
is worth it.

2) Recode age so i put the lowest two and the highest two levels
together giving me <39, 40-49, 50-69 as the three levels of age.

Is this a matter of judgement or it there something wrong with option
2?

Best regards.

/Thomas

Bruce Weaver

unread,
May 6, 2009, 7:00:29 AM5/6/09
to MedStats
Was age collected as categorical, or do you have the actual ages?

--
Bruce Weaver
bwe...@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/
"When all else fails, RTFM."

Thomas

unread,
May 6, 2009, 8:15:25 AM5/6/09
to MedStats
No, to bad it was collected in the categories i outlined below. It is
data on employees so it starts att 20 years and go in 10 year
intervals up to 60.
> bwea...@lakeheadu.cahttp://sites.google.com/a/lakeheadu.ca/bweaver/

Bruce Weaver

unread,
May 6, 2009, 2:08:08 PM5/6/09
to MedStats
On May 6, 8:15 am, Thomas <tfr...@gmail.com> wrote:
> No, to bad it was collected in the categories i outlined below. It is
> data on employees so it starts att 20 years and go in 10 year
> intervals up to 60.


Well one question then is whether you get substantially different
results using the two approaches you described, i.e.,

1) Removing the <29 years level and the 60-69 years level.

2) Recode age so i put the lowest two and the highest two levels
together giving me <39, 40-49, 50-69 as the three levels of age.

If the two approaches yield similar results, then you're agonizing
over nothing. If they yield substantially different results, then
you'll have to figure out why.

--
Bruce Weaver
bwe...@lakeheadu.ca

Barry McDonald

unread,
May 6, 2009, 5:41:40 PM5/6/09
to MedS...@googlegroups.com
Thomas,
Is it reasonable in your context to model age as a covariate rather
than a factor, e.g. using agegroup midpoints 25,35,45,55,65 to look
for a the linear trend with age. Then the you can fit interaction
models and they are interpreted as different slopes. The problem of
empty cells goes away.

To check whether it is reasonable to linearize age this way, you could either
1. do a pure error test (essentially an F test based on the
difference in residual SS from treating age as linear minus the
residual SS from treating age as a factor as in ANOVA, then adjusted
for the difference in df) or
2. simply add an age^2 and possibly age^3 covariate to the model.
These would pick up the grossest deviations from linearity. Test
whether these higher-order covariates are needed. Even if they are
it will still take up fewer df than treating agegroup as a factor
and has the advantage that you can still test for interactions such
as sex*age sex*age^2 without the problem of empty cells.

This approach doesn't always help but it might.
-Barry

Reply all
Reply to author
Forward
0 new messages