logistic regression

22 views
Skip to first unread message

Adi Milrad

unread,
Jan 17, 2018, 8:11:36 AM1/17/18
to israel-r-...@googlegroups.com
hi,
I'm trying to understand if a categorical variable with about 40,000 possible values is effecting some ratio variable.
I want to see also the impact of each category.

I though of running logistic regresion: 

model_1 = glm(formula = cbind(clicks, impressions-clicks) ~profile , data, family = binomial)

where profile has about 40K different string values. 
but it seems like it is to heavy to handle.

eventually I would like to add more predicting variables (country for example..) 
Can you think of a better way to solve the problem? 

Thanks.


עמרי מנדלס

unread,
Jan 22, 2018, 2:19:45 AM1/22/18
to Israel R User Group
Do you wish to understand whether there is a significant effect, or try to predict the ratio variable?
How many samples do you have?

Essentially when you convert the categorical variable into binary (one hot encoding) you get 40,000 different features.
Now you can start doing feature selection or other dimensionality reduction techniques. You could also incorporate new features (country as you mentioned) at this point. This will better position you towards predicting the ratio variable.

If you wish to understand the effect, you can cluster categories (profiles) together and look for an effect at a higher granularity.

Hope this helps,
Omri
Reply all
Reply to author
Forward
0 new messages