Create GLM model with performance comparable to forward selection approach

588 views
Skip to first unread message

greif...@gmail.com

unread,
Sep 25, 2015, 1:07:51 PM9/25/15
to H2O Open Source Scalable Machine Learning - h2ostream
My GLM models in H2O are repeatedly beaten by stepwise procedures in SAS (using gini on validation data set as criteria). It makes me both upset (I think stepwise is generally wrong idea) and sad (because I continue to believe that elastic net regularization is cool and should outperform stepwise).

I am able to create comparable model, but only at the expense of too many variables in the model (e.g. 600). I have 3,000 attributes and all of them are factors (enums), leading to 18,000 weights to estimate in total.

My goal is to find parsimonious model that will include less than few tens of attributes (< 20). I tried grid search over alpha with lambda search enabled, manually choosing lambdas/alphas, but I do not see any way how to get dense model (in terms of used attributes, not estimated weights) that would be powerful enough.

What would be the good approach to this?

tomas....@gmail.com

unread,
Sep 25, 2015, 3:29:59 PM9/25/15
to H2O Open Source Scalable Machine Learning - h2ostream, greif...@gmail.com
Hi Tomas,

that is surprising to me as well, although I think your case is specific in that you want to select only very few number of predictors of the total and it might be possible that step-wise approach works better here as the regularization strength needed to filter all other coefficients out might be too strong.

The approach I would recommend would be to run lambda search with alpha = 1 and set max_predictors slightly higher than what you want, e.g. ~30 if you want about ~20 in your model. the number really specifies number of active predictors after applying strong rules screening and is generally a little higher than actual number of nonzero coefficients in the final model. If the model is not good enough you can retrain glm with only the coefficients present in the model.

Alternatively, you may try to run full lambda search, get the best result and then take N coefficients with highest absolute value and then retrain your model with those.

Best,

Tomas

Tomáš Greif

unread,
Sep 25, 2015, 3:47:45 PM9/25/15
to tomas....@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Hi Tomas,

thank you for your help. Strange things happen when I use max_active_predictors in R. The model always converges with intercept only. I have tried this with alpha ranging from 1 to 0.95, e.g.:

fit.glm.alpha.1 <- h2o.glm(y = 'target', 
                           x = incl_vars,
                           training_frame = def_train.hex, validation_frame = def_test.hex, 
                           alpha = .95,  max_active_predictors = 40,
                           family = 'binomial', link = 'logit', solver = 'L_BFGS', standardize = FALSE, model_id = 'fit.glm.alpha.1')

Unless I change the number of active predictors to something extremely high (e.g. 20,000) it always ends with intercept only (and converges immediatelly in no time). Using version 3.2.0.3. I have also observed some issues with lambda search. With my data, it takes extreme number of iterations to test all lambdas - looks like there is something what prevents the algorithm in moving on to the next lambda. I had to increase number of iterations significantly (e.g. to 30,000) in order to test more than 20 lambdas (the lower ones). This is why now I am mostly testing manual setting of lambda.

I would kind of argue that my case is specific - in a credit management domain, it is a common practice over last decades to build parsimonious model with small number of attributes - this makes the model better to understand, deploy, manage, and explain to regulator.

Tomas

tomas....@gmail.com

unread,
Sep 25, 2015, 4:08:28 PM9/25/15
to H2O Open Source Scalable Machine Learning - h2ostream, tomas....@gmail.com, greif...@gmail.com
Hi Tomas,

The max_active_predictors will generally work only when used in combination with lambda-search and with sufficiently small distance between two consecutive lambdas (the defaults for nlambdas and lambda_min give ~ lambda_k-1 = .92*lambda_k). It is because we can then use recursive strong rules, which are way more efficient in filtering out inactive columns.

The excessive number of iterations is due to LBFGS solver used together with l1 penalty. Can you use the IRLSM instead? IRLSM generally does better job with l1 penalty. It is way faster.

Tomáš Greif

unread,
Sep 25, 2015, 5:24:21 PM9/25/15
to Tomas Nykodym, H2O Open Source Scalable Machine Learning - h2ostream
Hi Tomas,

excellent advice, this should be definitely part of the documentation! I used LBFGS exclusively as I got error that gram matrices won't fit in memory for some other configuration without max_active_predictors constraint.

It definitely runs much faster and seems to pick up some of the most important variables. I tried to do a naive grid search over alpha values close to one and different value of max_active_predictors:

alpha <- c(0.95, 0.97, 0.98, 0.99, 1)
vars  <- c(25, 100, 150, 200, 250, 300, 350, 400)
models <- list()

for (i in 1:length(alpha)) {
  models <- c(models, lapply(vars, function(x) {
    h2o.glm(y = 'target', x = incl_vars,
            training_frame = def.hex, validation_frame = def_test.hex, 
            lambda_search = TRUE, nlambdas = 100, max_active_predictors = x, alpha = alpha[i],
            family = 'binomial', link = 'logit', solver = 'IRLSM', standardize = FALSE, 
            model_id = paste0('p_', x, 'a_', alpha[i]))
  }))
}


This willl give me (once it computes) a list 5*8 models where hopefully there will be only strong (but non-overlapping) variables represented. I will try to do some analysis on variable selection overlap and then to refit on top variables with lambda search off, lambda = 0, in order to fit weights to all factor levels. Would you generally agree with such approach?

Will post some update on my progress later.

Tomas

tomas....@gmail.com

unread,
Sep 25, 2015, 6:16:33 PM9/25/15
to H2O Open Source Scalable Machine Learning - h2ostream, tomas....@gmail.com, greif...@gmail.com
Hi Tomas,

cool, I am curious about how your overall results will compare to the step-wise method.

In my experiments, the glm was not too sensitive to alpha, i.e. the models for alpha close to 1 would be very similar to each other, but I was mostly interested in the best overall model and did not inspect the low-variable count models too closely.

As for running with lambda = 0, I never run with no regularization myself and I would generally not recommend it. If you want coefficients for all levels, you can set alpha=0 to disable l1 penalty and then pick (lambdas search?) some small lambda to add a bit of l2 penalty. The l2 will then handle correlated variables nicely (with no l2 you actually have to drop one level from each categorical variable otherwise they are correlated with intercept) and prevent overfitting.

Best,

Tomas

Tomáš Greif

unread,
Sep 26, 2015, 12:47:27 AM9/26/15
to Tomas Nykodym, H2O Open Source Scalable Machine Learning - h2ostream
Well, I hope I figured something what can be considered to be a "strategy" for model selection.

1) Run grid search over alpha values close to one and varying number of max_active_predictors (I would like to end up with 20 predictors and my range was from 25 to 400), use lambda_search with default setting (nlambdas = 100)
2) Take only variables that are in the top N to the next step (I have chosen 30) based on defined criteria (my was total sum of absolute weights in all models of step 1)
3) Run model on the selected variables with some small alpha (e.g. 0.05) and lambda_search 

5) (Optional) Use data for variables from step 2 and use it in stepwise logistic regression (with such small number of variables this should be possible to calculate even in single threaded R using step())


Using this, I was able to get results on par with SAS stepwise. Almost all top variables were the same, there was quite a big difference in the other half of variables, but I guess this is just because there are many extremely correlated variables and there are many ways to produce "best" model.
Reply all
Reply to author
Forward
0 new messages