default lambda in glm elastic net regularization

greif...@gmail.com

unread,

Sep 24, 2015, 3:54:59 PM9/24/15

to H2O Open Source Scalable Machine Learning - h2ostream

What is the default lambda in GLM? I am trying to do grid search over alpha (without lambda search) and wonder how lambda is determined, because there is always some lambda in the result, e.g. I get:

GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors
1 binomial logit Elastic Net (alpha = 0.05, lambda = 0.07968 ) 18576 434

Erin LeDell

unread,

Sep 24, 2015, 9:27:27 PM9/24/15

to greif...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream, Tomas Nykodym

Hi,
It looks like our R documentation is not clear on the topic of GLM's
lambda. In the R docs it specifies a default of lambda = 1e-05,
however, if you don't explicitly set the lambda, it will use a heuristic
based on the training set to select a lambda to use. I will make a note
to update our R docs.

The more accurate definition for lambda is explained on pg 23 of our GLM
booklet (section 5.5.1), available here:
http://h2o-release.s3.amazonaws.com/h2o/master/3191/docs-website/h2o-docs/booklets/GLM_Vignette.pdf

I don't have any further info on the heuristic used, but if you are
interested, reply back to this email and Tomas (the GLM author) may be
able to explain in further detail.

Best,
Erin

Tomáš Greif

unread,

Sep 24, 2015, 9:59:50 PM9/24/15

to Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream, Tomas Nykodym

Thank you Erin. Would be great to understand more how default lambda is used.

Given number of predictors I have I am using grid search over alpha (.05, .25, .5, .75, .95) without lambda search to reduce data only to reasonable subset where there are most likely all the relevant predictors so I can run more extensive search. The result obviously depends on aggressiveness of lambda heuristic as high values close to lambda max will cut-off more predictors.

Also, the example with grid search over alpha in here: http://s3.amazonaws.com/h2o-release/h2o/master/3147/docs-website/h2o-docs/booklets/GLM_Vignette.pdf and here: http://h2o-release.s3.amazonaws.com/h2o/master/3191/docs-website/h2o-docs/booklets/GLM_Vignette.pdf does not work (it is "overlisted" :):

Failed models

-------------

alpha status_failed msgs_failed

[[0.0]] FAIL "Cannot set field 'alpha'"

[[0.25]] FAIL "Cannot set field 'alpha'"

[[0.5]] FAIL "Cannot set field 'alpha'"

[[0.75]] FAIL "Cannot set field 'alpha'"

[[1.0]] FAIL "Cannot set field 'alpha'"

I made it work using by passing vector of alpha values as single item of list:

path = system.file("extdata","prostate.csv", package = "h2o")

h2o_df = h2o.importFile (path)

h2o_df$CAPSULE = as.factor(h2o_df$CAPSULE)

########## Alpha grid search

alpha_opts <- c(0.05, 0.25, 0.5, 0.75, 0.95)

hyper_parameters <- list(alpha = alpha_opts)

grid <- h2o.grid("glm", hyper_params = hyper_parameters , y = "CAPSULE" , x = c("AGE" ,"RACE" , "PSA" , "GLEASON" ),

training_frame = h2o_df, validation_frame = h2o_df, family = "binomial")

grid_models <- lapply(grid@model_ids , function (model_id ) {model = h2o.getModel(model_id)})

for (i in 1:length(grid_models)) {

print (sprintf ("regularization: %-50 s auc: %f test gini: %f valid gini: %f" ,

grid_models[[i]]@model$model_summary$regularization,

h2o.auc(grid_models[[i]]),

h2o.giniCoef(grid_models[[i]], train = TRUE),

h2o.giniCoef(grid_models[[i]], valid = TRUE)

)

}

Erin LeDell

unread,

Sep 24, 2015, 10:21:04 PM9/24/15

to Tomáš Greif, H2O Open Source Scalable Machine Learning - h2ostream, Tomas Nykodym

Hi,
Are you using the latest version of H2O and the latest docs? I am able to run the code in the example without errors. The grid search code examples from the booklet are available here:

https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/booklets/v2_2015/source/glm/glm_grid_search_over_lambda.R
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/booklets/v2_2015/source/glm/glm_grid_search_over_alpha.R

Make sure you are always using the latest version of the GLM booklet. We build the booklets with each nightly build and it looks like you are pointing to release 3147 and 3191.

I'll ask (our) Tomas to respond in further detail about the lambda heuristic.

Thanks,
Erin

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

{'player': 'leon', 'score': 32}
{'player': 'slava', 'score': 58}
{'player': 'ashley', 'score': 68}
{'player': 'ben', 'score': 145}
{'player': 'skyla', 'score': 141}
{'player': 'mike', 'score': 97}
{'player': 'jessica', 'score': 35}

On 9/24/15 6:37 PM, Erin LeDell wrote:

One way to make a "generic" JSON parser is to only support JSON that has a particular format. For example, I wrote a bunch of code a while ago that used RethinkDB (a nosql db) -- and this is their requirement for JSON (basically it must be able to be easily converted to a Frame):

RethinkDB will accept two formats for JSON files:

An array of JSON documents.

js [ { field: "value" }, { field: "value"}, ... ]

Whitespace-separated JSON rows.

js { field: "value" } { field: "value" }

On 9/24/15 6:18 PM, Spencer Aiello wrote:

are there tools that do this automatically without the user knowing? there are those that will like this and those that will hate this. IMO, better for user to manage and understand that what's being shown is not what's being processed in a learning task.

tomas....@gmail.com

unread,

Sep 25, 2015, 3:09:18 PM9/25/15

to H2O Open Source Scalable Machine Learning - h2ostream, er...@h2o.ai, to...@h2o.ai, greif...@gmail.com

Hi Tomas.

The default lambda is selected by heuristic based on lambda_max (minimal lambda value s.t. all coefficients are 0) for the given dataset. For a dataset with P predictors and N rows, it is selected as 0.1*lambda_max if P >= 4*N, 0.001*lambda_max otherwise.

Best,

Tomas

Reply all

Reply to author

Forward