Trouble with GLM in h2o

504 views
Skip to first unread message

Lorenzo Isella

unread,
Nov 6, 2016, 11:25:22 AM11/6/16
to h2os...@googlegroups.com
Dear All,
I ask if you can find anything odd in the script at the end of the
email.
Essentially, when I run it on my workstation and I limit the train data set
to e.g. 20000 columns, everything is fine.
If instead I use the whole data set (less than 200 thousand rows,
not huge by h2o standards), then I get an error.
The data set is available at

https://dl.dropboxusercontent.com/u/5685598/test_engineered2.RDS


This some some output I get when I run my script without limiting the
number of rows


Hyper-parameter: alpha, 0.665
[2016-11-06 16:52:27] failure_details: DistributedException from
/127.0.0.1:54321
[2016-11-06 16:52:27] failure_stack_traces: DistributedException from
/127.0.0.1:54321, caused by java.lang.ArrayIndexOutOfBoundsException:
99
at water.MRTask.getResult(MRTask.java:477)
at water.MRTask.getResult(MRTask.java:485)
at water.MRTask.doAll(MRTask.java:389)
at water.MRTask.doAll(MRTask.java:395)
at
hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1192)
at hex.Model.score(Model.java:926)
at hex.Model.score(Model.java:898)
at
hex.glm.GLM$GLMDriver.scoreAndUpdateModel(GLM.java:947)
at
hex.glm.GLM$GLMDriver.computeImpl(GLM.java:1068)
at
hex.ModelBuilder$Driver.compute2(ModelBuilder.java:169)
at
hex.glm.GLM$GLMDriver.compute2(GLM.java:515)
at
water.H2O$H2OCountedCompleter.compute(H2O.java:1203)
at
jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at
jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at
jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at
jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at
jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused
by:
java.lang.ArrayIndexOutOfBoundsException: 99
at
hex.glm.GLMModel$GLMOutput.bestSubmodel(GLMModel.java:895)
at
hex.glm.GLMModel.makeMetricBuilder(GLMModel.java:101)
at
hex.glm.GLMScore.map(GLMScore.java:55)
at
water.MRTask.compute2(MRTask.java:645)
at
water.MRTask.compute2(MRTask.java:585)
at
water.MRTask.compute2(MRTask.java:585)
at
water.MRTask.compute2(MRTask.java:585)
at
water.MRTask.compute2(MRTask.java:585)
at
water.H2O$H2OCountedCompleter.compute1(H2O.java:1206)
at
hex.glm.GLMScore$Icer.compute1(GLMScore$Icer.java)
at
water.H2O$H2OCountedCompleter.compute(H2O.java:1202)
... 5 more





Here is my sessionInfo() output

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
[1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
[3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8
[5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8
[7] LC_PAPER=en_GB.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] h2o_3.10.0.8 statmod_1.4.26

loaded via a namespace (and not attached):
[1] tools_3.3.2 RCurl_1.95-4.8 jsonlite_1.1 bitops_1.0-6


Any suggestion is appreciated. I am very puzzled.
Cheers

Lorenzo



###################################################################
###################################################################
###################################################################

library(h2o)



## h2o.removeAll()


localH2O <- h2o.init(nthread=-1,max_mem_size="29g")

train <- readRDS("train_engineered2.RDS")


# take only part of training
# If I do not subset the train set, then I have an error.
train <- train[1:20000, ]

predictors <- 1:(ncol(train))
predictors <- predictors[-ncol(train)]


response <- ncol(train)

#change into h2o objects

train.hex <- as.h2o(train)


set.seed(1234)

alpha_opts = seq(0,.95, len=11)
hyper_params = list(alpha = alpha_opts)

gbm_grid <- h2o.grid("glm",
grid_id = "mygrid",
x = predictors,
y = response,
training_frame = train.hex,
nfolds= 3,
family="gaussian",
## I can only specify a grid in alpha.
## H2o will look for the optimal value of lambda
lambda_search = TRUE,
nlambdas= 100,
hyper_params =hyper_params)

gbm_sorted_grid <- h2o.getGrid(grid_id =
"mygrid", sort_by = "mae")

print(gbm_sorted_grid)

best_model <- h2o.getModel(gbm_sorted_grid@model_ids[[1]])
summary(best_model)


h2o.saveModel(object = best_model, path = ".", force=TRUE)



Darren Cook

unread,
Nov 6, 2016, 1:22:40 PM11/6/16
to h2os...@googlegroups.com
> I ask if you can find anything odd in the script at the end of the
> email.
> Essentially, when I run it on my workstation and I limit the train data set
> to e.g. 20000 columns, everything is fine.
> If instead I use the whole data set (less than 200 thousand rows,
> not huge by h2o standards), then I get an error.

I couldn't spot any problem with the script. It dies halfway through the
grid, at an alpha of 0.665... could you be running out of memory, or
something like that? (Viewing cluster status from Flow, while the
models are building, can show you the memory status.)

Troubleshooting ideas:
* Is this repeatable, and does it always die at the same alpha value?
(If so, what about if you don't use a grid, and use that alpha value
directly?)
* What about if you try 50,000 or 100,000 rows?
* Does your full 200K rows work if you guess alpha as 0.5, and don't
use the grid?

Darren


> Hyper-parameter: alpha, 0.665
> [2016-11-06 16:52:27] failure_details: DistributedException from
> /127.0.0.1:54321
> [2016-11-06 16:52:27] failure_stack_traces: DistributedException from
> /127.0.0.1:54321, caused by java.lang.ArrayIndexOutOfBoundsException:
> 99
> ...
--
Darren Cook, Software Researcher/Developer
My New Book: Practical Machine Learning with H2O,
published by O'Reilly. If interested, let me know and
I'll send you a discount code as soon it is released.

Lorenzo Isella

unread,
Nov 6, 2016, 5:47:24 PM11/6/16
to h2os...@googlegroups.com

Apologies,
This is the right link to my train dataset

https://dl.dropboxusercontent.com/u/5685598/train_engineered2.RDS

##############################################################


Dear All,
I ask if you can find anything odd in the script at the end of the
email.
Essentially, when I run it on my workstation and I limit the train data set
to e.g. 20000 columns, everything is fine.
If instead I use the whole data set (less than 200 thousand rows,
not huge by h2o standards), then I get an error.
The data set is available at

https://dl.dropboxusercontent.com/u/5685598/test_engineered2.RDS


This some some output I get when I run my script without limiting the
number of rows


Hyper-parameter: alpha, 0.665
[2016-11-06 16:52:27] failure_details: DistributedException from
/127.0.0.1:54321
[2016-11-06 16:52:27] failure_stack_traces: DistributedException from
/127.0.0.1:54321, caused by java.lang.ArrayIndexOutOfBoundsException:
99
Reply all
Reply to author
Forward
0 new messages