Dear All,
I ask if you can find anything odd in the script at the end of the
email.
Essentially, when I run it on my workstation and I limit the train data set
to e.g. 20000 columns, everything is fine.
If instead I use the whole data set (less than 200 thousand rows,
not huge by h2o standards), then I get an error.
The data set is available at
https://dl.dropboxusercontent.com/u/5685598/test_engineered2.RDS
This some some output I get when I run my script without limiting the
number of rows
Hyper-parameter: alpha, 0.665
[2016-11-06 16:52:27] failure_details: DistributedException from
/
127.0.0.1:54321
[2016-11-06 16:52:27] failure_stack_traces: DistributedException from
/
127.0.0.1:54321, caused by java.lang.ArrayIndexOutOfBoundsException:
99
at water.MRTask.getResult(MRTask.java:477)
at water.MRTask.getResult(MRTask.java:485)
at water.MRTask.doAll(MRTask.java:389)
at water.MRTask.doAll(MRTask.java:395)
at
hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1192)
at hex.Model.score(Model.java:926)
at hex.Model.score(Model.java:898)
at
hex.glm.GLM$GLMDriver.scoreAndUpdateModel(GLM.java:947)
at
hex.glm.GLM$GLMDriver.computeImpl(GLM.java:1068)
at
hex.ModelBuilder$Driver.compute2(ModelBuilder.java:169)
at
hex.glm.GLM$GLMDriver.compute2(GLM.java:515)
at
water.H2O$H2OCountedCompleter.compute(H2O.java:1203)
at
jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at
jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at
jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at
jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at
jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused
by:
java.lang.ArrayIndexOutOfBoundsException: 99
at
hex.glm.GLMModel$GLMOutput.bestSubmodel(GLMModel.java:895)
at
hex.glm.GLMModel.makeMetricBuilder(GLMModel.java:101)
at
hex.glm.GLMScore.map(GLMScore.java:55)
at
water.MRTask.compute2(MRTask.java:645)
at
water.MRTask.compute2(MRTask.java:585)
at
water.MRTask.compute2(MRTask.java:585)
at
water.MRTask.compute2(MRTask.java:585)
at
water.MRTask.compute2(MRTask.java:585)
at
water.H2O$H2OCountedCompleter.compute1(H2O.java:1206)
at
hex.glm.GLMScore$Icer.compute1(GLMScore$Icer.java)
at
water.H2O$H2OCountedCompleter.compute(H2O.java:1202)
... 5 more
Here is my sessionInfo() output
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
[3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8
[5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8
[7] LC_PAPER=en_GB.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] h2o_3.10.0.8 statmod_1.4.26
loaded via a namespace (and not attached):
[1] tools_3.3.2 RCurl_1.95-4.8 jsonlite_1.1 bitops_1.0-6
Any suggestion is appreciated. I am very puzzled.
Cheers
Lorenzo
###################################################################
###################################################################
###################################################################
library(h2o)
## h2o.removeAll()
localH2O <- h2o.init(nthread=-1,max_mem_size="29g")
train <- readRDS("train_engineered2.RDS")
# take only part of training
# If I do not subset the train set, then I have an error.
train <- train[1:20000, ]
predictors <- 1:(ncol(train))
predictors <- predictors[-ncol(train)]
response <- ncol(train)
#change into h2o objects
train.hex <- as.h2o(train)
set.seed(1234)
alpha_opts = seq(0,.95, len=11)
hyper_params = list(alpha = alpha_opts)
gbm_grid <- h2o.grid("glm",
grid_id = "mygrid",
x = predictors,
y = response,
training_frame = train.hex,
nfolds= 3,
family="gaussian",
## I can only specify a grid in alpha.
## H2o will look for the optimal value of lambda
lambda_search = TRUE,
nlambdas= 100,
hyper_params =hyper_params)
gbm_sorted_grid <- h2o.getGrid(grid_id =
"mygrid", sort_by = "mae")
print(gbm_sorted_grid)
best_model <- h2o.getModel(gbm_sorted_grid@model_ids[[1]])
summary(best_model)
h2o.saveModel(object = best_model, path = ".", force=TRUE)