installing h2oEnsemble-package

360 views
Skip to first unread message

samira Ellouze

unread,
Oct 6, 2015, 2:49:26 PM10/6/15
to H2O Open Source Scalable Machine Learning - h2ostream
Hello,
I tried to install "h2oEnsemble-package" many times but always I find the following error:

 Downloading github repo h2oai/h2o-2@master
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached

I use this code:
library(devtools)
install_github
("h2oai/h2o-2/R/ensemble/h2oEnsemble-package")
I use R versio3.2.2

 Can someone help me please? 

Best,
Samira 

Erin LeDell

unread,
Oct 6, 2015, 5:23:13 PM10/6/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Hi Samira,
You should be using the H2O 3.0 compatible version of h2oEnsemble.  You should use the h2o-3 repository instead of the h2o-2.

So first, change the URL to:
library(devtools)
install_github("h2oai/h2o-3/R/ensemble/h2oEnsemble-package")

And then try again to see if it works.

I have not seen that error before, but maybe you don't have curl installed on your machine?  Are you using a vanilla linux distro?  That is an issue with devtools rather than h2oEnsemble.

If you don't want to use devtools, you can install h2oEnsemble from the main h2o-3 repo as follows:

git clone https://github.com/h2oai/h2o-3.git
R CMD INSTALL h2o-3/h2o-r/ensemble/h2oEnsemble-package

Best,
Erin
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

samira Ellouze

unread,
Oct 10, 2015, 1:53:42 AM10/10/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin
thank you very much
I should  use the h2o-3 repository instead of the h2o-2.
but I change URL to:
install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")
finally I installed h2oEnsemble 
now, I would like some informations about h2oEnsemble. 
When I run h2o ensemble, I can indicate the number of folds for each "Learner" with "cvControl" command. 
I would like to know if I can use cross-validation for the metalearner
best,
Samira

Erin LeDell

unread,
Oct 10, 2015, 5:13:58 AM10/10/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Hi Samira,

Ah, yes, thanks for pointing that out...your install_github URL is correct. 

When you say "cross-validation for the metalearner", do you mean cross-validation for the whole ensemble?  I am assuming you are asking about the same behavior induced by the nfolds argument in the regular H2O algos.  That feature is not yet enabled in h2o.ensemble, but you could do it manually.  There is an example in this blog post of how to manually CV any H2O algo: http://h2o.ai/blog/2015/07/kfold-cross-validation/  This was a post from before we enabled nfolds in all the algos.

For the Super Learner ensemble algorithm itself, there is no need to cross-validate the metalearner on the training data.  It requires only fitting the metalearner once on the full training set.  If you wanted to know the cross-validated performance of the metalearner, then you could pull out the Z matrix from the output and use any H2O algorithm with the nfolds argument.

-Erin

samira Ellouze

unread,
Oct 13, 2015, 11:22:49 PM10/13/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
thank you for your help
I use the kfolds function indicated in http://h2o.ai/blog/2015/07/kfold-cross-validation/ using: 
fit.dl <- h2o.kfold(3, up_sum.hex, 1:89, 90, h2o_ensemble, h2o.predict, TRUE)
with h2o_ensemble is equal to:
h2o_ensemble <-function(up_sum.hex,X,Y) { 
 h2o.ensemble(x = X, y = Y,training_frame =up_sum.hex,learner =c("h2o.glm.wrapper", "h2o.randomForest.wrapper","h2o.gbm.wrapper", "h2o.deeplearning.wrapper"),metalearner ="h2o.deeplearning.wrapper",cvControl = list(V = 10, shuffle = TRUE))
 }

I constate that h2o.ensemble is repeated 3 times (3 iterations). For each time, a model for each learner ("h2o.glm.wrapper", "h2o.randomForest.wrapper","h2o.gbm.wrapper", "h2o.deeplearning.wrapper") is built and a model for the metalearner is also built.
 I my mind, a metalearner is built only once than it is tested for each fold. that's right or not??
also after the 3rd iteration, an error message is displayed:
Error: could not find function "h2o.getFutureModel"
can you help me or can someone help me to know the error?
Best,
Samira

samira Ellouze

unread,
Oct 14, 2015, 11:40:44 PM10/14/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi,
I execute the command source ("file.R") where file.R contains h2o.kfold function and 
X<- 1:89
Y<-90
h2o_ensemble <-function(up_sum.hex,X,Y) { 
 h2o.ensemble(x = X, y = Y,training_frame =......................)
 }
fit.dl <- h2o.kfold(3, ................)
just an explanation "h2o.getFutureModel"  function is called from h2o.kfold function but there aren't any code for this function, is it a predefined function?????
Please have you any clue about what is wrong in my code. I appreciate any help

This is a portion of the execution from R:

 h2o_ensemble <-function(up_sum.hex,X,Y) { 
+  h2o.ensemble(x = X, y = Y,training_frame =up_sum.hex,learner =c("h2o.glm.wrapper", "h2o.randomForest.w ..." ... [TRUNCATED] 

> fit.dl <- h2o.kfold(3, up_sum.hex, 1:89, 90, h2o_ensemble, h2o.predict, TRUE)
[1] 2200    1

                                                                            
  |======================================================================| 100%
[1] "Cross-validating and training base learner 1: h2o.glm.wrapper"
                                                                        
  |======================================================================| 100%
[1] "Cross-validating and training base learner 2: h2o.randomForest.wrapper"

  |======================================================================| 100%
[1] "Cross-validating and training base learner 3: h2o.gbm.wrapper"
 
  |======================================================================| 100%
[1] "Cross-validating and training base learner 4: h2o.deeplearning.wrapper"
                                                                
  |======================================================================| 100%
[1] "Metalearning"
 
  |======================================================================| 100%
                                                                       
  |======================================================================| 100%
[1] "Cross-validating and training base learner 1: h2o.glm.wrapper"
                                                                           
  |======================================================================| 100%
[1] "Cross-validating and training base learner 2: h2o.randomForest.wrapper"
                                                                                                                             
  |======================================================================| 100%
[1] "Cross-validating and training base learner 3: h2o.gbm.wrapper"

  |======================================================================| 100%
[1] "Cross-validating and training base learner 4: h2o.deeplearning.wrapper"
 
  |======================================================================| 100%
[1] "Metalearning"
   
  |======================================================================| 100%

                                                                          
  |======================================================================| 100%
[1] "Cross-validating and training base learner 1: h2o.glm.wrapper"
                                                                          
  |======================================================================| 100%
[1] "Cross-validating and training base learner 2: h2o.randomForest.wrapper"
                                                                                                        
  |======================================================================| 100%
[1] "Cross-validating and training base learner 3: h2o.gbm.wrapper"

  |======================================================================| 100%
[1] "Cross-validating and training base learner 4: h2o.deeplearning.wrapper"
 
  |======================================================================| 100%
[1] "Metalearning"
                                                                  
  |======================================================================| 100%
Error: could not find function "h2o.getFutureModel"


Erin LeDell

unread,
Oct 15, 2015, 2:54:41 AM10/15/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Hi Samira,
It seems like you might be using an older version of H2O.  Can you tell me which version you are running?

Check the h2o cluster version by executing:
h2o.clusterInfo()


And if it's not 3.2.0.5 or above, can you upgrade (as follows)?

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-slater/8/R")))

Best,
Erin

Erin LeDell

unread,
Oct 15, 2015, 2:58:40 AM10/15/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Samira,
Also, yes h2o.getFutureModel is a function in the H2O code.  You can view it here: https://github.com/h2oai/h2o-3/blob/master/h2o-r/h2o-package/R/models.R#L155

-Erin


On 10/14/15 8:40 PM, samira Ellouze wrote:

Erin LeDell

unread,
Oct 15, 2015, 3:02:59 AM10/15/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Samira,
To answer your third question... cross-validation of the base learners and one fit of the metalearner is performed once per iteration.  If k = 3, then you train the ensemble (which includes cv base learners and train the metalearner) three times total.

Can you please post your full code (and data if possible), so I can try to reproduce your error?

Thanks,
Erin

samira Ellouze

unread,
Oct 15, 2015, 4:14:43 AM10/15/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
I have h2o 3.2.0.5
I send you the code and a sample of my data 
thank you very much for your help

samira Ellouze

unread,
Oct 15, 2015, 8:45:01 AM10/15/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
I send the code and the sample of data to your email address "er...@h2o.ai"
Thank you
Best,
Samira

Erin LeDell

unread,
Oct 19, 2015, 6:56:37 PM10/19/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Hi Samira,

Here is an example of performing cross-validation on `h2o.ensemble`.  Right now you can run this as a stand-alone function, but in the future you can expect an "nfolds" argument to be added to `h2o.ensemble`, similar to the other algos.  This function runs the cv process in a simple loop.

I have only tested with binary classification (demo below), but it should work for regression as well.  Let me know if you have any issues.

kfold_h2o_ensemble <- function(x, y, training_frame, family, learner, metalearner, nfolds = 5,
                               cvControl = list(V = 5, shuffle = TRUE), seed = 1, fold_column = NULL) {
 
  # Note: Only tested on a binary classification ensemble
  # TO DO: Test regression
 
  # Create the cross-validation folds (for external cross-validation)
  N <- nrow(training_frame)
  if (is.numeric(seed)) set.seed(seed)  #If seed is specified, set seed prior to next step
  if (is.null(fold_column)) {
    folds <- as.h2o(sample(rep(seq(nfolds), ceiling(N/nfolds)))[1:N])  # 1-col H2O Frame of fold ids for each row
  } else {
    folds <- training_frame[,c(fold_column)]
  }

  # For storing results
  models <- list()
  preds <- h2o.createFrame(rows = N, cols = 1,
                          randomize = FALSE,
                          value = 0.0,
                          categorical_fraction = 0.0,
                          integer_fraction = 0.0,
                          missing_fraction = 0.0)
 
  for (k in 1:nfolds) {
    print(paste0("Begin outer cross-validation loop: ", k, " of ", nfolds))
   
    # Train an ensemble model on folds != k
    fold_idx_test <- which(as.data.frame((folds==k))[,1]==1)
    fold_idx_train <- which(as.data.frame((folds==k))[,1]==0)
    fold_train <- training_frame[fold_idx_train,]
    fold_test <- training_frame[fold_idx_test,]  
    fold_fit <- h2o.ensemble(x = x, y = y,
                             training_frame = fold_train,
                             family = family,
                             learner = learner,
                             metalearner = metalearner,
                             cvControl = cvControl,
                             seed = seed)
   
    # Generate predictions on the test set
    pp <- predict.h2o.ensemble(fold_fit, fold_test)
   
    # Insert preds into appropriate rows
    if (family == "binomial") {
      preds[fold_idx_test,] <- pp$pred$p1     
    } else if (family == "gaussian") {
      preds[fold_idx_test,] <- pp$pred$predict
    }

    # Collect models
    models[[(length(models)+1)]] <- fold_fit
  }

  # Return the results
  return(list(models = models, folds = folds, preds = preds))
}


# An example of binary classification on a local machine, which cross-validates h2o.ensemble

library(h2oEnsemble)  # Requires version >=0.0.4 of h2oEnsemble
library(cvAUC)  # Used to calculate test set AUC (requires version >=1.0.1 of cvAUC)
localH2O <-  h2o.init(nthreads = -1)  # Start an H2O cluster with nthreads = num cores on your machine

# Import a sample binary outcome train/test set into H2O
train <- h2o.importFile("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv")

# Identify response variable and predictor cols
y <- "C1"
x <- setdiff(names(train), y)

# Convert response to a categorical (for binary classification)
family <- "binomial"
train[,y] <- as.factor(train[,y])


# Specify the base learner library & the metalearner
# Let's use a reproducible library (set seed on RF and GBM):
h2o.randomForest.1 <- function(..., ntrees = 100, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, seed = seed)
h2o.gbm.1 <- function(..., ntrees = 100, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, seed = seed)
learner <- c("h2o.glm.wrapper", "h2o.randomForest.1", "h2o.gbm.1")
metalearner <- "h2o.glm.wrapper"

# Cross-validate the ensemble with nfolds = 5
# nfolds relates to outer loop cross-validation (cross-validate the ensemble)
# cvControl$V relates to inner loop cross-valiation (inside the ensemble)
# Note: nfolds and cvControl$V do not have to be the same number
cve <- kfold_h2o_ensemble(x = x, y = y,
                          training_frame = train,
                          family = family,
                          learner = learner,
                          metalearner = metalearner,
                          nfolds = 5,
                          cvControl = list(V = 5, shuffle = TRUE),
                          seed = 1)

# If we want to calculate the CV AUC of the ensemble,
# we can use the cvAUC package, but that requires us to
# pull the preds, labels and into R as follows
library(cvAUC)
folds <- as.data.frame(cve$folds)[,1]  #Folds vector
preds <- as.data.frame(cve$preds)[,1]  #Cross-validated predicted values
labels <- as.data.frame(train[,y])[,1]  #Response vector

auc <- cvAUC(predictions = preds, labels = labels, folds = folds)
auc

#$fold.AUC
#[1] 0.7822530 0.7957805 0.7834457 0.7837479 0.7821255
#
#$cvAUC
#[1] 0.7854705
Message has been deleted

samira Ellouze

unread,
Oct 20, 2015, 3:07:00 AM10/20/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
Thank you very much for your efforts
I run " kfold_h2o_ensemble" it works correctly
But in the result it does not displays :MSE, R2,  Mean Residual Deviance for meta-learner on cross-validation data
It displays only for training data
How can I display them (MSE,R2,..) for cross-validation data

Best,
Samira

samira Ellouze

unread,
Oct 20, 2015, 6:13:37 AM10/20/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
In final result, I found for each folds: R2, MSE,  Mean Residual Deviance, reported on training data for the meta ranger
I have five folds that is mean that I found 5 reports on training data for the metalearner
but I don't find any report (R2, MSE,  Mean Residual Deviance) for metalearner on cross validation data 
Best,
Samira

samira Ellouze

unread,
Oct 21, 2015, 9:14:30 AM10/21/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
I run kfold_h2o_ensemble many times, sometimes I encounter the following error :
Got exception 'class java.lang.AssertionError', with msg 'null'
java.lang.AssertionError

        at water.fvec.Vec.chunkForChunkIdx(Vec.java:819)

        at hex.glm.GLMTask$GLMGradientTask.map(GLMTask.java:495)

        at water.MRTask.compute2(MRTask.java:657)

        at water.H2O$H2OCountedCompleter.compute(H2O.java:1017)

        at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)

        at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)

        at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914)

        at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979)

        at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)

        at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)



Error: 'null'

Can you help me to solve the error???

Best,
Samira

Erin LeDell

unread,
Oct 21, 2015, 2:32:10 PM10/21/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Samira,
h2o.ensemble does not automatically calculate performance metrics such as R2, MSE, AUC... (that is currently unimplemented, but documented here: https://0xdata.atlassian.net/browse/PUBDEV-2237).

Until that ticket is closed, you will have to manually calculate those metrics in R, like was done in the example below with cross-validated AUC. 

-Erin

samira Ellouze

unread,
Oct 21, 2015, 11:01:08 PM10/21/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
Thank you Erin for you help, I really appreciate it

I use the instruction given below to calculate performance metrics but I found this error:

# If we want to calculate the CV AUC of the ensemble, 
> # we can use the cvAUC package, but that requires us to 
> # pull the preds, labels and int .... [TRUNCATED] 

> folds <- as.data.frame(fit$folds)[,1]  #Folds vector

> preds <- as.data.frame(fit$preds)[,1]  #Cross-validated predicted values

> labels <- as.data.frame(up_sum.hex[,Y])[,1]  #Response vector

> auc <- cvAUC(predictions = preds, labels = labels, folds = folds)
Error in prediction(predictions = predictions, labels = labels, label.ordering = label.ordering) : 
  Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.

Can you help me to solve this error?


also sometime I got this error :

Got exception 'class java.lang.AssertionError', with msg 'null'
java.lang.AssertionError

        at water.fvec.Vec.chunkForChunkIdx(Vec.java:819)

        at hex.glm.GLMTask$GLMGradientTask.map(GLMTask.java:495)

        at water.MRTask.compute2(MRTask.java:657)

        at water.H2O$H2OCountedCompleter.compute(H2O.java:1017)

        at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)

        at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)

        at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914)

        at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979)

        at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)

        at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)



Error: 'null'
 
have you an idea how to avoid this error??

Erin LeDell

unread,
Oct 22, 2015, 12:24:38 AM10/22/15
to samira Ellouze, H2O Open Source Scalable Machine Learning - h2ostream
Hi Samira,
Are you doing a binary classification problem or a regression problem?  Let me know what type of data you have in your response vector.  Is it real valued, or binary (0/1 or something equivalent)?

In the code example below, you could type the following to count the number of unique elements in your response vector:
length(unique(labels))  # this should equal 2, if not you will see the error you got below...

However, if you are doing a regression problem, then you would probably be interested in calculating the cross-validated MSE or R2.

# Calculate MSE and CV MSE

mse <- function(pred_y, true_y) {
 return(mean((pred_y - true_y)^2))
}

cv_mse <- function(pred_y, true_y, folds) {
  fold_ids <- sort(unique(folds))
  fold_mse <- sapply(fold_ids, function(i) mse(pred_y[folds==i], true_y[folds==i]))
  print("fold MSE:")
  print(fold_mse)
  print("CV MSE")
  return(mean(fold_mse))
}

cv_mse(pred_y, true_y, folds)  #will return CV MSE assuming `pred_y` and `true_y` are numeric vectors


-Erin

samira Ellouze

unread,
Oct 30, 2015, 8:20:00 AM10/30/15
to H2O Open Source Scalable Machine Learning - h2ostream, ellouze...@gmail.com
Hi Erin,
thank you for your help, I would like to calculate correlation but not MSE, I make a function that calculate correlation
folds <- as.data.frame(fit$folds)[,1]  #Folds vector
pred_y <- as.data.frame(fit$preds)[,1]  #Cross-validated predicted values
true_y <- as.data.frame(up_sum.hex[,Y])[,1]

correlation <- function(pred_y, true_y) {
 return(cor(pred_y, true_y))
}
cv_correlation<-correlation(pred_y, true_y)
print ("cross validation correllation :")
print(cv_correlation)

I would like to know if the method of calculation of correlation for cross-validation is correct or not???
 also I would like to know if R2 can be >1 (greater than 1) or no??? because I found R2= -2.136219 :


H2ORegressionMetrics: glm
** Reported on cross-validation data. **
Description: 5-fold cross-validation on training data

MSE:  0.01187553
R2 :  -2.136219
Mean Residual Deviance :  0.01187553
Null Deviance :0.175854
Null D.o.F. :44
Residual Deviance :0.5343988
Residual D.o.F. :5
AIC :10.20708

Please, can you give me an answer to those questions

Best,
Samira
...
Reply all
Reply to author
Forward
0 new messages