Get the prediction from cross validation in h2o

1,149 views
Skip to first unread message

Vinh DANG

unread,
Feb 2, 2016, 9:50:02 AM2/2/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hello all

I am using h2o.randomForest with cross validation.

After training the model, I can get the confusion matrix of all elements in my data, but how can I get the prediction as the vector in R?

Thank you very much

Tom Kraljevic

unread,
Feb 2, 2016, 12:29:30 PM2/2/16
to Vinh DANG, H2O Open Source Scalable Machine Learning - h2ostream

gererate it again.
h2o.predict()

Sent from my iPad
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vinh Dang

unread,
Feb 2, 2016, 12:34:45 PM2/2/16
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Hello

It seems to me that h2o.predict () will return the probability for each class. Is there a parameter to force h2o.predict () returns the prediction?
Best Regards

Vinh DANG

Erin LeDell

unread,
Feb 2, 2016, 12:37:51 PM2/2/16
to Vinh Dang, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
The predicted label is always in the first column.  Just pull out the first column of the frame.
-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Vinh Dang

unread,
Feb 2, 2016, 12:41:59 PM2/2/16
to Erin LeDell, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Thank you Erin

However, I think I did something wrong.

Let’s say I have a dataframe all_data

data = as.h2o (all_data)

rf = h2o.randomForest (V1 ~ V2 + V3, data, nfolds = 5)

Then

pre = h2o.predict (rf, newdata = data)

table (all_data$V1, as.vector (pre[,1]))

which give me the confusion matrix with the accuracy ~ 100%, which is unbelievable .

(If I print(rf) the accuracy is ~ 60% on 5-folds, and it make sense)
Best Regards

Vinh DANG

Tom Kraljevic

unread,
Feb 2, 2016, 12:42:22 PM2/2/16
to Vinh Dang, H2O Open Source Scalable Machine Learning - h2ostream

the first column is the chosen class based on a default threshold.
but most people inspect the per class probability and do their own thresholding.

Vinh Dang

unread,
Feb 2, 2016, 12:52:06 PM2/2/16
to Erin LeDell, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Hello Erin and Tom

The idea is, I want to calculate ROC AUC of a random forest classifier.

However, as I observed, it is not reported by h2o, so I want to pull out all predictions the h2o.randomForest did during 5-folds, so in the end I will have a vector of prediction with length = length of my data.

Then I will use this vector to calculate ROC AUC by myself.
Best Regards

Vinh DANG

Tom Kraljevic

unread,
Feb 2, 2016, 1:13:25 PM2/2/16
to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

to get the auc, call

h2o.performance()
and 
h2o.auc()

you can, of course, also calculate it by hand if you want to.
this is a good discussion to read if you choose that path:

Vinh Dang

unread,
Feb 2, 2016, 1:15:01 PM2/2/16
to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
Hello Tom

In my case, h2o.auc (rf) returns NULL.


Best Regards

Vinh DANG

Tom Kraljevic

unread,
Feb 2, 2016, 1:25:47 PM2/2/16
to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream


Try h2o.auc(xval=TRUE)

Tom


h2o.auc {h2o}R Documentation

Retrieve the AUC

Description

Retrieves the AUC value from an H2OBinomialMetrics. If "train", "valid", and "xval" parameters are FALSE (default), then the training AUC value is returned. If more than one parameter is set to TRUE, then a named vector of AUCs are returned, where the names are "train", "valid" or "xval".

Usage

h2o.auc(object, train = FALSE, valid = FALSE, xval = FALSE, ...)

Arguments

object

An H2OBinomialMetrics object.

train

Retrieve the training AUC

valid

Retrieve the validation AUC

xval

Retrieve the cross-validation AUC

...

extra arguments to be passed if 'object' is of type H2OModel (e.g. train=TRUE)


Vinh Dang

unread,
Feb 2, 2016, 1:28:10 PM2/2/16
to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
Hello

Yes, I tried both of them, and here is the result:

> h2o.auc(rf_h2o)
NULL
> h2o.auc(rf_h2o, xval = TRUE)
Error in names(v) <- v_names : attempt to set an attribute on NULL


My rf_h2o

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-6 Hit Ratios: 
  k hit_ratio
1 1  0.635951
2 2  0.867441
3 3  0.959588
4 4  0.993948
5 5  0.998975
6 6  1.000000


rf_h2o = h2o.randomForest(x = 2:21, y = 1, training_frame = data, ntrees = 450, nfolds = 5)

Best Regards

Vinh DANG

Tom Kraljevic

unread,
Feb 2, 2016, 1:37:46 PM2/2/16
to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Here is another example from the glm booklet.
It runs for me on h2o 3.6.0.8.


library(h2o)
h2o.init()
path = system.file("extdata", "prostate.csv", package = "h2o")
h2o_df = h2o.importFile(path)
h2o_df$CAPSULE = as.factor(h2o_df$CAPSULE)
binomial.fit = h2o.glm(y = "CAPSULE", x = c("AGE", "RACE", "PSA", "GLEASON"), training_frame = h2o_df, family = "binomial", nfolds = 5)
print(binomial.fit)
print(paste("training auc:        ", binomial.fit@model$training_metrics@metrics$AUC))
print(paste("cross-validation auc:", binomial.fit@model$cross_validation_metrics@metrics$AUC))

[1] "training auc:         0.79276438916242"
[1] "cross-validation auc: 0.783090034839193"


Vinh Dang

unread,
Feb 2, 2016, 2:14:09 PM2/2/16
to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
Well, I could run your example also

Maybe the problem comes from multi - class classification in my case? Is it possible to calculate AUC with multi - class classification?
Best Regards

Vinh DANG

Tom Kraljevic

unread,
Feb 2, 2016, 2:23:18 PM2/2/16
to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

> On Feb 2, 2016, at 11:14 AM, Vinh Dang <dqvi...@gmail.com> wrote:
>
> Well, I could run your example also
>
> Maybe the problem comes from multi - class classification in my case? Is it possible to calculate AUC with multi - class classification?

no, auc is only valid for binomial classification.

Vinh Dang

unread,
Feb 2, 2016, 2:32:24 PM2/2/16
to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
Thank you very much, because there are some people are asking me to calculate AUC for multi class classification :’(

If it’s possible, could you give me the reference for this fact (a paper, book). Of course, it is totally optional and I understand that you helped me a lot.
Best Regards

Vinh DANG

Vinh Dang

unread,
Feb 2, 2016, 2:35:08 PM2/2/16
to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
However, I think multi class AUC is available at https://cran.r-project.org/web/packages/pROC/pROC.pdf 

They implemented the idea at

David J. Hand and Robert J. Till (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning 45(2), p. 171–186. DOI: 10.1023/A:1010920819831.
Best Regards

Vinh DANG

On 02 Feb 2016, at 20:23, Tom Kraljevic <to...@0xdata.com> wrote:

Vinh Dang

unread,
Feb 2, 2016, 2:43:15 PM2/2/16
to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
Therefore, to calculate AUC for multi class classification, it would be good if I can extract the prediction values from h2o.randomForest object ...
Best Regards

Vinh DANG

Erin LeDell

unread,
Feb 2, 2016, 7:27:53 PM2/2/16
to Vinh Dang, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Vinh,
There are a few definitions of multiclass AUC and the Hand and Till version is one of the most popular.

If you want to extract predictions generated on a test set, you use the h2o.predict() function as Tom mentioned above.  Look at the R docs for how to use that function.

-Erin
Reply all
Reply to author
Forward
0 new messages