Get the prediction from cross validation in h2o

Vinh DANG

unread,

Feb 2, 2016, 9:50:02 AM2/2/16

to H2O Open Source Scalable Machine Learning - h2ostream

Hello all

I am using h2o.randomForest with cross validation.

After training the model, I can get the confusion matrix of all elements in my data, but how can I get the prediction as the vector in R?

Thank you very much

Tom Kraljevic

unread,

Feb 2, 2016, 12:29:30 PM2/2/16

to Vinh DANG, H2O Open Source Scalable Machine Learning - h2ostream

gererate it again.

h2o.predict()

Sent from my iPad

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vinh Dang

unread,

Feb 2, 2016, 12:34:45 PM2/2/16

to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream

Hello

It seems to me that h2o.predict () will return the probability for each class. Is there a parameter to force h2o.predict () returns the prediction?

—

Best Regards

Vinh DANG

Erin LeDell

unread,

Feb 2, 2016, 12:37:51 PM2/2/16

to Vinh Dang, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream

The predicted label is always in the first column. Just pull out the first column of the frame.

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Vinh Dang

unread,

Feb 2, 2016, 12:41:59 PM2/2/16

to Erin LeDell, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream

Thank you Erin

However, I think I did something wrong.

Let’s say I have a dataframe all_data

data = as.h2o (all_data)

rf = h2o.randomForest (V1 ~ V2 + V3, data, nfolds = 5)

Then

pre = h2o.predict (rf, newdata = data)

table (all_data$V1, as.vector (pre[,1]))

which give me the confusion matrix with the accuracy ~ 100%, which is unbelievable .

(If I print(rf) the accuracy is ~ 60% on 5-folds, and it make sense)

—

Best Regards

Vinh DANG

Tom Kraljevic

unread,

Feb 2, 2016, 12:42:22 PM2/2/16

to Vinh Dang, H2O Open Source Scalable Machine Learning - h2ostream

the first column is the chosen class based on a default threshold.

but most people inspect the per class probability and do their own thresholding.

Vinh Dang

unread,

Feb 2, 2016, 12:52:06 PM2/2/16

to Erin LeDell, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream

Hello Erin and Tom

The idea is, I want to calculate ROC AUC of a random forest classifier.

However, as I observed, it is not reported by h2o, so I want to pull out all predictions the h2o.randomForest did during 5-folds, so in the end I will have a vector of prediction with length = length of my data.

Then I will use this vector to calculate ROC AUC by myself.

—

Best Regards

Vinh DANG

Tom Kraljevic

unread,

Feb 2, 2016, 1:13:25 PM2/2/16

to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

to get the auc, call

h2o.performance()

and

h2o.auc()

you can, of course, also calculate it by hand if you want to.

this is a good discussion to read if you choose that path:

http://stats.stackexchange.com/questions/145566/how-to-calculate-area-under-the-curve-auc-or-the-c-statistic-by-hand

Vinh Dang

unread,

Feb 2, 2016, 1:15:01 PM2/2/16

to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Hello Tom

In my case, h2o.auc (rf) returns NULL.

—

Best Regards

Vinh DANG

Tom Kraljevic

unread,

Feb 2, 2016, 1:25:47 PM2/2/16

to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Try h2o.auc(xval=TRUE)

Tom

h2o.auc {h2o}

R Documentation

Retrieve the AUC

Description

Retrieves the AUC value from an H2OBinomialMetrics. If "train", "valid", and "xval" parameters are FALSE (default), then the training AUC value is returned. If more than one parameter is set to TRUE, then a named vector of AUCs are returned, where the names are "train", "valid" or "xval".

Usage

h2o.auc(object, train = FALSE, valid = FALSE, xval = FALSE, ...)

Arguments

`object`	An H2OBinomialMetrics object.
`train`	Retrieve the training AUC
`valid`	Retrieve the validation AUC
`xval`	Retrieve the cross-validation AUC
`...`	extra arguments to be passed if 'object' is of type H2OModel (e.g. train=TRUE)

Vinh Dang

unread,

Feb 2, 2016, 1:28:10 PM2/2/16

to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Hello

Yes, I tried both of them, and here is the result:

> h2o.auc(rf_h2o)

NULL

> h2o.auc(rf_h2o, xval = TRUE)

Error in names(v) <- v_names : attempt to set an attribute on NULL

My rf_h2o

…

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`

=======================================================================

Top-6 Hit Ratios:

k hit_ratio

1 1 0.635951

2 2 0.867441

3 3 0.959588

4 4 0.993948

5 5 0.998975

6 6 1.000000

rf_h2o = h2o.randomForest(x = 2:21, y = 1, training_frame = data, ntrees = 450, nfolds = 5)

—

Best Regards

Vinh DANG

Tom Kraljevic

unread,

Feb 2, 2016, 1:37:46 PM2/2/16

to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Here is another example from the glm booklet.

It runs for me on h2o 3.6.0.8.

library(h2o)

h2o.init()

path = system.file("extdata", "prostate.csv", package = "h2o")

h2o_df = h2o.importFile(path)

h2o_df$CAPSULE = as.factor(h2o_df$CAPSULE)

binomial.fit = h2o.glm(y = "CAPSULE", x = c("AGE", "RACE", "PSA", "GLEASON"), training_frame = h2o_df, family = "binomial", nfolds = 5)

print(binomial.fit)

print(paste("training auc:        ", binomial.fit@model$training_metrics@metrics$AUC))

print(paste("cross-validation auc:", binomial.fit@model$cross_validation_metrics@metrics$AUC))

[1] "training auc: 0.79276438916242"

[1] "cross-validation auc: 0.783090034839193"

Vinh Dang

unread,

Feb 2, 2016, 2:14:09 PM2/2/16

to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Well, I could run your example also

Maybe the problem comes from multi - class classification in my case? Is it possible to calculate AUC with multi - class classification?

—

Best Regards

Vinh DANG

Tom Kraljevic

unread,

Feb 2, 2016, 2:23:18 PM2/2/16

to Vinh Dang, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

> On Feb 2, 2016, at 11:14 AM, Vinh Dang <dqvi...@gmail.com> wrote:
>
> Well, I could run your example also
>
> Maybe the problem comes from multi - class classification in my case? Is it possible to calculate AUC with multi - class classification?

no, auc is only valid for binomial classification.

Vinh Dang

unread,

Feb 2, 2016, 2:32:24 PM2/2/16

to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Thank you very much, because there are some people are asking me to calculate AUC for multi class classification :’(

If it’s possible, could you give me the reference for this fact (a paper, book). Of course, it is totally optional and I understand that you helped me a lot.

—

Best Regards

Vinh DANG

Vinh Dang

unread,

Feb 2, 2016, 2:35:08 PM2/2/16

to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

However, I think multi class AUC is available at https://cran.r-project.org/web/packages/pROC/pROC.pdf

They implemented the idea at

David J. Hand and Robert J. Till (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning 45(2), p. 171–186. DOI: 10.1023/A:1010920819831.

—

Best Regards

Vinh DANG

On 02 Feb 2016, at 20:23, Tom Kraljevic <to...@0xdata.com> wrote:

Vinh Dang

unread,

Feb 2, 2016, 2:43:15 PM2/2/16

to Tom Kraljevic, Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Therefore, to calculate AUC for multi class classification, it would be good if I can extract the prediction values from h2o.randomForest object ...

—

Best Regards

Vinh DANG

Erin LeDell

unread,

Feb 2, 2016, 7:27:53 PM2/2/16

to Vinh Dang, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream

Vinh,
There are a few definitions of multiclass AUC and the Hand and Till version is one of the most popular.

If you want to extract predictions generated on a test set, you use the h2o.predict() function as Tom mentioned above. Look at the R docs for how to use that function.

-Erin

Reply all

Reply to author

Forward