how to run prediction on new dataframe without values for y (predicted variable)

2,741 views
Skip to first unread message

agen...@gmail.com

unread,
Mar 7, 2016, 12:12:35 PM3/7/16
to H2O Open Source Scalable Machine Learning - h2ostream
I've used gbm to learn a model for classifying some data. I wanted to test it on another datafarme I loaded. The predicted variable is named "verdict". I set y to this when learning the model. The dataframe I want to test on doesn't have values for this - that's the whole point is this is what I want to predict.

When I call h2o on this dataframe it says:
Test/Validation dataset has a categorical response column 'verdict' with no levels in common with the model

Do I need to remove columns from the dataframe I'm testing on so that it matches the X values in the original call to learn the model? Is there a canonical way to do this? I'd rather not alter the dataframe, but it's huge, so making a copy with the necessary columns isn't really great.

thanks

Spencer Aiello

unread,
Mar 7, 2016, 12:37:17 PM3/7/16
to agen...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Hi which interface are you using to drive this?

From python you could do:

     my_model.predict(my_new_data)

or from R:

     h2o.predict(my_model, my_new_data)


Could you give these a try?

agen...@gmail.com

unread,
Mar 7, 2016, 1:36:11 PM3/7/16
to H2O Open Source Scalable Machine Learning - h2ostream, agen...@gmail.com
I was using the h2o web interface, inspect model, click "predict" then select the dataframe to be the new one that I loaded.

I removed the offending columns and got a prediction, although now I'm uncertain of how to interpret the output.

Again, the important point is that the predicted variable is there in the test data (it's the label in labeled data for supervised learning) whereas it's not in the test frame in this case (no ground truth to test again, trying to make a prediction).

I'll try these other methods

Spencer Aiello

unread,
Mar 7, 2016, 2:06:19 PM3/7/16
to John Langton, H2O Open Source Scalable Machine Learning - h2ostream
Does this example fit what you're trying to achieve (R code follows):


fr <- as.h2o(iris)
r <- h2o.runif(fr)
train <- fr[r < 0.8,]
test <- fr[ r >= 0.8,1:4]  # drop the species column

# train a model
gbm <- h2o.gbm(x=1:4, y=5, training_frame=train)

# predict (no label)
h2o.predict(gbm, test)




agen...@gmail.com

unread,
Mar 7, 2016, 2:23:43 PM3/7/16
to H2O Open Source Scalable Machine Learning - h2ostream, agen...@gmail.com
Close. It would be if the species column was still there but had 'unknown" as a value (and was an enum in the training set and testing set). When trying to load that into the h2o flow I couldn't use that frame for predicting.

So you are saying that it is necessary to remove y from the dataframe that I want to predict against? Is that correct? So for any prediction, you must remove y from the dataframe?

Erin LeDell

unread,
Mar 7, 2016, 3:01:50 PM3/7/16
to agen...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
I think what you are saying is that your test set does have a "verdict"
column (the response), but there are NAs currently filling this column
in your test frame? What exactly do you mean by "The dataframe I want
to test on doesn't have values for this". I assume if the column is
there, but "doesn't have values", that means those values are NA...

The h2o.predict function won't fill in an empty/NA response column in
your test set, it will return a frame of predicted values for that
column. If you want to add a "predicted_verdict" column to your test
set that would be a decent place to put them.

It's important to differentiate between "verdict" (actual labels) and
"predicted_verdict" (values or labels predicted by the model) because
they are not the same thing.

-Erin
--
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Spencer Aiello

unread,
Mar 7, 2016, 3:03:52 PM3/7/16
to John Langton, H2O Open Source Scalable Machine Learning - h2ostream
no it's not necessary ... the predict call will automatically ignore it for you

agen...@gmail.com

unread,
Mar 7, 2016, 3:20:09 PM3/7/16
to H2O Open Source Scalable Machine Learning - h2ostream, agen...@gmail.com
Yes, very close, sorry for my inconsistent language.
"verdict" is the response column. In my training data the possible values are 'good' and 'bad'. It's being treated as an enum. In my data I want to predict on, the values are 'unknown'. I expected h2o to ignore the response column altogether when trying to run the prediction, however, it did not. It complained that the test data had a categorical response column 'verdict' with no levels in common with the model (which is accurate).

I'm inferring there are two options:

1. the test frame has the response column with possible values in common with the training data set, in which case I'll probably get r^2 and other evaluation scores back

2. I omit the response column altogether from the training set, in which case I will get back results and no accuracy scores

I was confused at first because I wanted to make sure that prediction wasn't using my response column in some way in the predictions, that it knew that was the response column from training and therefore treating as such in prediction. It wasn't clear to me from the error since I has assumed it would be ignored entirely.

I think this makes sense now. But please do correct me if any of this is wrong. Thanks!!

Spencer Aiello

unread,
Mar 7, 2016, 3:24:05 PM3/7/16
to John Langton, H2O Open Source Scalable Machine Learning - h2ostream
Yep you've got it completely correct.

For the test data, you get metrics on the target column (if present) -- with some hollering if the target column in the test data is inappropriate in some way (in this case mismatched levels).

Otherwise, you only get the raw predictions.


Thanks,
Spencer
Reply all
Reply to author
Forward
0 new messages