H20 Model Validation and Prediction

nate.a...@gmail.com

unread,

May 19, 2014, 3:33:57 PM5/19/14

to h2os...@googlegroups.com

Hello,

Just started playing with H20 about a week ago (professional applied stats guy) trying to see if this is something that would be useful for exploring new datasets. Couple of problems for me, maybe other users have found work-arounds or I've just missed a feature.

1) Lack of cross-validation. Can't believe there's not a built-in engine for specifying a quick n-fold cross validation. Doing this manually is for the birds.

2) Lack of model prediction visualization. I mostly do regression models and the generated statistics are woefully lacking. Aggregate statistics like RMSE can be a complete lie depending on the range of the target being modeled and how the error ends up being distributed. As it is, it seems like I have to manually create a hold-out set, then predict that hold out set, download the results, all so I can get a predicted vs actual plot.

3) Large number of descriptors. We typically have hundreds of descriptors (columns) and 10k+ rows in our datasets. The displays for large numbers of descriptors is pretty rough, trying to horizontally scroll through a list of 50 checkboxes to select variables for a rebuild is kind of tough.

It seems like an interesting tool. It's very accessible and relatively easy to get data in and models constructed. However, right now it seems really hard to be able to actually evaluate the performance of a model built on the platform. Any suggestions on people are doing this in practice would be appreciated.

Nate

Tom Kraljevic

unread,

May 20, 2014, 2:47:51 AM5/20/14

to nate.a...@gmail.com, h2os...@googlegroups.com

Hi Nate,

Thanks for trying out H2O and for your feedback.

Regarding 1), our GLM algorithm does have cross-validation built in, but others currently do not. What we do currently have is a train (source key) + test (validation key) methodology we have been propagating through all the supervised algorithms. You can use either R or the Frame Split page (in beta tab w/ -beta) to have H2O assist in doing train/test splits. Over the past few months we have been adding capabilities to make it easier for modelers to compare results from different algorithms with each other. More on this will be coming.

With respect to your point 2), we do have ROC plots and confusion matrices built in to the product. If you are interested in getting prediction results out of H2O for plotting, you could try extracting them with h2o’s R package and plotting from there.

On 3), I suggest you look at our R integration if you want to automate or repeat things. In the future we will also be adding a screen that lets you take an existing model as the basis for a new one, which might reduce some of the repetitiveness of selecting columns in the UI.

Here is an example, written in R, that reads a data set, builds a test/train split, builds a model and does scoring.
https://github.com/0xdata/h2o/blob/master/R/tests/testdir_demos/runit_demo_cm_roc_.R

Thanks,
Tom

> --
> You received this message because you are subscribed to the Google Groups "H2O Users - h2ostream" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

nate.a...@gmail.com

unread,

May 20, 2014, 8:46:20 AM5/20/14

to h2os...@googlegroups.com, nate.a...@gmail.com

First of all, thanks for your quick response. It is an interesting product and I do look forward to seeing where it goes. I'm always looking for better, faster ways to evaluate the modelability of datasets and the ability to deploy models from H20 is an interesting one.

I'll just quickly respond to your comments. First, the split frame doesn't allow any obvious way to do cross-validation (I suppose you can argue it allows sampling with replacement). If you do use the split frame, managing the result N datasets and N models will be tough. Personally, I'd really want a cross-validation wrapper around all models. That would just make use of the validation set feature that all models have (which I do like being able to specify that set at model-build time and not have to come back and predict later). Second, both validation measures you mention are for classification models. I'm still out in the woods with regression. Lastly, I haven't looked much at the external API. Do you envision that much of the heavy use will be through that external API or is that a temporary crutch while features are implemented inside of H20?

Thanks again and looking forward to seeing where this goes,
Nate

Tom Kraljevic

unread,

May 20, 2014, 12:39:56 PM5/20/14

to nate.a...@gmail.com, h2os...@googlegroups.com

Nate,

> First of all, thanks for your quick response. It is an interesting product and I do look forward to seeing where it goes. I'm always looking for better, faster ways to evaluate the modelability of datasets and the ability to deploy models from H20 is an interesting one.

Great!

> I'll just quickly respond to your comments. First, the split frame doesn't allow any obvious way to do cross-validation (I suppose you can argue it allows sampling with replacement). If you do use the split frame, managing the result N datasets and N models will be tough. Personally, I'd really want a cross-validation wrapper around all models. That would just make use of the validation set feature that all models have (which I do like being able to specify that set at model-build time and not have to come back and predict later).

Other people have also asked for this. I added your vote to the ticket.

> Second, both validation measures you mention are for classification models. I'm still out in the woods with regression.

I’d appreciate it if you sent me a specific example of the use case you are advocating.
(eg. what the plot looks like, how you interpret it, and anything else you can think of that’s useful for us to know).
Also, please keep in mind from big data perspective: if you were dealing with a billion rows would your point of view change? if so, how?

> Lastly, I haven't looked much at the external API. Do you envision that much of the heavy use will be through that external API or is that a temporary crutch while features are implemented inside of H20?

We don’t view the REST API as a crutch at all. It’s a first class citizen.
Our UI uses the REST API, the R client uses the REST API and our python test infrastructure uses the REST API.
It’s also what allows people to automate stuff for production.

At this time, most people that drive H2O programmatically do it through the H2O R client package.

Thanks,
Tom

nate.a...@gmail.com

unread,

May 21, 2014, 10:57:30 AM5/21/14

to h2os...@googlegroups.com, nate.a...@gmail.com

On Tuesday, May 20, 2014 12:39:56 PM UTC-4, Tom Kraljevic wrote:
> Nate,
>
>
>
>
>
> > First of all, thanks for your quick response. It is an interesting product and I do look forward to seeing where it goes. I'm always looking for better, faster ways to evaluate the modelability of datasets and the ability to deploy models from H20 is an interesting one.
>
>
>
> Great!
>
>
>
>
>
> > I'll just quickly respond to your comments. First, the split frame doesn't allow any obvious way to do cross-validation (I suppose you can argue it allows sampling with replacement). If you do use the split frame, managing the result N datasets and N models will be tough. Personally, I'd really want a cross-validation wrapper around all models. That would just make use of the validation set feature that all models have (which I do like being able to specify that set at model-build time and not have to come back and predict later).
>
>
>
> Other people have also asked for this. I added your vote to the ticket.

Thank you!

>
>
>
>
>
> > Second, both validation measures you mention are for classification models. I'm still out in the woods with regression.
>
>
>
> I’d appreciate it if you sent me a specific example of the use case you are advocating.
>
> (eg. what the plot looks like, how you interpret it, and anything else you can think of that’s useful for us to know).
>
> Also, please keep in mind from big data perspective: if you were dealing with a billion rows would your point of view change? if so, how?
>

Standard predicted vs actual:
http://www.jmp.com/support/help/Graphs_for_Goodness_of_Fit.shtml
For larger datasets, the individual points aren't useful and we use a density heatmap (Nice example: http://www.chrisstucchio.com/blog/2012/dont_use_scatterplots.html, but we use much higher resolution) or we calculate contours on the density. I also typically calculate multiple statistics (pearsons and spearmans, % within a log unit), so that we can better understand the error. Our data tops out at 1-2 million rows, so no practical opinion on billion row models. I model biological measurement data. The error is non-linear, the data is not uniformly distributed and we are more sensitive to predictive errors in some ranges than others. In other words, someone may not care about a high vs very high mis-prediction, but care greatly about a medium vs low mis-prediction. I want to know how badly the model is regressing to the mean and how evenly distributed the prediction error is.

>
>
>
>
> > Lastly, I haven't looked much at the external API. Do you envision that much of the heavy use will be through that external API or is that a temporary crutch while features are implemented inside of H20?
>
>
>
> We don’t view the REST API as a crutch at all. It’s a first class citizen.
>
> Our UI uses the REST API, the R client uses the REST API and our python test infrastructure uses the REST API.
>
> It’s also what allows people to automate stuff for production.
>
>
>
> At this time, most people that drive H2O programmatically do it through the H2O R client package.
>

Thanks, I will look at the REST API using Python.

>
>
>
>
> Thanks,
>
> Tom

Tom Kraljevic

unread,

May 21, 2014, 12:34:52 PM5/21/14

to nate.a...@gmail.com, h2os...@googlegroups.com

Thanks Nate,

I created a ticket capturing your feedback.

https://0xdata.atlassian.net/browse/PUB-722

Tom

dus...@gmail.com

unread,

Jun 5, 2015, 10:50:17 AM6/5/15

to h2os...@googlegroups.com, nate.a...@gmail.com

在 2014年5月20日星期二 UTC+8上午3:33:57，nate.a...@gmail.com写道：

Hi, all

Any improvement or you give up h2o ?

Reply all

Reply to author

Forward