Just started playing with H20 about a week ago (professional applied stats guy) trying to see if this is something that would be useful for exploring new datasets. Couple of problems for me, maybe other users have found work-arounds or I've just missed a feature.
1) Lack of cross-validation. Can't believe there's not a built-in engine for specifying a quick n-fold cross validation. Doing this manually is for the birds.
2) Lack of model prediction visualization. I mostly do regression models and the generated statistics are woefully lacking. Aggregate statistics like RMSE can be a complete lie depending on the range of the target being modeled and how the error ends up being distributed. As it is, it seems like I have to manually create a hold-out set, then predict that hold out set, download the results, all so I can get a predicted vs actual plot.
3) Large number of descriptors. We typically have hundreds of descriptors (columns) and 10k+ rows in our datasets. The displays for large numbers of descriptors is pretty rough, trying to horizontally scroll through a list of 50 checkboxes to select variables for a rebuild is kind of tough.
It seems like an interesting tool. It's very accessible and relatively easy to get data in and models constructed. However, right now it seems really hard to be able to actually evaluate the performance of a model built on the platform. Any suggestions on people are doing this in practice would be appreciated.
Nate