Getting prediction intervals from GBM models

1,189 views
Skip to first unread message

SC

unread,
Jul 8, 2016, 1:31:41 PM7/8/16
to H2O Open Source Scalable Machine Learning - h2ostream

I am working with GBM regression models in H2O and am using Quantile distribution for the distribution parameter. I am looking for  a method to provide prediction intervals in addition to point value prediction. What is the best way to achieve this in H2O? 


Since we are using quantile regression, I was hoping we should be able to exploit it as follows:

Lets say we are interested to find 95% prediction interval. We train one GBM model with quantile alpha=.025 and another with .975. And then we call predict on these 2 models to get lower and upper values of the range. Does that seem appropriate?

Navdeep Gill

unread,
Jul 8, 2016, 9:28:37 PM7/8/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,

Yes, your approach seems appropriate and should work. Also, here is a quick example of applying Quantile GBM: https://github.com/h2oai/h2o-3/blob/739ec856995d066ccecb8eb605ca9ef5a9d3baa6/h2o-r/tests/testdir_algos/gbm/runit_GBM_quantile.R

Please let us know if you have anymore questions!

Thanks,
Navdeep

Navdeep Gill

unread,
Jul 8, 2016, 9:32:51 PM7/8/16
to H2O Open Source Scalable Machine Learning - h2ostream

SC

unread,
Jul 21, 2016, 6:19:20 PM7/21/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hi
I was able to build a range using quantile GBM. It works pretty good. I noticed one issue though.

So we use quantile alpha =0.4 to predict the point estimate for our target and 0.05 and .95 to get the prediction range. Ideally I would expect that for given features, prediction generated by point estimate model(with 0.4) will ALWAYS lie WITHIN my range. But I noticed that for 824/14326 = 5.7% rows in my test set, point estimate was outside the range.

Is this a bug with quantile GBM in H2O or am I incorrect with my assumptions?

Thanks

ma...@0xdata.com

unread,
Jul 28, 2016, 10:00:50 AM7/28/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,

I have used quantile regression fairly extensively in a couple GBM implementations, including H2O. It is not uncommon to find quantile ranges "out of order" since they are solved independently. I usually impose a sort on the final predictions, as I expect some degree of this behavior. Further, I would expect similar behavior out of any method (GBM or other) that solves quantiles independently.

What you are experiencing does seem a bit extreme though. If I'm understanding this correctly, 5.7% of the predictions from the alpha 0.4 model are either lower than the prediction for the same row at alpha 0.05 or higher than the prediction for the same row at alpha 0.95. Strictly speaking, the 0.05 model's predictions ought to produce a number that is lower than 95% of your targets, 0.4 / 60%, and 0.95 5%. Can you measure whether these are the case for your (1) train predictions and (2) test predictions?

Trying more precise GBMs (more trees, lower learning rates) may help. At a high learning rate like the default, I have run into 20/50/80 quantiles being out of order at a similar rate as you have experienced. Those got a little better after reducing the learning rate. Also, that particular model had few features and mainly categoricals, so intuitively it was a fairly fragile set of models, which can cause the independent fits to be highly variable.

Internally, we can double-check our test cases for quantile regression and ensure it is operating the way we intend.

Thanks,

Mark Landry
Data scientist, H2O
Reply all
Reply to author
Forward
0 new messages