Finding optimize number of tree and nodesize in h2o random forest

Vinh DANG

unread,

Jan 5, 2016, 10:25:44 AM1/5/16

to H2O Open Source Scalable Machine Learning - h2ostream

Dear all,

How could I train and optimize for these two parameters in h2o.randomForest (for randomForest in R, we have a package caret)?

Thanks a lot

Erin LeDell

unread,

Jan 5, 2016, 2:27:35 PM1/5/16

to Vinh DANG, H2O Open Source Scalable Machine Learning - h2ostream

Vinh,
I think you may be looking for the h2o.grid() function, which will allow you to train models over a set of model parameters. It will produce a list of models, which can then be sorted by a performance metric of your choosing (e.g. AUC, MSE, etc).

-Erin

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Vinh Đặng

unread,

Jan 8, 2016, 10:45:03 AM1/8/16

to Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream

Hello Erin

Thanks for your comment.

However, if I understand correctly, the idea of grid is: you give to the function a list of parameters, and the function will say which parameter is the best one. Even we can do it manually I think: just run randomForest several time with different parameter each time, log all the results and see what parameter provide the highest accuracy.

But my question is, what should be the "best" parameter (for instance, in randomForest, how many trees ntrees we should use, and other parameters). Of course, the naive way is I can give the list 1:5000000 to h2o.grid () and let it run, but I believe that there should be a smarter way.

Please correct me if I am wrong.

----------------------------------
Best Regards

Vinh Dang

Erin LeDell

unread,

Jan 11, 2016, 3:51:33 PM1/11/16

to Vinh Đặng, H2O Open Source Scalable Machine Learning - h2ostream

Hi,

For ntrees, that is a unique parameter in that it pretty much always produces better results, by any measure, if you increase ntrees. Try: 500, 1000, 2000
For most other parameters, the results are not usually monotonic. So just make a grid around the default value. For example, if the default value of a parameter than ranges from 0 to 1 is 0.5, then you might want to try: 0.0, 0.25, 0.5, 0.75 and 1.0.
Every problem is different, so you have to try a wide variety of parameter values.
The "best" set of model parameters depends on your definition of "best". If you are trying to minimize MSE, then you should choose the model param set that minimizes MSE. If you want to maximize AUC, then that's probably going to select a different set.

-Erin

Reply all

Reply to author

Forward