Random Forest Classification in Python

512 views
Skip to first unread message

David Comfort

unread,
Feb 17, 2016, 6:02:02 PM2/17/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,
I am trying to run Random Forest on Python 2.7 for a classification task. How do I specify that this is a classification task not regression? I do not see an appropriate parameter.

For instance, I see that there is a classification parameter here: https://h2o.gitbooks.io/h2o-training-day/content/hands-on_training/classification.html

classification = TRUE,

However, there isn't here:


Also, for the documentation of H2O in Python, it looks like there are blocks of R code


I want to be able output the AUC and ROC but it seems be running it as a regression.

> rf = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=False)
> rf.train(x=X, y=Y, training_frame=df_h2o_train_hex, validation_frame=df_h2o_valid_hex)


Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1455741432772_1

Model Summary: 
number_of_treesmodel_size_in_bytesmin_depthmax_depthmean_depthmin_leavesmax_leavesmean_leaves
10.0624029.020.020.020.04502.05701.05365.2
ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.203627622723
R^2: 0.083251815499
Mean Residual Deviance: 0.203627622723

ModelMetricsRegression: drf
** Reported on validation data. **

MSE: 0.190313952689
R^2: 0.151951524407
Mean Residual Deviance: 0.190313952689



Tom Kraljevic

unread,
Feb 17, 2016, 6:32:40 PM2/17/16
to David Comfort, H2O Open Source Scalable Machine Learning - h2ostream
On Feb 17, 2016, at 3:02 PM, David Comfort <davidmich...@gmail.com> wrote:

Hi,
I am trying to run Random Forest on Python 2.7 for a classification task. How do I specify that this is a classification task not regression? I do not see an appropriate parameter.

For instance, I see that there is a classification parameter here: https://h2o.gitbooks.io/h2o-training-day/content/hands-on_training/classification.html

classification = TRUE,


This is the h2o-world 2014 training which is old.  (The introduction points you to the 2015 material, although each individual section, one of which you have linked to above,  currently does not.)

See the latest here:  http://learn.h2o.ai

The short answer is you need to cast your Y column to a factor first.

For example with h2o 3.6.0.8 or h2o 3.8.0.3:
test["CAPSULE"] = test["CAPSULE"].asfactor()

Thanks,
Tom


--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Comfort

unread,
Feb 17, 2016, 7:01:23 PM2/17/16
to H2O Open Source Scalable Machine Learning - h2ostream
Thanks for the prompt response. 

Isn't that an R command or an H2O method? As far as I know, Python does not have "asfactor()" method.

Spencer Aiello

unread,
Feb 17, 2016, 7:03:15 PM2/17/16
to David Comfort, H2O Open Source Scalable Machine Learning - h2ostream
Hi David,

The H2OFrame object has the "asfactor" member method (of course drawing its inspiration from the world of R).


David Comfort

unread,
Feb 17, 2016, 7:27:39 PM2/17/16
to H2O Open Source Scalable Machine Learning - h2ostream
Thanks Spencer and Tom,

I did get it working using the following code.

df_h2o[255] = df_h2o[255].asfactor()

# Split into training and validation sets
# Generate random numbers and create training, validation, testing splits 
r = df_h2o.runif()   # Random UNIForm numbers, one per row 
df_h2o_train_hex = df_h2o[r  < 0.8] 
df_h2o_valid_hex = df_h2o[r >= 0.8] 

# Features
X = df_h2o.columns[2:255]

# Classification
Y = df_h2o.columns[255:256]

from h2o.estimators.random_forest import H2ORandomForestEstimator
Reply all
Reply to author
Forward
0 new messages