Random Forest Classification in Python

David Comfort

unread,

Feb 17, 2016, 6:02:02 PM2/17/16

to H2O Open Source Scalable Machine Learning - h2ostream

Hi,

I am trying to run Random Forest on Python 2.7 for a classification task. How do I specify that this is a classification task not regression? I do not see an appropriate parameter.

For instance, I see that there is a classification parameter here: https://h2o.gitbooks.io/h2o-training-day/content/hands-on_training/classification.html

classification = TRUE,

However, there isn't here:

https://github.com/h2oai/h2o-3/blob/master/h2o-py/demos/rf_balance_classes.ipynb

Also, for the documentation of H2O in Python, it looks like there are blocks of R code

https://h2o-release.s3.amazonaws.com/h2o/rel-slater/5/docs-website/h2o-py/docs/h2o.html

I want to be able output the AUC and ROC but it seems be running it as a regression.

> rf = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=False)

> rf.train(x=X, y=Y, training_frame=df_h2o_train_hex, validation_frame=df_h2o_valid_hex)

Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1455741432772_1

Model Summary:

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	10.0	624029.0	20.0	20.0	20.0	4502.0	5701.0	5365.2

ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.203627622723
R^2: 0.083251815499
Mean Residual Deviance: 0.203627622723

ModelMetricsRegression: drf
** Reported on validation data. **

MSE: 0.190313952689
R^2: 0.151951524407
Mean Residual Deviance: 0.190313952689

Tom Kraljevic

unread,

Feb 17, 2016, 6:32:40 PM2/17/16

to David Comfort, H2O Open Source Scalable Machine Learning - h2ostream

On Feb 17, 2016, at 3:02 PM, David Comfort <davidmich...@gmail.com> wrote:

Hi,
I am trying to run Random Forest on Python 2.7 for a classification task. How do I specify that this is a classification task not regression? I do not see an appropriate parameter.

For instance, I see that there is a classification parameter here: https://h2o.gitbooks.io/h2o-training-day/content/hands-on_training/classification.html

classification = TRUE,

This is the h2o-world 2014 training which is old. (The introduction points you to the 2015 material, although each individual section, one of which you have linked to above, currently does not.)

See the latest here: http://learn.h2o.ai

The short answer is you need to cast your Y column to a factor first.

For example with h2o 3.6.0.8 or h2o 3.8.0.3:

	test["CAPSULE"] = test["CAPSULE"].asfactor()

Thanks,

Tom

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Comfort

unread,

Feb 17, 2016, 7:01:23 PM2/17/16

to H2O Open Source Scalable Machine Learning - h2ostream

Thanks for the prompt response.

Isn't that an R command or an H2O method? As far as I know, Python does not have "asfactor()" method.

Spencer Aiello

unread,

Feb 17, 2016, 7:03:15 PM2/17/16

to David Comfort, H2O Open Source Scalable Machine Learning - h2ostream

Hi David,

The H2OFrame object has the "asfactor" member method (of course drawing its inspiration from the world of R).

David Comfort

unread,

Feb 17, 2016, 7:27:39 PM2/17/16

to H2O Open Source Scalable Machine Learning - h2ostream

Thanks Spencer and Tom,

I did get it working using the following code.

df_h2o[255] = df_h2o[255].asfactor()

# Split into training and validation sets

# Generate random numbers and create training, validation, testing splits

r = df_h2o.runif() # Random UNIForm numbers, one per row

df_h2o_train_hex = df_h2o[r < 0.8]

df_h2o_valid_hex = df_h2o[r >= 0.8]

# Features

X = df_h2o.columns[2:255]

# Classification

Y = df_h2o.columns[255:256]

from h2o.estimators.random_forest import H2ORandomForestEstimator

Reply all

Reply to author

Forward