How to randomly split a frame into train, valid and test frames?

1,609 views
Skip to first unread message

thib...@gmail.com

unread,
Aug 25, 2015, 10:58:30 AM8/25/15
to H2O Open Source Scalable Machine Learning - h2ostream
Hi, I have been trying to do so using the python api and flow, but with little success, whatever I give in argument as destination_frames (list of strings, list of frames, nothing).

Thanks!

thib...@gmail.com

unread,
Aug 25, 2015, 10:59:40 AM8/25/15
to H2O Open Source Scalable Machine Learning - h2ostream, thib...@gmail.com
NB: I have been trying using the split_frame Frame method in python

Erin LeDell

unread,
Aug 25, 2015, 1:46:07 PM8/25/15
to thib...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Hi,
The split_frame method actually just splits the frame into two pieces
without shuffling the rows. To split randomly, here is an example of
how to create an 80/20 train/test split in H2O Python:

# Load the data and prepare for modeling
airlines_hex = h2o.import_file(path = "allyears2k_headers.csv")

# Generate random numbers and create training, validation, testing splits
r = airlines_hex.runif() # Random UNIForm numbers, one per row
air_train_hex = airlines_hex[r < 0.8]
air_valid_hex = airlines_hex[ >= 0.8]


myX = ["DayofMonth", "DayOfWeek"]

# Now, train the GBM model:
air_model = h2o.gbm(y = "IsDepDelayed", x = myX,
distribution="bernoulli", training_frame = air_train_hex,
validation_frame = air_valid_hex, ntrees=100, max_depth=4, learn_rate=0.1)

-Erin

thib...@gmail.com

unread,
Aug 25, 2015, 1:57:19 PM8/25/15
to H2O Open Source Scalable Machine Learning - h2ostream, thib...@gmail.com
Thanks! I actually tried to do what you did but using a number array of random numbers instead of the random vec/frame generated by the actual data frame. So if I understand correctly, only a frame can do advanced indexing/masking on another frame?

Erin LeDell

unread,
Aug 25, 2015, 3:22:35 PM8/25/15
to thib...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
To the best of my knowledge, yes.

We have discussed adding a utility function like `h2o.split_frame` that
randomly splits the data at a desired rate (80/20, 70/30, etc) since
that is more common in practice... It's on the to-do list.

-Erin

Spencer Aiello

unread,
Aug 25, 2015, 7:56:42 PM8/25/15
to thib...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
In general, you'll get the most efficient row selection using a 1column H2OFrame mask.

But you definitely can select rows in a number of ways:

            my_frame[:400,:]            # first 400 rows
            my_frame[[25,99,200],:]  # row indexes in an array

If you have a numpy array, you should cast it down to a python list.


And as Erin mentioned, we'll probably have a piece of functionality very soon that mirrors scikit-learn's cross_validation.test_train_split method.

Reply all
Reply to author
Forward
0 new messages