Multi class classification for large data set using H2O

ranjana...@gmail.com

unread,

Aug 10, 2017, 7:10:34 AM8/10/17

to H2O Open Source Scalable Machine Learning - h2ostream

We are working on multi class classification.
Currently up to 1.1 million records Ranger package in R is able to handle. Training time on 128 GB RAM is 12 days. Which is not a practically feasible method to proceed further.

In future we will have dataset of dimension 10 million records, we are in search for a package or framework which can handle 10 million records with at least 12000 features with less training time.

Can H2O handle this much dimension of data with packages to support below functionalities.

The package or framework we are searching should handle all the below tasks:

1. Pre-processing of words in corpus( Stopword removal, stemming, removal of special character)
2. Construct document term matrix
3. Sparse Matrix support
4. Feature selection process like chi square, information gain, Gain ratio.
5. Random forest classification etc.

Kindly let us know the Whether H2O can scale up to 10 million rows and 12k columns

Erin LeDell

unread,

Aug 10, 2017, 2:07:35 PM8/10/17

to ranjana...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Hi,

Ranger was designed for genomic data (high-dimensional data, where p >>
n) but it was not designed to scale in terms of the number of rows.
There are some benchmarks in the JSS paper that show this:
https://www.jstatsoft.org/article/view/v077i01

Yes, 10 million rows is easy for H2O if you're using an appropriate
machine. We recommend using a machine with RAM of at least 3x the size
of your dataset on disk. Just like Ranger was designed for datasets
with a lot of columns, H2O was designed for datasets with a lot of
rows. You *should* be able to train a Random Forest with 12k columns,
but the only way to know for sure is to try it.

H2O is not designed to be a text processing engine so there are not
functions for removing stopwords, etc. H2O does have a word2vec
implementation, so if you can use that if it's helpful. My
recommendation is to keep your current text processing framework and
then switch over to H2O for the machine learning aspect -- essentially
swap out ranger in your pipeline for H2O. If your current text
processing framework is not cutting it, then you could use Spark to do
the data munging and use H2O's Sparkling Water for the Random Forest piece.

To answer the rest of your questions, H2O can load sparse data from
SVMLight format. H2O does not have special tools for feature selection
-- we rely on the algos to do feature selection themselves via
regularization, or rely on the user to filter out features manually
based on feature importance (all the H2O algos provide feature importance).

-Erin

--
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Tom Kraljevic

unread,

Aug 10, 2017, 4:04:42 PM8/10/17

to Erin LeDell, ranjana...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Hi,

I would also add that for multinomial classification with RF, the training time is proportional to the number of levels in the response.

For each “tree” the user requests, the algo underneath builds an internal tree for each level in the response.

So nTrees * numLevels total internal trees get trained.

(Binomial has a special shortcut where it trains just one of the classes and then does 1-probability).

(So, just know upfront, don’t ask for 1000 levels and expect it to take the same time as 2 levels.)

Thanks,

Tom

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erin LeDell

unread,

Aug 10, 2017, 6:50:19 PM8/10/17

to Tom Kraljevic, ranjana...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Good points, Tom.

My recommendation is to start training your H2O RF with 1,000 columns, see how that goes, then move up to 5,000 then 12,000.

-Erin

Reply all

Reply to author

Forward