In future we will have dataset of dimension 10 million records, we are in search for a package or framework which can handle 10 million records with at least 12000 features with less training time.
Can H2O handle this much dimension of data with packages to support below functionalities.
The package or framework we are searching should handle all the below tasks:
1. Pre-processing of words in corpus( Stopword removal, stemming, removal of special character)
2. Construct document term matrix
3. Sparse Matrix support
4. Feature selection process like chi square, information gain, Gain ratio.
5. Random forest classification etc.
Kindly let us know the Whether H2O can scale up to 10 million rows and 12k columns
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Good points, Tom.
My recommendation is to start training your H2O RF with 1,000
columns, see how that goes, then move up to 5,000 then 12,000.
-Erin