Data Preprocessing in H2O

419 views
Skip to first unread message

carmelo

unread,
Apr 5, 2018, 8:50:54 AM4/5/18
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,
I couldn't find a reference in this group nor the documentation that explain if and how Data Preprocessing is possible in H2O in general and in H2O Flow in particular.
The only comment I found was regarding normalizing automatically the data to N(0,1) for Deep Learning.

Is there something similar to R's caret package where I can provide a parameter to the preProcess function or is the only option to perform such pre-processing "outside H2O" and then work with the transformed data frame?
 

"quick summary of all of the transform methods supported in the method argument of the preProcess() function in caret.

  • BoxCox“: apply a Box–Cox transform, values must be non-zero and positive.
  • YeoJohnson“: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
  • expoTrans“: apply a power transform like BoxCox and YeoJohnson.
  • zv“: remove attributes with a zero variance (all the same value).
  • nzv“: remove attributes with a near zero variance (close to the same value).
  • center“: subtract mean from values.
  • scale“: divide values by standard deviation.
  • range“: normalize values.
  • pca“: transform data to the principal components.
  • ica“: transform data to the independent components.
  • spatialSign“: project data onto a unit circle.

"

thanks
Carmelo

Erin LeDell

unread,
Apr 6, 2018, 1:32:55 AM4/6/18
to carmelo, H2O Open Source Scalable Machine Learning - h2ostream

H2O algos that require it (DL, GLM, Naive Bayes) center/scale the inputs automatically.  Tree based algos don't require this so we don't do it.  All algos impute missing values (using the mean by default) automatically.  So there's nothing else you need to do if that's a sufficient amount of pre-processing for your needs.

We have PCA if you want to use that, but there's no option to turn it on automatically. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/pca.html

-Erin

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell Ph.D.
Chief Machine Learning Scientist | H2O.ai
Reply all
Reply to author
Forward
0 new messages