Is there some list of which H2O algo's are capable of dealing with NA's? So far I have noticed that GBM and RF are, and that GLM isn't. How about Deeplearning for instance?
Kind regards,
Herman
Hi Herman,
Yes, GBM/RF handle missing values (they go into the left-most bin). Currently, GLM and NaiveBayes skip rows with missing values. DeepLearning performs mean-imputation for missing numericals and creates a separate factor level for missing categoricals by default. KMeans also handles missing values by assuming that missing feature distance contributions are equal to the average of all other distance term contributions.
Best regards,
Arno
df = data.frame(x = 1:20, y = c(1:10,rep(NA,10)))
df$y[is.na(df$y)] = mean(df$y, na.rm=TRUE)
Cliff
Is this still the case?
If missing data goes into the left-most bin, that's not handling missing data. I've seen the following in several decision tree implementations:
if ( value >= cut ) rightbin
else leftbin
What happens with the following rule?
if ( value < cut ) leftbin
else rightbin
... nan is now in the rightbin. Really you need:
if ( isNaN(value) handleNAN // decide what to do
else if ( value >= cut ) rightbin
else leftbin
There may different policies for handling missing values in decision trees -- really it should be a parameter to the algorithm. Almost all of them are better than imputation. All of the a better that stuffing null values down an arbitrary branch.
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.