H2O algo's and missing values

1,508 views
Skip to first unread message

hermanv...@gmail.com

unread,
Sep 1, 2014, 8:47:37 AM9/1/14
to h2os...@googlegroups.com
Hi all,

Is there some list of which H2O algo's are capable of dealing with NA's? So far I have noticed that GBM and RF are, and that GLM isn't. How about Deeplearning for instance?

Kind regards,
Herman

arno....@gmail.com

unread,
Sep 2, 2014, 11:07:46 AM9/2/14
to h2os...@googlegroups.com, hermanv...@gmail.com

Hi Herman,
Yes, GBM/RF handle missing values (they go into the left-most bin). Currently, GLM and NaiveBayes skip rows with missing values. DeepLearning performs mean-imputation for missing numericals and creates a separate factor level for missing categoricals by default. KMeans also handles missing values by assuming that missing feature distance contributions are equal to the average of all other distance term contributions.
Best regards,
Arno

cli...@0xdata.com

unread,
Sep 2, 2014, 12:15:08 PM9/2/14
to h2os...@googlegroups.com, hermanv...@gmail.com
Additionally to Arno's response, it's fairly straightforward in R to impute various values for NA's before modeling.
Pulled from: http://www.r-bloggers.com/example-2014-5-simple-mean-imputation/

df = data.frame(x = 1:20, y = c(1:10,rep(NA,10)))
df$y[is.na(df$y)] = mean(df$y, na.rm=TRUE)

Cliff



On Monday, September 1, 2014 5:47:37 AM UTC-7, hermanv...@gmail.com wrote:

s...@physiosigns.com

unread,
Oct 10, 2016, 1:39:33 PM10/10/16
to H2O Open Source Scalable Machine Learning - h2ostream, hermanv...@gmail.com, arno....@gmail.com
> Yes, GBM/RF handle missing values (they go into the left-most bin).

Is this still the case?
If missing data goes into the left-most bin, that's not handling missing data. I've seen the following in several decision tree implementations:

if ( value >= cut ) rightbin
else leftbin

What happens with the following rule?

if ( value < cut ) leftbin
else rightbin

... nan is now in the rightbin. Really you need:

if ( isNaN(value) handleNAN // decide what to do
else if ( value >= cut ) rightbin
else leftbin

There may different policies for handling missing values in decision trees -- really it should be a parameter to the algorithm. Almost all of them are better than imputation. All of the a better that stuffing null values down an arbitrary branch.

Tom Kraljevic

unread,
Oct 10, 2016, 7:08:11 PM10/10/16
to s...@physiosigns.com, H2O Open Source Scalable Machine Learning - h2ostream, hermanv...@gmail.com, arno....@gmail.com
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning  - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Geoffrey Anderson

unread,
Oct 27, 2016, 4:46:52 PM10/27/16
to H2O Open Source Scalable Machine Learning - h2ostream, hermanv...@gmail.com
Can the H2O staff kindly elaborate on the meaning of missing_values_handling = "Skip" for the deeplearning routine, above and beyond the documentation for deeplearning module regarding the missing_values_handling = "Skip"?

I am consistiently seeing failure to converge the MSE, when missing_values_handling = "Skip" with autoencoders.

What exactly is being skipped, in more words?  Thanks.  It's really not self evident what you have in mind with "Skip" except that something is being skipped. For example, Is it a whole training row?  Is it a single variable within a row?

Thanks again and love the product overall and also the consistent, excellent responsiveness of the staff to help everyone.

Darren Cook

unread,
Oct 27, 2016, 5:18:32 PM10/27/16
to h2os...@googlegroups.com
> Can the H2O staff kindly elaborate on the meaning of missing_values_handling =
> "Skip" for the deeplearning routine, above and beyond the documentation for
> deeplearning module regarding the missing_values_handling = "Skip"?

I'm not "staff" :-)... but I believe it is the whole row is thrown away
if any column has a missing value.

The alternative of MeanImputation is to take the average of the column,
and use that for any missing values in that column.

> I am consistiently seeing failure to converge the MSE,
> when missing_values_handling = "Skip" with autoencoders.

You should be able to see in the model summary how many rows of your
data were actually used for training.

If it is much lower than you expect you perhaps have one column with
lots of missing values. When that happens it can often be worth just
deleting the column (if lots of missing values, there is maybe not much
to learn from it anyway).

(In one of the data sets for my book, I discovered that I was losing 85%
of the data, due to one bad column, but only in my validation data set -
needless to say that was distorting everything!)

Darren


--
Darren Cook, Software Researcher/Developer
My New Book: Practical Machine Learning with H2O,
published by O'Reilly. If interested, let me know and
I'll send you a discount code as soon it is released.

Erin LeDell

unread,
Oct 30, 2016, 6:36:07 PM10/30/16
to Darren Cook, h2os...@googlegroups.com
Darren is correct -- "skip" means skip that row in training (aka throw
it away).


On 10/27/16 2:18 PM, Darren Cook wrote:
>> Can the H2O staff kindly elaborate on the meaning of missing_values_handling =
>> "Skip" for the deeplearning routine, above and beyond the documentation for
>> deeplearning module regarding the missing_values_handling = "Skip"?
> I'm not "staff" :-)... but I believe it is the whole row is thrown away
> if any column has a missing value.
>
> The alternative of MeanImputation is to take the average of the column,
> and use that for any missing values in that column.
>
>> I am consistiently seeing failure to converge the MSE,
>> when missing_values_handling = "Skip" with autoencoders.
> You should be able to see in the model summary how many rows of your
> data were actually used for training.
>
> If it is much lower than you expect you perhaps have one column with
> lots of missing values. When that happens it can often be worth just
> deleting the column (if lots of missing values, there is maybe not much
> to learn from it anyway).
>
> (In one of the data sets for my book, I discovered that I was losing 85%
> of the data, due to one bad column, but only in my validation data set -
> needless to say that was distorting everything!)
>
> Darren
>
>

--
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Reply all
Reply to author
Forward
0 new messages