H2O data frame not using reference level from R data frame

170 views
Skip to first unread message

mr.li...@gmail.com

unread,
May 9, 2016, 1:38:51 PM5/9/16
to H2O Open Source Scalable Machine Learning - h2ostream
I have an R data frame that's converted to an H2O data frame for modeling. But after the conversion, I find that the reference levels no longer match that of the original R data. For example, my "GENDER" variable has initial reference level "F", and I'd like to change that to "M":

mydata$GENDER = relevel(mydata$GENDER, ref='M')
mydata.h2o = as.h2o(mydata)

The first line of code successfully changes the reference to "M" in the "mydata" R data frame. Yet after running the second line of code to convert to an H2O data frame, that new reference level is lost. Can anyone explain why?

~ Li

Erin LeDell

unread,
May 9, 2016, 1:47:59 PM5/9/16
to mr.li...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Li,

Check out h2o.setLevels(), it will allow you to change the order of your levels (and hence redefine your reference level).

-Erin

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

mr.li...@gmail.com

unread,
May 9, 2016, 2:08:29 PM5/9/16
to H2O Open Source Scalable Machine Learning - h2ostream, mr.li...@gmail.com


Hi Erin,

Thanks for the suggestion, and in fact that's exactly what I'm doing as a workaround. But it would be nice if the "as.h2o" function could automatically inherit or at least provide the option of inheriting the reference level from the R data so that I won't need to use the "h2o.setLevels()" function. Perhaps that could be a new feature?

Best,
~ Li

Erin LeDell

unread,
May 9, 2016, 2:15:29 PM5/9/16
to mr.li...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Li,

Ok, I'm glad you know about h2o.setLevels(). 

H2O's parse functionality for factors will always use alphabetical order for the ordering of the factors.  I agree that it would make sense to inherit any factor level information from the data.frame in R or Pandas DataFrame in Python.  I will have to look into how to make that work across all our different APIs and whether or how changing the expected behavior of our parsing will affect the rest of the data processing workflow.

Thanks for the suggestion.

-Erin

Erin LeDell

unread,
May 9, 2016, 2:18:19 PM5/9/16
to mr.li...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

I'll also add that most of our users don't use the as.h2o() method of converting a data.frame from R, but rather, use the h2o.importFile() for moving the data from disk into the H2O cluster directly (skipping the step where you duplicate the data in R memory).

Unless you require doing data munging directly in R, it is best to load the data frame into H2O memory and use the rest of the H2O utilities to munge the data directly in H2O.  (Then you wouldn't have to set the levels twice as well).

-Erin


On 5/9/16 11:08 AM, mr.li...@gmail.com wrote:

Li Yang

unread,
May 9, 2016, 3:17:03 PM5/9/16
to Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
Got it. That makes sense. Thanks Erin,

              ~ Li
Reply all
Reply to author
Forward
0 new messages