Pass a pandas dataFrame to H2O

5,311 views
Skip to first unread message

cong yue

unread,
Jul 30, 2015, 5:46:49 PM7/30/15
to H2O Open Source Scalable Machine Learning - h2ostream
Hello:

How I could pass a pandas dataFrame to H2O?
I checked the documentation of 

and I tried with my_df_h2o = h2o.frame.H2OFrame(python_obj = my_df) and it said there is no H2OFrame function for h2o.frame.
When I tried with my_df_h2o = h2o.H2OFrame(python_obj = my_df) and it said 
ValueError: `python_obj` must be a tuple, list, dict, collections.OrderedDict. Got: <class 'pandas.core.frame.DataFrame'>

Could somebody advise how I could pass a pandas dataframe to h2o? I would like avoid to write it to disk temporarily.


Thanks,
Cong

Spencer Aiello

unread,
Jul 30, 2015, 5:59:49 PM7/30/15
to cong yue, H2O Open Source Scalable Machine Learning - h2ostream
try one of these:


    import pandas as pd
    f = pd.read_csv(...)

    h2o.H2OFrame(f.values.tolist())   # get no header

    h2o.H2OFrame(f.to_dict())           # out-of-order cols due to python dict



cong yue

unread,
Jul 30, 2015, 6:03:00 PM7/30/15
to H2O Open Source Scalable Machine Learning - h2ostream, yueco...@gmail.com, spe...@h2o.ai
Thanks. Got the point now.

cong yue

unread,
Jul 30, 2015, 6:23:09 PM7/30/15
to H2O Open Source Scalable Machine Learning - h2ostream, yueco...@gmail.com, spe...@h2o.ai
I  checked the data carefully and found the data imported is wrong. I am not sure whether it is bug or I did something wrong.

The dataframe data is as the attachment, which is the kaggle titanic data.
The result for train_df.to_dict() is like
{'Age': {0: 22.0,
  1: 38.0,
  2: 26.0,
  3: 35.0,
  4: 35.0,
  5: 29.69911764705882,
  6: 54.0,
  7: 2.0,
  8: 27.0,
...
Which are expected.

But with the following code, it seems only the index for each dictionary is imported instead of the value.
train_df_h2o = h2o.H2OFrame(python_obj=train_df.to_dict())
train_df_h2o.show()

See the attachment for detail.
Screenshot from 2015-07-30 15:19:35.png
Screenshot from 2015-07-30 15:22:39.png

Spencer Aiello

unread,
Jul 30, 2015, 6:35:44 PM7/30/15
to cong yue, H2O Open Source Scalable Machine Learning - h2ostream
this is what my comment on Python dictionaries is about. They do not preserve order. 
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

cong yue

unread,
Jul 30, 2015, 6:39:16 PM7/30/15
to Spencer Aiello, H2O Open Source Scalable Machine Learning - h2ostream
Hi Spencer:

The issue is not losing the order, but the data itself is not imported at all.
You could find that actually all the data are lost, and there is only 1,2,3 are imported.

thanks,
Cong

Spencer Aiello

unread,
Jul 30, 2015, 6:47:42 PM7/30/15
to cong yue, H2O Open Source Scalable Machine Learning - h2ostream
looks like you've got your pandas frame indexed -- try removing the index and seeing what you get out of the to_dict

Spencer Aiello

unread,
Jul 30, 2015, 6:49:51 PM7/30/15
to cong yue, H2O Open Source Scalable Machine Learning - h2ostream
pass "list" into to_dict

here's the documentation on pandas to_dict:

cong yue

unread,
Jul 30, 2015, 7:01:36 PM7/30/15
to H2O Open Source Scalable Machine Learning - h2ostream, yueco...@gmail.com, spe...@h2o.ai
yup. I also just got it. 

train_df_h2o = h2o.H2OFrame(python_obj=train_df.to_dict('list'))

But it seems some values are still lost. Please check the attachment.  There are several values lost in 'Ticket' column. Could you please advise the process to import data from a python object? 

thanks,
Cong
Screenshot from 2015-07-30 15:59:49.png

cong yue

unread,
Jul 30, 2015, 7:09:57 PM7/30/15
to H2O Open Source Scalable Machine Learning - h2ostream, yueco...@gmail.com, spe...@h2o.ai
https://github.com/h2oai/h2o-3/blob/master/h2o-py/h2o/frame.py

Internally, it seems python object will be written as a temporary file, so it might be a better way for me write it to a csv from pandas and then just pass the path to H2O.

thanks,
Cong

Spencer Aiello

unread,
Jul 30, 2015, 7:11:58 PM7/30/15
to cong yue, H2O Open Source Scalable Machine Learning - h2ostream
i agree -- you'll have to import the dataset but force the h2o parser to read the column as an ENUM instead of as numeric -- it's forcing non-numeric values to NA in that column.

Spencer Aiello

unread,
Jul 30, 2015, 7:19:15 PM7/30/15
to cong yue, H2O Open Source Scalable Machine Learning - h2ostream
i've added a jira so that this feature gets added:



for now, you can do it "the hard way" by first dumping the data to disk and then doing the parse as follows:

fraw = h2o.import_file("/path/to/your/data") 
fsetup = h2o.parse_setup(fraw) 
fsetup["column_types"][1] = "Enum" # change second column "CAPSULE" to categorical 
fr = h2o.parse_raw(fsetup) 


hope that helps!

andrew...@gmail.com

unread,
Oct 6, 2015, 3:24:11 PM10/6/15
to H2O Open Source Scalable Machine Learning - h2ostream

I would like to +1 this feature. I cannot use h2o.import_file("/path/to/your/data") for files in the local file system on my edgenode at work. And thus would have to write to HDFS and then read from there (too much IO) in my opinion.

using the pandas_df.to_dict() messes up the data as noted above.

Spencer Aiello

unread,
Oct 6, 2015, 3:56:21 PM10/6/15
to andrew...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
@Andrew

There's an additional argument you need to add on the Pandas side of thing (please refer to the above discussion for this line):

           train_df_h2o = h2o.H2OFrame(python_obj=train_df.to_dict('list'))

@Cong, you should be able to supply in a dictionary of col types to h2o.import_file (requires bleeding edge build)


Spencer

Spencer Aiello

unread,
Oct 6, 2015, 4:03:04 PM10/6/15
to andrew...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
oh, one more thing:

you can use 

        h2o.upload_file

to push from your local machine to a remote machine (with no intermediary python parse).

Bingjing Gu

unread,
Jul 11, 2016, 11:59:18 AM7/11/16
to H2O Open Source Scalable Machine Learning - h2ostream, yueco...@gmail.com
if using tolist() get no header, how did h2o know which features are specified for predictor variable?
just x and y could do it ?

Lauren DiPerna

unread,
Jul 11, 2016, 1:58:14 PM7/11/16
to Bingjing Gu, H2O Open Source Scalable Machine Learning - h2ostream, yueco...@gmail.com
When you pass a pandas data frame to an H2OFrame, the headers take the form C1,C2, C3 etc.

to see try the following code:

import pandas as pd
import h2o
h2o.init()
frame = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
h2o.H2OFrame(frame)


You can either rename those headers with their original names or you can use the new headers. Either way you have to specify which columns are your features and which column is your predictor by passing the C1,C2, C3, etc headers that correspond to your features to `x` and the C# (column that corresponds to your predictor) to y.

cheers,

Lauren

Reply all
Reply to author
Forward
0 new messages