Change H2O feature name in Python

507 views
Skip to first unread message

Shi Yu

unread,
May 5, 2017, 4:20:32 PM5/5/17
to H2O Open Source Scalable Machine Learning - h2ostream
I created a h2o frame from PySpark dataframe (sparse vector):

h2o_frame = h2c.as_h2o_frame(all_data)

when I describe it:

h2o_frame.describe()

it shows the automatically named feature names:

feature1,   feature2,  feature3, ..

I tried to rename them using 

h2o_frame.names = myexpected_names

however, it does not work.  When I describe, or run the model in H2O flow, the displayed feature names are still "feature1, feature2, feature3, ..."

How could I change those feature names so they are more meaningful in the H2O figures?

Erin LeDell

unread,
May 5, 2017, 6:21:03 PM5/5/17
to Shi Yu, H2O Open Source Scalable Machine Learning - h2ostream
Any chance you can make a reproducible example?

I have verified that this works in regular Python h2o, so if you can provide a reproducible example, I will file a bug report.  Also please note what version you are using:


Here is regular h2o Python example:

import h2o
h2o.init()
iris = h2o.import_file(path="https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
iris.names = ["a","b","c","d","e"]

In [9]: iris.names
Out[9]: [u'a', u'b', u'c', u'd', u'e']
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Shi Yu

unread,
May 7, 2017, 12:23:10 PM5/7/17
to Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
Sure, here is the code.  I am using H2O 3.10.0.7 with Python 2.7.12 final and Spark 1.6,  so I can only reproduce this issue in this environment. I don't have access to other versions of H2O so I don't know whether it works okay elsewhere.

The problem is once the H2O frame has been initialized, it seems cannot change the feature(column) names.  One may think why not change the name before initialization, but the problem is we use sparse vectors in RDD and it is easier for us to reassign the names after H2O frame is created. 


Below is the reproducible code:

# I create an h2o frame first  with old column names:

testdata = sc.parallelize([[1,2],[3,4],[5,6]])
df = testdata.toDF(['col1','col2'])
h2o_test = h2c.as_h2o_frame(df)
h2o_test.describe()

#--display--
col1 col2
type int int
mins 1.0 2.0
mean 3.0 4.0
maxs 5.0 6.0
sigma 2.0 2.0
zeros 0 0
missing0 0
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0   

# then I change the first feature's name

h2o_test.names[0]='newcol1'
h2o_test.names

# ['newcol1', u'col2']  seems correct

# but when I describe it, it still shows the old name, and in H2O model training process and figures it still uses the old names.
h2o_test.describe()

col1 col2
type int int
mins 1.0 2.0
mean 3.0 4.0
maxs 5.0 6.0
sigma 2.0 2.0
zeros 0 0
missing0 0
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0   

On Fri, May 5, 2017 at 5:20 PM, Erin LeDell <er...@h2o.ai> wrote:
Any chance you can make a reproducible example?

I have verified that this works in regular Python h2o, so if you can provide a reproducible example, I will file a bug report.  Also please note what version you are using:


Here is regular h2o Python example:

import h2o
h2o.init()
iris = h2o.import_file(path="https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
iris.names = ["a","b","c","d","e"]

In [9]: iris.names
Out[9]: [u'a', u'b', u'c', u'd', u'e']



On 5/5/17 1:20 PM, Shi Yu wrote:
I created a h2o frame from PySpark dataframe (sparse vector):

h2o_frame = h2c.as_h2o_frame(all_data)

when I describe it:

h2o_frame.describe()

it shows the automatically named feature names:

feature1,   feature2,  feature3, ..

I tried to rename them using 

h2o_frame.names = myexpected_names

however, it does not work.  When I describe, or run the model in H2O flow, the displayed feature names are still "feature1, feature2, feature3, ..."

How could I change those feature names so they are more meaningful in the H2O figures?
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shi Yu

unread,
May 7, 2017, 10:33:33 PM5/7/17
to Erin LeDell, H2O Open Source Scalable Machine Learning - h2ostream
After some search on web, I found this method works:

h2o_test.set_names(['newcol1','newcol2'])

Erin LeDell

unread,
May 7, 2017, 11:38:01 PM5/7/17
to Shi Yu, H2O Open Source Scalable Machine Learning - h2ostream

Great,

Looks like you've solved the problem.  If I remember correctly, we had this same issue in the h2o Python module a while back (where frames were uploaded directly from disk; not copied from Spark), and the bug was fixed, but maybe the fix doesn't work for frames copied over from Spark for whatever reason.

I filed a bug report here: https://0xdata.atlassian.net/browse/SW-425

Thanks!

-Erin

Erin LeDell

unread,
May 8, 2017, 1:12:18 AM5/8/17
to Shi Yu, H2O Open Source Scalable Machine Learning - h2ostream

Yes, good question. 

If you google terms like "h2o python docs" or something similar, you may find older versioned copies of the documentation.  What you're looking for is:

Python module docs: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html

You can find this link from docs.h2o.ai and scroll down to the Python section.  This is the "Python module documentation" ... Not to be confused with the regular H2O documentation, aka "H2O User Guide"

For the H2O User Guide: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html  Or you can find it by going to docs.h2o.ai and click on "H2O User Guide."

Best,
Erin



On 5/7/17 9:46 PM, Shi Yu wrote:
Yes, it was confusing because h2o.names and h2o.describe show different results.  But I did see this https://0xdata.atlassian.net/browse/PUBDEV-2466  and got hint.  

BTW, is there a way to get an updated python API (methods) for H2O.  I found many pieces of information here and there, but hard to find a go-to place (most of them are for R not python)

Hansu Gu

unread,
Oct 2, 2017, 5:33:40 PM10/2/17
to H2O Open Source Scalable Machine Learning - h2ostream
A follow-up question on the column name change. It does not work as expected and here is the code:
iris.names[0]='a'

iris.names



We get ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']

Can we get some help?

Hansu
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Erin LeDell

unread,
Oct 2, 2017, 6:58:36 PM10/2/17
to Hansu Gu, H2O Open Source Scalable Machine Learning - h2ostream

The Python H2OFrame follows Pandas conventions (as much as we can)... though we have aliases: columns, col_names, names

In a Pandas dataframe, the way you're doing it doesn't work either:

import pandas
df = pandas.DataFrame([{'c1':3,'c2':10},{'c1':2, 'c2':30},{'c1':1,'c2':20},{'c1':2,'c2':15},{'c1':2,'c2':100}])
df.columns[0] = "a1"  #gives an error
df.columns.values[0] = "a1"  #works

The Pandas Dataframe has a rename method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

df = df.rename(columns = {'c1':'bb'})

And the H2OFrame has a set_name (and set_names) method (i opened a ticket here to create a rename method that wraps this).  It looks like set_name works but it's buggy -- it throws an error yet completes the rename.  We will fix this in the next bug fix release: https://0xdata.atlassian.net/browse/PUBDEV-4969

import h2o
h2o.init()

hf = h2o.H2OFrame(df)
hf = hf.set_name('c2', 'bb')  #this contains a bug: it throws an error, but at the same time also renames the column

    
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-59-b1fb03098a83> in <module>()
----> 1 hf = hf.set_name(col = 'c2', name = 'bb')
/usr/local/lib/python2.7/site-packages/h2o/frame.pyc in set_name(self, col, name)
   1051             self._frame()._ex._cache.fill()
   1052         else:
-> 1053             self._ex._cache._names = self.names[:col] + [name] + self.names[col + 1:]
   1054             self._ex._cache._types[name] = self._ex._cache._types.pop(oldname)
   1055         return
TypeError: slice indices must be integers or None or have an __index__ method
# see that it's updated
hf.names
# [u'c1', u'bb']

If you want to replace the whole list of names, you can use h2o.H2OFrame.set_names and that's also an option (and no errors).

-Erin

Erin LeDell

unread,
Oct 2, 2017, 7:31:49 PM10/2/17
to Hansu Gu, H2O Open Source Scalable Machine Learning - h2ostream

Follow-up ... that Type error that you get with set_names was recently fixed and is working in 3.14.0.3.

Reply all
Reply to author
Forward
0 new messages