H2O + Python - Iterating over rows to create new frames

533 views
Skip to first unread message

carolynl...@gmail.com

unread,
Apr 4, 2016, 5:23:13 PM4/4/16
to H2O Open Source Scalable Machine Learning - h2ostream

I have an H20 frame that I want to break into multiple H20 frames using the criteria of sharing a value in a certain field.

(e.g. using Boolean Masking, df2[ df2["B"] == "a", :] )

However, first I need to know all the possible values this field has

(e.g. df2["B"].unique() )

Right now I am doing something that feels very inefficient and sort of kludgey.

df3 = df2['B'].unique()

for i in xrange(df3.nrow):
val = df3[i,:].as_data_frame(use_pandas=False)[1][0]
newframe = df2[df2['B']==val]

*** Do Computation on newframe ***


Additionally, when I do this, it randomly eventually produces a error message such as

>ERROR MESSAGE:
>
>Temp ID py_417 already exists

Is there a better way to do this?

-Carolyn

Erin LeDell

unread,
Apr 4, 2016, 10:38:25 PM4/4/16
to carolynl...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Hi Carolyn,
Here's how I would do it (using an example dataset). Let me know if
this works for you.

import h2o
h2o.init()

csv_url =
"https://h2o-public-test-data.s3.amazonaws.com/eeg_eyestate_splits.csv"
data = h2o.import_file(csv_url)

categories = data['split'].levels()[0]

# list of H2OFrames that are subsets of your original frame
df_list = map(lambda x: data[data['split']==x], categories)

print df_list[0].dim
#[2996, 16]


Best,
Erin
--
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

carolynl...@gmail.com

unread,
Apr 5, 2016, 9:45:43 AM4/5/16
to H2O Open Source Scalable Machine Learning - h2ostream, carolynl...@gmail.com
Oh that's nice.....

Perfect. Thank you!

-Carolyn

carolynl...@gmail.com

unread,
Apr 5, 2016, 10:06:56 AM4/5/16
to H2O Open Source Scalable Machine Learning - h2ostream, carolynl...@gmail.com
Hi Erin,

As a follow up question, lets say I now wanted to create a histogram of the distribution of lengths of this set of dataframes.

I wrote:

all_lengths = map( lambda x: x.dim[0], df_list)

However, this call is sloooooowwwwwww. It is taking me much longer to gather this statistic about the frames than it took to construct them in the first place.

Any ideas as to why that is or how this could be improved?

Spencer Aiello

unread,
Apr 5, 2016, 1:53:33 PM4/5/16
to carolynl...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
youch! Many H2OFrame ops are already vectorized, so iterating over the H2OFrame is rarely going to be the right thing to do :).

For example, you want frequencies of a categorical column:

      my_frame["my_cat_column"].table()

will give you those frequencies...


Spencer Aiello

unread,
Apr 5, 2016, 1:54:39 PM4/5/16
to carolynl...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
i.e.m no need to split up the H2OFrame into many H2OFrame objects just to dim[0] the result, it's pointless work when all you really want is a single pass over the H2OFrame.

Erin LeDell

unread,
Apr 5, 2016, 2:26:06 PM4/5/16
to Spencer Aiello, carolynl...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Spencer, Carolyn,

It sounds like the main goal is to create the subset dataframes and a secondary goal is to get their dimensions.... I don't think it was her goal to create the subset dataframes just to count the number of elements in each category.  However, Spencer is right that you can get this info very efficiently by using the table() method.

In [26]: print data['split'].table()
split      Count
-------  -------
test        2996
train       8988
valid       2996

[3 rows x 2 columns]


On 4/5/16 10:54 AM, Spencer Aiello wrote:
i.e.m no need to split up the H2OFrame into many H2OFrame objects just to dim[0] the result, it's pointless work when all you really want is a single pass over the H2OFrame.
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

carolynl...@gmail.com

unread,
Apr 5, 2016, 3:32:34 PM4/5/16
to H2O Open Source Scalable Machine Learning - h2ostream, carolynl...@gmail.com
Indeed,

What I really want to do is to extract a frame and then perform an operation that takes into account the order of the rows in the frame.

(in pseudo code)

state_value = initial_value

for i in range(frame.dim[0]):
state_value = defined_function(state_value, frame[i,:])

Spencer Aiello

unread,
Apr 5, 2016, 3:40:54 PM4/5/16
to Carolyn Phillips, H2O Open Source Scalable Machine Learning - h2ostream
Hi Carolyn,

I see.

There are a number of primitive ops that are pre-defined for H2OFrame and a limited number of functions that can be user-defined (for use in H2OFrame.apply).

It looks like you want to accumulate some value/list per-row with some arbitrary function. In previous versions of H2O we've had this js-style accumulator functionality, but it requires backend java pieces that are no longer around.

If you have a clearer description of the function/use-case we could look at building this back into h2o.

There are some vanilla accumulation methods (cumsum, cummin, etc.) as well as some groupby functions (min,max,mean,sd, etc.)... but if these aren't in the realm of what "defined_function" attempts, then it will require more work.

Alternatively, if you feel comfortable with java, you could write these types of distributed/parallel ops right on H2O itself via MRTask.

Thanks!

carolynl...@gmail.com

unread,
Apr 5, 2016, 4:26:36 PM4/5/16
to H2O Open Source Scalable Machine Learning - h2ostream, carolynl...@gmail.com
Hi Spencer,

The data frame being extracted is a time series and we want to be able to implement fairly flexible logic to determine if a certain sequence of events happened over the time series.

For example,
1) "Did the value in column B drop by more than 50% while the value in column A was rising"
2) What events are separated by no more than 1 second?

As a result, I doubt a vanilla accumulation method will get us there.

What would it take to do this in H2O?

Carolyn Phillips

unread,
Apr 5, 2016, 4:28:13 PM4/5/16
to H2O Open Source Scalable Machine Learning - h2ostream, Carolyn Phillips
Also, I have heard H2O has stencils, but I don't see them in the documentation.  Are they available for python?


Spencer Aiello

unread,
Apr 5, 2016, 10:28:43 PM4/5/16
to Carolyn Phillips, H2O Open Source Scalable Machine Learning - h2ostream
sure we accept contributions (code or $$)

Erin LeDell

unread,
Apr 5, 2016, 11:31:50 PM4/5/16
to Spencer Aiello, Carolyn Phillips, H2O Open Source Scalable Machine Learning - h2ostream
Carolyn,

If you are interested in contributing to H2O, we have some info here: https://github.com/h2oai/h2o-3/blob/master/CONTRIBUTING.md and feel free to reach out to us directly if you have questions.

However, I would first look into using H2OFrame.apply to see if it can solve your problem, suggested by Spencer.

Best,
Erin

Spencer Aiello

unread,
Apr 5, 2016, 11:37:22 PM4/5/16
to Carolyn Phillips, H2O Open Source Scalable Machine Learning - h2ostream
all kidding aside, we can accommodate you in making an open source contribution if you're sufficiently motivated! or we could move this discussion to a different channel to discuss supporting your use case further.

We don't have stencils, do you have an example use? 

We have a lot of requests for time-series (classical and not) capabilities. If you have more use cases, we can discuss further support :)

thanks!
spencer 

Spencer Aiello

unread,
Apr 6, 2016, 10:59:03 AM4/6/16
to Carolyn Phillips, H2O Open Source Scalable Machine Learning - h2ostream
all kidding aside, we can accommodate you in making an open source contribution if you're sufficiently motivated! or we could move this discussion to a different channel to discuss supporting your use case further.

We don't have stencils, do you have an example use? 

We have a lot of requests for time-series (classical and not) capabilities. If you have more use cases, we can discuss further support :)

thanks!
spencer 

carolynl...@gmail.com

unread,
Apr 6, 2016, 5:55:07 PM4/6/16
to H2O Open Source Scalable Machine Learning - h2ostream

Thanks for your help Spencer and Erin
Reply all
Reply to author
Forward
0 new messages