Pandas groupby: Possible to build a custom grouper with overlapping groups?

Olof Sjöbergh

unread,

Apr 22, 2015, 11:06:22 AM4/22/15

to pyd...@googlegroups.com

Hi,

I have some problems where I would like to create Pandas groupings with overlapping groups, and am trying to find out if this is possible. Basically, I would like to build a custom grouper that assigns some items to multiple groups, and then be able to run aggregations on those groups.

This is related to a question posted on Stackoverflow here: http://stackoverflow.com/questions/29032937/aggregate-events-with-start-and-end-times-with-pandas

The data there is for a number of events that have both a start and an end time, and then I need to calculate certain aggregations for the items active at different times. A custom grouper that could assign the items to multiple groups would be a nice solution.

I have tried to read through the source in groupby.py but it's a bit hard to follow at in some parts.

Do you think it is possible to build such a custom grouper, or is it not supported?

Best regards,

Olof Sjöbergh

Teake Nutma

unread,

Jan 7, 2016, 8:52:55 AM1/7/16

to PyData

FWIW, I'm also interested in this. If anybody could give some pointers I'd be much obliged.
Best,

Teake Nutma

tom

unread,

Jan 7, 2016, 9:03:41 AM1/7/16

to pyd...@googlegroups.com

I wonder if this is something that Stephan’s IntervalIndex can handle naturally? The tricky part would be getting the intervals I think.

Taeke do you have a example data for your problem? Or does the Stackoverflow question accurately capture it?

- Tom

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Teake Nutma

unread,

Jan 8, 2016, 3:13:29 AM1/8/16

to PyData

The data in my case is something like the following:

import pandas
import numpy
import scipy

n = 500
x = 5 * ( numpy.random.rand(n) - 0.5 )
y = scipy.stats.norm.pdf(x) + 0.05* (numpy.random.randn(n))
df = pandas.DataFrame({'x': x, 'y': y})

I'd like to group e.g. by || x - x_i || < 0.1 for x_i in numpy.linspace(df.x.min(), df.x.max(), 100 ) .
I can of course create separate dataframes for each x_i, but then I can't use the groupby aggregation functions.
Best,

Teake Nutma

Reply all

Reply to author

Forward