Candidate for a new itertool

Raymond Hettinger

unread,

Mar 7, 2009, 7:47:21 PM3/7/09

to

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
'Split iterable on event boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
def transition_counter(x, start=start, cnt=[0]):
before = cnt[0]
if event(x):
cnt[0] += 1
after = cnt[0]
return after if start else before
return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
for start in True, False:
for g in split_on('X1X23X456X', 'X'.__eq__, start):
print list(g)
print

from pprint import pprint
boundary = '--===============2615450625767277916==\n'
email = open('email.txt')
for mime_section in split_on(email, boundary.__eq__):
pprint(list(mime_section, 1, None))
print '= = ' * 30

bearoph...@lycos.com

unread,

Mar 7, 2009, 9:58:44 PM3/7/09

to

Raymond Hettinger, maybe it can be useful to add an optional argument
flag to tell such split_on to keep the separators or not? This is the
xsplit I usually use:

def xsplit(seq, key=bool, keepkeys=True):
"""xsplit(seq, key=bool, keepkeys=True): given an iterable seq and
a predicate
key, splits the iterable where key(item) is True and yields the
parts as lists.
If keepkeys is True then the splitting items are kept at the
beginning of the
sublists (but the first sublist may miss the key item).

>>> list(xsplit([]))
[]

>>> key = lambda x: 0x80 & x
>>> l = [1,2,3,0xF0,4,5,6,0xF1,7,8,0xF2,9,10,11,12,13]
>>> list(xsplit(l, key=key))
[[1, 2, 3], [240, 4, 5, 6], [241, 7, 8], [242, 9, 10, 11, 12, 13]]

>>> l =
[0xF0,1,2,3,0xF0,4,5,6,0xF1,7,8,0xF2,9,10,11,12,13,0xF0,14,0xF1]
>>> list(xsplit(l, key=key, keepkeys=False))
[[1, 2, 3], [4, 5, 6], [7, 8], [9, 10, 11, 12, 13], [14]]

>>> s1 = "100001000101100001000000010000"
>>> ["".join(map(str, g)) for g in xsplit(s1, key=int)]
['10000', '1000', '10', '1', '10000', '10000000', '10000']

>>> from itertools import groupby # To compare against groupby
>>> s2 = "1111100011111100011100101011111"
>>> ["".join(map(str, g)) for h, g in groupby(s2, key=int)]
['11111', '000', '111111', '000', '111', '00', '1', '0', '1', '0',
'11111']
"""
group = []
for el in seq:
if key(el):
if group:
yield group
group = []
if keepkeys:
group.append(el)
else:
group.append(el)
if group:
yield group

Maybe it's better to separate or denote the separators in some way?

A possibility:
"X1X23X456X" => "X", "1", "X", "23", "X", "456", "X"

Another possibility:
"X1X23X456X" => ("", "X"), ("1", "X"), (["2", "3"], "X"), (["4", "5",
"6"], "X")

Another possibility (True == is a separator):
"X1X23X456X" => (True, "X"), (False, ["1"]), (True, "X"), (False,
["2", "3"]), (True, "X"), (False, ["4", "5", "6"]), (True, "X")

Is it useful to merge successive separators (notice two X)?

"X1X23XX456X" => (True, ["X"]), (False, ["1"]), (True, ["X"]), (False,
["2", "3"]), (True, ["X", "X"]), (False, ["4", "5", "6"]), (True,
["X"])

Opps, this is groupby :-)

Is a name like isplitter or splitter better this itertool?

Bye,
bearophile

jay logan

unread,

Mar 8, 2009, 10:54:01 PM3/8/09

to

I've found this type of splitting quite useful when grouping sections
of a text file. I used the groupby function directly in the file, when
i would have rather used something like this.

However, I wonder if it would be helpful to break that function into
two instead of having the "start" flag. The flag feels odd to me
(maybe it's the name?), and the documentation might have a better feel
to it, coming from a newcomer's perspective. Also, it would be cool if
the function took keywords; I wonder why most of the other functions
in the itertools module don't take keywords.

I wouldn't split out the keys separately from the groups. But the idea
of a flag to exclude the keys sounds interesting to me.

Thank you for giving me the opportunity to use the nonlocal keyword
for the first time since trying out Python 3.0. I hope this is an
appropriate usage:

def split_on(iterable, key=bool, start=True):
'Split iterable on boundaries (either start events or stop

events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X

flag = 0

def event_marker(x, start_flag=start):
nonlocal flag, key
before = flag
if key(x):
flag += 1
after = flag

return after if start_flag else before

return (g for k, g in it.groupby(iterable, key=event_marker))

Raymond Hettinger

unread,

Mar 9, 2009, 4:12:01 AM3/9/09

to

On Mar 7, 7:58 pm, bearophileH...@lycos.com wrote:
> Raymond Hettinger, maybe it can be useful to add an optional argument
> flag to tell such split_on to keep the separators or not? This is the
> xsplit I usually use:

In your experiences with xsplit(), do most use cases involve removing
the separators?

My thought is that if you know the separator is at the beginning, it
is easy to strip it off with a next() on the subgroup iterator.
If so, then there's no reason to complicate the API with another
set of options.

Also, separation events do not necessarily tie themselves to specific
values that can be throw-away. Perhaps, the event detector is looking
for the number of lines written on a page and generates a new page
event every time a page is full.

> Maybe it's better to separate or denote the separators in some way?

Probably not. Itertools work together best when a stream of tokens
all have the same meaning. Otherwise, you end-up complicating
consumer
code with something like:

for val in split(...):
if val == separator:
do_onething(val)
else:
do_another(val)

> Is it useful to merge successive separators (notice two X)?

Probably depends on real-world use cases. Do you have any?

> Is a name like isplitter or splitter better this itertool?

Like groupbyer, filterer, teer, starmapper, and chainer ;-)

Raymond

bearoph...@lycos.com

unread,

Mar 9, 2009, 8:35:11 AM3/9/09

to

Raymond Hettinger:

>In your experiences with xsplit(), do most use cases involve removing the separators?<

Unfortunately I am not able to tell you how often I remove them. But
regarding strings, I usually want to remove separators:

>>> "aXcdXfg".split("X")
['a', 'cd', 'fg']

So sometimes I want to do the same with lists too.
Maybe I am also a bit spoiled by the D (V.1) language, where strings
are dynamic arrays, so you can use the same templated functions for
arrays and strings, so a xsplit() works on both.

>>Is it useful to merge successive separators (notice two X)?<<

>Probably depends on real-world use cases. Do you have any?<

I don't think so.

Bye and thank you for your work that keeps shrinking my bag of tricks
(about 100+ functions to go),
bearophile

prue...@latinmail.com

unread,

Mar 9, 2009, 11:26:23 AM3/9/09

to

On Mar 7, 8:47 pm, Raymond Hettinger <pyt...@rcn.com> wrote:

Sorry to hijack the thread but I now that you have a knack for finding
good iterator patterns. I have noticed a pattern lately: Aggregation
using a defaultdict. I quickly found two examples of problems that
could use this:

http://groups.google.com/group/comp.lang.python/browse_frm/thread/c8b3976ec3ceadfd

http://www.willmcgugan.com/blog/tech/2009/1/17/python-coder-test/

To show an example, using data like this:

>>> data=[('red',2,'other data'),('blue',5,'more data'),('yellow',3,'lots of things'),('blue',1,'data'),('red',2,'random data')]

Then
>>> from itertools import groupby
>>> from operator import itemgetter
>>> from collections import defaultdict

We can use groupby to do this:
>>> [(el[0],sum(x[1] for x in el[1])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', 6), ('red', 4), ('yellow', 3)]

>>> [(el[0],[x[1] for x in el[1]]) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', [5, 1]), ('red', [2, 2]), ('yellow', [3])]

>>> [(el[0],set([x[1] for x in el[1]])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', set([1, 5])), ('red', set([2])), ('yellow', set([3]))]

But this way seems to be more efficient:

>>> def aggrsum(data,key,agrcol):
dd=defaultdict(int)
for el in data:
dd[key(el)]+=agrcol(el)
return dd.items()

>>> aggrsum(data,itemgetter(0),itemgetter(1))
[('blue', 6), ('yellow', 3), ('red', 4)]

>>> def aggrlist(data,key,agrcol):
dd=defaultdict(list)
for el in data:
dd[key(el)].append(agrcol(el))
return dd.items()

>>> aggrlist(data,itemgetter(0),itemgetter(1))
[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

>>> def aggrset(data,key,agrcol):
dd=defaultdict(set)
for el in data:
dd[key(el)].add(agrcol(el))
return dd.items()

>>> aggrset(data,itemgetter(0),itemgetter(1))
[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.

But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.

Raymond Hettinger

unread,

Mar 9, 2009, 6:55:28 PM3/9/09

to

[prueba]

> The data often contains objects with attributes instead of tuples, and
> I expect the new namedtuple datatype to be used also as elements of
> the list to be processed.
>
> But I haven't found a nice generalized way for that kind of pattern
> that aggregates from a list of one datatype to a list of key plus
> output datatype that would make it practical and suitable for
> inclusion in the standard library.

Looks like you've searched the possibilities thoroughly and no one
aggregation function seems to meet all needs. That's usually a cue
to not try to build one and instead let simple python loops do the
work for you (that also saves the awkward itemgetter() calls in your
examples). To my eyes, all three examples look like straight-forward,
easy-to-write, easy-to-read, fast plain python:

>>> d = defaultdict(int)
>>> for color, n, info in data:
... d[color] += n
>>> d.items()

[('blue', 6), ('yellow', 3), ('red', 4)]

>>> d = defaultdict(list)
>>> for color, n, info in data:
... d[color].append(n)
>>> d.items()

[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

>>> d = defaultdict(set)
>>> for color, n, info in data:
... d[color].add(n)
>>> d.items()

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

I don't think you can readily combine all three examples into a single
aggregator without the obfuscation and awkwardness that comes from
parameterizing all of the varying parts:

def aggregator(default_factory, adder, iterable, keyfunc, valuefunc):
d = defaultdict(default_factory)
for record in iterable:
key = keyfunc(record)
value = valuefunc(record)
adder(d[key], value)
return d.items()

>>> aggregator(list, list.append, data, itemgetter(0), itemgetter(1))

[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

>>> aggregator(set, set.add, data, itemgetter(0), itemgetter(1))

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

Yuck! Plain Python wins.

Raymond

P.S. The aggregator doesn't work so well for:

>>> aggregator(int, operator.iadd, data, itemgetter(0), itemgetter(1))
[('blue', 0), ('yellow', 0), ('red', 0)]

The problem is that operator.iadd() doesn't have a way to both
retrieve
and store back into a dictionary.

prue...@latinmail.com

unread,

Mar 10, 2009, 11:35:48 AM3/10/09

to

Yes thinking about this more, one probably needs to have two code
paths depending if the type returned by default_factory is mutable or
immutable. But you are probably right that the ratio of redundancy/
variability is pretty low for such a function and the plain written
out for loop is not too painful. The only redundancy is the creation
and manipulation of the dictionary and the explicit looping.

prue...@latinmail.com

unread,

Mar 12, 2009, 1:36:30 PM3/12/09

to

For me your examples don't justify why you would need such a general
algorithm. A split function that works on iterables instead of just
strings seems straightforward, so maybe we should have that and
another one function with examples of problems where a plain split
does not work.
Something like this should work for the two examples you gave were the
boundaries are a known constants (and therefore there is really no
need to keep them. I can always add them later):

def split_on(iterable, boundary):
l=[]
for el in iterable:
if el!=boundary:
l.append(el)
else:
yield l
l=[]
yield l

def join_on(iterable, boundary):
it=iter(iterable)
firstel=it.next()
for el in it:
yield boundary
for x in el:
yield x

if __name__ == '__main__':
lst=[]
for g in split_on('X1X23X456X', 'X'):
print list(g)
lst.append(g)
print
print list(join_on(lst,'X'))

Aahz

unread,

Mar 18, 2009, 10:46:18 PM3/18/09

to

In article <6ca71455-2fb2-4dd0...@v6g2000vbb.googlegroups.com>,

Raymond Hettinger <pyt...@rcn.com> wrote:
>
>For edge-triggered events, we need to convert a boundary-event
>predicate to groupby-style key function. The code below encapsulates
>that process in a new itertool called split_on().
>
>Would love you guys to experiment with it for a bit and confirm that
>you find it useful. Suggestions are welcome.

It seems a little esoteric for the standard library. It was non-trivial
for me to make it work with what seems to me an obvious use-case
(although perhaps I'm missing something), and I'm not sure it has enough
general utility otherwise:

from math import log10

class C:
def __init__(self, data):
self.data = data
self.index = 0
self.sentinel = None

def __iter__(self):
return self

def next(self):
if self.index >= len(self.data):
raise StopIteration
value = self.data[self.index]
self.index += 1
return value

def make_sentinel(self, value):
return (int(value / 10) + 1) * 10

def grouper(self, value):
if self.sentinel is None:
self.sentinel = self.make_sentinel(value)
return False
if value >= self.sentinel:
self.sentinel = self.make_sentinel(value)
return True
return False

if __name__ == '__main__':
for start in True, False:

L_iter = C([11, 12, 13, 20, 32])
for g in split_on(L_iter, L_iter.grouper, start):
print list(g)
print

[11, 12, 13]
[20]
[32]

[11, 12, 13, 20]
[32]
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"Programming language design is not a rational science. Most reasoning
about it is at best rationalization of gut feelings, and at worst plain
wrong." --GvR, python-ideas, 2009-3-1

George Sakkis

unread,

Mar 19, 2009, 12:02:06 AM3/19/09

to

On Mar 7, 8:47 pm, Raymond Hettinger <pyt...@rcn.com> wrote:

> The existing groupby() itertool works great when every element in a
> group has the same key, but it is not so handy when groups are
> determined by boundary conditions.
>
> For edge-triggered events, we need to convert a boundary-event
> predicate to groupby-style key function. The code below encapsulates
> that process in a new itertool called split_on().
>
> Would love you guys to experiment with it for a bit and confirm that
> you find it useful. Suggestions are welcome.

That's pretty close to a recipe [1] I had posted some time ago (and
you commented improving it:)). As they stand, none is a generalization
of the other; your version allows either start or stop events (but not
both) while mine requires the start but takes an optional stop event
(plus an option to control whether separators are returned). If we
were to generalize them, the resulting signature would be something
like:

def split_on(iterable, **kwds):
'''Split iterable on event boundaries.

@keyword start,stop: The start and stop boundaries. Both are
optional
(but at least one of them must be given).
@keyword yield_bounds: If True, yield also the boundary events.
'''

On a related note, I recently needed a grouper that I couldn't come up
with either groupby() or the split_on() above. The reason is that
instead of one, it needs two consecutive events to decide whether to
make a split or not. An example would be to partition an iterable of
numbers (or any orderable objects for that matter) in increasing or
non-decreasing groups:

>>> from operator import gt, ge
>>> list(groupby_needsbettername([3,4,4,2,2,5,1], gt))
[[3, 4, 4], [2, 2, 5], [1]]
>>> list(groupby_needsbettername([3,4,4,2,2,5,1], ge))
[[3, 4], [4], [2], [2, 5], [1]]

def groupby_needsbettername(iterable, is_boundary):
it = iter(iterable)
try: cur = it.next()
except StopIteration:
return
group = [cur]
for next in it:
if is_boundary(cur,next):
yield group
group = []
group.append(next)
cur = next
yield group

George

[1] http://code.activestate.com/recipes/521877/