Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

itertools.groupby

96 views
Skip to first unread message

7stud

unread,
May 27, 2007, 1:17:52 PM5/27/07
to
Bejeezus. The description of groupby in the docs is a poster child
for why the docs need user comments. Can someone explain to me in
what sense the name 'uniquekeys' is used this example:


import itertools

mylist = ['a', 1, 'b', 2, 3, 'c']

def isString(x):
s = str(x)
if s == x:
return True
else:
return False

uniquekeys = []
groups = []
for k, g in itertools.groupby(mylist, isString):
uniquekeys.append(k)
groups.append(list(g))

print uniquekeys
print groups

--output:--
[True, False, True, False, True]
[['a'], [1], ['b'], [2, 3], ['c']]

Steve Howell

unread,
May 27, 2007, 1:28:41 PM5/27/07
to 7stud, pytho...@python.org

--- 7stud <bbxx78...@yahoo.com> wrote:

> Bejeezus. The description of groupby in the docs is
> a poster child
> for why the docs need user comments. Can someone
> explain to me in
> what sense the name 'uniquekeys' is used this

> example: [...]
>

The groupby method has its uses, but it's behavior is
going to be very surprising to anybody that has used
the "group by" syntax of SQL, because Python's groupby
method will repeat groups if your data is not sorted,
whereas SQL has the luxury of (knowing that it's)
working with a finite data set, so it can provide the
more convenient semantics.



____________________________________________________________________________________You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_html.html

7stud

unread,
May 27, 2007, 1:49:06 PM5/27/07
to
On May 27, 11:28 am, Steve Howell <showel...@yahoo.com> wrote:

> --- 7stud <bbxx789_0...@yahoo.com> wrote:
> > Bejeezus. The description of groupby in the docs is
> > a poster child
> > for why the docs need user comments. Can someone
> > explain to me in
> > what sense the name 'uniquekeys' is used this
> > example: [...]
>
> The groupby method has its uses, but it's behavior is
> going to be very surprising to anybody that has used
> the "group by" syntax of SQL, because Python's groupby
> method will repeat groups if your data is not sorted,
> whereas SQL has the luxury of (knowing that it's)
> working with a finite data set, so it can provide the
> more convenient semantics.
>
> ___________________________________________________________________________ _________You snooze, you lose. Get messages ASAP with AutoCheck

> The groupby method has its uses

I'd settle for a simple explanation of what it does in python.

Steve Howell

unread,
May 27, 2007, 1:57:29 PM5/27/07
to 7stud, pytho...@python.org

--- 7stud <bbxx78...@yahoo.com> wrote:

> Bejeezus. The description of groupby in the docs is
> a poster child
> for why the docs need user comments.

I would suggest an example with a little more
concreteness than what's currently there.

For example, this code...

import itertools

syslog_messages = [
'out of file descriptors',
'out of file descriptors',
'unexpected access',
'out of file descriptors',
]

for message, messages in
itertools.groupby(syslog_messages):
print message, len(list(messages))

...produces this...

out of file descriptors 2
unexpected access 1
out of file descriptors 1


____________________________________________________________________________________Get the free Yahoo! toolbar and rest assured with the added security of spyware protection.
http://new.toolbar.yahoo.com/toolbar/features/norton/index.php

Steve Howell

unread,
May 27, 2007, 2:20:34 PM5/27/07
to 7stud, pytho...@python.org

--- 7stud <bbxx78...@yahoo.com> wrote:
>
> I'd settle for a simple explanation of what it does
> in python.
>

The groupby function prevents you have from having to
write awkward (and possibly broken) code like this:

group = []
lastKey = None
for item in items:
newKey = item.key()
if newKey == lastKey:
group.append(word)
elif group:
doSomething(group)
group = []
lastKey = newKey
if group:
doSomething(group)

See my other reply for what it actually does in a
simple example.



____________________________________________________________________________________Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool.
http://autos.yahoo.com/carfinder/

Steve Howell

unread,
May 27, 2007, 2:31:31 PM5/27/07
to pytho...@python.org

--- Steve Howell <show...@yahoo.com> wrote:

>
> --- 7stud <bbxx78...@yahoo.com> wrote:
>
> > Bejeezus. The description of groupby in the docs
> is
> > a poster child
> > for why the docs need user comments.
>

Regarding the pitfalls of groupby in general (even
assuming we had better documentation), I invite people
to view the following posting that I made on
python-ideas, entitled "SQL-like way to manipulate
Python data structures":

http://mail.python.org/pipermail/python-ideas/2007-May/000807.html

In the thread, I don't really make a proposal, so much
as a problem statement, but my radical idea is that
lists of dictionaries fit the relational model
perfectly, so why not allow some kind of native SQL
syntax in Python that allows you to manipulate those
data structures more naturally?



____________________________________________________________________________________Luggage? GPS? Comic books?
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz

Carsten Haese

unread,
May 27, 2007, 3:47:01 PM5/27/07
to pytho...@python.org

The so-called example you're quoting from the docs is not an actual
example of using itertools.groupby, but suggested code for how you can
store the grouping if you need to iterate over it twice, since iterators
are in general not repeatable.

As such, 'uniquekeys' lists the key values that correspond to each group
in 'groups'. groups[0] is the list of elements grouped under
uniquekeys[0], groups[1] is the list of elements grouped under
uniquekeys[1], etc. You are getting surprising results because your data
is not sorted by the group key. Your group key alternates between True
and False.

Maybe you need to explain to us what you're actually trying to do.
User-supplied comments to the documentation won't help with that.

Regards,

--
Carsten Haese
http://informixdb.sourceforge.net


paul

unread,
May 27, 2007, 3:54:32 PM5/27/07
to pytho...@python.org
Steve Howell schrieb:

> --- Steve Howell <show...@yahoo.com> wrote:
>
>> --- 7stud <bbxx78...@yahoo.com> wrote:
>>
>>> Bejeezus. The description of groupby in the docs
>> is
>>> a poster child
>>> for why the docs need user comments.
>
> Regarding the pitfalls of groupby in general (even
> assuming we had better documentation), I invite people
> to view the following posting that I made on
> python-ideas, entitled "SQL-like way to manipulate
> Python data structures":
>
> http://mail.python.org/pipermail/python-ideas/2007-May/000807.html
>
> In the thread, I don't really make a proposal, so much
> as a problem statement, but my radical idea is that
> lists of dictionaries fit the relational model
> perfectly, so why not allow some kind of native SQL
> syntax in Python that allows you to manipulate those
> data structures more naturally?

LINQ?

cheers
Paul

Steve Howell

unread,
May 27, 2007, 5:54:54 PM5/27/07
to paul, pytho...@python.org

--- paul <pa...@subsignal.org> wrote:
> >
> > Regarding the pitfalls of groupby in general (even
> > assuming we had better documentation), I invite
> people
> > to view the following posting that I made on
> > python-ideas, entitled "SQL-like way to manipulate
> > Python data structures":
> >
>
> LINQ?
>

Maybe. I think they're at least trying to solve the
same problem as I am.



____________________________________________________________________________________
Expecting? Get great news right away with email Auto-Check.
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html

Steve Howell

unread,
May 27, 2007, 5:59:43 PM5/27/07
to Carsten Haese, pytho...@python.org

--- Carsten Haese <car...@uniqsys.com> wrote:

> On Sun, 2007-05-27 at 10:17 -0700, 7stud wrote:
> > Bejeezus. The description of groupby in the docs
> is a poster child

> > for why the docs need user comments. Can someone
> explain to me in
> > what sense the name 'uniquekeys' is used this
> example:
> >
> >
> > import itertools
> >
> > mylist = ['a', 1, 'b', 2, 3, 'c']
> >
> > def isString(x):
> > s = str(x)
> > if s == x:
> > return True
> > else:
> > return False
> >
> > uniquekeys = []
> > groups = []
> > for k, g in itertools.groupby(mylist, isString):
> > uniquekeys.append(k)
> > groups.append(list(g))
> >
> > print uniquekeys
> > print groups
> >
> > --output:--
> > [True, False, True, False, True]
> > [['a'], [1], ['b'], [2, 3], ['c']]
>
> The so-called example you're quoting from the docs
> is not an actual

> example of using itertools.groupby [...]

Huh? How is code that uses itertools.groupby not an
actual example of using itertools.groupby?

These docs need work. Please do not defend them;
please suggest improvements.



____________________________________________________________________________________Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more.
http://mobile.yahoo.com/go?refer=1GNXIC

Carsten Haese

unread,
May 27, 2007, 8:31:28 PM5/27/07
to pytho...@python.org
On Sun, 2007-05-27 at 14:59 -0700, Steve Howell wrote:
> Huh? How is code that uses itertools.groupby not an
> actual example of using itertools.groupby?

Here's how:

"""
The returned group is itself an iterator that shares the underlying
iterable with groupby(). Because the source is shared, when the groupby
object is advanced, the previous group is no longer visible. So, if that
data is needed later, it should be stored as a list:

groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
"""

It does not say "Here is an example for how to use itertools.groupby."
It's an abstract code pattern for an abstract use case. There is an
example on the following page, called Examples!

> These docs need work. Please do not defend them;

The docs do their job in specifying what groupby does. Providing code
snippets is not the job of the documentation. There are plenty of other
resources with code snippets.

To name just one, there's an example of itertools.groupby in the last
code snippet at
http://informixdb.blogspot.com/2007/04/power-of-generators-part-two.html

> please suggest improvements.

Why should I? The docs suit my needs just fine. Of course, that
shouldn't stop you from suggesting improvements.

Best regards,

Raymond Hettinger

unread,
May 27, 2007, 8:50:41 PM5/27/07
to
On May 27, 2:59 pm, Steve Howell <showel...@yahoo.com> wrote:
> These docs need work. Please do not defend them;
> please suggest improvements.

FWIW, I wrote those docs. Suggested improvements are
welcome; however, I think they already meet a somewhat
high standard of quality:

- there is an accurate, succinct one-paragraph description
of what the itertool does.

- there is advice to pre-sort the data using the same
key function.

- it also advises when to list-out the group iterator
and gives an example with working code.

- it includes a pure python equivalent which shows precisely
how the function operates.

- there are two more examples on the next page. those two
examples also give sample inputs and outputs.

This is most complex itertool in the module. I believe
the docs for it can be usable, complete, and precise,
but not idiot-proof.

The groupby itertool came-out in Py2.4 and has had remarkable
success (people seem to get what it does and like using it, and
there have been no bug reports or reports of usability problems).
All in all, that ain't bad (for what 7stud calls a poster child).


Raymond

Steve Howell

unread,
May 27, 2007, 9:12:15 PM5/27/07
to Carsten Haese, pytho...@python.org

--- Carsten Haese <car...@uniqsys.com> wrote:
> [...] It's an abstract code pattern for an abstract
use
> case.

I question the use of abstract code patterns in
documentation, as they just lead to confusion. I
really think concrete examples are better in any
circumstance.

Also, to the OP's original contention, there is no way
that "uniquekeys" is a sensible variable in the overly
abstract example that is provided as an example in
the, er, non-examples portion of the documentation.
With the abstract non-example that's posted as an
example, the assertion of uniqueness implicit in the
name of the variable doesn't make any sense.


> There is an
> example on the following page, called Examples!
>

The example is useful. Thank you.



> > These docs need work. Please do not defend them;
>

> [...]


> To name just one, there's an example of
> itertools.groupby in the last
> code snippet at
>
http://informixdb.blogspot.com/2007/04/power-of-generators-part-two.html
>

Do we now, or could we, link to this example from the
docs?

> [...] that shouldn't stop you from suggesting
improvements.
>

I already did in a previous reply.

To repeat myself, I think a concrete example is
beneficial even on the main page:


import itertools

syslog_messages = [
'out of file descriptors',
'out of file descriptors',
'unexpected access',
'out of file descriptors',
]

for message, messages in
itertools.groupby(syslog_messages):
print message, len(list(messages))

...produces this...

out of file descriptors 2
unexpected access 1
out of file descriptors 1

Steve Howell

unread,
May 27, 2007, 9:26:09 PM5/27/07
to Raymond Hettinger, pytho...@python.org

--- Raymond Hettinger <pyt...@rcn.com> wrote:

>
> FWIW, I wrote those docs. Suggested improvements
> are
> welcome; however, I think they already meet a
> somewhat
> high standard of quality:
>

I respectfully disagree, and I have suggested
improvements in this thread.

Without even reading the doc, I completely understand
the motivation for this function, and I understand its
implementation from reading email threads where it was
discussed, but when I go back a couple days later to
read the docs, I find it hard to grok how to actually
use the module.

You provided a bunch of points that clarify what you
did specify correctly in the documentation, and I'm
not going to attempt to refute them individually. I'm
simply going to agree with the original poster that
the docs as written are hard to understand, and I'll
leave it to you to make your own judgment upon
re-reading the docs.

It could come down to simply needing a better
motivating example.

My suggestions mostly come down to providing better
example code (I provided some in a separate reply),
but I think you could also clarify the main use case
(aggregating a stream of data) and the main limitation
(requirement to sort by key since the iteration could
be infinite)--which I know you mention, but you maybe
could emphasize it more.


> This is most complex itertool in the module. I
> believe
> the docs for it can be usable, complete, and
> precise,
> but not idiot-proof.
>

Agreed, of course, that nothing can be idiot-proof,
and I understand the limitations myself, and I
understand groupby's power.

> The groupby itertool came-out in Py2.4 and has had
> remarkable
> success (people seem to get what it does and like
> using it, and
> there have been no bug reports or reports of
> usability problems).
> All in all, that ain't bad (for what 7stud calls a
> poster child).
>

I agree that "poster child" is way too strong, but
please don't disregard 7stud completely just because
he exaggerates a tad.

I've had minor usability problems with groupby, and I
just haven't reported them. I'm still on 2.3 for
most of my day-to-work work, but I've been
sufficiently intrigued by the power of groupby() to
try it out in a later version.

I really mean all of these suggestions constructively.
It's a great function to have in Python, and I think
the documentation's mostly good, just could be even
better.



____________________________________________________________________________________
Don't get soaked. Take a quick peak at the forecast
with the Yahoo! Search weather shortcut.
http://tools.search.yahoo.com/shortcuts/#loc_weather

Paul Rubin

unread,
May 27, 2007, 11:28:26 PM5/27/07
to
Raymond Hettinger <pyt...@rcn.com> writes:
> The groupby itertool came-out in Py2.4 and has had remarkable
> success (people seem to get what it does and like using it, and
> there have been no bug reports or reports of usability problems).
> All in all, that ain't bad (for what 7stud calls a poster child).

I use the module all the time now and it is great. Basically it
gets rid of the problem of the "lump moving through the snake"
when iterating through a sequence, noticing when some condition
changes, and having to juggle an element from one call to another.
That said, I too found the docs a little confusing on first reading.
I'll see if I can go over them again and suggest improvements.

Here for me is a typical example: you have a file of multi-line
records. Each record starts with a marker saying "New Record". You
want to iterate through the records. You could do it by collecting
lines til you see a new record marker, then juggling the marker into
the next record somehow; in some situations you could do it by some
kind of pushback mechanism that's not supported in the general
iterator protocol (maybe it should be); I like to do it with what I
call a "Bates stamp". (A Bates stamp is a rubber stamp with a serial
numbering mechanism, so each time you operate it the number goes
higher by one. You use it to stamp serial numbers on pages of legal
documents and that sort of thing). I use enumerate to supply Bates
numbers to the lines from the file, incrementing the number every
time there's a new record:

fst = operator.itemgetter(0)
snd = operator.itemgetter(1)

def bates(fd):
# generate tuples (n,d) of lines from file fd,
# where n is the record number. Just iterate through all lines
# of the file, stamping a number on each one as it goes by.
n = 0 # record number
for d in fd:
if d.startswith('New Record'): n += 1
yield (n, d)

def records(fd):
for i,d in groupby(bates(fd), fst):
yield imap(snd, d)

This shows a "straight paper path" approach where all the buffering
and juggling is hidden inside groupby.

Raymond Hettinger

unread,
May 28, 2007, 1:29:58 AM5/28/07
to
On May 27, 8:28 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> I use the module all the time now and it is great.

Thanks for the accolades and the great example.


FWIW, I checked in a minor update to the docs:

+++ python/trunk/Doc/lib/libitertools.tex Mon May 28 07:23:22 2007
@@ -138,6 +138,13 @@
identity function and returns the element unchanged. Generally,
the
iterable needs to already be sorted on the same key function.

+ The operation of \function{groupby()} is similar to the \code{uniq}
filter
+ in \UNIX{}. It generates a break or new group every time the value
+ of the key function changes (which is why it is usually necessary
+ to have sorted the data using the same key function). That
behavior
+ differs from SQL's GROUP BY which aggregates common elements
regardless
+ of their input order.
+


The returned group is itself an iterator that shares the underlying

iterable with \function{groupby()}. Because the source is shared,
when
the \function{groupby} object is advanced, the previous group is no
@@ -147,6 +154,7 @@
\begin{verbatim}
groups = []
uniquekeys = []
+ data = sorted(data, key=keyfunc)


for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)


Raymond

Paul Rubin

unread,
May 28, 2007, 2:34:55 AM5/28/07
to
Raymond Hettinger <pyt...@rcn.com> writes:
> On May 27, 8:28 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> > I use the module all the time now and it is great.
> Thanks for the accolades and the great example.

Thank YOU for the great module ;). Feel free to use the example in the
docs if you want. The question someone coincidentally posted about
finding sequences of capitalized words also made a nice example.

Here's yet another example that came up in something I was working on:
you are indexing a book and you want to print a list of page numbers
for pages that refer to George Washington. If Washington occurs on
several consecutive pages you want to print those numbers as a
hyphenated range, e.g.

Washington, George: 5, 19, 37-45, 82-91, 103

This is easy with groupby (this version not tested but it's pretty close
to what I wrote in the real program). Again it works by Bates numbering,
but a little more subtly (enumerate generates the Bates numbers):

snd = operator.itemgetter(1) # as before

def page_ranges():
pages = sorted(filter(contains_washington, all_page_numbers))
for d,g in groupby(enumerate(pages), lambda (i,p): i-p):
h = map(snd, g)
if len(h) > 1:
yield '%d-%d'% (h[0], h[-1])
else:
yield '%d'% h[0]
print ', '.join(page_ranges())

See what has happened: for a sequence of p's that are consecutive, i-p
stays constant, and groupby splits out the clusters where this occurs.

> FWIW, I checked in a minor update to the docs: ...

The uniq example certainly should be helpful for Unix users.

Carsten Haese

unread,
May 28, 2007, 9:17:55 AM5/28/07
to pytho...@python.org
On Sun, 2007-05-27 at 18:12 -0700, Steve Howell wrote:
> [...] there is no way
> that "uniquekeys" is a sensible variable [...]

That's because the OP didn't heed the advice from the docs that


"Generally, the iterable needs to already be sorted on the same key
function."

> http://informixdb.blogspot.com/2007/04/power-of-generators-part-two.html


> >
>
> Do we now, or could we, link to this example from the
> docs?

Do we? No. Could we? Technically we could, but I'm not sure we should.
The article is about more than just groupby. I just brought it up as one
readily available instance of an independent source of examples.

> To repeat myself, I think a concrete example is
> beneficial even on the main page:

I disagree. Examples add clutter to the page of synopses. Somebody
looking for examples should look on the page that's conveniently called
Examples.

Suppose hypothetically you wanted to show off a really neat example that
involves chain, izip, and groupby. If the examples were forced into the
page of function synopses, you'd have to duplicate it in all three
functions, or randomly pick one function for which your example is an
example. Having a separate examples page that is not arbitrarily
sectioned by function name makes more sense.

Carsten Haese

unread,
May 28, 2007, 9:50:24 AM5/28/07
to pytho...@python.org
On Sun, 2007-05-27 at 20:28 -0700, Paul Rubin wrote:
> fst = operator.itemgetter(0)
> snd = operator.itemgetter(1)
>
> def bates(fd):
> # generate tuples (n,d) of lines from file fd,
> # where n is the record number. Just iterate through all lines
> # of the file, stamping a number on each one as it goes by.
> n = 0 # record number
> for d in fd:
> if d.startswith('New Record'): n += 1
> yield (n, d)
>
> def records(fd):
> for i,d in groupby(bates(fd), fst):
> yield imap(snd, d)

Now there's a clever variation of the Schwartzian transform: decorate,
groupby, undecorate.

That's a nice example of groupby, but it could benefit from using better
variable names.

Steve Howell

unread,
May 28, 2007, 10:43:37 AM5/28/07
to Raymond Hettinger, pytho...@python.org

--- Raymond Hettinger <pyt...@rcn.com> wrote:

> + The operation of \function{groupby()} is similar
> to the \code{uniq}
> filter

> + in \UNIX{}. [...]

Thanks!

The comparison of groupby() to "uniq" really clicks
with me.

To the extent that others like the Unix command line
analogy for understanding Python idioms, I compiled
the following list, which includes a couple groupby
examples from Raymond.


>>> 'abacadabra'[:5] # head -5
abaca

>>> 'abacadabra'[-5:] # tail -5
dabra

>>> [word for word in 'aaa,abc,foo,zzz,cba'.split(',')
if 'a' in word] # grep a
['aaa', 'abc', 'cba']

>>> sorted('abracadabra') # sort
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'c', 'd', 'r',
'r']

>>> list(reversed(sorted('abracadabra'))) # sort -r
['r', 'r', 'd', 'c', 'b', 'b', 'a', 'a', 'a', 'a',
'a']

>>> [k for k, g in groupby(sorted('abracadabra'))] #
sort | uniq
['a', 'b', 'c', 'd', 'r']

>>> [(k, len(list(g))) for k, g in
groupby(sorted('abracadabra'))] # sort | uniq -c
[('a', 5), ('b', 2), ('c', 1), ('d', 1), ('r', 2)]

>>> [k for k, g in groupby(sorted('abracadabra')) if
len(list(g)) > 1] # sort | uniq -d
['a', 'b', 'r']



____________________________________________________________________________________Get the Yahoo! toolbar and be alerted to new email wherever you're surfing.
http://new.toolbar.yahoo.com/toolbar/features/mail/index.php

Steve Howell

unread,
May 28, 2007, 11:14:24 AM5/28/07
to pytho...@python.org

--- Paul Rubin <"http://phr.cx"@NOSPAM.invalid> wrote:
> [...]
> [...]

Cool.

Here's another variation on itertools.groupby, which
wraps text from paragraphs:

import itertools
lines = [line.strip() for line in '''
This is the
first paragraph.

This is the second.
'''.split('\n')]

for has_chars, frags in itertools.groupby(lines,
lambda x: len(x) > 0):
if has_chars:
print ' '.join(list(frags))
# prints this:
#
# This is the first paragraph.
# This is the second.


I put the above example here:

http://wiki.python.org/moin/SimplePrograms


____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/

7stud

unread,
May 28, 2007, 11:34:11 AM5/28/07
to

As is often the case, the specifics of the description may only be
meaningful to someone who already knows what groupby does. There are
many terms and concepts that experienced programmers use to describe
programming problems, but often the terms and concepts only ring true
with people who already understand the problem, and they are not at
all helpful for someone who is trying to learn about the concept.

Sometimes when you describe a function accurately, the description
becomes almost impossible to read because of all the detail. What is
really needed is a general, simple description of the primary use of
the function, so that a reader can immediately employ the function in
a basic way. Code snippets are extremely helpful in that regard.
Subsequently, the details and edge cases can be fleshed out in the
rest of the description.

>- there is advice to pre-sort the data using the same
> key function.

But you have to know why that is relevant in the first place--
otherwise it is just a confusing detail. Two short code examples
could flesh out the relevance of that comment. I think I now
understand why pre-sorting is necessary: groupby only groups similar
items that are adjacent to each other in the sequence, and similar
items that are elsewhere in the sequence will be in a different group.

>- it includes a pure python equivalent which shows precisely
> how the function operates.

It's too complicated. Once again, it's probably most useful to an
experienced programmer who is trying to figure out some edge case. So
the code example is certainly valuable to one group of readers--just
not a reader who is trying to get a basic idea of what groupby does.

>- there are two more examples on the next page. those two
> examples also give sample inputs and outputs.

I didn't see those.

> people seem to get what it does and like
> using it, and
> there have been no bug reports or reports of
> usability problems

Wouldn't you get the same results if not many people used groupby
because they couldn't understand what it does?

I don't think you even need good docs if you allow users to attach
comments to the docs because all the issues will get fleshed out by
the users. I appreciate the fact that it must be difficult to write
the docs--that's why I think user comments can help.

How about this for the beginning of the description of groupby in the
docs:

groupby divides a sequence into groups of similar elements.

Compare to:

> Make an iterator that returns consecutive keys and groups
> from the iterable.

Huh?

Continuing with a kinder, gentler description:

With a starting sequence like this:

lst = [1, 2, 2, 2, 1, 1, 3]

groupby divides the sequence into groups like this:

[1], [2, 2, 2], [1, 1], [3]

groupby takes similar elements that are adjacent to each other and
gathers them into a group. If you sort the sequence beforehand, then
all the similar elements in a sequence will be adjacent to one
another, and therefore they will all end up in one group.

Optionally, you can specify a function func which groupby will use to
determine which elements in the sequence are similar (if func isn't
specified or is None, then func defaults to the identity function
which returns the element unchanged). An example:

------
import itertools

lst = [1, 2, 2, 2, 1, 1, 3]

def func(num):
if num == 2:
return "a"
else:
return "b"

keys = []
groups = []
for k, g in itertools.groupby(lst, func):
keys.append(k)
groups.append( list(g) )

print keys
print groups

---output:---
['b', 'a', 'b']
[[1], [2, 2, 2], [1, 1, 3]]

When func is applied to an element in the list, and the return
value(or key) is equal to "a", the element is considered similar to
other elements with a return value(or key) equal to "a". As a result,
the adjacent elements that all have a key equal to "a" are put in a
group; likewise the adjacent elements that all have a key equal to "b"
are put in a group.

RETURN VALUE: groupby returns a tuple consisting of:
1) the key for the current group; all the elements of a group have the
same key

2) an iterator for the current group, which you normally use list(g)
on to get the current group as a list.
-----------------

That description probably contains some inaccuracies, but sometimes a
less accurate description can be more useful.

Raymond Hettinger

unread,
May 28, 2007, 1:29:48 PM5/28/07
to
On May 28, 8:34 am, 7stud <bbxx789_0...@yahoo.com> wrote:
> >- there are two more examples on the next page. those two
> > examples also give sample inputs and outputs.
>
> I didn't see those.

Ah, there's the rub. The two sections of examples and recipes
are there for a reason. This isn't a beginner module. It is
full of high-speed optimizations applied in a functional style.
That's not for everyone, so it isn't a loss if someone sticks
with writing plain, clear everyday Python instead of an itertool.


Raymond

Steve Howell

unread,
May 28, 2007, 1:39:46 PM5/28/07
to pytho...@python.org

--- Carsten Haese <car...@uniqsys.com> wrote:

> On Sun, 2007-05-27 at 18:12 -0700, Steve Howell
> wrote:
> > [...] there is no way
> > that "uniquekeys" is a sensible variable [...]
>
> That's because the OP didn't heed the advice from
> the docs that
> "Generally, the iterable needs to already be sorted
> on the same key
> function."
>

It was I who was complaining about the variable name
uniquekeys, because the example itself didn't
incorporate a call to sorted(). I would have proposed
two solutions to prevent people from falling into the
trap/pitfall:

1) Add a call to sorted() in the example.
2) Rename the variable to
not_necessarily_unique_keys or something like that.

Raymond did the former, so I'm happy.

Although I do think, of course, that people need to
read docs carefully, I think it was a big trap/pitfall
that people might assume the semantics of the SQL
"group by" syntax, so I'm glad that Raymond now calls
out the pitfall, and compares it to the Unix "uniq"
command, which has more similar semantics.

> Suppose hypothetically you wanted to show off a
> really neat example that
> involves chain, izip, and groupby.

It's hypothetical for itertools, but I can understand
your premise for other modules, where you do more
typically need multiple functions from the module to
provide meaningful examples.

> If the examples
> were forced into the
> page of function synopses, you'd have to duplicate
> it in all three
> functions, or randomly pick one function for which
> your example is an
> example.

There's no reason why all three functions couldn't
link to the same example on the Examples page, though.

> Having a separate examples page that is not
> arbitrarily
> sectioned by function name makes more sense.
>

I'd propose a separate examples page that is not
arbitrarily sectioned by function name, or not by
function name, but which is organized according to the
best way to help users use the module. In the case of
itertools, I'd see the benefit of a separate section
with many examples of groupby(), since it's a very
rich function in its capabilities, and it doesn't
really require anything else from itertools to write
useful programs.



____________________________________________________________________________________Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow

Steve Howell

unread,
May 28, 2007, 1:58:50 PM5/28/07
to Raymond Hettinger, pytho...@python.org

--- Raymond Hettinger <pyt...@rcn.com> wrote:

> That's not for everyone, so it isn't a loss if
> someone sticks
> with writing plain, clear everyday Python instead of
> an itertool.
>

I know most of the module is fairly advanced, and that
average users can mostly avoid it, but this is a very
common-antipattern that groupby() solves:

group = []
lastKey = None
for item in items:
newKey = item.key()
if newKey == lastKey:

group.append(item)


elif group:
doSomething(group)
group = []
lastKey = newKey
if group:
doSomething(group)

See this recent thread, for example:

http://mail.python.org/pipermail/python-list/2007-May/442602.html


____________________________________________________________________________________
The fish are biting.
Get more visitors on your site using Yahoo! Search Marketing.
http://searchmarketing.yahoo.com/arp/sponsoredsearch_v2.php

Raymond Hettinger

unread,
May 28, 2007, 4:02:12 PM5/28/07
to
> > That's not for everyone, so it isn't a loss if
> > someone sticks
> > with writing plain, clear everyday Python instead of
> > an itertool.
>
> I know most of the module is fairly advanced, and that
> average users can mostly avoid it, but this is a very
> common-antipattern that groupby() solves:

I think the OP would have been better-off with plain
vanilla Python such as:

See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173

Raymond

Alex Martelli

unread,
May 28, 2007, 5:12:32 PM5/28/07
to
Steve Howell <show...@yahoo.com> wrote:
...

> for has_chars, frags in itertools.groupby(lines,
> lambda x: len(x) > 0):

Hmmm, it appears to me that itertools.groupby(lines, bool) should do
just the same job, just a bit faster and simpler, no?


Alex

Gordon Airporte

unread,
May 28, 2007, 5:19:00 PM5/28/07
to
7stud wrote:
> Bejeezus. The description of groupby in the docs is a poster child

> for why the docs need user comments. Can someone explain to me in
> what sense the name 'uniquekeys' is used this example:
>

This is my first exposure to this function, and I see that it does have
some uses in my code. I agree that it is confusing, however.
IMO the confusion could be lessened if the function with the current
behavior were renamed 'telescope' or 'compact' or 'collapse' or
something (since it collapses the iterable linearly over homogeneous
sequences.)
A function named groupby could then have what I think is the clearly
implied behavior of creating just one iterator for each unique type of
thing in the input list, as categorized by the key function.

Paul Rubin

unread,
May 28, 2007, 5:31:22 PM5/28/07
to
Gordon Airporte <JHo...@fbi.gov> writes:
> This is my first exposure to this function, and I see that it does
> have some uses in my code. I agree that it is confusing, however.
> IMO the confusion could be lessened if the function with the current
> behavior were renamed 'telescope' or 'compact' or 'collapse' or
> something (since it collapses the iterable linearly over homogeneous
> sequences.)

It chops up the iterable into a bunch of smaller ones, but the total
size ends up the same. "Telescope", "compact", "collapse" etc. make
it sound like the output is going to end up smaller than the input.

There is also a dirty secret involved <wink>, which is that the
itertools functions (including groupby) are mostly patterned after
similarly named functions in the Haskell Prelude, which do about the
same thing. They are aimed at helping a similar style of programming,
so staying with similar names IMO is a good thing.

> A function named groupby could then have what I think is the clearly
> implied behavior of creating just one iterator for each unique type of
> thing in the input list, as categorized by the key function.

But that is what groupby does, except its notion of uniqueness is
limited to contiguous runs of elements having the same key.

Steve Howell

unread,
May 28, 2007, 5:50:01 PM5/28/07
to pytho...@python.org

--- Paul Rubin <"http://phr.cx"@NOSPAM.invalid> wrote:
>
>
> But that is what groupby does, except its notion of
> uniqueness is
> limited to contiguous runs of elements having the
> same key.

It occurred to me that we could also rename the
function uniq(), or unique(), after its Unix
counterpart, but then I though better of it. As one
of the folks who was making a bit of noise about the
groupby() semantics before, I'm really fine with the
name now that the docs make it a little more clear how
it behaves.

Steve Howell

unread,
May 28, 2007, 6:03:42 PM5/28/07
to pytho...@python.org

--- Raymond Hettinger <pyt...@rcn.com> wrote:

That recipe implements SQL-like Group By, but it's the
uniq-like reimplementation of groupby() that I think
is error prone.

The OP was writing code that wanted the uniq-like
semantics.

Steve Howell

unread,
May 28, 2007, 6:22:25 PM5/28/07
to pytho...@python.org

Agreed.

I updated the webpages with your change (after testing
it in 2.5):

http://wiki.python.org/moin/SimplePrograms

Paul Rubin

unread,
May 28, 2007, 6:49:35 PM5/28/07
to
Raymond Hettinger <pyt...@rcn.com> writes:
> I think the OP would have been better-off with plain
> vanilla Python such as:
>
> See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173

But that recipe generates the groups in a random order depending on
the dict hashing, instead of keeping them in the original sequence's
order, which the OP's application might well require.
itertools.groupby really is the right thing. I agree that itertools
is not the easiest module in the world for beginning programmers to
understand, but every serious Python user should spend some time
figuring it out sooner or later. Iterators and itertools really turn
Python into a higher-level language than it was before, giving
powerful and streamlined general-purpose mechanisms that replace a lot
of special-purpose hand-coding that usually ends up being a lot more
work to debug in addition to bloating the user's code. Itertools
should by no means be thought of as just a performance hack. It makes
programs smaller and sharper. It quickly becomes the One Obvious Way
To Do It.

In my past few kloc of Python, I think I've written just one or two
"class" statements. I used to use class instances all the time, to
maintain little bits of state that had to be held between different
operations in a program. Using itertools means I now tend to organize
entire programs as iterator pipelines so that all the data runs
"through the goose" exactly once and there is almost no need to
maintain any state anywhere outside the scope of simple function
invocations. There are just fewer user-written moving parts when a
program is written that way, and therefore fewer ways for the program
to go wrong. Messy edge cases that used to take a lot of thought to
handle correctly now need no attention at all--they just handle
themselves.

Also I think it's generally better to use a documented standard
library routine than a purpose-written routine or even a downloaded
recipe, since the stdlib routine will stay easily available from one
project to another and as the user gains experience with it, it will
become more and more powerful in his or her hands.

Also, these days I think I'd write that recipe with a defaultdict
instead of setdefault, but that's new with Python 2.5.

Paul Rubin

unread,
May 28, 2007, 6:58:22 PM5/28/07
to
Paul Rubin <http://phr...@NOSPAM.invalid> writes:
> > See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173
> But that recipe generates the groups in a random order depending on
> the dict hashing,

Correction, it generates the right order in this case, although it
builds up an in-memory copy of the entire input, which could be
problematic if the input is large (e.g. the input sequence is coming
from a file or socket stream).

Gordon Airporte

unread,
May 28, 2007, 11:02:31 PM5/28/07
to
Paul Rubin wrote:

> It chops up the iterable into a bunch of smaller ones, but the total
> size ends up the same. "Telescope", "compact", "collapse" etc. make
> it sound like the output is going to end up smaller than the input.

Good point... I guess I was thinking in terms of the number of iterators
being returned being smaller than the length of the input, and ordered
relative to the input - not about the fact that the iterators contain
all of the objects.


> There is also a dirty secret involved <wink>, which is that the
> itertools functions (including groupby) are mostly patterned after
> similarly named functions in the Haskell Prelude, which do about the
> same thing. They are aimed at helping a similar style of programming,
> so staying with similar names IMO is a good thing.

Ah - those horrible, intolerant Functionalists. I dig ;-).

> But that is what groupby does, except its notion of uniqueness is
> limited to contiguous runs of elements having the same key.

"itertools.groupby_except_the_notion_of_uniqueness_is_limited_to-
_contiguous_runs_of_elements_having_the_same_key()" doesn't have much of
a ring to it. I guess this gets back to documentation problems, because
the help string says nothing about this limitation:

'''
class groupby(__builtin__.object)
| groupby(iterable[, keyfunc]) -> create an iterator which returns
| (key, sub-iterator) grouped by each value of key(value).
|
'''

"Each" seems to imply uniqueness here.

Paul Rubin

unread,
May 28, 2007, 11:18:18 PM5/28/07
to
Gordon Airporte <JHo...@fbi.gov> writes:
> "itertools.groupby_except_the_notion_of_uniqueness_is_limited_to-
> _contiguous_runs_of_elements_having_the_same_key()" doesn't have much
> of a ring to it. I guess this gets back to documentation problems,
> because the help string says nothing about this limitation:
>
> '''
> class groupby(__builtin__.object)
> | groupby(iterable[, keyfunc]) -> create an iterator which returns
> | (key, sub-iterator) grouped by each value of key(value).
> |
> '''

I wouldn't call it a "limitation"; it's a designed behavior which is
the right thing for some purposes and maybe not for others. For
example, groupby (as currently defined) works properly on infinite
sequences, but a version that scans the entire sequence to get bring
together every occurrence of every key would fail in that situation.
I agree that the doc description could be reworded slightly.

Carsten Haese

unread,
May 28, 2007, 11:36:02 PM5/28/07
to pytho...@python.org
On Mon, 28 May 2007 23:02:31 -0400, Gordon Airporte wrote

> '''
> class groupby(__builtin__.object)
> | groupby(iterable[, keyfunc]) -> create an iterator which returns
> | (key, sub-iterator) grouped by each value of key(value).
> |
> '''
>
> "Each" seems to imply uniqueness here.

Yes, I can see how somebody might read it this way.

How about "...grouped by contiguous runs of key(value)" instead? And while
we're at it, it probably should be keyfunc(value), not key(value).

Raymond Hettinger

unread,
May 29, 2007, 2:34:36 AM5/29/07
to
On May 28, 8:36 pm, "Carsten Haese" <cars...@uniqsys.com> wrote:
> And while
> we're at it, it probably should be keyfunc(value), not key(value).

No dice. The itertools.groupby() function is typically used
in conjunction with sorted(). It would be a mistake to call
it keyfunc in one place and not in the other. The mental
association is essential. The key= nomenclature is used
throughout Python -- see min(), max(), sorted(), list.sort(),
itertools.groupby(), heapq.nsmallest(), and heapq.nlargest().

Really. People need to stop making-up random edits to the docs.
For the most part, the current wording is there for a reason.
The poster who wanted to rename the function to telescope() did
not participate in the extensive python-dev discussions on the
subject, did not consider the implications of unnecessarily
breaking code between versions, did not consider that the term
telescope() would mean A LOT of different things to different
people, did not consider the useful mental associations with SQL, etc.

I recognize that the naming of things and the wording
of documentation is something *everyone* has an opinion
about. Even on python-dev, it is typical that posts with
technical analysis or use case studies are far outnumbered
by posts from folks with strong opinions about how to
name things.

I also realize that you could write a book on the subject
of this particular itertool and someone somewhere would still
find it confusing. In response to this thread, I've put in
additional documentation (described in an earlier post).
I think it is time to call this one solved and move on.
It currently has a paragraph plain English description,
a pure python equivalent, an example, advice on when to
list-out the iterator, triply repeated advice to pre-sort
using the same key function, an alternate description as
a tool that groups whenever key(x) changes, a comparison to
UNIX's uniq filter, a contrast against SQL's GROUP BY clauses,
and two worked-out examples on the next page which show
sample inputs and outputs. It is now one of the most
throughly documented individual functions in the language.
If someone reads all that, runs a couple of experiments
at the interactive prompt, and still doesn't get it,
then god help them when they get to the threading module
or to regular expressions.

If the posters on this thread have developed an interest
in the subject, I would find it useful to hear their
ideas on new and creative ways to use groupby(). The
analogy to UNIX's uniq filter was found only after the
design was complete. Likewise, the page numbering trick
(shown above by Paul and in the examples in the docs)
was found afterwards. I have a sense that there are entire
classes of undiscovered use cases which would emerge
if serious creative effort where focused on new and
interesting key= functions (the page numbering trick
ought to serve as inspiration in this regard).

The gauntlet has been thrown down. Any creative thinkers
up to the challenge? Give me cool recipes.


Raymond

Raymond Hettinger

unread,
May 29, 2007, 6:02:33 AM5/29/07
to
On May 28, 8:02 pm, Gordon Airporte <JHoo...@fbi.gov> wrote:
> "Each" seems to imply uniqueness here.

Doh! This sort of micro-massaging the docs misses the big picture.
If "each" meant unique across the entire input stream, then how the
heck could the function work without reading in the entire data stream
all at once. An understanding of iterators and itertools philosophy
reveals the correct interpretation. Without that understanding, it is
a fools errand to try to inject all of the attendant knowledge into
the docs for each individual function. Without that understanding, a
user would be *much* better off using list based functions (i.e. using
zip() instead izip() so that they will have a thorough understanding
of what their code actually does).

The itertools module necessarily requires an understanding of
iterators. The module has a clear philosophy and unifying theme. It
is about consuming data lazily, writing out results in small bits,
keeping as little as possible in memory, and being a set of composable
functional-style tools running at C speed (often making it possible to
avoid the Python eval-loop entirely).

The docs intentionally include an introduction that articulates the
philosophy and unifying theme. Likewise, there is a reason for the
examples page and the recipes page. Taken together, those three
sections and the docs on the individual functions guide a programmer
to a clear sense of what the tools are for, when to use them, how to
compose them, their inherent strengths and weaknesses, and a good
intuition about how they work under the hood.

Given that context, it is a trivial matter to explain what groupby()
does: it is an itertool (with all that implies) that emits groups
from the input stream whenever the key(x) function changes or the
stream ends.

Without the context, someone somewhere will find a way to get confused
no matter how the individual function docs are worded. When the OP
said that he hadn't read the examples, it is not surprising that he
found a way to get confused about the most complex tool in the
toolset.*

Debating the meaning of "each" is sure sign of ignoring context and
editing with tunnel vision instead of holistic thinking. Similar
issues arise in the socket, decimal, threading and regular expression
modules. For users who do not grok those module's unifying concepts,
no massaging of the docs for individual functions can prevent
occasional bouts of confusion.


Raymond


* -- FWIW, the OP then did the RightThing (tm) by experimenting at the
interactive prompt to observe what the function actually does and then
posted on comp.lang.python in a further effort to resolve his
understanding.

Carsten Haese

unread,
May 29, 2007, 8:39:29 AM5/29/07
to Raymond Hettinger, pytho...@python.org
On Mon, 2007-05-28 at 23:34 -0700, Raymond Hettinger wrote:
> On May 28, 8:36 pm, "Carsten Haese" <cars...@uniqsys.com> wrote:
> > And while
> > we're at it, it probably should be keyfunc(value), not key(value).
>
> No dice. The itertools.groupby() function is typically used
> in conjunction with sorted(). It would be a mistake to call
> it keyfunc in one place and not in the other. The mental
> association is essential. The key= nomenclature is used
> throughout Python -- see min(), max(), sorted(), list.sort(),
> itertools.groupby(), heapq.nsmallest(), and heapq.nlargest().

Point taken, but in that case, the argument name in the function
signature is technically incorrect. I don't really need this corrected,
I was merely pointing out the discrepancy between the name 'keyfunc' in
the signature and the call 'key(value)' in the description. For what
it's worth, which is probably very little, help(sorted) correctly
identifies the name of the key argument as 'key'.

As an aside, while groupby() will indeed often be used in conjunction
with sorted(), there is a significant class of use cases where that's
not the case: I use groupby to produce grouped reports from the results
of an SQL query. In such cases, I use ORDER BY to guarantee that the
results are supplied in the correct order rather than using sorted().

Having said that, I'd like to expressly thank you for providing such a
mindbogglingly useful feature. Writing reports would be much less
enjoyable without groupby.

Best regards,

Steve Howell

unread,
May 29, 2007, 6:18:25 PM5/29/07
to Carsten Haese, Raymond Hettinger, pytho...@python.org

--- Carsten Haese <car...@uniqsys.com> wrote:
> As an aside, while groupby() will indeed often be
> used in conjunction
> with sorted(), there is a significant class of use
> cases where that's
> not the case: I use groupby to produce grouped
> reports from the results
> of an SQL query. In such cases, I use ORDER BY to
> guarantee that the
> results are supplied in the correct order rather
> than using sorted().
>

Although I'm not trying to preoptimize here, it seems
a mistake to use sorted() and groupby() in
conjunction, if you're dealing with a use case where
you don't need the groups themselves to be sorted.
Instead, you'd use something more straightforward (and
faster, I think) like the cookbook "SQL-like Group By"
example.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173

It seems to me that the classic use case for
itertools.groupby() is the never-ending stream of data
where you're just trying to pick out consecutive
related elements. For example, if you're snooping on
syslog, you could use groupby() to avoid repeating
duplicate messages to some other output stream.

George Sakkis

unread,
May 29, 2007, 11:30:14 PM5/29/07
to
On May 29, 2:34 am, Raymond Hettinger <pyt...@rcn.com> wrote:

> If the posters on this thread have developed an interest
> in the subject, I would find it useful to hear their
> ideas on new and creative ways to use groupby(). The
> analogy to UNIX's uniq filter was found only after the
> design was complete. Likewise, the page numbering trick
> (shown above by Paul and in the examples in the docs)
> was found afterwards. I have a sense that there are entire
> classes of undiscovered use cases which would emerge
> if serious creative effort where focused on new and
> interesting key= functions (the page numbering trick
> ought to serve as inspiration in this regard).
>
> The gauntlet has been thrown down. Any creative thinkers
> up to the challenge? Give me cool recipes.

Although obfuscated one-liners don't have a large coolness factor in
Python, I'll bite:

from itertools import groupby
from random import randint
x = [randint(0,100) for _ in xrange(20)]
print x
n = 7
# <-- insert fat comments here about the next line --> #
reduce(lambda acc,(rem,divs): acc[rem].extend(divs) or acc,
groupby(x, key=lambda div: div%n),
[[] for _ in xrange(n)])


George

Paul Rubin

unread,
May 29, 2007, 11:30:22 PM5/29/07
to
Raymond Hettinger <pyt...@rcn.com> writes:
> The gauntlet has been thrown down. Any creative thinkers
> up to the challenge? Give me cool recipes.

Here is my version (with different semantics) of the grouper recipe in
the existing recipe section:

snd = operator.itemgetter(1) # I use this so often...

def grouper(seq, n):
for k,g in groupby(enumerate(seq), lambda (i,x): i//n):
yield imap(snd, g)

I sometimes use the above for chopping large (multi-gigabyte) data
sets into manageable sized runs of a program. That is, my value of n
might be something like 1 million, so making tuples that size (as the
version in the itertools docs does) starts being unpleasant. Plus,
I think the groupby version makes more intuitive sense, though it
has pitfalls if you do anything with the output other than iterate
through each item as it emerges. I guess you could always use map
instead of imap.

Steve Howell

unread,
May 29, 2007, 11:53:00 PM5/29/07
to pytho...@python.org
On May 29, 2:34 am, Raymond Hettinger
<pyt...@rcn.com> wrote:
> The gauntlet has been thrown down. Any creative
> thinkers
> up to the challenge? Give me cool recipes.
>

I don't make any claims to coolness, but I can say
that I myself would have written the code below with
significantly more complexity before groupby(), and I
can see the utility for this code in dealing with
certain mail programs.

The code is phrased verbosely, even if you remove the
tests, but the meat of it could be boiled down to a
one-liner.


import itertools
lines = '''


This is the
first paragraph.

This is the second.
'''.splitlines()
# Use itertools.groupby and bool to return groups of
# consecutive lines that either have content or don't.


for has_chars, frags in itertools.groupby(lines,

bool):
if has_chars:
print ' '.join(frags)
# PRINTS:


# This is the first paragraph.
# This is the second.


____________________________________________________________________________________Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/

Steve Howell

unread,
May 30, 2007, 12:07:23 AM5/30/07
to pytho...@python.org
> Raymond Hettinger <pyt...@rcn.com> writes:
> > The gauntlet has been thrown down. Any creative
> thinkers
> > up to the challenge? Give me cool recipes.
>

Twin primes? (Sorry, no code, but there's a good
Python example somewhere that returns an iterator that
keeps doing the sieve, feed it to groupby,...)


____________________________________________________________________________________Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7


tut...@gmail.com

unread,
Jun 5, 2007, 12:29:35 PM6/5/07
to
On May 27, 7:50 pm, Raymond Hettinger <pyt...@rcn.com> wrote:
> The groupby itertool came-out in Py2.4 and has had remarkable
> success (people seem to get what it does and like using it, and
> there have been no bug reports or reports of usability problems).

With due respect, I disagree. Bug ID #1212077 is either a bug report
or a report of a usability problem, depending on your point of view.
You may disagree on whether or not this is a problem that needs to be
be fixed, but it *is* a report.

http://sourceforge.net/tracker/index.php?func=detail&aid=1212077&group_id=5470&atid=105470


I think the semantics of the itertools groupby are too tricky for
naive users--I find them confusing myself, and I've been using Python
for quite a while. I still hope that Python will someday gain a
groupby function suitable for ordinary use. Until that happens, I
recommend the following cookbook entry:

# from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173

class groupby(dict):
def __init__(self, seq, key=lambda x:x):
for value in seq:
k = key(value)
self.setdefault(k, []).append(value)
__iter__ = dict.iteritems


Mike

sjde...@yahoo.com

unread,
Jun 5, 2007, 1:17:30 PM6/5/07
to
tut...@gmail.com wrote:
> On May 27, 7:50 pm, Raymond Hettinger <pyt...@rcn.com> wrote:
> > The groupby itertool came-out in Py2.4 and has had remarkable
> > success (people seem to get what it does and like using it, and
> > there have been no bug reports or reports of usability problems).
>
> With due respect, I disagree. Bug ID #1212077 is either a bug report
> or a report of a usability problem, depending on your point of view.
> You may disagree on whether or not this is a problem that needs to be
> be fixed, but it *is* a report.
>
> http://sourceforge.net/tracker/index.php?func=detail&aid=1212077&group_id=5470&atid=105470
>
>
> I think the semantics of the itertools groupby are too tricky for
> naive users

Itertools isn't targeted primarily at naive users. It can be useful
to them, but it's really there to allow sophisticated work on
iterables without reading them all in at once (indeed, it works
properly on infinite iterables). That's pretty much _the_ defining
characteristic of itertools

Anyone who's doing that knows you can't do infinite lookahead, so you
can't do a sort or a group-by over the entire data set. IOW, for
anyone who would be looking to use itertools for what it's designed
for, the kind of operation you specify below would be very unexpected.

>--I find them confusing myself, and I've been using Python
> for quite a while. I still hope that Python will someday gain a
> groupby function suitable for ordinary use. Until that happens, I
> recommend the following cookbook entry:
>
> # from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173
>
> class groupby(dict):
> def __init__(self, seq, key=lambda x:x):
> for value in seq:
> k = key(value)
> self.setdefault(k, []).append(value)
> __iter__ = dict.iteritems

The itertools groupby is incredibly useful for writing SQL object
mappers. It's exactly what I wanted when I first started looking in
itertools to see if there was a way to consolidate rows.

Also, that recipe goes against the spirit of itertools--if I'm going
out of my way to use itertools, it usually means I may be working with
very large data sets that I can't read into memory. It's a useful
recipe, but it's also likely to be unusable in the context of
itertools-related problem domains.

BJörn Lindqvist

unread,
Jun 5, 2007, 3:00:40 PM6/5/07
to 7stud, pytho...@python.org
On 27 May 2007 10:49:06 -0700, 7stud <bbxx78...@yahoo.com> wrote:
> On May 27, 11:28 am, Steve Howell <showel...@yahoo.com> wrote:
> > The groupby method has its uses, but it's behavior is
> > going to be very surprising to anybody that has used
> > the "group by" syntax of SQL, because Python's groupby
> > method will repeat groups if your data is not sorted,
> > whereas SQL has the luxury of (knowing that it's)
> > working with a finite data set, so it can provide the
> > more convenient semantics.
> > The groupby method has its uses
>
> I'd settle for a simple explanation of what it does in python.

Here is another example:

import itertools
import random

dierolls = sorted(random.randint(1, 6) for x in range(200))

for number, numbers in itertools.groupby(dierolls):
number_count = len(list(numbers))
print number, "came up", number_count, "times."

--
mvh Björn

0 new messages