getting n items at a time from a generator

503 views
Skip to first unread message

Kugutsumen

unread,
Dec 27, 2007, 6:34:57 AM12/27/07
to
I am relatively new the python language and I am afraid to be missing
some clever construct or built-in way equivalent to my 'chunk'
generator below.

def chunk(size, items):
"""generate N items from a generator."""
chunk = []
count = 0
while True:
try:
item = items.next()
count += 1
except StopIteration:
yield chunk
break
chunk.append(item)
if not (count % size):
yield chunk
chunk = []
count = 0

>>> t = (i for i in range(30))
>>> c = chunk(7, t)
>>> for i in c:
... print i
...
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27]
[28, 29]

In my real world project, I have over 250 million items that are too
big to fit in memory and that processed and later used to update
records in a database... to minimize disk IO, I found it was more
efficient to process them by batch or "chunk" of 50,000 or so. Hence

Is this the proper way to do this?

Paul Hankin

unread,
Dec 27, 2007, 7:07:52 AM12/27/07
to
On Dec 27, 11:34 am, Kugutsumen <kugutsu...@gmail.com> wrote:
> I am relatively new the python language and I am afraid to be missing
> some clever construct or built-in way equivalent to my 'chunk'
> generator below.
>
> def chunk(size, items):
>     """generate N items from a generator."""
>     chunk = []
>     count = 0
>     while True:
>         try:
>             item = items.next()
>             count += 1
>         except StopIteration:
>             yield chunk
>             break
>         chunk.append(item)
>         if not (count % size):
>             yield chunk
>             chunk = []
>             count = 0

The itertools module is always a good place to look when you've got a
complicated generator.

import itertools
import operator

def chunk(N, items):
"Group items in chunks of N"
def clump((n, _)):
return n // N
for _, group in itertools.groupby(enumerate(items), clump):
yield itertools.imap(operator.itemgetter(1), group)

for ch in chunk(7, range(30)):
print list(ch)


I've changed chunk to return a generator rather than building a list
which is probably only going to be iterated over. But if you prefer
the list version, replace 'itertools.imap' with 'map'.

--
Paul Hankin

Kugutsumen

unread,
Dec 27, 2007, 7:17:30 AM12/27/07
to

Thanks, I am going to take a look at itertools.
I prefer the list version since I need to buffer that chunk in memory
at this point.

Steven D'Aprano

unread,
Dec 27, 2007, 7:24:29 AM12/27/07
to
On Thu, 27 Dec 2007 03:34:57 -0800, Kugutsumen wrote:

> I am relatively new the python language and I am afraid to be missing
> some clever construct or built-in way equivalent to my 'chunk' generator
> below.
>
> def chunk(size, items):
> """generate N items from a generator."""

[snip code]


Try this instead:


import itertools

def chunk(iterator, size):
# I prefer the argument order to be the reverse of yours.
while True:
chunk = list(itertools.islice(iterator, size))
if chunk: yield chunk
else: break


And in use:

>>> it = chunk(iter(xrange(30)), 7)
>>> for L in it:
... print L


...
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27]
[28, 29]

--
Steven

Terry Jones

unread,
Dec 27, 2007, 7:24:51 AM12/27/07
to Kugutsumen, pytho...@python.org
>>>>> "Kugutsumen" == Kugutsumen <kugut...@gmail.com> writes:

Kugutsumen> On Dec 27, 7:07 pm, Paul Hankin <paul.han...@gmail.com> wrote:
>> On Dec 27, 11:34 am, Kugutsumen <kugutsu...@gmail.com> wrote:
>>
>> > I am relatively new the python language and I am afraid to be missing
>> > some clever construct or built-in way equivalent to my 'chunk'
>> > generator below.

Kugutsumen> Thanks, I am going to take a look at itertools. I prefer the
Kugutsumen> list version since I need to buffer that chunk in memory at
Kugutsumen> this point.

Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705

def chop(iterable, length=2):
return izip(*(iter(iterable),) * length)

Terry

Kugutsumen

unread,
Dec 27, 2007, 8:00:27 AM12/27/07
to
On Dec 27, 7:24 pm, Terry Jones <te...@jon.es> wrote:

Thanks Terry,

However, chop ignores the remainder of the data in the example.

>>> t = (i for i in range(30))

>>> c =chop (t, 7)
>>> for ch in c:
... print ch
...
(0, 1, 2, 3, 4, 5, 6)
(7, 8, 9, 10, 11, 12, 13)
(14, 15, 16, 17, 18, 19, 20)
(21, 22, 23, 24, 25, 26, 27)

k

Kugutsumen

unread,
Dec 27, 2007, 8:31:00 AM12/27/07
to
On Dec 27, 7:24 pm, Terry Jones <te...@jon.es> wrote:

> [snip code]


>
> Try this instead:
>
> import itertools
>
> def chunk(iterator, size):
> # I prefer the argument order to be the reverse of yours.
> while True:
> chunk = list(itertools.islice(iterator, size))
> if chunk: yield chunk
> else: break
>

Steven, I really like your version since I've managed to understand it
in one pass.
Paul's version works but is too obscure to read for me :)

Thanks a lot again.


Shane Geiger

unread,
Dec 27, 2007, 10:43:13 AM12/27/07
to Kugutsumen, pytho...@python.org
# http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496958

from itertools import *
def group(lst, n):
"""group([0,3,4,10,2,3], 2) => iterator

Group an iterable into an n-tuples iterable. Incomplete tuples
are padded with Nones e.g.

>>> list(group(range(10), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, None, None)]
"""
iters = tee(lst, n)
iters = [iters[0]] + [chain(iter, repeat(None))
for iter in iters[1:]]
return izip(
*[islice(iter, i, None, n) for i, iter
in enumerate(iters)])

import string
for grp in list(group(string.letters,25)):
print grp

"""
('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y')
('z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X')
('Y', 'Z', None, None, None, None, None, None, None, None, None, None,
None, None, None, None, None, None, None, None, None, None, None, None,
None)

"""


--
Shane Geiger
IT Director
National Council on Economic Education
sge...@ncee.net | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

Tim Roberts

unread,
Dec 29, 2007, 2:18:39 AM12/29/07
to
Kugutsumen <kugut...@gmail.com> wrote:
>
>I am relatively new the python language and I am afraid to be missing
>some clever construct or built-in way equivalent to my 'chunk'
>generator below.

I have to say that I have found this to be a surprisingly common need as
well. Would this be an appropriate construct to add to itertools?
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Igor V. Rafienko

unread,
Dec 29, 2007, 2:36:31 PM12/29/07
to
[ Terry Jones ]

[ ... ]

> Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705
>
> def chop(iterable, length=2):
> return izip(*(iter(iterable),) * length)


Is this *always* guaranteed by the language to work? Should the
iterator returned by izip() change the implementation and evaluate the
underlying iterators in, say, reverse order, the solution would no
longer function, would it? Or is it something the return value of
izip() would never do?

(I am just trying to understand the solution, not criticize it. Took a
while to parse the argument(s) to izip in the example).

ivr
--
<+Kaptein-Dah> igorr: for få parenteser
<+Kaptein-Dah> igorr: parenteser virker som lubrication under iterasjon
<+Kaptein-Dah> igorr: velkjent

Terry Jones

unread,
Dec 29, 2007, 3:01:22 PM12/29/07
to Igor V. Rafienko, pytho...@python.org
Hi Igor

>>>>> "Igor" == Igor V Rafienko <ig...@ifi.uio.no> writes:

>> Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.)
>> p705
>>
>> def chop(iterable, length=2):
>> return izip(*(iter(iterable),) * length)

Igor> Is this *always* guaranteed by the language to work? Should the
Igor> iterator returned by izip() change the implementation and evaluate
Igor> the underlying iterators in, say, reverse order, the solution would
Igor> no longer function, would it? Or is it something the return value of
Igor> izip() would never do?

Igor> (I am just trying to understand the solution, not criticize it. Took
Igor> a while to parse the argument(s) to izip in the example).

I had to look at it a bit too. I actually deleted the comment I wrote
about it in my own code before posting it here and decided to simply say
"consider" in the above instead :-)

As far as I understand it, you're right. The docstring for izip doesn't
guarantee that it will pull items from the passed iterables in any order.
So an alternate implementation of izip might produce other results. If it
did them in reverse order you'd get each n-chunk reversed, etc.

Terry

Raymond Hettinger

unread,
Dec 29, 2007, 3:09:01 PM12/29/07
to
> >     def chop(iterable, length=2):
> >         return izip(*(iter(iterable),) * length)
>
> Is this *always* guaranteed by the language to work?

Yes!

Users requested this guarantee, and I agreed. The docs now explicitly
guarantee this behavior.


Raymond

Raymond Hettinger

unread,
Dec 29, 2007, 3:12:30 PM12/29/07
to
> > Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705
>
> >     def chop(iterable, length=2):
> >         return izip(*(iter(iterable),) * length)
>
> However, chop ignores the remainder of the data in the example.

There is a recipe in the itertools docs which handles the odd-length
data at the end:

def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'),
('g','x','x')"
return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)


Raymond

NickC

unread,
Jan 1, 2008, 12:06:53 AM1/1/08
to

To work with an arbitrary iterable, it needs an extra line at the
start to ensure the iterator items are consumed correctly each time
around the loop. It may also be better to ensure the final item is the
same length as the other items - if that isn't the case (you want to
know where the data really ends) then leave out the parts relating to
adding the padding object.

import itertools

def chunk(iterable, size, pad=None):
iterator = iter(iterable)
padding = [pad]


while True:
chunk = list(itertools.islice(iterator, size))
if chunk:

yield chunk + (padding*(size-len(chunk)))
else:
break

Cheers,
Nick.

Message has been deleted

Shane Geiger

unread,
Jan 10, 2008, 10:36:31 PM1/10/08
to Paul Rubin, pytho...@python.org

Paul Rubin wrote:

> Tim Roberts <ti...@probo.com> writes:
>
>> I have to say that I have found this to be a surprisingly common need as
>> well. Would this be an appropriate construct to add to itertools?
>>
>
> I'm in favor.
>


I am ecstatic about the idea of getting n items at a time from a
generator! This would eliminate the use of less elegant functions to do
this sort of thing which I would do even more frequently if it were
easier.

Is it possible that this syntax for generator expressions could be adopted?

>>> sentence = 'this is a senTence WiTH'
>>> generator = (word.capitalize() for word in sentence.split())
>>> print generator.next(3,'PadValue')
('This','Is','A')
>>> print generator.next(3,'PadValue')
('Sentence','With','PadValue')
>>> generator.next(3,'PadValue')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>>


While on the topic of generators:

Something else I have longed for is assignment within a while loop. (I
realize this might be more controversial and might have been avoided on
purpose, but I wasn't around for that discussion.)


>>> sentence = 'this is a senTence WiTH'
>>> generator = (word.capitalize() for word in sentence.split())
>>> while a,b,c = generator.next(3,'PadValue'):
... print a,b,c
...
This Is A
Sentence With PadValue

Reply all
Reply to author
Forward
Message has been deleted
0 new messages