Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

chunking a long string?

63 views
Skip to first unread message

Roy Smith

unread,
Nov 8, 2013, 12:48:12 PM11/8/13
to Python List
I have a long string (several Mbytes). I want to iterate over it in manageable chunks (say, 1 kbyte each). For (a small) example, if I started with "this is a very long string", and I wanted 10 character chunks, I should get:

"this is a "
"very long "
"string"

This seems like something itertools would do, but I don't see anything. Is there something, or do I just need to loop and slice (and worry about getting all the edge conditions right) myself?

---
Roy Smith
r...@panix.com

Mark Lawrence

unread,
Nov 8, 2013, 12:52:30 PM11/8/13
to pytho...@python.org
Any good to you http://pythonhosted.org/more-itertools/api.html ?

--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer

Mark Lawrence

Nick Cash

unread,
Nov 8, 2013, 12:59:33 PM11/8/13
to Roy Smith, Python List
> I have a long string (several Mbytes). I want to iterate over it in manageable chunks

This is a weirdly common question. See http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python for several solutions.

It's been proposed to be added to itertools before, but rejected: https://mail.python.org/pipermail/python-ideas/2012-July/015671.html and http://bugs.python.org/issue13095

- Nick Cash

Peter Otten

unread,
Nov 8, 2013, 1:02:39 PM11/8/13
to pytho...@python.org
Roy Smith wrote:

> I have a long string (several Mbytes). I want to iterate over it in
> manageable chunks (say, 1 kbyte each). For (a small) example, if I
> started with "this is a very long string", and I wanted 10 character
> chunks, I should get:
>
> "this is a "
> "very long "
> "string"
>
> This seems like something itertools would do, but I don't see anything.
> Is there something, or do I just need to loop and slice (and worry about
> getting all the edge conditions right) myself?

(x)range() can take care of the edges:

>>> s = "this is a very long string"
>>> def chunks(s, size):
... for start in xrange(0, len(s), size):
... yield s[start:start+size]
...
>>> list(chunks(s, 10))
['this is a ', 'very long ', 'string']
>>> list(chunks(s, 5))
['this ', 'is a ', 'very ', 'long ', 'strin', 'g']
>>> list(chunks(s, 100))
['this is a very long string']

Or you use StringIO:

>>> from functools import partial
>>> from StringIO import StringIO
>>> list(iter(partial(StringIO(s).read, 5), ""))
['this ', 'is a ', 'very ', 'long ', 'strin', 'g']

And no, this need not be a one-liner ;)

Skip Montanaro

unread,
Nov 8, 2013, 1:04:55 PM11/8/13
to Roy Smith, Python List
> I have a long string (several Mbytes). I want to iterate over it in manageable chunks (say, 1 kbyte each).

You don't mention if the string is in memory or on disk. If it's in memory:

>>> for i in range(0, len(s), 10):
... print repr(s[i:i+10])
...
'this is a '
'very long '
'string'

If your string is on disk, just loop over an open file object, reading
your chunk size every pass of the loop.

Skip

Zero Piraeus

unread,
Nov 8, 2013, 1:09:33 PM11/8/13
to pytho...@python.org
:

On Fri, Nov 08, 2013 at 12:48:12PM -0500, Roy Smith wrote:
> I have a long string (several Mbytes). I want to iterate over it in
> manageable chunks (say, 1 kbyte each).
>
> "this is a "
> "very long "
> "string"
>
> This seems like something itertools would do, but I don't see anything.

You could use io.StringIO (or StringIO.StringIO in Python 2.x):

from io import StringIO
big_str = 'x' * 10000000
stream = StringIO(big_str)
while True:
chunk = stream.read(1024)
if not chunk:
break
# process chunk

-[]z.

--
Zero Piraeus: ad referendum
http://etiol.net/pubkey.asc

Roy Smith

unread,
Nov 8, 2013, 1:24:40 PM11/8/13
to Nick Cash, Python List
Oh my, it turns out I don't really need to do this after all, due to previously undiscovered uber-coolness in the tools I'm using!

My use case is that from inside of a Django view, I needed to retrieve a large file via a HTTP GET, and serve that back up, with some time delays inserted into the data stream. Turns out, requests (uber-cool tool #1) provides a way to iterate over the content of a GET, and Django (uber-cool tool #2) provides a way to build a HttpResponse from the data in an iterator. Epiphany! I ended up with (essentially) this:

def stream_slow(request, song_id):
"""Streams a song, but does it extra slowly, for client testing
purposes.

"""
def _slow_stream(r, chunk_size):
for chunk in r.iter_content(chunk_size):
yield chunk
time.sleep(0.1)

url = get_url(song_id)
response = requests.get(url, stream=True)
return HttpResponse(_slow_stream(response, 1024))




On Nov 8, 2013, at 12:59 PM, Nick Cash wrote:

>> I have a long string (several Mbytes). I want to iterate over it in manageable chunks
>
> This is a weirdly common question. See http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python for several solutions.
>
> It's been proposed to be added to itertools before, but rejected: https://mail.python.org/pipermail/python-ideas/2012-July/015671.html and http://bugs.python.org/issue13095
>
> - Nick Cash
>


---
Roy Smith
r...@panix.com



wxjm...@gmail.com

unread,
Nov 8, 2013, 3:43:43 PM11/8/13
to

"(say, 1 kbyte each)": one "kilo" of characters or bytes?

Glad to read some users are still living in an ascii world,
at the "Unicode time" where an encoded code point size may vary
between 1-4 bytes.


Oops, sorry, I'm wrong, it can be much more.

>>> sys.getsizeof('ab')
27
>>> sys.getsizeof('a\U0001d11e')
48
>>>

jmf


Chris Angelico

unread,
Nov 8, 2013, 3:53:10 PM11/8/13
to pytho...@python.org
On Sat, Nov 9, 2013 at 7:43 AM, <wxjm...@gmail.com> wrote:
> Oops, sorry, I'm wrong, it can be much more.
>
>>>> sys.getsizeof('ab')
> 27
>>>> sys.getsizeof('a\U0001d11e')
> 48
>>>>

I know, overhead sucks doesn't it. Python is really abysmal at that;
look how big a single bit is:

>>> sys.getsizeof(1)
14
>>> sys.getsizeof(True)
14

On the flip side, Python gets really awesome at some other things.
Your operating system probably takes an entire CD to distribute, maybe
even a DVD, so that's either 700MB or 4.7GB, give or take. Look how
efficiently Python can represent it:

>>> sys.getsizeof(os)
36

Wow!

ChrisA

Mark Lawrence

unread,
Nov 8, 2013, 3:57:32 PM11/8/13
to pytho...@python.org
For any newcomers please ignore the rubbish that "Joseph McCarthy" Faust
comes up with from time to time. He's been asked repeatedly to come up
with evidence to support his claims regarding PEP 393, the Flexible
String Representation, but he never does, clearly because he can't.
Instead he provides micro benchmarks or meaningless numbers like those
above.

Tim Chase

unread,
Nov 8, 2013, 4:04:43 PM11/8/13
to Chris Angelico, pytho...@python.org
On 2013-11-09 07:53, Chris Angelico wrote:
> On the flip side, Python gets really awesome at some other things.
> Your operating system probably takes an entire CD to distribute,
> maybe even a DVD, so that's either 700MB or 4.7GB, give or take.
> Look how efficiently Python can represent it:
>
> >>> sys.getsizeof(os)
> 36

Someone has been hanging out too much over on that thread about
compressing random data ;-)

-tkc



Mark Lawrence

unread,
Nov 8, 2013, 4:06:17 PM11/8/13
to pytho...@python.org
On 08/11/2013 20:53, Chris Angelico wrote:
> On Sat, Nov 9, 2013 at 7:43 AM, <wxjm...@gmail.com> wrote:
>> Oops, sorry, I'm wrong, it can be much more.
>>
>>>>> sys.getsizeof('ab')
>> 27
>>>>> sys.getsizeof('a\U0001d11e')
>> 48
>>>>>
>
> I know, overhead sucks doesn't it. Python is really abysmal at that;
> look how big a single bit is:
>
>>>> sys.getsizeof(1)
> 14
>>>> sys.getsizeof(True)
> 14
>
> On the flip side, Python gets really awesome at some other things.
> Your operating system probably takes an entire CD to distribute, maybe
> even a DVD, so that's either 700MB or 4.7GB, give or take. Look how
> efficiently Python can represent it:
>
>>>> sys.getsizeof(os)
> 36
>
> Wow!
>
> ChrisA
>

Those figures look really good but I actually want figures that do
things my way, even if the figures aren't as good or even suck
completely. Can you help me with this even if I've already asked 42
times before but have always been given the same figures in response?

Chris Angelico

unread,
Nov 8, 2013, 4:04:17 PM11/8/13
to pytho...@python.org
On Sat, Nov 9, 2013 at 8:04 AM, Tim Chase <pytho...@tim.thechases.com> wrote:
> On 2013-11-09 07:53, Chris Angelico wrote:
>> On the flip side, Python gets really awesome at some other things.
>> Your operating system probably takes an entire CD to distribute,
>> maybe even a DVD, so that's either 700MB or 4.7GB, give or take.
>> Look how efficiently Python can represent it:
>>
>> >>> sys.getsizeof(os)
>> 36
>
> Someone has been hanging out too much over on that thread about
> compressing random data ;-)

Hey, that's a bit unfair! Operating systems aren't full of random data!

At least... well, I can't speak for Windows here...

*dives for cover*

ChrisA

Chris Angelico

unread,
Nov 8, 2013, 4:17:33 PM11/8/13
to pytho...@python.org
On Sat, Nov 9, 2013 at 8:06 AM, Mark Lawrence <bream...@yahoo.co.uk> wrote:
> Those figures look really good but I actually want figures that do things my
> way, even if the figures aren't as good or even suck completely. Can you
> help me with this even if I've already asked 42 times before but have always
> been given the same figures in response?

Yep! I can even offer you an hourglass figure. Here, watch this figure
of an hourglass while you wait for a different answer...

ChrisA

Steven D'Aprano

unread,
Nov 8, 2013, 7:46:32 PM11/8/13
to
On Fri, 08 Nov 2013 12:43:43 -0800, wxjmfauth wrote:

> "(say, 1 kbyte each)": one "kilo" of characters or bytes?
>
> Glad to read some users are still living in an ascii world, at the
> "Unicode time" where an encoded code point size may vary between 1-4
> bytes.
>
>
> Oops, sorry, I'm wrong,

That part is true.


> it can be much more.

That part is false. You're measuring the overhead of the object
structure, not the per-character storage. This has been the case going
back since at least Python 2.2: strings are objects, and have overhead.

>>>> sys.getsizeof('ab')
> 27

27 bytes for two characters! Except it isn't, it's actually 25 bytes for
the object header and two bytes for the two characters.

>>>> sys.getsizeof('a\U0001d11e')
> 48

And here you have four bytes each for the two characters and a 40 byte
header. Observe:

py> c = '\U0001d11e'
py> len(c)
1
py> sys.getsizeof(2*c) - sys.getsizeof(c)
4
py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
4


How big is the object overhead on a (say) thousand character string? Just
one percent:

py> (sys.getsizeof(1000*c) - 4000)/4000
0.01



--
Steven

Steven D'Aprano

unread,
Nov 8, 2013, 7:54:13 PM11/8/13
to
What edge conditions? Should be trivially easy to loop and slice:

def grouper(string, size):
i = 0
while i <= len(string):
yield string[i:i+size]
i += size


But if you prefer, there is a recipe in the itertools documentation to
solve this problem for you:

http://docs.python.org/2/library/itertools.html#recipes

It's short enough to reproduce here.

from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

grouper(your_string, 10, '')

ought to give you the results you want.


I expect (but haven't tested) that for strings, the slice version will be
faster.


--
Steven

wxjm...@gmail.com

unread,
Nov 9, 2013, 3:14:20 AM11/9/13
to
--------

Sure, the new phone "xyz" does not cost 600$, it only cost
only 100$ more than the "abc" 500$ phone model.


If you wish to count the the frequency of chars in a text
and store the results in a dict, {char: number_of_that_char, ...},
do not forget to save the key in utf-XXX, it saves memory.

After all, it is much more funny to waste its time in coding
and in attempting to handle unicode properly and to observe
this poor Python wasting its time in conversions.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('\U0001d11e')
44
>>> sys.getsizeof('\U0001d11e'.encode('utf-32'))
25


Hint: If you attempt to do the same exercise with
words in a "latin" text, never forget the length average
of a word is approximatively 1000 chars.

jmf


Chris Angelico

unread,
Nov 9, 2013, 3:26:20 AM11/9/13
to pytho...@python.org
On Sat, Nov 9, 2013 at 7:14 PM, <wxjm...@gmail.com> wrote:
> If you wish to count the the frequency of chars in a text
> and store the results in a dict, {char: number_of_that_char, ...},
> do not forget to save the key in utf-XXX, it saves memory.

Oh, if you're that concerned about memory usage of individual
characters, try storing them as integers:

>>> sys.getsizeof("a")
26
>>> sys.getsizeof("a".encode("utf-32"))
25
>>> sys.getsizeof("a".encode("utf-8"))
18
>>> sys.getsizeof(ord("a"))
14

I really don't see that UTF-32 is much advantage here. UTF-8 happens
to be, because I used an ASCII character, but the integer beats them
all, even for larger numbers:
>>> sys.getsizeof(ord("\U0001d11e"))
16

And there's even less difference on my Linux box, but of course, you
never compare against Linux because Python 3.2 wide builds don't suit
your numbers.

For longer strings, there's an even more efficient way to store them.
Just store the memory address - that's going to be 4 bytes or 8,
depending on whether it's a 32-bit or 64-bit build of Python. There's
a name for this method of comparing strings: interning. Some languages
do it automatically for all strings, others (like Python) only when
you ask for it. Suddenly it doesn't matter at all what the storage
format is - if the two strings are the same, their addresses are the
same, and conversely. That's how to make it cheap.

> Hint: If you attempt to do the same exercise with
> words in a "latin" text, never forget the length average
> of a word is approximatively 1000 chars.

I think you're confusing length of word with value of picture.

ChrisA

Mark Lawrence

unread,
Nov 9, 2013, 5:13:45 AM11/9/13
to pytho...@python.org
On 09/11/2013 08:14, wxjm...@gmail.com wrote:

I'll ask again, please don't send us double spaced google crap.

Roy Smith

unread,
Nov 9, 2013, 9:37:54 AM11/9/13
to
In article <mailman.2283.1383985...@python.org>,
Chris Angelico <ros...@gmail.com> wrote:

> Some languages [intern] automatically for all strings, others
> (like Python) only when you ask for it.

What does "only when you ask for it" mean?

Chris Angelico

unread,
Nov 9, 2013, 10:02:52 AM11/9/13
to pytho...@python.org
You can explicitly intern a Python string with the sys.intern()
function, which returns either the string itself or an
indistinguishable "interned" string. Two equal strings, when interned,
will return the same object:

>>> foo = "asdf"
>>> bar = "as"
>>> bar += "df"
>>> foo is bar
False

Note that the Python interpreter is free to answer True there, but
there's no mandate for it.

>>> foo = sys.intern(foo)
>>> bar = sys.intern(bar)
>>> foo is bar
True

Now it's mandated. The two strings must be the same object. Interning
in this way makes string equality come down to an 'is' check, which is
potentially a lot faster than actual string equality.

Some languages (eg Pike) do this automatically with all strings - the
construction of any string includes checking to see if it's a
duplicate of any other string. This adds cost to string manipulation
and speeds up string comparisons; since the engine knows that all
strings are interned, it can do the equivalent of an 'is' check for
any string equality.

So what I meant, in terms of storage/representation efficiency, is
that you can store duplicate strings very efficiently if you simply
increment the reference counts of the same few objects. Python won't
necessarily do that for you; check memory usage of something like
this:

strings = [open("some_big_file").read() for _ in range(10000)]

And compare against this:

strings = [sys.intern(open("some_big_file").read()) for _ in range(10000)]

In a language that guarantees string interning, the syntax of the
former would have the memory consumption of the latter. Whether that
memory saving and improved equality comparison is worth the effort of
dictionarification is one of those eternally-debatable points.

ChrisA

Roy Smith

unread,
Nov 9, 2013, 10:21:26 AM11/9/13
to
In article <mailman.2298.1384009...@python.org>,
Chris Angelico <ros...@gmail.com> wrote:

> On Sun, Nov 10, 2013 at 1:37 AM, Roy Smith <r...@panix.com> wrote:
> > In article <mailman.2283.1383985...@python.org>,
> > Chris Angelico <ros...@gmail.com> wrote:
> >
> >> Some languages [intern] automatically for all strings, others
> >> (like Python) only when you ask for it.
> >
> > What does "only when you ask for it" mean?
>
> You can explicitly intern a Python string with the sys.intern()
> function
> [long, and good, explanation of interning]

But, you missed the point of my question. You said that Python does
this "only when you ask for it". That implies it never interns strings
if you don't ask for it, which is clearly not true:

$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[...]
>>> x = "foo"
>>> y = "foo"
>>> x is y
True

I think what you're trying to say is that there are several possible
interning policies:

1) Strings are never interned

2) Strings are always interned

3) Strings are optionally interned, at the discretion of the
implementation

4) The user may force a specific string to be interned by explicitly
requesting it.

and that Pike implements #1, while Python implements #3 and #4.

Chris Angelico

unread,
Nov 9, 2013, 10:30:16 AM11/9/13
to pytho...@python.org
On Sun, Nov 10, 2013 at 2:21 AM, Roy Smith <r...@panix.com> wrote:
> But, you missed the point of my question. You said that Python does
> this "only when you ask for it". That implies it never interns strings
> if you don't ask for it, which is clearly not true:
>
> $ python
> Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
> [...]
>>>> x = "foo"
>>>> y = "foo"
>>>> x is y
> True

Ah! Yes, that's true; literals are interned - I forgot that. But
anything from an external source won't be, hence my example with
reading in the contents of a file.

> I think what you're trying to say is that there are several possible
> interning policies:
>
> 1) Strings are never interned
>
> 2) Strings are always interned
>
> 3) Strings are optionally interned, at the discretion of the
> implementation
>
> 4) The user may force a specific string to be interned by explicitly
> requesting it.
>
> and that Pike implements #1, while Python implements #3 and #4.

Pike implements #2, I presume that was a typo. And yes, the interning
of literals falls under #3, while sys.intern() gives #4. Use of #1
would be restricted to languages with mutable strings, I would expect,
for the same reason that Python tuples might be shared but lists won't
be.

ChrisA

Roy Smith

unread,
Nov 9, 2013, 10:35:25 AM11/9/13
to
In article <mailman.2301.1384011...@python.org>,
Chris Angelico <ros...@gmail.com> wrote:

> Pike implements #2, I presume that was a typo.

Duh. Yes.

Steven D'Aprano

unread,
Nov 9, 2013, 10:37:03 AM11/9/13
to
In Python 2:

help(intern)


In Python 3:

import sys
help(sys.intern)


for more info. I think that Chris is wrong about Python "only" interning
strings if you explicitly ask for it. I recall that Python will (may?)
automatically intern strings which look like identifiers (e.g. "spam" but
not "Hello World" or "123abc"). Let's see now:

# using Python 3.1 on Linux

py> s = "spam"
py> t = "spam"
py> s is t
True

but:

py> z = ''.join(["sp", "am"])
py> z is s
False

However:

py> u = "123abc"
py> v = "123abc"
py> u is v
True

Hmmm, obviously the rules are a tad more complicated than I thought... in
any case, you shouldn't rely on automatic interning since it is an
implementation dependent optimization and will probably change without
notice.



--
Steven

Chris Angelico

unread,
Nov 9, 2013, 5:14:28 PM11/9/13
to pytho...@python.org
On Sun, Nov 10, 2013 at 2:37 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> I think that Chris is wrong about Python "only" interning
> strings if you explicitly ask for it. I recall that Python will (may?)
> automatically intern strings which look like identifiers (e.g. "spam" but
> not "Hello World" or "123abc").

I'm pretty sure it's simply that literals are interned, or at least
shared across a module (and the interactive interpreter "counts" as a
module). And it might still only be ones which look like identifiers,
because:

>>> foo = "lorem ipsum dolor sit amet"
>>> bar = "lorem ipsum dolor sit amet"
>>> foo is bar
False

My "only" was false because of the sharing/interning of (some)
literals, which I'd forgotten about; however, there's still the
distinction that I was trying to draw, that in Python _some strings_
are interned (a feature you can explicitly request), rather than _all
strings_ being interned. And as is typical of python-list, it's this
extremely minor point that became the new course of the thread - my
main point was not whether all, some, or no strings get interned, but
that string interning makes the storage space of duplicate strings
immaterial :)

ChrisA

Steven D'Aprano

unread,
Nov 10, 2013, 1:39:38 AM11/10/13
to
On Sun, 10 Nov 2013 09:14:28 +1100, Chris Angelico wrote:

> And
> as is typical of python-list, it's this extremely minor point that
> became the new course of the thread -

You say that as if it were a bad thing :-P


> my main point was not whether all,
> some, or no strings get interned, but that string interning makes the
> storage space of duplicate strings immaterial :)

True. It's not just a memory saver[1], but a time saver too. Using Python
3.3:

py> from timeit import Timer
py> t1 = Timer('s == t', setup='s = "a b"*10000; t = "a b"*10000')
py> t2 = Timer('s == t',
... setup='from sys import intern; s = intern("a b"*10000); '
... 't = intern("a b"*10000)')
py> min(t1.repeat(number=100000))
7.651959054172039
py> min(t2.repeat(number=100000))
0.00881262868642807


String equality does a short-cut of checking for identity; if the strings
are interned, they will be identical.



[1] Assuming that you actually do have duplicate strings. If every string
is unique, interning them potentially wastes memory.



--
Steven

Chris Angelico

unread,
Nov 10, 2013, 3:46:27 AM11/10/13
to pytho...@python.org
On Sun, Nov 10, 2013 at 5:39 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Sun, 10 Nov 2013 09:14:28 +1100, Chris Angelico wrote:
>
>> And
>> as is typical of python-list, it's this extremely minor point that
>> became the new course of the thread -
>
> You say that as if it were a bad thing :-P

More a curiosity than a bad thing.

ChrisA
0 new messages