Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Processing a large string

32 views
Skip to first unread message

goldtech

unread,
Aug 11, 2011, 10:03:36 PM8/11/11
to
Hi,

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?

Thanks.

MRAB

unread,
Aug 11, 2011, 10:15:58 PM8/11/11
to pytho...@python.org
You could write a generator like this:

def split(string, sep):
pos = 0
try:
while True:
next_pos = string.index(sep, pos)
yield string[pos : next_pos]
pos = next_pos + 1
except ValueError:
yield string[pos : ]

string = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..."

for part in split(string, "3"):
print(part)

Steven D'Aprano

unread,
Aug 11, 2011, 10:30:57 PM8/11/11
to
goldtech wrote:

> Hi,
>
> Say I have a very big string with a pattern like:
>
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....


Define "big".

What seems big to you is probably not big to your computer.


> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?) I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
>
> Would I loop it and read it char by char till a "3"...? Or?

You could, but unless there are a lot of 3s, it will probably be slow. If
the 3s are far apart, it will be better to do this:

# untested
def split(source):
start = 0
i = source.find("3")
while i >= 0:
yield source[start:i]
start = i+1
i = source.find("3", start)


That should give you the pieces of the string one at a time, as efficiently
as possible.


--
Steven

Nobody

unread,
Aug 12, 2011, 12:11:40 AM8/12/11
to

Use the .find() or .index() methods to find the next occurrence of a
character.

Building a large string by concatenation is inefficient, as each append
will copy the original string. If you must have the result as a
single string, using cStringIO would be preferable. But you'd be better
off if you can work with a list of strings.

Peter Otten

unread,
Aug 12, 2011, 4:39:38 AM8/12/11
to
goldtech wrote:

You can read the file in chunks:

from functools import partial

def read_chunks(instream, chunksize=None):
if chunksize is None:
chunksize = 2**20
return iter(partial(instream.read, chunksize), "")

def split_file(instream, delimiter, chunksize=None):
leftover = ""
chunk = None
for chunk in read_chunks(instream):
chunk = leftover + chunk
parts = chunk.split(delimiter)
leftover = parts.pop()
for part in parts:
yield part
if leftover or chunk is None or chunk.endswith(delimiter):
yield leftover

I hope I got the corner cases right.

PS: This has come up before, but I couldn't find the relevant threads...

goldtech

unread,
Aug 12, 2011, 9:36:52 AM8/12/11
to
Thanks for all this info.

Peter Otten

unread,
Aug 12, 2011, 10:48:10 AM8/12/11
to pytho...@python.org
Peter Otten wrote:

> goldtech wrote:

>> Say I have a very big string with a pattern like:
>>
>> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
>>
>> I want to split the sting into separate parts on the "3" and process
>> each part separately. I might run into memory limitations if I use
>> "split" and get a big array(?) I wondered if there's a way I could
>> read (stream?) the string from start to finish and read what's
>> delimited by the "3" into a variable, process the smaller string
>> variable then append/build a new string with the processed data?

> PS: This has come up before, but I couldn't find the relevant threads...

Alex Martelli a looong time ago:

> from __future__ import generators
>
> def splitby(fileobj, splitter, bufsize=8192):
> buf = ''
>
> while True:
> try:
> item, buf = buf.split(splitter, 1)
> except ValueError:
> more = fileobj.read(bufsize)
> if not more: break
> buf += more
> else:
> yield item + splitter
>
> if buf:
> yield buf

http://mail.python.org/pipermail/python-list/2002-September/770673.html


Paul Rudin

unread,
Aug 28, 2011, 3:18:11 PM8/28/11
to
goldtech <gold...@worldpost.com> writes:

s = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn"
for k, subs in itertools.groupby(s, lambda x: x=="3"):
print ''.join(subs)


what you actually do in the body of the loop depends on what you want to
do with the bits.

0 new messages