Say I have a very big string with a pattern like:
akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?
Would I loop it and read it char by char till a "3"...? Or?
Thanks.
def split(string, sep):
pos = 0
try:
while True:
next_pos = string.index(sep, pos)
yield string[pos : next_pos]
pos = next_pos + 1
except ValueError:
yield string[pos : ]
string = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..."
for part in split(string, "3"):
print(part)
> Hi,
>
> Say I have a very big string with a pattern like:
>
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
Define "big".
What seems big to you is probably not big to your computer.
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?) I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
>
> Would I loop it and read it char by char till a "3"...? Or?
You could, but unless there are a lot of 3s, it will probably be slow. If
the 3s are far apart, it will be better to do this:
# untested
def split(source):
start = 0
i = source.find("3")
while i >= 0:
yield source[start:i]
start = i+1
i = source.find("3", start)
That should give you the pieces of the string one at a time, as efficiently
as possible.
--
Steven
Use the .find() or .index() methods to find the next occurrence of a
character.
Building a large string by concatenation is inefficient, as each append
will copy the original string. If you must have the result as a
single string, using cStringIO would be preferable. But you'd be better
off if you can work with a list of strings.
You can read the file in chunks:
from functools import partial
def read_chunks(instream, chunksize=None):
if chunksize is None:
chunksize = 2**20
return iter(partial(instream.read, chunksize), "")
def split_file(instream, delimiter, chunksize=None):
leftover = ""
chunk = None
for chunk in read_chunks(instream):
chunk = leftover + chunk
parts = chunk.split(delimiter)
leftover = parts.pop()
for part in parts:
yield part
if leftover or chunk is None or chunk.endswith(delimiter):
yield leftover
I hope I got the corner cases right.
PS: This has come up before, but I couldn't find the relevant threads...
> goldtech wrote:
>> Say I have a very big string with a pattern like:
>>
>> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
>>
>> I want to split the sting into separate parts on the "3" and process
>> each part separately. I might run into memory limitations if I use
>> "split" and get a big array(?) I wondered if there's a way I could
>> read (stream?) the string from start to finish and read what's
>> delimited by the "3" into a variable, process the smaller string
>> variable then append/build a new string with the processed data?
> PS: This has come up before, but I couldn't find the relevant threads...
Alex Martelli a looong time ago:
> from __future__ import generators
>
> def splitby(fileobj, splitter, bufsize=8192):
> buf = ''
>
> while True:
> try:
> item, buf = buf.split(splitter, 1)
> except ValueError:
> more = fileobj.read(bufsize)
> if not more: break
> buf += more
> else:
> yield item + splitter
>
> if buf:
> yield buf
http://mail.python.org/pipermail/python-list/2002-September/770673.html
s = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn"
for k, subs in itertools.groupby(s, lambda x: x=="3"):
print ''.join(subs)
what you actually do in the body of the loop depends on what you want to
do with the bits.