Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Scanning a file

0 views
Skip to first unread message

pinkflo...@gmail.com

unread,
Oct 28, 2005, 7:06:07 AM10/28/05
to
I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read(1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

Gerhard Häring

unread,
Oct 28, 2005, 7:21:07 AM10/28/05
to pytho...@python.org
pinkflo...@gmail.com wrote:
> I want to scan a file byte for byte [...]

> while True:
> ch = inputFile.read(1)
> [...] But it is very slow. What is the fastest way to do this? Using some

> native call? Using a buffer? Using whatever?

Read in blocks, not byte for byte. I had good experiences with block
sizes like 4096 or 8192.

-- Gerhard

pinkflo...@gmail.com

unread,
Oct 28, 2005, 7:25:22 AM10/28/05
to
Okay, how do I do this?

Also, if you look at the code, I build a 32-bit unsigned integer from
the bytes I read. And the 32-bit pattern I am looking for can start on
_any_ byte boundary in the file. It would be nice if I could somehow
just scan for that pattern explicitly, without having to build a 32-bit
integer first. If I could tell python "scan this file for the bytes 0,
0, 1, 0 in succession. How many 0, 0, 1, 0 did you find?"

/David

Paul Rubin

unread,
Oct 28, 2005, 7:29:47 AM10/28/05
to
pinkflo...@gmail.com writes:
> I want to scan a file byte for byte for occurences of the the four byte
> pattern 0x00000100. I've tried with this:

use re.search or string.find. The simplest way is just read the whole
file into memory first. If the file is too big, you have to read it in
chunks and include some hair to notice if the four byte pattern straddles
two adjoining chunks.

pinkflo...@gmail.com

unread,
Oct 28, 2005, 9:22:11 AM10/28/05
to
I'm now down to:

f = open("filename", "rb")
s = f.read()
sub = "\x00\x00\x01\x00"
count = s.count(sub)
print count

Which is quite fast. The only problems is that the file might be huge.
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

/David

Björn Lindström

unread,
Oct 28, 2005, 9:29:46 AM10/28/05
to
"pinkflo...@gmail.com" <pinkflo...@gmail.com> writes:

> f = open("filename", "rb")
> s = f.read()
> sub = "\x00\x00\x01\x00"
> count = s.count(sub)
> print count

That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename", "rb").read().count("\x00\x00\x01\x00")

--
Björn Lindström <bk...@stp.lingfil.uu.se>
Student of computational linguistics, Uppsala University, Sweden

Jorge Godoy

unread,
Oct 28, 2005, 9:35:08 AM10/28/05
to
"pinkflo...@gmail.com" <pinkflo...@gmail.com> writes:

> Which is quite fast. The only problems is that the file might be huge.
> I really have no need for reading the entire file into a string as I am
> doing here. All I want is to count occurences this substring. Can I
> somehow count occurences in a file without reading it into a string
> first?

How about iterating through the file? You can read it line by line, two lines
at a time. Pseudocode follows:

line1 = read_line
while line2 = read_line:
line_to_check = ''.join([line1, line2])
check_for_desired_string
line1 = line2

With that you always have two lines in the buffer and you can check all of
them for your desired string, no matter what the size of the file is.


Be seeing you,
--
Jorge Godoy <go...@ieee.org>

Bernhard Herzog

unread,
Oct 28, 2005, 9:40:02 AM10/28/05
to
Jorge Godoy <go...@ieee.org> writes:

> How about iterating through the file? You can read it line by line, two lines
> at a time. Pseudocode follows:
>
> line1 = read_line
> while line2 = read_line:
> line_to_check = ''.join([line1, line2])
> check_for_desired_string
> line1 = line2
>
> With that you always have two lines in the buffer and you can check all of
> them for your desired string, no matter what the size of the file is.

This will fail if the string to search for is e.g. "\n\n\n\n" and it
actually occcurs in the file.

Bernhard

--
Intevation GmbH http://intevation.de/
Skencil http://skencil.org/
Thuban http://thuban.intevation.org/

pinkflo...@gmail.com

unread,
Oct 28, 2005, 9:51:36 AM10/28/05
to
First of all, this isn't a text file, it is a binary file. Secondly,
substrings can overlap. In the sequence 0010010 the substring 0010
occurs twice.

/David

Andrew McCarthy

unread,
Oct 28, 2005, 9:59:40 AM10/28/05
to

Yes - use memory mapping (the mmap module). An mmap object is like a
cross between a file and a string, but the data is only read into RAM
when, and for as long as, necessary. An mmap object doesn't have a
count() method, but you can just use find() in a while loop instead.

Andrew

Jeremy Sanders

unread,
Oct 28, 2005, 10:37:46 AM10/28/05
to
Gerhard Häring wrote:

It's difficult to handle overlaps. The four byte sequence may occur at the
end of one block and beginning of the next. You'd need to check for these
special cases.

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/

Paul Watson

unread,
Oct 28, 2005, 11:59:24 AM10/28/05
to
<pinkflo...@gmail.com> wrote in message
news:1130497567.7...@g44g2000cwa.googlegroups.com...

How about something like:

#!/usr/bin/env python

import sys

fn = 't.dat'
ss = '\x00\x00\x01\x00'

be = len(ss) - 1 # length of overlap to check
blocksize = 4 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
found = 0
while len(b) > be:
if b.find(ss) != -1:
found = 1
break
b = b[-be:] + fp.read(blocksize)
fp.close()
sys.exit(found)


James Stroud

unread,
Oct 28, 2005, 1:53:06 PM10/28/05
to pytho...@python.org
On Friday 28 October 2005 06:29, Björn Lindström wrote:
> "pinkflo...@gmail.com" <pinkflo...@gmail.com> writes:
> > f = open("filename", "rb")
> > s = f.read()
> > sub = "\x00\x00\x01\x00"
> > count = s.count(sub)
> > print count
>
> That's a lot of lines. This is a bit off topic, but I just can't stand
> unnecessary local variables.
>
> print file("filename", "rb").read().count("\x00\x00\x01\x00")


The "f" is not terribly unnecessary, because the part of the code you didn't
see was

f.close()

Which would be considered good practice.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Mike Meyer

unread,
Oct 28, 2005, 2:39:04 PM10/28/05
to
Andrew McCarthy <a_mcc...@hotmail.com> writes:

Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

<mike
--
Mike Meyer <m...@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Kent Johnson

unread,
Oct 28, 2005, 2:53:22 PM10/28/05
to
pinkflo...@gmail.com wrote:
> I want to scan a file byte for byte for occurences of the the four byte
> pattern 0x00000100.

data = sys.stdin.read()
print data.count('\x00\x00\x01\x00')

Kent

Paul Watson

unread,
Oct 28, 2005, 7:09:18 PM10/28/05
to
<pinkflo...@gmail.com> wrote in message
news:1130497567.7...@g44g2000cwa.googlegroups.com...

Here is an attempt at counting and using the mmap facility. There appear to
be some serious backward compatibility issues. I tried Python 2.1 on
Windows and AIX and had some odd results. If you are 2.4.1 or higher that
should not be a problem.

#!/usr/bin/env python
import sys
import os
import mmap

fn = 't.dat'
ss = '\x00\x00\x01\x00'

fp = open(fn, 'rb')

b = mmap.mmap(fp.fileno(), os.stat(fp.name).st_size,
access=mmap.ACCESS_READ)

count = 0
foundpoint = b.find(ss, 0)
while foundpoint != -1 and (foundpoint + 1) < b.size():


count = count + 1

foundpoint = b.find(ss, foundpoint + 1)
b.close()

print count

fp.close()
sys.exit(0)


Steven D'Aprano

unread,
Oct 28, 2005, 7:41:14 PM10/28/05
to
On Fri, 28 Oct 2005 06:22:11 -0700, pinkflo...@gmail.com wrote:

> Which is quite fast. The only problems is that the file might be huge.

What *you* call huge and what *Python* calls huge may be very different
indeed. What are you calling huge?

> I really have no need for reading the entire file into a string as I am
> doing here. All I want is to count occurences this substring. Can I
> somehow count occurences in a file without reading it into a string
> first?

Magic?

You have to read the file into memory at some stage, otherwise how can you
see what value the bytes are? The only question is, can you read it all
into one big string (in which case, your solution is unlikely to be
beaten), or do you have to read the file in chunks and deal with the
boundary cases (which is harder)?

Here is another thought. What are you going to do with the count when you
are done? That sounds to me like a pretty pointless result: "Hi user, the
file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like
to do another file?"

If you are planning to use this count to do something, perhaps there is a
more efficient way to combine the two steps into one -- especially
valuable if your files really are huge.


--
Steven.

Steven D'Aprano

unread,
Oct 28, 2005, 7:41:54 PM10/28/05
to
On Fri, 28 Oct 2005 15:29:46 +0200, Björn Lindström wrote:

> "pinkflo...@gmail.com" <pinkflo...@gmail.com> writes:
>
>> f = open("filename", "rb")
>> s = f.read()
>> sub = "\x00\x00\x01\x00"
>> count = s.count(sub)
>> print count
>
> That's a lot of lines. This is a bit off topic, but I just can't stand
> unnecessary local variables.
>
> print file("filename", "rb").read().count("\x00\x00\x01\x00")

Funny you should say that, because I can't stand unnecessary one-liners.

In any case, you are assuming that Python will automagically close the
file when you are done. That's good enough for a script, but not best
practice.

f = open("filename", "rb")

print f.read().count("\x00\x00\x01\x00")
f.close()

is safer, has no unnecessary local variables, and is not unnecessarily
terse. Unfortunately, it doesn't solve the original poster's problem,
because his file is too big to read into memory all at once -- or so he
tells us.


--
Steven.

Alex Martelli

unread,
Oct 28, 2005, 11:03:17 PM10/28/05
to
Mike Meyer <m...@mired.org> wrote:
...

> Except if you can't read the file into memory because it's to large,
> there's a pretty good chance you won't be able to mmap it either. To
> deal with huge files, the only option is to read the file in in
> chunks, count the occurences in each chunk, and then do some fiddling
> to deal with the pattern landing on a boundary.

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksize)
yield block
while block:
block = block[-overlap:] + f.read(blocksize-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(subst)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...


Alex

Paul Watson

unread,
Oct 28, 2005, 11:27:41 PM10/28/05
to
"Paul Watson" <pwa...@redlinepy.com> wrote in message
news:3sf070F...@individual.net...

> <pinkflo...@gmail.com> wrote in message
> news:1130497567.7...@g44g2000cwa.googlegroups.com...
>>I want to scan a file byte for byte for occurences of the the four byte
>> pattern 0x00000100. I've tried with this:
>>
>> # start
>> import sys
>>
>> numChars = 0
>> startCode = 0
>> count = 0
>>
>> inputFile = sys.stdin
>>
>> while True:
>> ch = inputFile.read(1)
>> numChars += 1
>>
>> if len(ch) < 1: break
>>
>> startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
>> if numChars < 4: continue
>>
>> if startCode == 0x00000100:
>> count = count + 1
>>
>> print count
>> # end
>>
>> But it is very slow. What is the fastest way to do this? Using some
>> native call? Using a buffer? Using whatever?
>>
>> /David

Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.

#!/usr/bin/env python
import sys

fn = 't2.dat'


ss = '\x00\x00\x01\x00'

be = len(ss) - 1 # length of overlap to check

blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)

count = 0
while len(b) > be:
count += b.count(ss)


b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)


Mike Meyer

unread,
Oct 29, 2005, 1:38:21 AM10/29/05
to
"Paul Watson" <pwa...@redlinepy.com> writes:
> Here is a better one that counts, and not just detects, the substring. This
> is -much- faster than using mmap; especially for a large file that may cause
> paging to start. Using mmap can be -very- slow.
>
> #!/usr/bin/env python
> import sys
>
> fn = 't2.dat'
> ss = '\x00\x00\x01\x00'
>
> be = len(ss) - 1 # length of overlap to check
> blocksize = 64 * 1024 # need to ensure that blocksize > overlap
>
> fp = open(fn, 'rb')
> b = fp.read(blocksize)
> count = 0
> while len(b) > be:
> count += b.count(ss)
> b = b[-be:] + fp.read(blocksize)
> fp.close()
>
> print count
> sys.exit(0)
>
>

Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.

Thanks,

Fredrik Lundh

unread,
Oct 29, 2005, 2:48:18 AM10/29/05
to pytho...@python.org
Mike Meyer wrote:

> Did you do timings on it vs. mmap? Having to copy the data multiple
> times to deal with the overlap - thanks to strings being immutable -
> would seem to be a lose, and makes me wonder how it could be faster
> than mmap in general.

if you use "mmap" to read large files sequentially, without calling "madvise",
the system may try to keep old pages around just in case, and won't do as
much read-ahead as it can do. if you're low on memory, that means that
the system may waste some time swapping out data for other applications,
rather than throwing away data that you know that you will never look at
again.

if you have reasonably large files and you're not running on an overcrowded
machine, this is usually not a problem.

(does Python's mmap module support madvise, btw? doesn't look like it does...)

</F>

Bengt Richter

unread,
Oct 29, 2005, 3:27:34 AM10/29/05
to

ISTM you better let others know exactly what you mean by this, before
you use the various solutions suggested or your own ;-)

a) Are you talking about bit strings or byte strings?
b) Do you want to _count_ overlapping substrings??!!
Note result of s.count on your example:

>>> s = '0010010'
>>> s.count('0010')
1

vs. brute force counting overlapped substrings (not tested beyond what you see ;-)

>>> def ovcount(s, sub):
... start = count = 0
... while True:
... start = s.find(sub, start) + 1
... if start==0: break
... count += 1
... return count
...
>>> ovcount(s, '0010')
2

Regards,
Bengt Richter

Bengt Richter

unread,
Oct 29, 2005, 3:48:39 AM10/29/05
to

Do I get a job at google if I find something wrong with the above? ;-)

Regards,
Bengt Richter

Peter Otten

unread,
Oct 29, 2005, 4:34:24 AM10/29/05
to
Bengt Richter wrote:

Try it with a subst of length 1. Seems like you missed an opportunity :-)

Peter

Bengt Richter

unread,
Oct 29, 2005, 2:12:40 PM10/29/05
to

I was thinking this was an example a la Alex's previous discussion
of interviewee code challenges ;-)

What struck me was

>>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
>>> [gen.next() for i in xrange(10)]
['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Regards,
Bengt Richter

Alex Martelli

unread,
Oct 29, 2005, 2:23:25 PM10/29/05
to
Bengt Richter <bo...@oz.net> wrote:
...

> >>> while block:
> >>> block = block[-overlap:] + f.read(blocksize-overlap)
> >>> if block: yield block
...

> I was thinking this was an example a la Alex's previous discussion
> of interviewee code challenges ;-)
>
> What struck me was
>
> >>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
> >>> [gen.next() for i in xrange(10)]
> ['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Heh, OK, I should get back into the habit of adding a "warning: untested
code" when I post code (particularly when it's late and I'm
jetlagged;-). The code I posted will never exit, since block always
keeps the last overlap bytes; it needs to be changed into something like
(warning -- untested code!-)

if overlap>0:
while True:
next = f.read(blocksize-overlap)
if not next: break
block = block[-overlap:] + next
yield block
else:
while True:
next = f.read(blocksize)
if not next: break
yield next

(the if/else is needed to handle requests for overlaps <= 0, if desired;
I think it's clearer to split the cases rather than to test inside the
loop's body).


Alex

Peter Otten

unread,
Oct 29, 2005, 2:36:12 PM10/29/05
to
Bengt Richter wrote:

> What struck me was
>
> >>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
> >>> [gen.next() for i in xrange(10)]
> ['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Ouch. Seems like I spotted the subtle cornercase error and missed the big
one.

Peter

Paul Watson

unread,
Oct 29, 2005, 3:15:46 PM10/29/05
to
"Mike Meyer" <m...@mired.org> wrote in message
news:864q70e...@bhuda.mired.org...
> "Paul Watson" <pwa...@redlinepy.com> writes:
...

> Did you do timings on it vs. mmap? Having to copy the data multiple
> times to deal with the overlap - thanks to strings being immutable -
> would seem to be a lose, and makes me wonder how it could be faster
> than mmap in general.

The only thing copied is a string one byte less than the search string for
each block.

I did not do due dilligence with respect to timings. Here is a small
dataset read sequentially and using mmap.

$ ls -lgG t.dat
-rw-r--r-- 1 16777216 Oct 28 16:32 t.dat
$ time ./scanfile.py
1048576
0.80s real 0.64s user 0.15s system
$ time ./scanfilemmap.py
1048576
20.33s real 6.09s user 14.24s system

With a larger file, the system time skyrockets. I assume that to be the
paging mechanism in the OS. This is Cyngwin on Windows XP.

$ ls -lgG t2.dat
-rw-r--r-- 1 268435456 Oct 28 16:33 t2.dat
$ time ./scanfile.py
16777216
28.85s real 16.37s user 0.93s system
$ time ./scanfilemmap.py
16777216
323.45s real 94.45s user 227.74s system


netva...@gmail.com

unread,
Oct 29, 2005, 3:35:32 PM10/29/05
to
I think implementing a finite state automaton would be a good (best?)
solution. I have drawn a FSM for you (try viewing the following in
fixed width font). Just increment the count when you reach state 5.

<---------------|
| |
0 0 | 1 0 |0
-->[1]--->[2]--->[3]--->[4]--->[5]-|
^ | | ^ | | |
1| |<---| | | |1 |1
|_| 1 |_| | |
^ 0 | |
|---------------------|<-----|

If you don't understand FSM's, try getting a book on computational
theory (the book by Hopcroft & Ullman is great.)

Here you don't have special cases whether reading in blocks or reading
whole at once (as you only need one byte at a time).

Vaibhav

Steve Holden

unread,
Oct 29, 2005, 4:10:11 PM10/29/05
to pytho...@python.org

No, you just realised subconsciously that we'd all spot the obvious one
and decided to point out the bug that would remain after the obvious one
had been fixed.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Tim Roberts

unread,
Oct 29, 2005, 5:08:09 PM10/29/05
to
Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
>
>On Fri, 28 Oct 2005 15:29:46 +0200, Björn Lindström wrote:
>
>> "pinkflo...@gmail.com" <pinkflo...@gmail.com> writes:
>>
>>> f = open("filename", "rb")
>>> s = f.read()
>>> sub = "\x00\x00\x01\x00"
>>> count = s.count(sub)
>>> print count
>>
>> That's a lot of lines. This is a bit off topic, but I just can't stand
>> unnecessary local variables.
>>
>> print file("filename", "rb").read().count("\x00\x00\x01\x00")
>
>Funny you should say that, because I can't stand unnecessary one-liners.
>
>In any case, you are assuming that Python will automagically close the
>file when you are done.

Nonsense. This behavior is deterministic. At the end of that line, the
anonymous file object out of scope, the object is deleted, and the file is
closed.
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Scott David Daniels

unread,
Oct 29, 2005, 5:13:15 PM10/29/05
to
Paul Watson wrote:

> Here is a better one that counts, and not just detects, the substring. This
> is -much- faster than using mmap; especially for a large file that may cause
> paging to start. Using mmap can be -very- slow.
>

> <ss = pattern, be = len(ss) - 1>
> ...


> b = fp.read(blocksize)
> count = 0
> while len(b) > be:
> count += b.count(ss)
> b = b[-be:] + fp.read(blocksize)

> ...
In cases where that one wins and blocksize is big,
this should do even better:
...
block = fp.read(blocksize)
count = 0
while len(block) > be:
count += block.count(ss)
lead = block[-be :]
block = fp.read(blocksize)
count += (lead + block[: be]).count(ss)
...
--
-Scott David Daniels
scott....@acm.org

Alex Martelli

unread,
Oct 29, 2005, 5:14:59 PM10/29/05
to
Tim Roberts <ti...@probo.com> wrote:
...

> >> print file("filename", "rb").read().count("\x00\x00\x01\x00")
> >
> >Funny you should say that, because I can't stand unnecessary one-liners.
> >
> >In any case, you are assuming that Python will automagically close the
> >file when you are done.
>
> Nonsense. This behavior is deterministic. At the end of that line, the
> anonymous file object out of scope, the object is deleted, and the file is
> closed.

In today's implementations of Classic Python, yes. In other equally
valid implementations of the language, such as Jython, IronPython, or,
for all we know, some future implementation of Classic, that may well
not be the case. Many, quite reasonably, dislike relying on a specific
implementation's peculiarities, and prefer to write code that relies
only on what the _language_ specs guarantee.


Alex

Steve Holden

unread,
Oct 29, 2005, 5:43:18 PM10/29/05
to pytho...@python.org
Indeed, but reading one byte at a time is about the slowest way to
process a file, in Python or any other language, because it fails to
amortize the overhead cost of function calls over many characters.

Buffering wasn't invented because early programmers had nothing better
to occupy their minds, remember :-)

Paul Watson

unread,
Oct 29, 2005, 6:15:56 PM10/29/05
to
"Alex Martelli" <ale...@yahoo.com> wrote in message
news:1h5760l.1e2eatkurdeo7N%ale...@yahoo.com...

> In today's implementations of Classic Python, yes. In other equally
> valid implementations of the language, such as Jython, IronPython, or,
> for all we know, some future implementation of Classic, that may well
> not be the case. Many, quite reasonably, dislike relying on a specific
> implementation's peculiarities, and prefer to write code that relies
> only on what the _language_ specs guarantee.

How could I identify when Python code does not close files and depends on
the runtime to take care of this? I want to know that the code will work
well under other Python implementations and future implementations which may
not have this provided.


Paul Rubin

unread,
Oct 29, 2005, 6:18:09 PM10/29/05
to
"Paul Watson" <pwa...@redlinepy.com> writes:
> How could I identify when Python code does not close files and depends on
> the runtime to take care of this? I want to know that the code will work
> well under other Python implementations and future implementations which may
> not have this provided.

There is nothing in the Python language reference that guarantees the
files will be closed when the references go out of scope. That
CPython does it is simply an implementation artifact. If you want to
make sure they get closed, you have to close them explicitly. There
are some Python language extensions in the works to make this more
convenient (PEP 343) but for now you have to do it by hand.

David Rasmussen

unread,
Oct 29, 2005, 6:35:25 PM10/29/05
to
netva...@gmail.com wrote:
> I think implementing a finite state automaton would be a good (best?)
> solution. I have drawn a FSM for you (try viewing the following in
> fixed width font). Just increment the count when you reach state 5.
>
> <---------------|
> | |
> 0 0 | 1 0 |0
> -->[1]--->[2]--->[3]--->[4]--->[5]-|
> ^ | | ^ | | |
> 1| |<---| | | |1 |1
> |_| 1 |_| | |
> ^ 0 | |
> |---------------------|<-----|
>
> If you don't understand FSM's, try getting a book on computational
> theory (the book by Hopcroft & Ullman is great.)
>

I already have that book. The above solution very slow in practice. None
of the solutions presented in this thread is nearly as fast as the

print file("filename", "rb").read().count("\x00\x00\x01\x00")

/David

Steven D'Aprano

unread,
Oct 29, 2005, 6:49:02 PM10/29/05
to
On Sat, 29 Oct 2005 21:08:09 +0000, Tim Roberts wrote:

>>In any case, you are assuming that Python will automagically close the
>>file when you are done.
>
> Nonsense. This behavior is deterministic. At the end of that line, the
> anonymous file object out of scope, the object is deleted, and the file is
> closed.

That is an implementation detail. CPython may do that, but JPython does
not -- or at least it did not last time I looked. JPython doesn't
guarantee that the file will be closed at any particular time, just that
it will be closed eventually.

If all goes well. What if you have a circular dependence and the file
reference never gets garbage-collected? What happens if the JPython
process gets killed before the file is closed? You might not care about
one file not being released, but what if it is hundreds of files?

In general, it is best practice to release external resources as soon as
you're done with them, and not rely on a garbage collector which may or
may not release them in a timely manner.

There are circumstances where things do not go well and the file never
gets closed cleanly -- for example, when your disk is full, and the
buffer is only written to disk when you close the file. Would you
prefer that error to raise an exception, or to pass silently? If you want
close exceptions to pass silently, then by all means rely on the garbage
collector to close the file.

You might not care about these details in a short script -- when I'm
writing a use-once-and-throw-away script, that's what I do. But it isn't
best practice: explicit is better than implicit.

I should also point out that for really serious work, the idiom:

f = file("parrot")
handle(f)
f.close()

is insufficiently robust for production level code. That was a detail I
didn't think I needed to drop on the original newbie poster, but depending
on how paranoid you are, or how many exceptions you want to insulate the
user from, something like this might be needed:

try:
f = file("parrot")
try:
handle(f)
finally:
try:
f.close()
except:
print "The file could not be closed; see your sys admin."
except:
print "The file could not be opened."


--
Steven.

Mike Meyer

unread,
Oct 29, 2005, 6:36:47 PM10/29/05
to
"Paul Watson" <pwa...@redlinepy.com> writes:

> "Mike Meyer" <m...@mired.org> wrote in message
> news:864q70e...@bhuda.mired.org...
>> "Paul Watson" <pwa...@redlinepy.com> writes:
> ...
>> Did you do timings on it vs. mmap? Having to copy the data multiple
>> times to deal with the overlap - thanks to strings being immutable -
>> would seem to be a lose, and makes me wonder how it could be faster
>> than mmap in general.
>
> The only thing copied is a string one byte less than the search string for
> each block.

Um - you removed the code, but I could have *sworn* that it did
something like:

buf = buf[testlen:] + f.read(bufsize - testlen)

which should cause the the creation of three strings: the last few
bytes of the old buffer, a new bufferfull from the read, then the sum
of those two - created by copying the first two into a new string. So
you wind up copying all the data.

Which, as you showed, doesn't take nearly as much time as using mmap.

David Rasmussen

unread,
Oct 29, 2005, 6:55:32 PM10/29/05
to
Steven D'Aprano wrote:
> On Fri, 28 Oct 2005 06:22:11 -0700, pinkflo...@gmail.com wrote:
>
>>Which is quite fast. The only problems is that the file might be huge.
>
> What *you* call huge and what *Python* calls huge may be very different
> indeed. What are you calling huge?
>

I'm not saying that it is too big for Python. I am saying that it is too
big for the systems it is going to run on. These files can be 22 MB or 5
GB or ..., depending on the situation. It might not be okay to run a
tool that claims that much memory, even if it is available.

>
>>I really have no need for reading the entire file into a string as I am
>>doing here. All I want is to count occurences this substring. Can I
>>somehow count occurences in a file without reading it into a string
>>first?
>
> Magic?
>

That would be nice :)

But you misunderstand me...

> You have to read the file into memory at some stage, otherwise how can you
> see what value the bytes are?

I haven't said that I would like to scan the file without reading it. I
am just saying that the .count() functionality implemented into strings
could just as well be applied to some abstraction such as a stream (I
come from C++). In C++, the count() functionality would be separated as
much as possible from any concrete datatype (such as a string),
precisely because it is a concept that is applicable at a more abstract
level. I should be able to say "count the substring occurences of this
stream" or "using this iterator" or something to that effect. If I could say

print file("filename", "rb").count("\x00\x00\x01\x00")

(or something like that)

instead of the original

print file("filename", "rb").read().count("\x00\x00\x01\x00")

it would be exactly what I am after. What is the conceptual difference?
The first solution should be at least as fast as the second. I have to
read and compare the characters anyway. I just don't need to store them
in a string. In essence, I should be able to use the "count occurences"
functionality on more things, such as a file, or even better, a file
read through a buffer with a size specified by me.

>
> Here is another thought. What are you going to do with the count when you
> are done? That sounds to me like a pretty pointless result: "Hi user, the
> file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like
> to do another file?"
>

It might sound pointless to you, but it is not pointless for my purposes :)

If you must know, the above one-liner actually counts the number of
frames in an MPEG2 file. I want to know this number for a number of
files for various reasons. I don't want it to take forever.

> If you are planning to use this count to do something, perhaps there is a
> more efficient way to combine the two steps into one -- especially
> valuable if your files really are huge.
>

Of course, but I don't need to do anything else in this case.

/David

Alex Martelli

unread,
Oct 29, 2005, 7:36:22 PM10/29/05
to
Paul Watson <pwa...@redlinepy.com> wrote:

Then you should use try/finally (to have your code run correctly in all
of today's implementations; Python 2.5 will have a 'with' statement to
offer nicer syntax sugar for that, but it will be a while before all the
implementations get around to adding it).

If you're trying to test your code to ensure it explicitly closes all
files, you could (from within your tests) rebind built-ins 'file' and
'open' to be a class wrapping the real thing, and adding a flag to
remember if the file is open; at __del__ time it would warn if the file
had not been explicitly closed. E.g. (untested code):

import __builtin__
import warnings

_f = __builtin__.file
class testing_file(_f):
def __init__(self, *a, **k):
_f.__init__(self, *a, **k)
self._opened = True
def close(self):
_f.close(self)
self._opened = False
def __del__(self):
if self._opened:
warnings.warn(...)
self.close()

__builtin__.file = __builtin__.open = testing_file

Alex

Alex Martelli

unread,
Oct 29, 2005, 7:41:42 PM10/29/05
to
Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
...

> I should also point out that for really serious work, the idiom:
>
> f = file("parrot")
> handle(f)
> f.close()
>
> is insufficiently robust for production level code. That was a detail I
> didn't think I needed to drop on the original newbie poster, but depending
> on how paranoid you are, or how many exceptions you want to insulate the
> user from, something like this might be needed:
>
> try:
> f = file("parrot")
> try:
> handle(f)
> finally:
> try:
> f.close()
> except:
> print "The file could not be closed; see your sys admin."
> except:
> print "The file could not be opened."

The inner try/finally is fine, but both the try/except are total, utter,
unmitigated disasters: they will hide a lot of information about
problems, let the program continue in a totally erroneous state, give
mistaken messages if handle(f) causes any kind of error totally
unrelated to opening the file (or if the user hits control-C during a
lengthy run of handle(f)), emit messages that can erroneously end up in
the redirected stdout of your program... VERY, VERY bad things.

Don't ever catch and ``handle'' exceptions in such ways. In particular,
each time you're thinking of writing a bare 'except:' clause, think
again, and you'll most likely find a much better approach.


Alex

netva...@gmail.com

unread,
Oct 29, 2005, 10:02:16 PM10/29/05
to

Steve Holden wrote:
> Indeed, but reading one byte at a time is about the slowest way to
> process a file, in Python or any other language, because it fails to
> amortize the overhead cost of function calls over many characters.
>
> Buffering wasn't invented because early programmers had nothing better
> to occupy their minds, remember :-)

Buffer, and then read one byte at a time from the buffer.

Vaibhav

Peter Otten

unread,
Oct 30, 2005, 3:51:39 AM10/30/05
to
David Rasmussen wrote:

> None of the solutions presented in this thread is nearly as fast as the
>
> print file("filename", "rb").read().count("\x00\x00\x01\x00")

Have you already timed Scott David Daniel's approach with a /large/
blocksize? It looks promising.

Peter

Peter Otten

unread,
Oct 30, 2005, 3:53:09 AM10/30/05
to
David Rasmussen wrote:

> None of the solutions presented in this thread is nearly as fast as the
>
> print file("filename", "rb").read().count("\x00\x00\x01\x00")

Have you already timed Scott David Daniels' approach with a /large/

Fredrik Lundh

unread,
Oct 30, 2005, 2:55:06 AM10/30/05
to pytho...@python.org
Paul Watson wrote:

> This is Cyngwin on Windows XP.

using cygwin to analyze performance characteristics of portable API:s
is a really lousy idea.

here are corresponding figures from a real operating system:

using a 16 MB file:

$ time python2.4 scanmap.py
real 0m0.080s
user 0m0.070s
sys 0m0.010s

$ time python2.4 scanpaul.py
real 0m0.458s
user 0m0.450s
sys 0m0.010s

using a 256 MB file (50% of available memory):

$ time python2.4 scanmap.py
real 0m0.913s
user 0m0.810s
sys 0m0.100s

$ time python2.4 scanpaul.py
real 0m7.149s
user 0m6.950s
sys 0m0.200s

using a 1024 MB file (200% of available memory):

$ time python2.4 scanpaul.py
real 0m34.274s
user 0m28.030s
sys 0m1.350s

$ time python2.4 scanmap.py
real 0m20.221s
user 0m3.120s
sys 0m1.520s

(Intel(R) Pentium(R) 4 CPU 2.80GHz, 512 MB RAM, relatively slow ATA
disks, relatively recent Linux, best result of multiple mixed runs shown.
scanmap performance would probably improve if Python supported the
"madvise" API, but I don't have time to test that today...)

</F>

Steven D'Aprano

unread,
Oct 30, 2005, 4:02:27 AM10/30/05
to

Of course. That's why I said "something like this" and not "do this" :-) I
don't expect people to "handle" exceptions with a print statement in
anything more serious than a throw-away script.

For serious, production-level code, more often than not you will end up
spending more time and effort handling errors than you spend on handling
the task your application is actually meant to do. But I suspect I'm not
telling Alex anything he doesn't already know.


> Don't ever catch and ``handle'' exceptions in such ways. In particular,
> each time you're thinking of writing a bare 'except:' clause, think
> again, and you'll most likely find a much better approach.

What would you -- or anyone else -- recommend as a better approach?

Is there a canonical list somewhere that states every possible exception
from a file open or close?

--
Steven.

Alex Martelli

unread,
Oct 30, 2005, 10:35:12 AM10/30/05
to
Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
...
> > Don't ever catch and ``handle'' exceptions in such ways. In particular,
> > each time you're thinking of writing a bare 'except:' clause, think
> > again, and you'll most likely find a much better approach.
>
> What would you -- or anyone else -- recommend as a better approach?

That depends on your application, and what you're trying to accomplish
at this point.


> Is there a canonical list somewhere that states every possible exception
> from a file open or close?

No. But if you get a totally unexpected exception, something that shows
the world has gone crazy and most likely any further action you perform
would run the risk of damaging the user's persistent data since the
macchine appears to be careening wildly out of control... WHY would you
want to perform any further action? Crashing and burning (ideally
leaving as detailed a core-dump as feasible for later post-mortem)
appears to be preferable. (Detailed information for post-mortem
purposes is best dumped in a sys.excepthook handler, since wild
unexpected exceptions may occur anywhere and it's impractical to pepper
your application code with bare except clauses for such purposes).

Obviously, if your program is so life-crucial that it cannot be missing
for a long period of time, you will have separately set up a "hot spare"
system, ready to take over at the behest of a separate monitor program
as soon as your program develops problems of such magnitude (a
"heartbeat" system helps with monitoring). You do need redundant
hardware for that, since the root cause of unexpected problems may well
be in a hardware fault -- the disk has crashed, a memory chip just
melted, the CPU's on strike, locusts...! Not stuff any program can do
much about in the short term, except by switching to a different
machine.


Alex

John J. Lee

unread,
Oct 30, 2005, 5:01:02 PM10/30/05
to pytho...@python.org
ale...@yahoo.com (Alex Martelli) writes:
[...]

> If you're trying to test your code to ensure it explicitly closes all
> files, you could (from within your tests) rebind built-ins 'file' and
> 'open' to be a class wrapping the real thing, and adding a flag to
> remember if the file is open; at __del__ time it would warn if the file
> had not been explicitly closed. E.g. (untested code):
[...]

In general __del__ methods interfere with garbage collection, don't
they? I guess in the case of file objects this is unlikely to be
problematic (because unlikely to be any reference cycles), but I
thought it might be worth warning people that in general this
debugging strategy might get rather confusing since the __del__ could
actually change the operation of the program by preventing garbage
collection of the objects whose lifetime you're trying to
investigate...


John

Steven D'Aprano

unread,
Oct 30, 2005, 5:20:02 PM10/30/05
to
On Sun, 30 Oct 2005 08:35:12 -0700, Alex Martelli wrote:

> Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
> ...
>> > Don't ever catch and ``handle'' exceptions in such ways. In particular,
>> > each time you're thinking of writing a bare 'except:' clause, think
>> > again, and you'll most likely find a much better approach.
>>
>> What would you -- or anyone else -- recommend as a better approach?
>
> That depends on your application, and what you're trying to accomplish
> at this point.
>
>
>> Is there a canonical list somewhere that states every possible exception
>> from a file open or close?
>
> No. But if you get a totally unexpected exception,

I'm more concerned about getting an expected exception -- or more
accurately, *missing* an expected exception. Matching on Exception is too
high. EOFError will probably need to be handled separately, since it often
isn't an error at all, just a change of state. IOError is the bad one.
What else can go wrong?


> something that shows
> the world has gone crazy and most likely any further action you perform
> would run the risk of damaging the user's persistent data since the
> macchine appears to be careening wildly out of control... WHY would you
> want to perform any further action?

In circumstances where, as you put it, the hard disk has crashed, the CPU
is on strike, or the memory has melted down, not only can you not recover
gracefully, but you probably can't even fail gracefully -- at least not
without a special purpose fail-safe operating system.

I'm not concerned with mylist.append(None) unexpectedly -- and
impossibly? -- raising an ImportError. I can't predict every way things
can explode, and even if they do, I can't safely recover from them. But I
can fail gracefully from *expected* errors: rather than the application
just crashing, I can at least try to put up a somewhat informative dialog
box, or at least print something to the console. If opening a preferences
file fails, I can fall back on default settings. If writing the
preferences file fails, I can tell the user that their settings won't be
saved, and continue. Just because the disk is full or their disk quota is
exceeded, doesn't necessarily mean the app can't continue with the job on
hand.


--
Steven.

Alex Martelli

unread,
Oct 30, 2005, 7:08:04 PM10/30/05
to
John J. Lee <j...@pobox.com> wrote:

> ale...@yahoo.com (Alex Martelli) writes:
> [...]
> > If you're trying to test your code to ensure it explicitly closes all
> > files, you could (from within your tests) rebind built-ins 'file' and
> > 'open' to be a class wrapping the real thing, and adding a flag to
> > remember if the file is open; at __del__ time it would warn if the file
> > had not been explicitly closed. E.g. (untested code):
> [...]
>
> In general __del__ methods interfere with garbage collection, don't

Yes, cyclic gc only.

> they? I guess in the case of file objects this is unlikely to be
> problematic (because unlikely to be any reference cycles), but I
> thought it might be worth warning people that in general this
> debugging strategy might get rather confusing since the __del__ could
> actually change the operation of the program by preventing garbage
> collection of the objects whose lifetime you're trying to
> investigate...

Yeah, but you'll find them all listed in gc.garbage for your perusal --
they DO get collected, but they get put in gc.garbage rather than FREED,
that's all. E.g.:

>>> class a(object):
... def __del__(self): print 'del', self.__class__.__name__
...
>>> class b(a): pass
...
>>> x=a(); y=b(); x.y=y; y.x=x
>>> del x, y
>>> gc.collect()
4
>>> gc.garbage
[<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]

So, no big deal -- run a gc.collect() and parse through gc.garbage for
any instances of your "wrapper of file" class, and you'll find ones that
were forgotten as part of a cyclic garbage loop and you can check
whether they were explicitly closed or not.


Alex

Alex Martelli

unread,
Oct 30, 2005, 7:08:05 PM10/30/05
to
Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
...
> > No. But if you get a totally unexpected exception,
>
> I'm more concerned about getting an expected exception -- or more
> accurately, *missing* an expected exception. Matching on Exception is too
> high. EOFError will probably need to be handled separately, since it often
> isn't an error at all, just a change of state. IOError is the bad one.
> What else can go wrong?

Lots of things, but not ones you should WISH your application to
survive.

> > something that shows
> > the world has gone crazy and most likely any further action you perform
> > would run the risk of damaging the user's persistent data since the
> > macchine appears to be careening wildly out of control... WHY would you
> > want to perform any further action?
>
> In circumstances where, as you put it, the hard disk has crashed, the CPU
> is on strike, or the memory has melted down, not only can you not recover
> gracefully, but you probably can't even fail gracefully -- at least not
> without a special purpose fail-safe operating system.

Right: as I explained, all you can do is switch operation over to a "hot
spare" server (a different machine), through the good services of a
"monitor" machine - and hope the redundancy you've built into your
storage system (presumably a database with good solid mirroring running
on other machines yet, if your app is as critical as that) has survived.


> I'm not concerned with mylist.append(None) unexpectedly -- and
> impossibly? -- raising an ImportError. I can't predict every way things
> can explode, and even if they do, I can't safely recover from them. But I

Exactly my point -- and yet if you use "except:", that's exactly what
you're TRYING to do, rather than let your app die.

> can fail gracefully from *expected* errors: rather than the application
> just crashing, I can at least try to put up a somewhat informative dialog
> box, or at least print something to the console.

You can do that from your sys.excepthook routine, if it's OK for your
app to die afterwards. What we're discussing here are only, and
strictly, cases in which you wish your app to NOT die, but rather keep
processing (not just give nice error diagnostics and then terminate,
that's what sys.excepthook is for). And what I'm saying is that you
should keep processing only for errors you *DO* expect, and should not
even try if the error is NOT expected.

>. If opening a preferences
> file fails, I can fall back on default settings. If writing the
> preferences file fails, I can tell the user that their settings won't be
> saved, and continue. Just because the disk is full or their disk quota is
> exceeded, doesn't necessarily mean the app can't continue with the job on
> hand.

Sure, that's why you catch IOError, which covers these _expected_ cases
(or, if you want to be a bit wider, maybe OSError) AND check its errno
attribute to ensure you haven't mistakenly caught something you did NOT
in fact expect (and thus can't really handle), to use a bare raise
statement to re-raise the "accidentally caught" exception if needed.

But if something goes wrong that you had NOT anticipated, just log as
much info as you can for the postmortem, give nice diagnostics to the
user if you wish, and do NOT keep processing -- and for these
diagnostic-only purposes, use sys.excepthook, not a slew of try/except
all over your application making it crufty and unmaintainable.


Alex

Bengt Richter

unread,
Oct 30, 2005, 7:23:41 PM10/30/05
to
On Sat, 29 Oct 2005 21:10:11 +0100, Steve Holden <st...@holdenweb.com> wrote:

>Peter Otten wrote:
>> Bengt Richter wrote:
>>
>>
>>>What struck me was
>>>
>>>
>>>>>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
>>>>>> [gen.next() for i in xrange(10)]
>>>
>>>['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']
>>
>>
>> Ouch. Seems like I spotted the subtle cornercase error and missed the big
>> one.
>>
>
>No, you just realised subconsciously that we'd all spot the obvious one
>and decided to point out the bug that would remain after the obvious one
>had been fixed.
>

I still smelled a bug in the counting of substring in the overlap region,
and you motivated me to find it (obvious in hindsight, but aren't most ;-)

A substring can get over-counted if the "overlap" region joins infelicitously with the next
input. E.g., try counting 'xx' in 10*'xx' with a read chunk of 4 instead of 1024*1024:

Assuming corrections so far posted as I understand them:

>>> def byblocks(f, blocksize, overlap):
... block = f.read(blocksize)
... yield block
... if overlap>0:
... while True:
... next = f.read(blocksize-overlap)
... if not next: break
... block = block[-overlap:] + next
... yield block
... else:
... while True:
... next = f.read(blocksize)
... if not next: break
... yield next
...
>>> def countsubst(f, subst, blksize=1024*1024):
... count = 0
... for block in byblocks(f, blksize, len(subst)-1):
... count += block.count(subst)
... f.close()
... return count
...

>>> from StringIO import StringIO as S
>>> countsubst(S('xx'*10), 'xx', 4)
13
>>> ('xx'*10).count('xx')
10
>>> list(byblocks(S('xx'*10), 4, len('xx')-1))
['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']

Of course, a large read chunk will make the problem either
go away

>>> countsubst(S('xx'*10), 'xx', 1024)
10

or might make it low probability depending on the data.

Regards,
Bengt Richter

Tony Nelson

unread,
Oct 30, 2005, 8:37:18 PM10/30/05
to
In article <1130637600....@g43g2000cwa.googlegroups.com>,
netva...@gmail.com wrote:

Have you mesured it?

#!/usr/bin/python
'''Time some file scanning.
'''

import sys, time

f = open(sys.argv[1])
t = time.time()
while True:
b = f.read(256*1024)
if not b:
break
print 'initial read', time.time() - t
f.close()

f = open(sys.argv[1])
t = time.time()
while True:
b = f.read(256*1024)
if not b:
break
print 'second read', time.time() - t
f.close()

if 1:
f = open(sys.argv[1])
t = time.time()
while True:
b = f.read(256*1024)
if not b:
break
for c in b:
pass
print 'third chars', time.time() - t
f.close()

f = open(sys.argv[1])
t = time.time()
n = 0
srch = '\x00\x00\x01\x00'
laplen = len(srch)-1
lap = ''
while True:
b = f.read(256*1024)
if not b:
break
n += (lap+b[:laplen]).count(srch)
n += b.count(srch)
lap = b[-laplen:]
print 'fourth scan', time.time() - t, n
f.close()


On my (old) system, with a 512 MB file so it won't all buffer, the
second time I get:

initial read 14.513395071
second read 14.8771388531
third chars 178.250257969
fourth scan 26.1602909565 1
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

Paul Watson

unread,
Oct 30, 2005, 8:53:17 PM10/30/05
to
Fredrik Lundh wrote:
> Paul Watson wrote:
>
>>This is Cyngwin on Windows XP.
>
> using cygwin to analyze performance characteristics of portable API:s
> is a really lousy idea.

Ok. So, I agree. That is just what I had at hand. Here are some other
numbers to which due diligence has also not been applied. Source code
is at the bottom for both file and mmap process. I would be willing for
someone to tell me what I could improve.

$ python -V
Python 2.4.1

$ uname -a
Linux ruth 2.6.13-1.1532_FC4 #1 Thu Oct 20 01:30:08 EDT 2005 i686

$ cat /proc/meminfo|head -2
MemTotal: 514232 kB
MemFree: 47080 kB

$ time ./scanfile.py
16384

real 0m0.06s
user 0m0.03s
sys 0m0.01s

$ time ./scanfilemmap.py
16384

real 0m0.10s
user 0m0.06s
sys 0m0.00s

Using a ~ 250 MB file, not even half of physical memory.

$ time ./scanfile.py
16777216

real 0m11.19s
user 0m10.98s
sys 0m0.17s

$ time ./scanfilemmap.py
16777216

real 0m55.09s
user 0m43.12s
sys 0m11.92s

==============================

$ cat scanfile.py
#!/usr/bin/env python

import sys

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss = 'time'

be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')


b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)

fp.close()

print count
sys.exit(0)

===================================

$ cat scanfilemmap.py
#!/usr/bin/env python

import sys
import os
import mmap

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss='time'

fp = open(fn, 'rb')
b = mmap.mmap(fp.fileno(), os.stat(fp.name).st_size,
access=mmap.ACCESS_READ)

count = 0
foundpoint = b.find(ss, 0)
while foundpoint != -1 and (foundpoint + 1) < b.size():
#print foundpoint
count = count + 1
foundpoint = b.find(ss, foundpoint + 1)
b.close()

print count

fp.close()
sys.exit(0)

Paul Watson

unread,
Oct 30, 2005, 8:53:17 PM10/30/05
to Fredrik Lundh, pytho...@python.org
Fredrik Lundh wrote:
> Paul Watson wrote:
>
>>This is Cyngwin on Windows XP.
>
> using cygwin to analyze performance characteristics of portable API:s
> is a really lousy idea.

Ok. So, I agree. That is just what I had at hand. Here are some other

Steven D'Aprano

unread,
Oct 30, 2005, 9:08:27 PM10/30/05
to
Alex Martelli wrote:

> Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
> ...
>
>>>No. But if you get a totally unexpected exception,
>>
>>I'm more concerned about getting an expected exception -- or more
>>accurately, *missing* an expected exception. Matching on Exception is too
>>high. EOFError will probably need to be handled separately, since it often
>>isn't an error at all, just a change of state. IOError is the bad one.
>>What else can go wrong?
>
>
> Lots of things, but not ones you should WISH your application to
> survive.

[snip]

> Sure, that's why you catch IOError, which covers these _expected_ cases
> (or, if you want to be a bit wider, maybe OSError) AND check its errno
> attribute to ensure you haven't mistakenly caught something you did NOT
> in fact expect (and thus can't really handle), to use a bare raise
> statement to re-raise the "accidentally caught" exception if needed.

Ah, that's precisely the answer I was looking for:
IOError and OSError, and then check the errno.
Excellent: thanks for that.

> But if something goes wrong that you had NOT anticipated, just log as
> much info as you can for the postmortem, give nice diagnostics to the
> user if you wish, and do NOT keep processing -- and for these
> diagnostic-only purposes, use sys.excepthook, not a slew of try/except
> all over your application making it crufty and unmaintainable.

Agreed -- I never intended people to draw the
conclusion that every line, or even every logical block
of lines, should be wrapped in try/except.


--
Steven.

Peter Otten

unread,
Oct 31, 2005, 3:19:10 AM10/31/05
to
Bengt Richter wrote:

[David Rasmussen]

> First of all, this isn't a text file, it is a binary file. Secondly,
> substrings can overlap. In the sequence 0010010 the substring 0010
> occurs twice.

Coincidentally the "always overlap" case seems the easiest to fix. It
suffices to replace the count() method with

def count_overlap(s, token):
pos = -1
n = 0
while 1:
try:
pos = s.index(token, pos+1)
except ValueError:
break
n += 1
return n

Or so I hope, without the thorough tests that are indispensable as we should
have learned by now...

Peter

David Rasmussen

unread,
Oct 31, 2005, 2:26:04 AM10/31/05
to
No comments to this post?

/David

Lasse Vågsæther Karlsen

unread,
Oct 31, 2005, 3:41:02 AM10/31/05
to
David Rasmussen wrote:
<snip>

> If you must know, the above one-liner actually counts the number of
> frames in an MPEG2 file. I want to know this number for a number of
> files for various reasons. I don't want it to take forever.
<snip>

Don't you risk getting more "frames" than the file actually have? What
if the encoded data happens to have the magic byte values for something
else?

--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:la...@vkarlsen.no
PGP KeyID: 0x2A42A1C2

Steven D'Aprano

unread,
Oct 31, 2005, 5:16:41 AM10/31/05
to
David Rasmussen wrote:

> Steven D'Aprano wrote:
>
>> On Fri, 28 Oct 2005 06:22:11 -0700, pinkflo...@gmail.com wrote:
>>
>>> Which is quite fast. The only problems is that the file might be huge.
>>
>>
>> What *you* call huge and what *Python* calls huge may be very different
>> indeed. What are you calling huge?
>>
>
> I'm not saying that it is too big for Python. I am saying that it is too
> big for the systems it is going to run on. These files can be 22 MB or 5
> GB or ..., depending on the situation. It might not be okay to run a
> tool that claims that much memory, even if it is available.

If your files can reach multiple gigabytes, you will
definitely need an algorithm that avoids reading the
entire file into memory at once.


[snip]

> print file("filename", "rb").count("\x00\x00\x01\x00")
>
> (or something like that)
>
> instead of the original
>
> print file("filename", "rb").read().count("\x00\x00\x01\x00")
>
> it would be exactly what I am after.

I think I can say, without risk of contradiction, that
there is no built-in method to do that.


> What is the conceptual difference?
> The first solution should be at least as fast as the second. I have to
> read and compare the characters anyway. I just don't need to store them
> in a string. In essence, I should be able to use the "count occurences"
> functionality on more things, such as a file, or even better, a file
> read through a buffer with a size specified by me.

Of course, if you feel like coding the algorithm and
submitting it to be included in the next release of
Python... :-)


I can't help feeling that a generator with a buffer is
the way to go, but I just can't *quite* deal with the
case where the pattern overlaps the boundary... it is
very annoying.

But not half as annoying as it must be to you :-)

However, there may be a simpler solution *fingers
crossed* -- you are searching for a sub-string
"\x00\x00\x01\x00", which is hex 0x100. Surely you
don't want any old substring of "\x00\x00\x01\x00", but
only the ones which align on word boundaries?

So "ABCD\x00\x00\x01\x00" would match (in hex, it is
0x41424344 0x100), but "AB\x00\x00\x01\x00CD" should
not, because that is 0x41420000 0x1004344 in hex.

If that is the case, your problem is simpler: you don't
have to worry about the pattern crossing a boundary, so
long as your buffer is a multiple of four bytes.

--
Steven.

Paul Watson

unread,
Oct 31, 2005, 9:13:26 AM10/31/05
to
Alex Martelli wrote:
...

>>>>gc.garbage
>
> [<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]
>
> So, no big deal -- run a gc.collect() and parse through gc.garbage for
> any instances of your "wrapper of file" class, and you'll find ones that
> were forgotten as part of a cyclic garbage loop and you can check
> whether they were explicitly closed or not.
>
>
> Alex

Since everyone needs this, how about building it in such that files
which are closed by the runtime, and not user code, are reported or
queryable? Perhaps a command line switch to either invoke or suppress
reporting them on exit.

Is there any facility for another program to peer into the state of a
Python program? Would this be a security problem?

Steve Holden

unread,
Oct 31, 2005, 10:07:12 AM10/31/05
to pytho...@python.org
Paul Watson wrote:
> Alex Martelli wrote:
> ...
>
>>>>>gc.garbage
>>
>>[<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]
>>
>>So, no big deal -- run a gc.collect() and parse through gc.garbage for
>>any instances of your "wrapper of file" class, and you'll find ones that
>>were forgotten as part of a cyclic garbage loop and you can check
>>whether they were explicitly closed or not.
>>
>>
>>Alex
>
>
> Since everyone needs this, how about building it in such that files
> which are closed by the runtime, and not user code, are reported or
> queryable? Perhaps a command line switch to either invoke or suppress
> reporting them on exit.
>
This is a rather poor substitute from correct program design and
implementation. It also begs the question of exactly what constitutes a
"file". What about a network socket that the user has run makefile() on?
What about a pipe to another process? This suggestion is rather ill-defined.

> Is there any facility for another program to peer into the state of a
> Python program? Would this be a security problem?

It would indeed be a security problem, and there are enough of those
already without adding more.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Bengt Richter

unread,
Oct 31, 2005, 11:18:19 AM10/31/05
to
On Mon, 31 Oct 2005 09:41:02 +0100, =?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?= <la...@vkarlsen.no> wrote:

>David Rasmussen wrote:
><snip>
>> If you must know, the above one-liner actually counts the number of
>> frames in an MPEG2 file. I want to know this number for a number of
>> files for various reasons. I don't want it to take forever.
><snip>
>
>Don't you risk getting more "frames" than the file actually have? What
>if the encoded data happens to have the magic byte values for something
>else?
>

Good point, but perhaps the bit pattern the OP is looking for is guaranteed
(e.g. by some kind of HDLC-like bit or byte stuffing or escaping) not to occur
except as frame marker (which might make sense re the problem of re-synching
to frames in a glitched video stream).

The OP probably knows. I imagine this thread would have gone differently if the
title had been "How to count frames in an MPEG2 file?" and the OP had supplied
the info about what marks a frame and whether it is guaranteed not to occur
in the data ;-)

Regards,
Bengt Richter

Bengt Richter

unread,
Oct 31, 2005, 12:02:18 PM10/31/05
to

The OP didn't reply to my post re the above for some reason
http://groups.google.com/group/comp.lang.python/msg/dd4125bf38a54b7c?hl=en&

>
>Coincidentally the "always overlap" case seems the easiest to fix. It
>suffices to replace the count() method with
>
>def count_overlap(s, token):
> pos = -1
> n = 0
> while 1:
> try:
> pos = s.index(token, pos+1)
> except ValueError:
> break
> n += 1
> return n
>
>Or so I hope, without the thorough tests that are indispensable as we should
>have learned by now...
>

Unfortunately, there is such a thing as a correct implementation of an incorrect spec ;-)
I have some doubts about the OP's really wanting to count overlapping patterns as above,
which is what I asked about in the above referenced post. Elsewhere he later reveals:

[David Rasmussen]


>> If you must know, the above one-liner actually counts the number of
>> frames in an MPEG2 file. I want to know this number for a number of
>> files for various reasons. I don't want it to take forever.

In which case I doubt whether he wants to count as above. Scanning for the
particular 4 bytes would assume that non-frame-marker data is escaped
one way or another so it can't contain the marker byte sequence.
(If it did, you'd want to skip it, not count it, I presume). Robust streaming video
format would presumably be designed for unambigous re-synching, meaning
the data stream can't contain the sync mark. But I don't know if that
is guaranteed in conversion from file to stream a la HDLC or some link packet protocol
or whether it is actually encoded with escaping in the file. If framing in the file is with
length-specifying packet headers and no data escaping, then the filebytes.count(pattern)
approach is not going to do the job reliably, as Lasse was pointing out.

Requirements, requirements ;-)

Regards,
Bengt Richter

Paul Watson

unread,
Oct 31, 2005, 3:05:37 PM10/31/05
to
Steve Holden wrote:
>> Since everyone needs this, how about building it in such that files
>> which are closed by the runtime, and not user code, are reported or
>> queryable? Perhaps a command line switch to either invoke or suppress
>> reporting them on exit.
>>
> This is a rather poor substitute from correct program design and
> implementation. It also begs the question of exactly what constitutes a
> "file". What about a network socket that the user has run makefile() on?
> What about a pipe to another process? This suggestion is rather
> ill-defined.
>
>> Is there any facility for another program to peer into the state of a
>> Python program? Would this be a security problem?
>
> It would indeed be a security problem, and there are enough of those
> already without adding more.
>
> regards
> Steve

All I am looking for is the runtime to tell me when it is doing things
that are outside the language specification and that the developer
should have coded.

How "ill" will things be when large bodies of code cannot run
successfully on a future version of Python or a non-CPython
implementation which does not close files. Might as well put file
closing on exit into the specification.

The runtime knows it is doing it. Please allow the runtime to tell me
what it knows it is doing. Thanks.

Steve Holden

unread,
Oct 31, 2005, 7:18:42 PM10/31/05
to pytho...@python.org

In point oif fact I don't believe the runtime does any such thing
(though I must admit I haven't checked the source, so you may prove me
wrong).

As far as I know, Python simply relies on the opreating system to close
files left open at the end of the program.

John J. Lee

unread,
Oct 31, 2005, 7:58:52 PM10/31/05
to pytho...@python.org
Paul Watson <pwa...@redlinepy.com> writes:
[...]

> How "ill" will things be when large bodies of code cannot run
> successfully on a future version of Python or a non-CPython
> implementation which does not close files. Might as well put file
> closing on exit into the specification.
[...]

There are many, many ways of making a large body of code "ill".

Closing off this particular one would make it harder to get benefit of
non-C implementations of Python, so it has been judged "not worth it".
I think I agree with that judgement.


John

Paul Rubin

unread,
Oct 31, 2005, 9:07:25 PM10/31/05
to
j...@pobox.com (John J. Lee) writes:
> Closing off this particular one would make it harder to get benefit of
> non-C implementations of Python, so it has been judged "not worth it".
> I think I agree with that judgement.

The right fix is PEP 343.

Alex Martelli

unread,
Nov 1, 2005, 12:05:04 AM11/1/05
to
Steve Holden <st...@holdenweb.com> wrote:
...

> > The runtime knows it is doing it. Please allow the runtime to tell me
> > what it knows it is doing. Thanks.
>
> In point oif fact I don't believe the runtime does any such thing
> (though I must admit I haven't checked the source, so you may prove me
> wrong).
>
> As far as I know, Python simply relies on the opreating system to close
> files left open at the end of the program.

Nope, see
<http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Objects/fi
leobject.c?rev=2.164.2.3&view=markup> :

"""
static void
file_dealloc(PyFileObject *f)
{
int sts = 0;
if (f->weakreflist != NULL)
PyObject_ClearWeakRefs((PyObject *) f);
if (f->f_fp != NULL && f->f_close != NULL) {
Py_BEGIN_ALLOW_THREADS
sts = (*f->f_close)(f->f_fp);
"""
etc.

Exactly how the OP wants to "allow the runtime to tell [him] what it
knows it is doing", that is not equivalent to reading the freely
available sources of that runtime, is totally opaque to me, though.

"The runtime" (implementation of built-in object type `file`) could be
doing or not doing a bazillion things (in its ..._dealloc function as
well as many other functions), up to and including emailing the OP's
cousin if it detects the OP is up later than his or her bedtime -- the
language specs neither mandate nor forbid such behavior. How, exactly,
does the OP believe the language specs should "allow" (presumably,
REQUIRE) ``the runtime'' to communicate the sum total of all that it's
doing or not doing (beyond whatever the language specs themselves may
require or forbid it to do) on any particular occasion...?!


Alex

Paul Watson

unread,
Nov 1, 2005, 8:14:57 AM11/1/05
to

I am sure you are right. However, PEP 343 will not change the existing
body of Python source code. Nor will it, alone, change the existing
body of Python programmers who are writing code which does not close files.

Paul Watson

unread,
Nov 1, 2005, 8:26:42 AM11/1/05
to

The OP wants to know which files the runtime is closing automatically.
This may or may not occur on other or future Python implementations.
Identifying this condition will accelerate remediation efforts to avoid
the deleterious impact of failure to close().


The mechanism to implement such a capability might be similar to the -v
switch which traces imports, reporting to stdout. It might be a
callback function.

Fredrik Lundh

unread,
Nov 1, 2005, 9:36:36 AM11/1/05
to pytho...@python.org
Alex Martelli wrote:

>> As far as I know, Python simply relies on the opreating system to close
>> files left open at the end of the program.
>
> Nope, see
> <http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Objects/fi
> leobject.c?rev=2.164.2.3&view=markup>

that's slightly misleading: CPython will make a decent attempt (via the module cleanup
mechanism: http://www.python.org/doc/essays/cleanup.html ), but any files that are not
closed by that process will be handled by the OS.

CPython is not designed to run on an OS that doesn't reclaim memory and other re-
sources upon program exit.

</F>

David Rasmussen

unread,
Nov 1, 2005, 2:35:07 PM11/1/05
to
Lasse Vågsæther Karlsen wrote:
> David Rasmussen wrote:
> <snip>
>
>> If you must know, the above one-liner actually counts the number of
>> frames in an MPEG2 file. I want to know this number for a number of
>> files for various reasons. I don't want it to take forever.
>
> Don't you risk getting more "frames" than the file actually have? What
> if the encoded data happens to have the magic byte values for something
> else?
>

I am not too sure about the details, but I've been told from a reliable
source that 0x00000100 only occurs as a "begin frame" marker, and not
anywhere else. So far, it has been true on the files I have tried it on.

/David

David Rasmussen

unread,
Nov 1, 2005, 2:36:07 PM11/1/05
to
Bengt Richter wrote:
>
> Good point, but perhaps the bit pattern the OP is looking for is guaranteed
> (e.g. by some kind of HDLC-like bit or byte stuffing or escaping) not to occur
> except as frame marker (which might make sense re the problem of re-synching
> to frames in a glitched video stream).
>

Exactly.

> The OP probably knows. I imagine this thread would have gone differently if the
> title had been "How to count frames in an MPEG2 file?" and the OP had supplied
> the info about what marks a frame and whether it is guaranteed not to occur
> in the data ;-)
>

Sure, but I wanted to ask the general question :) I am new to Python and
I want to learn about the language.

/David

David Rasmussen

unread,
Nov 1, 2005, 2:38:47 PM11/1/05
to
Steven D'Aprano wrote:
>
> However, there may be a simpler solution *fingers crossed* -- you are
> searching for a sub-string "\x00\x00\x01\x00", which is hex 0x100.
> Surely you don't want any old substring of "\x00\x00\x01\x00", but only
> the ones which align on word boundaries?
>

Nope, sorry. On the files I have tried this on, the pattern could occur
on any byte boundary.

/David

Steven D'Aprano

unread,
Nov 1, 2005, 8:37:56 PM11/1/05
to
David Rasmussen wrote:

Not too reliable then.

0x00000100 is one of a number of unique start codes in
the MPEG2 standard. It is guaranteed to be unique in
the video stream, however when searching for codes
within the video stream, make sure you're in the video
stream!

See, for example,
http://forum.doom9.org/archive/index.php/t-29262.html

"Actually, one easy way (DVD specific) is to look for
00 00 01 e0 at byte offset 00e of the pack. Then look
at byte 016, it contains the size of the extension.
Resume your scan at 017 + contents of 016."

Right. Glad that's the easy way.

I really suspect that you need a proper MPEG2 parser,
and not just blindly counting bytes -- at least if you
want reliable, accurate counts and not just "number of
frames, plus some file-specific random number". And
heaven help you if you want to support MPEGs that are
slightly broken...

(It has to be said, depending on your ultimate needs,
"close enough" may very well be, um, close enough.)

Good luck!

--
Steven.

Bengt Richter

unread,
Nov 2, 2005, 1:46:43 AM11/2/05
to

It might be possible to recompile existing code (unchanged) to capture most
typical cpython use cases, I think...

E.g., I can imagine a family of command line options based on hooking import on
startup and passing option info to the selected and hooked import module,
which module would do extra things at the AST stage of compiling and executing modules
during import, to accomplish various things.

(I did a little proof of concept a while back, see

http://mail.python.org/pipermail/python-list/2005-August/296594.html

that gives me the feeling I could do this kind of thing).

E.g., for the purposes of guaranteeing close() on files opened in typical cpython
one-liners or single-suiters) like e.g.

for i, line in enumerate(open(fpath)):
print '%04d: %s' %(i, line.rstrip())

I think a custom import could recognize the open call
in the AST and extract it and wrap it up in a try/finally AST structure implementing
something like the following in the place of the above;

__f = open(fpath) # (suitable algorithm for non-colliding __f names is required)
try:
for i, line in enumerate(__f):
print '%04d: %s' %(i, line.rstrip())
finally:
__f.close()

In this case, the command line info passed to the special import might look like
python -with open script.py

meaning calls of open in a statement/suite should be recognized and extracted like
__f = open(fpath) above, and the try/finally be wrapped around the use of it.

I think this would capture a lot of typical usage, but of course I haven't bumped into
the gotchas yet, since I haven't implemented it ;-)

On a related note, I think one could implement macros of a sort in a similar way.
The command line parameter would pass the name of a class which is actually extracted
at AST-time, and whose methods and other class variables represent macro definitions
to be used in the processing of the rest of the module's AST, before compilation per se.

Thus you could implement e.g. in-lining, so that

----
#example.py
class inline:
def mac(acc, x, y):
acc += x*y

tot = 0
for i in xrange(10):
mac(tot, i*i)
----

Could be run with

python -macros inline example.py

and get the same identical .pyc as you would with the source

----
#example.py
tot = 0
for i in xrange(10):
tot += i*i
----

IOW, a copy of the macro body AST is substituted for the macro call AST, with
parameter names translated to actual macro call arg names. (Another variant
would also permit putting the macros in a separate module, and recognize their
import into other modules, and "do the right thing" instead of just translating
the import. Maybe specify the module by python - macromodule inline example.py
and then recognize "import inline" in example.py's AST).

Again, I just have a hunch I could make this work (and a number of people
here could beat me to it if they were motivated, I'm sure). Also have a hunch
I might need some flame shielding. ;-)

OTOH, it could be an easy way to experiment with some kinds of language
tweaks. The only limitation really is the necessity for the source to
look legal enough that an AST is formed and preserves the requisite info.
After that, there's no limit to what an AST-munger could do, especially
if it is allowed to call arbitrary tools and create auxiliary files such
as e.g. .dlls for synthesized imports plugging stuff into the final translated context ;-)
(I imagine this is essentially what the various machine code generating optimizers do).

IMO the concept of modules and their (optionally specially controlled) translation
and use could evolve in may interesting directions. E.g., __import__ could grow
keyword parameters too ... Good thing there is a BDFL with a veto, eh? ;-)

Should I bother trying to implement this import for with and macros from
the pieces I have (plus imp, to do it "right") ?

BTW, I haven't experimented with command line dependent site.py/sitecustomize.py stuff.
Would that be a place to do sessionwise import hooking and could one rewrite sys.argv
so the special import command line opts would not be visible to subsequent
processing (and the import hook would be in effect)? IWT so, but probably should read
site.py again and figure it out, but appreciate any hints on pitfalls ;-)

Regards,
Bengt Richter

David Rasmussen

unread,
Nov 2, 2005, 2:35:39 PM11/2/05
to
Steven D'Aprano wrote:
>
> 0x00000100 is one of a number of unique start codes in the MPEG2
> standard. It is guaranteed to be unique in the video stream, however
> when searching for codes within the video stream, make sure you're in
> the video stream!
>

I know I am in the cases I am interested in.

> And heaven help you if you want to support MPEGs that are slightly
> broken...
>

I don't. This tool is for use in house only. And on MPEGs that are
generated in house too.

/David

0 new messages