Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Python example: possible speedup?

1 view
Skip to first unread message

Hrvoje Niksic

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
As a Python exercise, I wrote a simple program to "scratch an itch",
i.e. do something useful. However, I found that Python's lack of
speed really bytes me here, so I'd like to hear suggestions for
speedup. People who don't like that kind of topic, please skip to the
following article. Others, read on.

When browsing my Debian packages, I found I often wanted to know the
size of the installed packages. To achieve this, I wrote a simple
program that prints the installed packages and their sizes, all sorted
by size. Output looks like this:

tetex-base: 27652
tetex-extra: 27118
cmucl-safe: 22174
mozilla: 13007
timidity-patches: 11662
locales: 11489
quake2: 10800
[...]

To achieve this, the program needs to parse two files:
/var/lib/dpkg/status, to find which packages are installed, and
/var/lib/dpkg/available, to find the size of the installed packages.
Both files are textual, in the form of headers. Like this:

Package: telnet
Status: install ok installed
Priority: standard
Section: net
Installed-Size: 130
Maintainer: Herbert Xu <her...@debian.org>
Source: netkit-telnet
Version: 0.14-4
Replaces: netstd
Depends: libc6 (>= 2.1), libncurses4 (>= 4.2-3.1)
Description: The telnet client.
The telnet command is used for interactive communication with another host
using the TELNET protocol.

Package: packaging-manual
Status: purge ok not-installed
Priority: optional
Section: doc
[...]

The program was quite easy to write, and easy to read afterwards. The
problem is that it is also quite slow. On my system, it takes about
27 CPU seconds (as reported by `time' shell builtin) to do the work,
which can extend to more than a minute of real time, depending on the
system load.

As a comparison, the equivalent Perl program does the same thing in 9
CPU seconds. I tried everything I knew to make the Python version
fast. I tried to use `re' to avoid returning headers other than the
ones we're interested in. I tried changing self.__current to just
current to avoid a dictionary lookup. I tried to make self.__current
a list, to avoid the expensive `current = current + line' operation.
All of these things made the program measure slower.

I would really appreciate some suggestions. The code is not large,
and is (I hope) rather elegant. I am a Python beginner, so I'd also
appreciate tips on Python style and OO technique. I'll post/mail the
Perl equivalent on demand.

The code follows:

#!/usr/bin/python

import string

class Dpkg_Reader:
def __init__(self, file):
self.__current = ''
self.__fp = open(file)

def next_header(self):
while 1:
line = self.__fp.readline()
if line == '':
return (None, None) # EOF
if self.__current:
if line == "\n" or (line[0] != ' ' and line[0] != "\t"):
#print "{%s}" % (self.__current)
try:
name, value = string.split(self.__current, ': ', 1)
except:
name, value = string.split(self.__current, ':', 1)
value = value[:-1]
if line == "\n":
self.__current = ''
else:
self.__current = line
return (name, value)
self.__current = self.__current + line

def close(self):
self.__fp.close()

def process_status(file):
reader = Dpkg_Reader(file)
package = None
installed = {}
while 1:
name, value = reader.next_header()
if name is None:
break
if name == 'Package':
package = value
elif name == 'Status':
status = string.split(value, ' ')
if status[2] == 'installed':
installed[package] = 1
reader.close()
return installed

def process_available(file, installed):
reader = Dpkg_Reader(file)
sizes = {}
while 1:
name, value = reader.next_header()
if name is None:
break
elif name == 'Package':
package = value
elif name == 'Installed-Size':
if installed.has_key(package):
sizes[package] = string.atoi(value)
reader.close()
return sizes

def main():
installed = process_status('/var/lib/dpkg/status')
sizes = process_available('/var/lib/dpkg/available', installed)
lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])

if __name__ == '__main__':
main()

Skip Montanaro

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to Hrvoje Niksic

Hrvoje> The program was quite easy to write, and easy to read
Hrvoje> afterwards. The problem is that it is also quite slow. On my
Hrvoje> system, it takes about 27 CPU seconds (as reported by `time'
Hrvoje> shell builtin) to do the work, which can extend to more than a
Hrvoje> minute of real time, depending on the system load.

...

Hrvoje> I would really appreciate some suggestions. The code is not
Hrvoje> large, and is (I hope) rather elegant. I am a Python beginner,
Hrvoje> so I'd also appreciate tips on Python style and OO technique.
Hrvoje> I'll post/mail the Perl equivalent on demand.

Assuming the package files aren't humongous, I'd suggest your __init__
method just slurp up the entire file instead of reading it line-by-line in
the next_header method. As has been recounted here before, Python does not
go to great lengths to optimize I/O the way Perl has, instead just relying
on a fairly straightforward interface to stdio.

Also, instead of calling next_header on a line-by-line basis, how about
calling it once per package and returning a dict containing the header names
as keys? Again, function calling in Python, while flexible, isn't terribly
efficient. A common optimization people use is to operate on aggregate
quantities in functions instead of their individual elements.

Skip Montanaro | http://www.mojam.com/
sk...@mojam.com | http://www.musi-cal.com/~skip/
847-971-7098 | Python: Programming the way Guido indented...

Hrvoje Niksic

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to Skip Montanaro
Skip Montanaro <sk...@mojam.com> writes:
[...]

Thanks for your comments. The package files aren't *humongous*, but
they aren't exactly small either. One file is 700K, and the other is
3M large. It would be really nice if it were possible to process them
in a stream.

But I'll try it, for the sake of argument, and see if the speedup is
worth it.

> Also, instead of calling next_header on a line-by-line basis, how
> about calling it once per package and returning a dict containing
> the header names as keys?

I'll try this too, and get back to you. That change will actually
make the code more correct because it will no longer assume that
`Package' header comes before `Status' and `Installed-Size' headers.

John Mitchell

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
On 8 Sep 1999, Hrvoje Niksic wrote:

> As a Python exercise, I wrote a simple program to "scratch an itch",
> i.e. do something useful. However, I found that Python's lack of
> speed really bytes me here, so I'd like to hear suggestions for
> speedup. People who don't like that kind of topic, please skip to the
> following article. Others, read on.
>
> When browsing my Debian packages, I found I often wanted to know the
> size of the installed packages.


A few suggestions:

1) almost never use a catch-all-exceptions block. For example, in your
next_header method.

2) avoid doing semi-slow operations in the middle of a loop:

- string addition (self.__current = self.__current + line)

- reading a file one line at a time (self.__fp.readline())

3) lambdas are fun, but very slow.

4) objects are your friend. Dont know about speed, but they simplify code
*so much* that I always use them, even for dump dictionary-like and
list-like things.


That is:

1) make low-level object classes, usually subclassing from UserDict or
UserList.

2) do many, bulk, bulk operations, instead of doing a lot of stuff in the
middle of a loop. Repeated short loops are better.


Work is slow, so I've coded up an "example". It doesnt really work, since
I dont have a 'status' or 'available' files to use -- please send me yours
(privately via email), and I'll get this code to work on it.


- j


import string

from UserList import UserList
from UserDict import UserDict

# a single package entry, which is like a dictionary.
# IE: ent['Source'] => 'netkit-telnet'
#
class Entry(UserDict):
def isInstalled(self):
# "install ok installed" => 1, else 0
return string.split(self['Status'])[-1] == 'installed'
def installedSize(self):
if self.isInstalled():
return string.atoi( self['Installed-Size'] )
return 0

class DpkgReaderJM(UserDict):
_primaryKey = 'Package'

def __init__(self, path=None):
UserDict.__init__(self)
if path:
self.feed(path=path)

def parseLine(self, line):
if not line:
e, self._entry = self._entry, None
return e

if line[0] == ' ':
# XX: handle multiline values here (ie: Description)
pass
else:
key, value = string.split(line, ':', 1)
self._entry[key] = string.strip(value)
# def flush(self):
# return self._entry

def feed(self, path=None, data=None):
if not data:
data = open(path).read()
self._entry = Entry()
entryList = map(self.parseLine, string.split(data, '\n'))
# skip blank entries:
entryList = filter(None, entryList)
primaryKey = self._primaryKey
for ent in entryList:
self.data[ent[primaryKey]] = ent

class StatusReader(DpkgReaderJM):
pass
class AvailableReader(DpkgReaderJM):
pass


def main_orig():
installed = StatusReader('/var/lib/dpkg/status')
avail = AvailableReader('/var/lib/dpkg/available', installed)


lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])

def main():
global status
status = StatusReader('status')
print 'applications:\n\t', status.keys()
print 'telnet installed size:\n\t', status['telnet'].installedSize()

if __name__=="__main__":
main()


Hrvoje Niksic

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
John Mitchell <jo...@magnet.com> writes:

> A few suggestions:
>
> 1) almost never use a catch-all-exceptions block. For example, in your
> next_header method.

I don't think it causes a slowdown, but point taken.

> 2) avoid doing semi-slow operations in the middle of a loop:
>
> - string addition (self.__current = self.__current + line)

As I said in the original message, I've tried to optimize this by
making self.__current a list, but it didn't work any faster.

> - reading a file one line at a time (self.__fp.readline())

I don't see an alternative to this, except to read the whole file at
once, which I am trying to avoid, as the files are large.

> 3) lambdas are fun, but very slow.

I am almost completely sure that the sort you refer to is not the
cause of the slowdown.

> 4) objects are your friend. Dont know about speed, but they
> simplify code *so much* that I always use them, even for dump
> dictionary-like and list-like things.

I don't really understand this. I'll study your code and see if it's
clearer then.

Sch...@bam.com

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to hni...@srce.hr, pytho...@python.org

>> - reading a file one line at a time (self.__fp.readline())

>I don't see an alternative to this, except to read the whole file at
>once, which I am trying to avoid, as the files are large.

possibly using readlines(size), with an appropriate size, will help here.


Bill

Hrvoje Niksic

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
Hrvoje Niksic <hni...@srce.hr> writes:

> But I'll try it [slurp all the lines at once and iterate over them],


> for the sake of argument, and see if the speedup is worth it.

Yes, this does speed things up. Instead of 27 CPU seconds, the
program now spends 19.5 seconds of CPU time. Along with:

> > Also, instead of calling next_header on a line-by-line basis, how
> > about calling it once per package and returning a dict containing
> > the header names as keys?

This buys me additional 2.5 CPU seconds.

The problem is that the thing is still twice slower than the
equivalent Perl program, which I haven't bothered to optimize at all.
Oh well.

In case anyone is still interested, after these two optimizations, the
code looks like this:

#!/usr/bin/python

import string

class Dpkg_Reader:
def __init__(self, file):
self.__current = ''

self.__lines = open(file).readlines()
self.__index = 0

def next_package(self):
package = {}
while 1:
if self.__index == len(self.__lines):
return None # EOF
line = self.__lines[self.__index]
self.__index = self.__index + 1


if self.__current:
if line == "\n" or (line[0] != ' ' and line[0] != "\t"):
#print "{%s}" % (self.__current)
try:
name, value = string.split(self.__current, ': ', 1)
except:
name, value = string.split(self.__current, ':', 1)
value = value[:-1]

package[name] = value


if line == "\n":
self.__current = ''

return package
else:
self.__current = line
continue


self.__current = self.__current + line

def close(self):
pass

def process_status(file):
reader = Dpkg_Reader(file)

installed = {}
while 1:
package = reader.next_package()
if package is None:
break
status = string.split(package['Status'], ' ')


if status[2] == 'installed':

installed[package['Package']] = 1
reader.close()
return installed

def process_available(file, installed):
reader = Dpkg_Reader(file)
sizes = {}
while 1:

package = reader.next_package()
if package is None:
break
pname = package['Package']
if installed.has_key(pname):
sizes[pname] = string.atoi(package['Installed-Size'])
reader.close()
return sizes

def main():
installed = process_status('/var/lib/dpkg/status')

sizes = process_available('/var/lib/dpkg/available', installed)


lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])

if __name__ == '__main__':
main()

Martin von Loewis

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
Skip Montanaro <sk...@mojam.com> writes:

> Also, instead of calling next_header on a line-by-line basis, how about
> calling it once per package and returning a dict containing the header names

> as keys? Again, function calling in Python, while flexible, isn't terribly
> efficient.

In addition, you could remove the __current member when doing so. If
there was a next_package method, it would just know that it is at the
beginning of a package (with __current==''), or at EOF. It would read
until EOF, or until it sees an empty line.

Regards,
Martin

Martin von Loewis

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
Hrvoje Niksic <hni...@srce.hr> writes:

> The problem is that the thing is still twice slower than the
> equivalent Perl program, which I haven't bothered to optimize at all.
> Oh well.
>
> In case anyone is still interested, after these two optimizations, the
> code looks like this:

Starting from your last version, I performed a number of additional
optimizations:
- get rid of members, localize remaining members and module accesses
- use find instead of split
- reduce number of string slicings

With that, I've reduced the time from 7s to 3.7s on my machine, with
my local data. Please find the code below.

Regards,
Martin

#!/usr/local/bin/python

import string

class Dpkg_Reader:
def __init__(self, file):

self.lines = open(file).readlines()
self.index = 0

def next_package(self):
package = {}
if self.index == len(self.lines):
return None # EOF
index = self.index # Localize everything
lines = self.lines
find = string.find
current = lines[index] # Get the first line
index = index + 1
while 1:
line = lines[index]
index = index + 1
if line[0] == ' ' or line[0] == '\t':


current = current + line

continue
pos = find(current, ':')
name = current[:pos]
if current[pos+1] == ' ': # Skip space following :
value = current[pos+2:-1]
else:
value = current[pos:-1]


package[name] = value
if line == "\n":

self.index = index # write back index before returning
return package
current = line

def close(self):
pass

def main():
import time
start=time.time()
installed = process_status('status')
end=time.time()
print "status:",end-start
start=end
sizes = process_available('available', installed)
end=time.time()
print "available:",end-start
start=end
lst = sizes.items()
for i in range(0,len(lst)):
k, v = lst[i]
lst[i] = (v,k)
lst.sort()
end=time.time()
print "sort:",end-start
start=end
#for pack in lst:
# print "%s: %d" % (pack[1],pack[0])

Gordon McMillan

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to Hrvoje Niksic, pytho...@python.org
> Hrvoje Niksic <hni...@srce.hr> writes:
>
> > But I'll try it [slurp all the lines at once and iterate over them],
> > for the sake of argument, and see if the speedup is worth it.
>
> Yes, this does speed things up. Instead of 27 CPU seconds, the
> program now spends 19.5 seconds of CPU time. Along with:
>
> > > Also, instead of calling next_header on a line-by-line basis, how
> > > about calling it once per package and returning a dict containing
> > > the header names as keys?
>
> This buys me additional 2.5 CPU seconds.
>
> The problem is that the thing is still twice slower than the
> equivalent Perl program, which I haven't bothered to optimize at
> all. Oh well.

I'd guess you'd find this getting quite a bit closer to Perl speeds:

#!/usr/bin/python

import string

class Dpkg_Reader:
def __init__(self, file):

self.__current = ''
#data = open(file).read()
data = file
self.__pkgs = string.split(data, "\n\n")
self.__index = 0

def next_package(self, split=string.split, join=string.join,
find=string.find):
if self.__index == len(self.__pkgs):
return None
data = self.__pkgs[self.__index]


self.__index = self.__index + 1

if find(data, '\n ') > -1:
data = split(data, '\n ')
data = join(data, ' ')
if find(data, '\n\t') > -1:
data = split(data, '\n\t')
data = join(data, ' ')
data = split(data, '\n')
package = {}
for line in data[:-1]:
try:
name, value = split(line, ': ', 1)
except:
name, value = split(line, ':', 1)
package[name] = value
return package

def close(self):
pass

txt = """\


Package: telnet
Status: install ok installed
Priority: standard
Section: net
Installed-Size: 130
Maintainer: Herbert Xu <her...@debian.org>
Source: netkit-telnet
Version: 0.14-4
Replaces: netstd
Depends: libc6 (>= 2.1), libncurses4 (>= 4.2-3.1)
Description: The telnet client.
The telnet command is used for interactive communication with another
host using the TELNET protocol.

"""


def test():
import time
start = time.time()
for i in xrange(100):
reader = Dpkg_Reader(txt*100)
while 1:
rslt = reader.next_package()
if rslt is None:
break
print "Took %4.2f secs" % (time.time()-start)

def main():
installed = process_status('/var/lib/dpkg/status')
sizes = process_available('/var/lib/dpkg/available', installed)
lst = sizes.keys() lst.sort(lambda a, b, sizes=sizes:
cmp(sizes[b], sizes[a])) for pack in lst:

print "%s: %d" % (pack, sizes[pack])

if __name__ == '__main__':
test()


- Gordon

Mike Fletcher

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
I do not have access to testing files, but I would guess the following would
be a speed up (untested pseudo-code). At the very least, it makes the code
easier to follow ;) . It minimizes memory overhead, doesn't bother with
counters at all, uses list indexing to find the end of the package, stores
the last name and value explicitly in a local variable (in fact, I could
have used the name and value variables, as they would always have the last
value, but that wouldn't be very readable), and generally rewrites the whole
algorithm.

def prime( self, heuristic ):
'''
prime buffer with "heuristic" lines
Should be ~2X larger than the largest possible
entry in file.
'''
self.lines[len(self.lines):] = self.file.readlines(heuristic)

def next_package(self, heuristic=1000, split=string.split,
strip=string.strip):
# initial priming of the buffer, in theory we might run past,
# so use a large heuristic...
if len( self.lines) < heuristic *2:
self.prime( heuristic )
package = {}
try:
packageindex = self.lines.index( '\n' )
except ValueError:# could do more checks here...
packageindex = len ( self.lines)
last = None
for line in self.lines[packageindex:]:
# is this a continuation
if line[0] in (' ' ,"\t"):
if last:
value = package[ last[0] ] = last[1]+line
last = (last[0], value)
else:
print '''Continued line follows null line!!!'''
# must be an opening/definition
else:
# get "cleaned up" versions
name, value = map( strip, split( line, ':', 1))
# store
package[name] = value
# save for continuation and deletion
last = (name, value)
del self.lines[ packageindex+1:]
return package # will be an empty dictionary if there is no next package

Enjoy yourselves,
Mike


Robert Vollmert

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to pytho...@python.org
On Wed, Sep 08, 1999 at 05:37:38PM +0200, Hrvoje Niksic wrote:
> As a Python exercise, I wrote a simple program to "scratch an itch",
> i.e. do something useful. However, I found that Python's lack of
> speed really bytes me here, so I'd like to hear suggestions for
> speedup. People who don't like that kind of topic, please skip to the
> following article. Others, read on.

...

> The program was quite easy to write, and easy to read afterwards. The
> problem is that it is also quite slow. On my system, it takes about
> 27 CPU seconds (as reported by `time' shell builtin) to do the work,
> which can extend to more than a minute of real time, depending on the
> system load.
>

> As a comparison, the equivalent Perl program does the same thing in 9
> CPU seconds. I tried everything I knew to make the Python version
> fast. I tried to use `re' to avoid returning headers other than the
> ones we're interested in. I tried changing self.__current to just
> current to avoid a dictionary lookup. I tried to make self.__current
> a list, to avoid the expensive `current = current + line' operation.
> All of these things made the program measure slower.
>

> I would really appreciate some suggestions. The code is not large,
> and is (I hope) rather elegant. I am a Python beginner, so I'd also
> appreciate tips on Python style and OO technique. I'll post/mail the
> Perl equivalent on demand.

Hi,

the following code is about five times faster here. That means it's
faster than perl, I suppose :-). It is of course not quite as general
as your's, but it seems to fit the job nicely. For this kind of
problem, I usually don't bother with OO.

The main speed advantage seems to be that the files aren't processed
line by line, which is of course very memory consuming.

HTH,
Robert

#! /usr/bin/env python

import string

def get_installed():
fstatus = open('/var/lib/dpkg/status', 'r')
status = fstatus.read()
fstatus.close()
status = string.split(status, '\n\n')
installed = []
for package in status:
fields = string.split(package, '\n')
name = fields[0][9:]
for line in fields:
if line[:7] == 'Status:':
if string.split(line[8:])[-1] == 'installed':
installed.append(name)
break
return installed

def get_sizes(packages):
favailable = open('/var/lib/dpkg/available', 'r')
available = favailable.read()
favailable.close()
available = string.split(available, '\n\n')
results = []
for package in available:
fields = string.split(package, '\n')
name = fields[0][9:]
if name in packages:
for line in fields:
if line[:15] == 'Installed-Size:':
results.append(name, int(line[16:]))
break
return results

def main():
results = get_sizes(get_installed())
results.sort(lambda a, b: cmp(b[1], a[1]))
for r in results:
print '%s: %d' % r

if __name__ == '__main__':
main()

--
Robert Vollmert rvol...@gmx.net

Tim Peters

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to pytho...@python.org
[Hrvoje Niksic, writes a pretty text-crunching program in Python
that's-- horrors! --slower than a Perl workalike]

[Robert Vollmert, among others, does various irksome things to
speed it up, and remarks ...]


> The main speed advantage seems to be that the files aren't
> processed line by line, which is of course very memory consuming.

This probably bears repetition about once a month <wink>: Python does
line-at-a-time input on top of the platform's stdio package. On the popular
platforms, Perl does line-at-a-time input by breaking into the platform's
stdio package, peeking and poking C FILE structs directly in
platform-specific ways. This can give Perl a huge speed advantage on
line-at-a-time input; indeed, on Windows, Perl does line-at-a-time input
faster than pure optimized C/C++ code using Microsoft's fgets, and about 3x
faster than Python under Win95. Perl's speed advantage here is directly
proportional to the stupidity of the platform's stdio implementation ("read
a line from a file? who on earth would want to do that?" <wink>).

eschew-files-embrace-3d-buttons-ly y'rs - tim

Mark Nottingham

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to Hrvoje Niksic
Have you tried profiling it? I.e.,

[...]

if __name__ == '__main__':
import profile
profile.run('main()')


Moshe Zadka

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to pytho...@python.org

Note an important speed-up in concatenating strings: use cStringIO:

a=cStringIO.StringIO()
for i in range(10000):
a.write('h')
a=a.getvalue()

rather then:

a=''
for i in range(10000):
a=a+'h'

Timothy R Evans

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
Hrvoje Niksic <hni...@srce.hr> writes:

> As a Python exercise, I wrote a simple program to "scratch an itch",
> i.e. do something useful. However, I found that Python's lack of
> speed really bytes me here, so I'd like to hear suggestions for
> speedup. People who don't like that kind of topic, please skip to the
> following article. Others, read on.
>

> When browsing my Debian packages, I found I often wanted to know the

> size of the installed packages. To achieve this, I wrote a simple
> program that prints the installed packages and their sizes, all sorted
> by size. Output looks like this:
>
> tetex-base: 27652
> tetex-extra: 27118
> cmucl-safe: 22174
> mozilla: 13007
> timidity-patches: 11662
> locales: 11489
> quake2: 10800
> [...]
>
> To achieve this, the program needs to parse two files:
> /var/lib/dpkg/status, to find which packages are installed, and

> /var/lib/dpkg/available, to find the size of the installed packages.
[snip]

I had a very similar problem, I have this University project to
visualise the Debian package file, so I needed to read in into useful
language (C was discard as fast but too much work). The read tiem was
really slow, so I only do it once. After loading the file and
creating a data structure that stores the information, I pickle it.
Read time drops by an order of magnitude. Not so applicable to your
situation possibly but you could use it if you don't install new
packages often.

--
Tim Evans

Fredrik Lundh

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
Hrvoje Niksic <hni...@srce.hr> wrote:
> > - reading a file one line at a time (self.__fp.readline())
>
> I don't see an alternative to this, except to read the whole file at
> once, which I am trying to avoid, as the files are large.

note that:

lines = fp.readlines(16384)
if not lines:
break
for line in lines:
...

is usually much faster than

line = fp.readline()
if not line:
break
...

(adjust the readlines buffer size as suitable)

but sure, Python's standard file object is slow.
working on a replacement (as an image processing
guy, I've never understood why anyone ever used
stdio), but there's so much code to write, and
sooo little time....

</F>


Hrvoje Niksic

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
Timothy R Evans <tr...@pc142.cosc.canterbury.ac.nz> writes:

> I had a very similar problem, I have this University project to
> visualise the Debian package file, so I needed to read in into
> useful language (C was discard as fast but too much work). The read
> tiem was really slow, so I only do it once. After loading the file
> and creating a data structure that stores the information, I pickle
> it. Read time drops by an order of magnitude. Not so applicable to
> your situation possibly but you could use it if you don't install
> new packages often.

Thanks for the suggestion. However, I do install new packages often,
so that's not really an option.

Hrvoje Niksic

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
Robert Vollmert <rvol...@gmx.net> writes:

> the following code is about five times faster here. That means it's
> faster than perl, I suppose :-).

Wow, this *is* fast. However, it also loses in generality. For
instance, if I wanted to do anything else, I'd have to rewrite it
completely.

meow

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
Fredrik Lundh <fre...@pythonware.com> wrote:

> but sure, Python's standard file object is slow.
> working on a replacement (as an image processing
> guy, I've never understood why anyone ever used
> stdio), but there's so much code to write, and
> sooo little time....

It's not exactly general-purpose right now (there are still methods to
add), but I was able to get about an order of magnitude speed-up on
readline() by adapting the QIO functions from INN. In its current
form it probably won't work right on anything but Unix/Linux (I can
never remember what to do about those pesky line endings). FWIW,
readlines() is about neck and neck with the built-in

Anyway, it's only 10K and in C if anyone wants to play.
<URL:http://members.xoom.com/meowing/programs/qio.tar.gz>

Aahz Maruch

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
In article <Pine.LNX.4.10.990909...@starship.python.net>,

I'll test this if I have to, but do you know off-hand how this compares
to

a = []
for i in range(10000):
a.append('h')
a = string.join(a,'')

What if you use range(100000) or range(1000000)?
--
--- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)

Skip Montanaro

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to Aahz Maruch

Aahz> I'll test this if I have to, but do you know off-hand how this
Aahz> compares to [list append]
....
Aahz> What if you use range(100000) or range(1000000)?

Executing

import cStringIO, time, string

def doit(n, r):
print "doing", n, "iterations - range function ==", r

c = time.clock()
for i in r(n):
pass
print "loop overhead:", time.clock() - c

c = time.clock()
a=cStringIO.StringIO()
for i in r(n):
a.write('h')
a=a.getvalue()
print "cStringIO:", time.clock() - c

c = time.clock()
a = []
for i in r(n):


a.append('h')
a = string.join(a,'')

print "list append:", time.clock() - c

print

for i in [1000, 5000, 10000, 50000, 100000, 1000000]:
doit(i, range)
doit(i, xrange)

I got

doing 1000 iterations - range function == <built-in function range>
loop overhead: 0.0
cStringIO: 0.01
list append: 0.01

doing 1000 iterations - range function == <built-in function xrange>
loop overhead: 0.0
cStringIO: 0.01
list append: 0.01

doing 5000 iterations - range function == <built-in function range>
loop overhead: 0.01
cStringIO: 0.03
list append: 0.04

doing 5000 iterations - range function == <built-in function xrange>
loop overhead: 0.01
cStringIO: 0.03
list append: 0.04

doing 10000 iterations - range function == <built-in function range>
loop overhead: 0.02
cStringIO: 0.06
list append: 0.08

doing 10000 iterations - range function == <built-in function xrange>
loop overhead: 0.02
cStringIO: 0.07
list append: 0.08

doing 50000 iterations - range function == <built-in function range>
loop overhead: 0.11
cStringIO: 0.32
list append: 0.42

doing 50000 iterations - range function == <built-in function xrange>
loop overhead: 0.1
cStringIO: 0.32
list append: 0.39

doing 100000 iterations - range function == <built-in function range>
loop overhead: 0.22
cStringIO: 0.65
list append: 0.92

doing 100000 iterations - range function == <built-in function xrange>
loop overhead: 0.18
cStringIO: 0.63
list append: 0.88

doing 1000000 iterations - range function == <built-in function range>
loop overhead: 2.22
cStringIO: 6.55
list append: 10.27

doing 1000000 iterations - range function == <built-in function xrange>
loop overhead: 1.77
cStringIO: 6.35
list append: 10.04

which suggests there is a performance boost for cStringIO vs list append and
xrange vs range for large numbers, but that for loops with less than about
10,000 iterations, it probably doesn't matter what you use.

Tim Peters

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to pytho...@python.org
[Robert Vollmert]

> the following code is about five times faster here. That means it's
> faster than perl, I suppose :-).

[Hrvoje Niksic]


> Wow, this *is* fast. However, it also loses in generality. For
> instance, if I wanted to do anything else, I'd have to rewrite it
> completely.

Was the Perl code to which you're comparing it as general as your Python
code?

conservation-of-agony-ly y'rs - tim

Doug Hellmann

unread,
Sep 10, 1999, 3:00:00 AM9/10/99
to

Probably not every time you run your app, though. You could parse the
data, pickle the results, and keep up with the mod times on the two
files. If the source data is newer than the pickle database you have to
update the pickle data. That would slow down some of your processing,
but in most cases would still let you take advantage of the faster
pickle version.

Doug

Les Schaffer

unread,
Sep 10, 1999, 3:00:00 AM9/10/99
to
> What if you use range(100000) or range(1000000)?
elapsed time
(sec)
trying NUM = 10
a=a+'h' --> 0.000987
cStringIO --> 0.000954
a.append('h') --> 0.001573
trying NUM = 100
a=a+'h' --> 0.004399
cStringIO --> 0.005185
a.append('h') --> 0.007262
trying NUM = 1000
a=a+'h' --> 0.056781
cStringIO --> 0.054581
a.append('h') --> 0.072653
trying NUM = 10000
a=a+'h' --> 3.217044
cStringIO --> 0.535160
a.append('h') --> 0.907285
trying NUM = 100000
a=a+'h' --> 324.303910
cStringIO --> 5.138107
a.append('h') --> 7.676265
trying NUM = 1000000

still running :-) [ i want skip m's computer !!! :-) ]

--
____ Les Schaffer ___| --->> Engineering R&D <<---
Theoretical & Applied Mechanics | Designspring, Inc.
Center for Radiophysics & Space Research | http://www.designspring.com/
Cornell Univ. scha...@tam.cornell.edu | l...@designspring.com

Hrvoje Niksic

unread,
Sep 10, 1999, 3:00:00 AM9/10/99
to
Doug Hellmann <doughe...@home.com> writes:

> > Thanks for the suggestion. However, I do install new packages
> > often, so that's not really an option.
>
> Probably not every time you run your app, though.

I use the application to see which packages waste too much space, so
that I know which ones to remove. Not to mention that I run nightly
updates of my Debian system. It is practically *guaranteed* that the
files will have changed between two runs of the program.

Sorry. I like the idea of pickling/marshalling, and I'd love to be
able to use it here. But it can't work that way.

> You could parse the data, pickle the results, and keep up with the
> mod times on the two files. If the source data is newer than the
> pickle database you have to update the pickle data.

The problem is that to "update" the pickle data (which I'd have to do
every time, as explained above), I have to parse the text files. And
it's the parsing that's slow, not the post-processing. If I need to
parse them every time, then I don't need pickling.

Hrvoje Niksic

unread,
Sep 10, 1999, 3:00:00 AM9/10/99
to
"Tim Peters" <tim...@email.msn.com> writes:

> Was the Perl code to which you're comparing it as general as your
> Python code?

The Perl code was almost equivalent to the first posted version of
Python code -- the only important difference is that it used a
functional approach rather than OOP. I can mail it to you if you are
curious.

Generality is important to me because another day I might wish to do
something else with those files, like draw a tree of dependencies, or
something. In that case, I'd surely want to reuse the existing
reader/parser. Speed is good, but not so good to justify loss of code
reuse.

Robert's solution is ideologically similar to the `grep' solution
proposed below.

Carel Fellinger

unread,
Sep 10, 1999, 3:00:00 AM9/10/99
to
Hrvoje Niksic <hni...@srce.hr> wrote:
> As a Python exercise, I wrote a simple program to "scratch an itch",
> i.e. do something useful. However, I found that Python's lack of
> speed really bytes me here, so I'd like to hear suggestions for
> speedup. People who don't like that kind of topic, please skip to the
> following article. Others, read on.

although many have answered you already, I can't resist in posting yet
another one. Your code takes almost 7 secs on my machine, mine takes a
little over 1 sec, and it's still OO and generic. Besides, I didn't use
any fance Python tricks, just the combination of lasy evaluation and bulk
IO did the trick here.

-----------------
#!/usr/bin/python

import string

class Dpkg_Munger:

def munge_all(self, file):
for raw_package in string.split(open(file).read(), '\n\n'):
if raw_package!='':
self.raw_package = '\n' + raw_package + '\n'
self.munge_one(self)

def __getitem__(self, key, find=string.find):
i = find(self.raw_package, '\n'+key+':')
if i == -1:
raise IndexError
i = i + len(key) + 3
if self.raw_package[i]==' ':
i = i + 1
j = i - 1
while 1:
j = find(self.raw_package,'\n',j+1)
if self.raw_package[j+1] != ' ' and self.raw_package[j+1] != '\t':
break
return self.raw_package[i:j]


class process_status(Dpkg_Munger):

def __init__(self, file):
self.installed = {}
self.munge_all(file)

def munge_one(self, package):
name = package['Package']
status = package['Status']
if string.split(status, ' ')[2] == 'installed':
self.installed[name] = 1


class process_available(Dpkg_Munger):

def __init__(self, file, installed):
self.installed, self.sizes = installed, {}
self.munge_all(file)

def munge_one(self, package):
name = package['Package']
size = package['Installed-Size']
if self.installed.has_key(name):
self.sizes[name] = string.atoi(size)


def main():
installed = process_status('/var/lib/dpkg/status').installed
sizes = process_available('/var/lib/dpkg/available', installed).sizes


lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])

if __name__ == '__main__':
main()

--
groetjes, carel

Carel Fellinger

unread,
Sep 11, 1999, 3:00:00 AM9/11/99
to
Carel Fellinger <cfel...@iae.nl> wrote:
> another one. Your code takes almost 7 secs on my machine, mine takes a
> little over 1 sec, and it's still OO and generic. Besides, I didn't use

two small adjustments in the munge_one methods brought it's time just below
the magical 1 sec! so now mine's even more then 7 times faster:)

> def munge_one(self, package):
> name = package['Package']
> status = package['Status']
> if string.split(status, ' ')[2] == 'installed':

this should read:
if status[-10:]==' installed':

> def munge_one(self, package):
> name = package['Package']
> size = package['Installed-Size']

the above line should go in the body of the next if-statement

> if self.installed.has_key(name):
> self.sizes[name] = string.atoi(size)

--
groetjes, carel

Carel Fellinger

unread,
Sep 11, 1999, 3:00:00 AM9/11/99
to
Hrvoje Niksic <hni...@srce.hr> wrote:
> "Tim Peters" <tim...@email.msn.com> writes:

>> Was the Perl code to which you're comparing it as general as your
>> Python code?

why is it that all of a sudden Tim's postings don't make it to the newsgroup?
I don't recall having seen the above before, and i got the same impression in
other treads as well. And as you all know, whithout Tim's postings there is
hardly any point in reading the newsgroup at all:)

> Robert's solution is ideologically similar to the `grep' solution
> proposed below.

I seem to have missed this grep solution too. or was it part of Tim's posting?

--
groetjes, carel

Robert Vollmert

unread,
Sep 11, 1999, 3:00:00 AM9/11/99
to pytho...@python.org
On Wed, Sep 08, 1999 at 05:37:38PM +0200, Hrvoje Niksic wrote:
> To achieve this, the program needs to parse two files:
> /var/lib/dpkg/status, to find which packages are installed, and
> /var/lib/dpkg/available, to find the size of the installed packages.
> Both files are textual, in the form of headers. Like this:

While this is hardly relevant to python anymore, I suppose it might
interest some. While looking at the relevant data files, I found that
you only need to parse the status file as it actually contains all
information about installed packages, including an 'Installed-Size'
header.

It describes the currently installed packages, while the available
file is about the newest packages available and might contain invalid
information regarding the currently installed version.

--
Robert Vollmert rvol...@gmx.net

Tim Peters

unread,
Sep 11, 1999, 3:00:00 AM9/11/99
to pytho...@python.org
[Carel Fellinger, snips some priceless verbiage from Tim]

> why is it that all of a sudden Tim's postings don't make it to
> the newsgroup? I don't recall having seen the above before, and
> i got the same impression in other treads as well.

This is an exciting new FeaBug of the timbot delivery mechanism: in a fit
of maniacal egoecentricity, the timbot reprogrammed many of the world's news
readers to show its responses only to people it deemed worthy of reading
them. I'm sure your absence from the list of approved readers is a
temporary glitch due to inferior operating systems (like Windows, MacOS,
EPOC32, Unix, Linux and BeOS) not yet fully grasping the elegance of the
timbot's new "Discriminating Display(tm)" code.

Another cause may be that the timbot emails responses to the mailing-list
version of c.l.py, and the mail->news delivery agent at CNRI suffers
periodic bouts of constipation. The anonymous Barry
"${you_name_it}mas...@python.org" Warsaw spends many hours with his fist in
up to the elbow, but somtimes Perl blockage requires both fists to unplug.

> And as you all know, whithout Tim's postings there is hardly any point in
> reading the newsgroup at all:)

None, I'd say. Well, actually, Tim's (that's me) postings are generally
long, rambling, off-topic, and project an irritating smugness as if their
very lack of sensible content rendered them immune from criticism. The
timbot's postings, however, are pure joy. I've been looking into getting us
separate email addresses so people have an easier time telling us apart.

it-doesn't-help-that-the-mechanical-bastard-stole-my-
signoff-style-either-ly y'rs - tim

Cliff, or a close facsimile

unread,
Sep 13, 1999, 3:00:00 AM9/13/99
to
Pada Sat, 11 Sep 1999 16:29:56 -0400, Tim Peters bilang:

| I'm sure your absence from the list of approved readers is a
| temporary glitch due to inferior operating systems (like Windows, MacOS,
| EPOC32, Unix, Linux and BeOS)

You just ruled out a whole bunch of OS's there...so what does the timbot
run on then? ITS? A Lisp Machine? AmigaOS?


--
cliff crawford http://www.people.cornell.edu/pages/cjc26/
There are more stars in the sky than there are
-><- grains of sand on all the beaches of the world.

Cedric Adjih

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
Hrvoje Niksic <hni...@srce.hr> wrote:
> Doug Hellmann <doughe...@home.com> writes:

>> > Thanks for the suggestion. However, I do install new packages
>> > often, so that's not really an option.
>>
>> Probably not every time you run your app, though.

> I use the application to see which packages waste too much space, so
> that I know which ones to remove. Not to mention that I run nightly
> updates of my Debian system. It is practically *guaranteed* that the
> files will have changed between two runs of the program.
>
> Sorry. I like the idea of pickling/marshalling, and I'd love to be
> able to use it here. But it can't work that way.

Maybe you can update the python structures at night too
(the same way the 'locate' database is built at night,
because a 'find' on the whole system is too slow).

>> You could parse the data, pickle the results, and keep up with the
>> mod times on the two files. If the source data is newer than the
>> pickle database you have to update the pickle data.
>
> The problem is that to "update" the pickle data (which I'd have to do
> every time, as explained above), I have to parse the text files. And
> it's the parsing that's slow, not the post-processing. If I need to
> parse them every time, then I don't need pickling.

Yes, the idea was to keep a cache of the pythonic database, so
whenever it has changed, you rebuild it, and save it, for reusing
it later.
Doing something clever would involve using "diff" for instance, but
is probably more complex than making a C module for Python :-).

-- Cedric Adjih

Neil Schemenauer

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
Fredrik Lundh <fre...@pythonware.com> wrote:
>Hrvoje Niksic <hni...@srce.hr> wrote:
>> > - reading a file one line at a time (self.__fp.readline())
>>
>> I don't see an alternative to this, except to read the whole file at
>> once, which I am trying to avoid, as the files are large.
>
>note that:
>
> lines = fp.readlines(16384)
> if not lines:
> break
> for line in lines:
> ...
>
>is usually much faster than
>
> line = fp.readline()
> if not line:
> break
> ...

What would be cool is if readlines() returned a lazy sequence
object (ie. only read as much is needed using a certain block
size). This should give the advantages of readlines() without
the concern about sucking up a huge file all at once.

I implemented this idea (probably badly) in pure Python
and got about a 2x speedup verses readline(). It is a small
module so I will post it here.


being-lazy-has-its-advantages'ly Neil


import string

class BlockFile:
def __init__(self, file, blocksize=1024*40, sep='\n'):
self.file = file
self.blocksize = blocksize
self.sep = sep
self.line = -1
self.lines = []
self.end = ''

def __getitem__(self, i):
try:
self.line = self.line+1
return self.lines[self.line]
except IndexError:
self.line = 0
self._get_block()
return self.lines[0]

def _get_block(self):
data = self.file.read(self.blocksize)
if len(data) == 0:
raise IndexError
self.lines = string.split(data, self.sep)
self.lines[0] = self.lines[0] + self.end
if len(data) == self.blocksize:
self.end = self.lines[-1]
del self.lines[-1] # this _should_ be fast
else:
self.end = ''

def test_block(input):
for l in BlockFile(open(input)):
pass

def test_normal(input):
f = open(input)
while 1:
l = f.readline()
if not l:
break

def measure(function, *args):
import time
t1 = time.time()
apply(function, args)
t2 = time.time()
apply(function, args)
return t2-t1

if __name__ == '__main__':
import sys
print 'block time', measure(test_block, sys.argv[1])
print 'normal time', measure(test_normal, sys.argv[1])

meow

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to pytho...@python.org
In nnmh:python, William Tanksley <wtan...@dolphin.openprojects.net> wrote:

> I wonder -- suppose the official Python were to replace the current
> readlines() with this one (or with something similar). Would it be
> possible to preserve all the current behavior (except for the memory
> usage, of course)?

It's tough. I've been trying to do that with my attempt, and the
performance hit of keeping the seeks synchronized makes it not worth
the trouble (IMHO) to support writing or having more than one object
able to read from the same descriptor.

If a read only object is good enough, a cleaned-up version of the QIO
hack thingy I mentioned a few days ago is uploaded. I think I've got
its methods, attributes and exceptions down to a faithful reproduction
of Python's native file object now (except for writing, of
course). readinto() is absent for now since it's undocumented. It can
go in if someone can supply some Python k0dez showing what buffer
objects are good for =)

Also added some convenience features like user-specified line
terminators, automatic chopping, and a way to read SMTP-like sockets
wicked fast.

After adding the requisite bloat, vanilla qio.readline() is still
about 5-8 times as fast as native Python looping through big text
files (where stdio is glibc's). readlines() and read() are not big
wins (and in pathological cases they're slower), but they're there for
completeness.

Details at <URL:http://members.xoom.com/meowing/python/>

Meow.

Alex

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The fileinput module does something like that in the background, I
think...

I would check out fileinput first, but if that doesn't suit you, you
could use what I rolled for myself a little while back. I think it's
slightly easier to have multiple instances this way:

class lazy_file:
def __init__ (self, filename, sizehint = 10**7, mode = 'r'):
self.file = open (filename, mode)
self.line_count=0
self.sizehint = sizehint
self.__refresh_buffer ()
def __refresh_buffer (self):
self.buffer = self.file.readlines (self.sizehint)
self.buffer_length = len (self.buffer)
if self.buffer:self.buffer_index = 0
else: self.buffer_index = None
def readline (self):
if self.buffer_index==self.buffer_length:
self.__refresh_buffer ()
if not self.buffer: self.line = ''
else:
self.line = self.buffer[self.buffer_index]
self.buffer_index = self.buffer_index + 1
if self.line:self.line_count=self.line_count+1
return self.line

class File_loop:
def __init__(self,filename,sizehint=10**7,mode='r'):
self.file=lazy_file(filename,sizehint,mode)
def __getitem__(self,index):
if self.file.readline():return self.file.line
else:raise IndexError, 'File finished'

You can just use it like

for line in File_loop('filename'):
munge(line)

Alex.

- --
If cars were like computers, they would go 300 m.p.h. and get a hundred
miles to the gallon and cost $50. Except that twice a month someone a
thousand miles away would be able to blow up the car, killing everyone
nearby.

-----BEGIN PGP SIGNATURE-----
Version: PGP for Personal Privacy 5.0
Comment: Processed by Mailcrypt 3.5b6, an Emacs/PGP interface
Charset: noconv

iQA/AwUBN+BT93zN4ZFYpUPAEQJQFACgrZ2asMAo6PPw7He+yVZsXd7qyHQAoMOg
7szYmUaTVubWw0fXZMWxdjeh
=6uiN
-----END PGP SIGNATURE-----

William Tanksley

unread,
Sep 16, 1999, 3:00:00 AM9/16/99
to
On Wed, 15 Sep 1999 23:21:02 GMT, Neil Schemenauer wrote:
>Fredrik Lundh <fre...@pythonware.com> wrote:
>>Hrvoje Niksic <hni...@srce.hr> wrote:
>>> > - reading a file one line at a time (self.__fp.readline())

>>> I don't see an alternative to this, except to read the whole file at
>>> once, which I am trying to avoid, as the files are large.

>What would be cool is if readlines() returned a lazy sequence


>object (ie. only read as much is needed using a certain block
>size). This should give the advantages of readlines() without
>the concern about sucking up a huge file all at once.

I wonder -- suppose the official Python were to replace the current


readlines() with this one (or with something similar). Would it be
possible to preserve all the current behavior (except for the memory
usage, of course)?

That sure would seem nice. It wouldn't solve all of our problems, but it
would make the most obvious speedup idiom behave well.

--
-William "Billy" Tanksley

Alexander Williams

unread,
Sep 16, 1999, 3:00:00 AM9/16/99
to
On Thu, 16 Sep 1999 00:19:20 GMT, William Tanksley
<wtan...@dolphin.openprojects.net> wrote:

>That sure would seem nice. It wouldn't solve all of our problems, but it
>would make the most obvious speedup idiom behave well.

As I've often been heard to say (generally after a marathon Haskell
hacking session):

You can never have too much laziness.

Personally, I'd love to have almost Haskell-level laziness on tap in
Python, especially in things like file IO. There's a lot to be said
for being able to go slurping through a file and having that stuff
done for you under the hood.

--
Alexander Williams (tha...@gw.total-web.net)
"In the end ... Oblivion Always Wins."

Marc Christiansen

unread,
Sep 17, 1999, 3:00:00 AM9/17/99
to
Neil Schemenauer <nasc...@ucalgary.ca> wrote:
> What would be cool is if readlines() returned a lazy sequence
> object (ie. only read as much is needed using a certain block
> size). This should give the advantages of readlines() without
> the concern about sucking up a huge file all at once.
>
> I implemented this idea (probably badly) in pure Python
> and got about a 2x speedup verses readline(). It is a small
> module so I will post it here.

Here it's the other way round:

14.SRam:> python BlockFile.py daten:programme/python/e4xtc.txt
block time 9.74225389957
normal time 5.96557199955

This is on an Amiga, using Python 1.5.2, the file is 276089 Bytes long.
I tried also with blocksizes of 100k and 8k for BlockFile, but the
difference was about nought, so the speed difference is not blocksize
related.

And there is a error in _get_block(self):

> self.lines[0] = self.lines[0] + self.end

This should read: self.lines[0] = self.end + self.lines[0]

Ciao
Marc
--
"Claims about the evils of cannabis have been exaggerated: The question now
is not whether it should be legalised, but how and when."
http://www.newscientist.com/ns/980221/contents.html
PGP-Key fingerprint EE 2E C1 06 D1 28 89 D2 F9 CE 35 7B F8 F4 42 52

Neil Schemenauer

unread,
Sep 19, 1999, 3:00:00 AM9/19/99
to
William Tanksley <wtan...@dolphin.openprojects.net> wrote:
>I wonder -- suppose the official Python were to replace the current
>readlines() with this one (or with something similar). Would it be
>possible to preserve all the current behavior (except for the memory
>usage, of course)?

Not easily. The code I wrote is only good for putting in a for
loop. If you access the lines out of order things go crazy.
This is similar to the fileinput module.

I coded a C version of this. The speed is quite impressive. For
random 3 MB text file (using time()):

BlockFile: 0.420621991158
readline: 7.21456801891
readlines: 0.672585010529

Even with a cold cache I get 0.6 seconds. Incidently, this C
version is over 2.5 times faster than:

perl -e "while (<>) {}"

If anyone is interested I can make the code available.

Neil

--
"The percentage of users running Windows NT Workstation 4.0 whose PCs stopped
working more than once a month was less than half that of Windows 95 users."
-- microsoft.com/ntworkstation/overview/Reliability/Highest.asp

William Tanksley

unread,
Sep 19, 1999, 3:00:00 AM9/19/99
to
On Sun, 19 Sep 1999 02:56:03 GMT, Neil Schemenauer wrote:
>William Tanksley <wtan...@dolphin.openprojects.net> wrote:
>>I wonder -- suppose the official Python were to replace the current
>>readlines() with this one (or with something similar). Would it be
>>possible to preserve all the current behavior (except for the memory
>>usage, of course)?

>Not easily. The code I wrote is only good for putting in a for
>loop. If you access the lines out of order things go crazy.
>This is similar to the fileinput module.

Yes, I see. It's pretty clear that this will have to wait for the 'for'
loop changes we've pondered in the past (i.e. making a distinction between
streams and random access entities).

Hmm... There is a simple stopgap solution. We *could* simply keep track
of accesses, and on the first random access dump everything into memory
and act like a list.

>I coded a C version of this. The speed is quite impressive. For
>random 3 MB text file (using time()):

> BlockFile: 0.420621991158
> readline: 7.21456801891
> readlines: 0.672585010529

>Even with a cold cache I get 0.6 seconds. Incidently, this C
>version is over 2.5 times faster than:
> perl -e "while (<>) {}"

Hey, COOL! Finally we get a sop to our inferiority complex. ;-)

Seriously, Perl and Python have the most amusing codependancy. We both
spend so much time feeling inferior.

It occurs to me that one nice thing to see in Python 1.6 would be a
built-in function and datatype, so a file could be specified perhaps like
this:

for line in url('file:/dos.txt'):

Eventually, perhaps that could become a syntax -- I don't want to look TOO
much like Perl :), but other languages seem to have demonstrated that easy
entry does make for good-looking code -- and after all, that's what we're
here for: to look good. No, to look GREAT.

for line in %file:/dos.txt:

Naw, that doesn't look good in Python. Sure looks nice in Rebol, though.
I guess Python ain't Rebol, huh? :)

>If anyone is interested I can make the code available.

I'm *interested*, but if you make the code available I won't do a blessed
thing about it. I'd love to see it in the official binary so I could use
it, but until then I like to only use things I can depend on portably.
And I'm lazy -- oops, I mean busy.

> Neil

<snore>-ly y'rs

--
-William "Billy" Tanksley

Tim Peters

unread,
Sep 19, 1999, 3:00:00 AM9/19/99
to pytho...@python.org
[meow, about his or her qio module at
http://members.xoom.com/meowing/python/
]
> ...

> Also added some convenience features like user-specified line
> terminators, automatic chopping, and a way to read SMTP-like sockets
> wicked fast.

Yes, qio.readline() is about the most frightening function in the Python
universe now <wink>.

> After adding the requisite bloat, vanilla qio.readline() is still
> about 5-8 times as fast as native Python looping through big text
> files (where stdio is glibc's).

I compiled it for Win32 (btw, "mode_t" was not defined, & there are about a
half dozen warnings about mixing signed and unsigned ints), with
optimization cranked up. Under Win95 it was between 2-3 times as fast as
1.5.2 Python doing a native readline() loop over an 11Mb text file. It's
hard to believe that MS's grudging implementation of stdio is that much
better than glibc's, so not sure what to make of this -- possibly that
*nothing* can run really fast under Windows <0.9 wink>.

> ...

Nice work!

if-only-windows-had-any-text-files<0.7-wink>-ly y'rs - tim

greg andruk

unread,
Sep 20, 1999, 3:00:00 AM9/20/99
to
Tim Peters <tim...@email.msn.com> wrote:

> meow, about his or her

Most people use 'its' or a choice expletive.

> I compiled it for Win32 (btw, "mode_t" was not defined, & there are about a
> half dozen warnings about mixing signed and unsigned ints)

If you saved 'em, I'd love to see the error messages. I don't have
Windows kit handy to check for these things, and am just a bit
surprised it actually works there at all.

> It's hard to believe that MS's grudging implementation of stdio is
> that much better than glibc's

glibc really does have some odd features, like macros where most other
libs seem to use functions and vice versa. It's almost as if they
think maintainability is important or something.

But maybe you were trying to be fair, while standard benchmaking
practice dictates that one stacks the deck ;) To get up to the 8X
range I glued together 15MB of news overview files with an average
line length of 447. In that case the native loop is dominated by
getc(), while with shorter lines the Python loop overhead masks much
of the difference. Size really does matter, especially when one is
comtemplating whipping out one's Python at an all-Perl shop and wants
to make a good show of it.

Neil Schemenauer

unread,
Sep 20, 1999, 3:00:00 AM9/20/99
to
William Tanksley <wtan...@dolphin.openprojects.net> wrote:
>Yes, I see. It's pretty clear that this will have to wait for the 'for'
>loop changes we've pondered in the past (i.e. making a distinction between
>streams and random access entities).

Python 2.0 should be very cool.

>Hmm... There is a simple stopgap solution. We *could* simply keep track
>of accesses, and on the first random access dump everything into memory
>and act like a list.

Yes, that could work. It would be complicated though. Not very
Python like.

>I'm *interested*, but if you make the code available I won't do a blessed
>thing about it. I'd love to see it in the official binary so I could use
>it, but until then I like to only use things I can depend on portably.
>And I'm lazy -- oops, I mean busy.

As someone else asked for it, here it is anyhow:

http://www.acs.ucalgary.ca/~nascheme/python/blockfile.tgz

It should be possible to transparently integrate this with the
fileinput module. I'm to lazy right now to do it. :)


Neil

Fredrik Lundh

unread,
Sep 20, 1999, 3:00:00 AM9/20/99
to
William Tanksley <wtan...@dolphin.openprojects.net> wrote:
> It occurs to me that one nice thing to see in Python 1.6 would be a
> built-in function and datatype, so a file could be specified perhaps like
> this:
>
> for line in url('file:/dos.txt'):

why not just add

import urllib
def lazy(file):
return file.readlines() # for now
def url(url):
return lazy(urllib.urlopen(url))
import __builtin__
__builtin__.url = url

to your startup file ;-)

(I'll leave it to someone else to actually
implement the "lazy" class/factory).

</F>


Darrell

unread,
Sep 20, 1999, 3:00:00 AM9/20/99
to
Ran the test.py and enabled the whole file test as well.
Looks like readline() is a smoker, but read() needs help.

On NT with a 54meg file and average line length of 147

compare readlines()
Using native readlines
3 function calls in 10.539 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 1.237 1.237 10.479 10.479 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.060 0.060 10.539 10.539 profile:0(testlines(filespec))
1 9.243 9.243 9.243 9.243 test.py:30(testlines)


Using QIO readlines
3 function calls in 10.212 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 1.556 1.556 10.211 10.211 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 10.212 10.212
profile:0(testqiolines(filespec))
1 8.655 8.655 8.655 8.655 test.py:38(testqiolines)


compare readline() loops
Using native readline
3 function calls in 31.633 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 31.632 31.632 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 31.633 31.633 profile:0(test(filespec))
1 31.632 31.632 31.632 31.632 test.py:5(test)


Using QIO readline
3 function calls in 5.539 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 5.538 5.538 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 5.539 5.539 profile:0(testqio(filespec))
1 5.538 5.538 5.538 5.538 test.py:20(testqio)


compare read() of a whole file
standard
Size: 53992540
2 function calls in 2.427 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 2.425 2.425 2.425 2.425 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 2.427 2.427 profile:0(z = f.read(); print
len(z))


qio
Size: 53992540
2 function calls in 15.612 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 15.550 15.550 15.550 15.550 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.061 0.061 15.612 15.612 profile:0(z = f.read(); print
len(z))


--
--Darrell

greg andruk

unread,
Sep 21, 1999, 3:00:00 AM9/21/99
to
Darrell <ne...@dorb.com> wrote:

[Thanks for the patch to make Windows happy, BTW.]

> Ran the test.py and enabled the whole file test as well.
> Looks like readline() is a smoker, but read() needs help.

Right, because you end up with two copies of the file in memory if you
use a single read to get the whole file; that's gonna hurt if your
machine is low on RAM. For whole-file slurping, use the native file
objects which handle that case very efficiently.

0 new messages