When browsing my Debian packages, I found I often wanted to know the
size of the installed packages. To achieve this, I wrote a simple
program that prints the installed packages and their sizes, all sorted
by size. Output looks like this:
tetex-base: 27652
tetex-extra: 27118
cmucl-safe: 22174
mozilla: 13007
timidity-patches: 11662
locales: 11489
quake2: 10800
[...]
To achieve this, the program needs to parse two files:
/var/lib/dpkg/status, to find which packages are installed, and
/var/lib/dpkg/available, to find the size of the installed packages.
Both files are textual, in the form of headers. Like this:
Package: telnet
Status: install ok installed
Priority: standard
Section: net
Installed-Size: 130
Maintainer: Herbert Xu <her...@debian.org>
Source: netkit-telnet
Version: 0.14-4
Replaces: netstd
Depends: libc6 (>= 2.1), libncurses4 (>= 4.2-3.1)
Description: The telnet client.
The telnet command is used for interactive communication with another host
using the TELNET protocol.
Package: packaging-manual
Status: purge ok not-installed
Priority: optional
Section: doc
[...]
The program was quite easy to write, and easy to read afterwards. The
problem is that it is also quite slow. On my system, it takes about
27 CPU seconds (as reported by `time' shell builtin) to do the work,
which can extend to more than a minute of real time, depending on the
system load.
As a comparison, the equivalent Perl program does the same thing in 9
CPU seconds. I tried everything I knew to make the Python version
fast. I tried to use `re' to avoid returning headers other than the
ones we're interested in. I tried changing self.__current to just
current to avoid a dictionary lookup. I tried to make self.__current
a list, to avoid the expensive `current = current + line' operation.
All of these things made the program measure slower.
I would really appreciate some suggestions. The code is not large,
and is (I hope) rather elegant. I am a Python beginner, so I'd also
appreciate tips on Python style and OO technique. I'll post/mail the
Perl equivalent on demand.
The code follows:
#!/usr/bin/python
import string
class Dpkg_Reader:
def __init__(self, file):
self.__current = ''
self.__fp = open(file)
def next_header(self):
while 1:
line = self.__fp.readline()
if line == '':
return (None, None) # EOF
if self.__current:
if line == "\n" or (line[0] != ' ' and line[0] != "\t"):
#print "{%s}" % (self.__current)
try:
name, value = string.split(self.__current, ': ', 1)
except:
name, value = string.split(self.__current, ':', 1)
value = value[:-1]
if line == "\n":
self.__current = ''
else:
self.__current = line
return (name, value)
self.__current = self.__current + line
def close(self):
self.__fp.close()
def process_status(file):
reader = Dpkg_Reader(file)
package = None
installed = {}
while 1:
name, value = reader.next_header()
if name is None:
break
if name == 'Package':
package = value
elif name == 'Status':
status = string.split(value, ' ')
if status[2] == 'installed':
installed[package] = 1
reader.close()
return installed
def process_available(file, installed):
reader = Dpkg_Reader(file)
sizes = {}
while 1:
name, value = reader.next_header()
if name is None:
break
elif name == 'Package':
package = value
elif name == 'Installed-Size':
if installed.has_key(package):
sizes[package] = string.atoi(value)
reader.close()
return sizes
def main():
installed = process_status('/var/lib/dpkg/status')
sizes = process_available('/var/lib/dpkg/available', installed)
lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])
if __name__ == '__main__':
main()
...
Hrvoje> I would really appreciate some suggestions. The code is not
Hrvoje> large, and is (I hope) rather elegant. I am a Python beginner,
Hrvoje> so I'd also appreciate tips on Python style and OO technique.
Hrvoje> I'll post/mail the Perl equivalent on demand.
Assuming the package files aren't humongous, I'd suggest your __init__
method just slurp up the entire file instead of reading it line-by-line in
the next_header method. As has been recounted here before, Python does not
go to great lengths to optimize I/O the way Perl has, instead just relying
on a fairly straightforward interface to stdio.
Also, instead of calling next_header on a line-by-line basis, how about
calling it once per package and returning a dict containing the header names
as keys? Again, function calling in Python, while flexible, isn't terribly
efficient. A common optimization people use is to operate on aggregate
quantities in functions instead of their individual elements.
Skip Montanaro | http://www.mojam.com/
sk...@mojam.com | http://www.musi-cal.com/~skip/
847-971-7098 | Python: Programming the way Guido indented...
Thanks for your comments. The package files aren't *humongous*, but
they aren't exactly small either. One file is 700K, and the other is
3M large. It would be really nice if it were possible to process them
in a stream.
But I'll try it, for the sake of argument, and see if the speedup is
worth it.
> Also, instead of calling next_header on a line-by-line basis, how
> about calling it once per package and returning a dict containing
> the header names as keys?
I'll try this too, and get back to you. That change will actually
make the code more correct because it will no longer assume that
`Package' header comes before `Status' and `Installed-Size' headers.
> As a Python exercise, I wrote a simple program to "scratch an itch",
> i.e. do something useful. However, I found that Python's lack of
> speed really bytes me here, so I'd like to hear suggestions for
> speedup. People who don't like that kind of topic, please skip to the
> following article. Others, read on.
>
> When browsing my Debian packages, I found I often wanted to know the
> size of the installed packages.
A few suggestions:
1) almost never use a catch-all-exceptions block. For example, in your
next_header method.
2) avoid doing semi-slow operations in the middle of a loop:
- string addition (self.__current = self.__current + line)
- reading a file one line at a time (self.__fp.readline())
3) lambdas are fun, but very slow.
4) objects are your friend. Dont know about speed, but they simplify code
*so much* that I always use them, even for dump dictionary-like and
list-like things.
That is:
1) make low-level object classes, usually subclassing from UserDict or
UserList.
2) do many, bulk, bulk operations, instead of doing a lot of stuff in the
middle of a loop. Repeated short loops are better.
Work is slow, so I've coded up an "example". It doesnt really work, since
I dont have a 'status' or 'available' files to use -- please send me yours
(privately via email), and I'll get this code to work on it.
- j
import string
from UserList import UserList
from UserDict import UserDict
# a single package entry, which is like a dictionary.
# IE: ent['Source'] => 'netkit-telnet'
#
class Entry(UserDict):
def isInstalled(self):
# "install ok installed" => 1, else 0
return string.split(self['Status'])[-1] == 'installed'
def installedSize(self):
if self.isInstalled():
return string.atoi( self['Installed-Size'] )
return 0
class DpkgReaderJM(UserDict):
_primaryKey = 'Package'
def __init__(self, path=None):
UserDict.__init__(self)
if path:
self.feed(path=path)
def parseLine(self, line):
if not line:
e, self._entry = self._entry, None
return e
if line[0] == ' ':
# XX: handle multiline values here (ie: Description)
pass
else:
key, value = string.split(line, ':', 1)
self._entry[key] = string.strip(value)
# def flush(self):
# return self._entry
def feed(self, path=None, data=None):
if not data:
data = open(path).read()
self._entry = Entry()
entryList = map(self.parseLine, string.split(data, '\n'))
# skip blank entries:
entryList = filter(None, entryList)
primaryKey = self._primaryKey
for ent in entryList:
self.data[ent[primaryKey]] = ent
class StatusReader(DpkgReaderJM):
pass
class AvailableReader(DpkgReaderJM):
pass
def main_orig():
installed = StatusReader('/var/lib/dpkg/status')
avail = AvailableReader('/var/lib/dpkg/available', installed)
lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])
def main():
global status
status = StatusReader('status')
print 'applications:\n\t', status.keys()
print 'telnet installed size:\n\t', status['telnet'].installedSize()
if __name__=="__main__":
main()
> A few suggestions:
>
> 1) almost never use a catch-all-exceptions block. For example, in your
> next_header method.
I don't think it causes a slowdown, but point taken.
> 2) avoid doing semi-slow operations in the middle of a loop:
>
> - string addition (self.__current = self.__current + line)
As I said in the original message, I've tried to optimize this by
making self.__current a list, but it didn't work any faster.
> - reading a file one line at a time (self.__fp.readline())
I don't see an alternative to this, except to read the whole file at
once, which I am trying to avoid, as the files are large.
> 3) lambdas are fun, but very slow.
I am almost completely sure that the sort you refer to is not the
cause of the slowdown.
> 4) objects are your friend. Dont know about speed, but they
> simplify code *so much* that I always use them, even for dump
> dictionary-like and list-like things.
I don't really understand this. I'll study your code and see if it's
clearer then.
>I don't see an alternative to this, except to read the whole file at
>once, which I am trying to avoid, as the files are large.
possibly using readlines(size), with an appropriate size, will help here.
Bill
> But I'll try it [slurp all the lines at once and iterate over them],
> for the sake of argument, and see if the speedup is worth it.
Yes, this does speed things up. Instead of 27 CPU seconds, the
program now spends 19.5 seconds of CPU time. Along with:
> > Also, instead of calling next_header on a line-by-line basis, how
> > about calling it once per package and returning a dict containing
> > the header names as keys?
This buys me additional 2.5 CPU seconds.
The problem is that the thing is still twice slower than the
equivalent Perl program, which I haven't bothered to optimize at all.
Oh well.
In case anyone is still interested, after these two optimizations, the
code looks like this:
#!/usr/bin/python
import string
class Dpkg_Reader:
def __init__(self, file):
self.__current = ''
self.__lines = open(file).readlines()
self.__index = 0
def next_package(self):
package = {}
while 1:
if self.__index == len(self.__lines):
return None # EOF
line = self.__lines[self.__index]
self.__index = self.__index + 1
if self.__current:
if line == "\n" or (line[0] != ' ' and line[0] != "\t"):
#print "{%s}" % (self.__current)
try:
name, value = string.split(self.__current, ': ', 1)
except:
name, value = string.split(self.__current, ':', 1)
value = value[:-1]
package[name] = value
if line == "\n":
self.__current = ''
return package
else:
self.__current = line
continue
self.__current = self.__current + line
def close(self):
pass
def process_status(file):
reader = Dpkg_Reader(file)
installed = {}
while 1:
package = reader.next_package()
if package is None:
break
status = string.split(package['Status'], ' ')
if status[2] == 'installed':
installed[package['Package']] = 1
reader.close()
return installed
def process_available(file, installed):
reader = Dpkg_Reader(file)
sizes = {}
while 1:
package = reader.next_package()
if package is None:
break
pname = package['Package']
if installed.has_key(pname):
sizes[pname] = string.atoi(package['Installed-Size'])
reader.close()
return sizes
def main():
installed = process_status('/var/lib/dpkg/status')
sizes = process_available('/var/lib/dpkg/available', installed)
lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])
if __name__ == '__main__':
main()
> Also, instead of calling next_header on a line-by-line basis, how about
> calling it once per package and returning a dict containing the header names
> as keys? Again, function calling in Python, while flexible, isn't terribly
> efficient.
In addition, you could remove the __current member when doing so. If
there was a next_package method, it would just know that it is at the
beginning of a package (with __current==''), or at EOF. It would read
until EOF, or until it sees an empty line.
Regards,
Martin
> The problem is that the thing is still twice slower than the
> equivalent Perl program, which I haven't bothered to optimize at all.
> Oh well.
>
> In case anyone is still interested, after these two optimizations, the
> code looks like this:
Starting from your last version, I performed a number of additional
optimizations:
- get rid of members, localize remaining members and module accesses
- use find instead of split
- reduce number of string slicings
With that, I've reduced the time from 7s to 3.7s on my machine, with
my local data. Please find the code below.
Regards,
Martin
#!/usr/local/bin/python
import string
class Dpkg_Reader:
def __init__(self, file):
self.lines = open(file).readlines()
self.index = 0
def next_package(self):
package = {}
if self.index == len(self.lines):
return None # EOF
index = self.index # Localize everything
lines = self.lines
find = string.find
current = lines[index] # Get the first line
index = index + 1
while 1:
line = lines[index]
index = index + 1
if line[0] == ' ' or line[0] == '\t':
current = current + line
continue
pos = find(current, ':')
name = current[:pos]
if current[pos+1] == ' ': # Skip space following :
value = current[pos+2:-1]
else:
value = current[pos:-1]
package[name] = value
if line == "\n":
self.index = index # write back index before returning
return package
current = line
def close(self):
pass
def main():
import time
start=time.time()
installed = process_status('status')
end=time.time()
print "status:",end-start
start=end
sizes = process_available('available', installed)
end=time.time()
print "available:",end-start
start=end
lst = sizes.items()
for i in range(0,len(lst)):
k, v = lst[i]
lst[i] = (v,k)
lst.sort()
end=time.time()
print "sort:",end-start
start=end
#for pack in lst:
# print "%s: %d" % (pack[1],pack[0])
I'd guess you'd find this getting quite a bit closer to Perl speeds:
#!/usr/bin/python
import string
class Dpkg_Reader:
def __init__(self, file):
self.__current = ''
#data = open(file).read()
data = file
self.__pkgs = string.split(data, "\n\n")
self.__index = 0
def next_package(self, split=string.split, join=string.join,
find=string.find):
if self.__index == len(self.__pkgs):
return None
data = self.__pkgs[self.__index]
self.__index = self.__index + 1
if find(data, '\n ') > -1:
data = split(data, '\n ')
data = join(data, ' ')
if find(data, '\n\t') > -1:
data = split(data, '\n\t')
data = join(data, ' ')
data = split(data, '\n')
package = {}
for line in data[:-1]:
try:
name, value = split(line, ': ', 1)
except:
name, value = split(line, ':', 1)
package[name] = value
return package
def close(self):
pass
txt = """\
Package: telnet
Status: install ok installed
Priority: standard
Section: net
Installed-Size: 130
Maintainer: Herbert Xu <her...@debian.org>
Source: netkit-telnet
Version: 0.14-4
Replaces: netstd
Depends: libc6 (>= 2.1), libncurses4 (>= 4.2-3.1)
Description: The telnet client.
The telnet command is used for interactive communication with another
host using the TELNET protocol.
"""
def test():
import time
start = time.time()
for i in xrange(100):
reader = Dpkg_Reader(txt*100)
while 1:
rslt = reader.next_package()
if rslt is None:
break
print "Took %4.2f secs" % (time.time()-start)
def main():
installed = process_status('/var/lib/dpkg/status')
sizes = process_available('/var/lib/dpkg/available', installed)
lst = sizes.keys() lst.sort(lambda a, b, sizes=sizes:
cmp(sizes[b], sizes[a])) for pack in lst:
print "%s: %d" % (pack, sizes[pack])
if __name__ == '__main__':
test()
- Gordon
def prime( self, heuristic ):
'''
prime buffer with "heuristic" lines
Should be ~2X larger than the largest possible
entry in file.
'''
self.lines[len(self.lines):] = self.file.readlines(heuristic)
def next_package(self, heuristic=1000, split=string.split,
strip=string.strip):
# initial priming of the buffer, in theory we might run past,
# so use a large heuristic...
if len( self.lines) < heuristic *2:
self.prime( heuristic )
package = {}
try:
packageindex = self.lines.index( '\n' )
except ValueError:# could do more checks here...
packageindex = len ( self.lines)
last = None
for line in self.lines[packageindex:]:
# is this a continuation
if line[0] in (' ' ,"\t"):
if last:
value = package[ last[0] ] = last[1]+line
last = (last[0], value)
else:
print '''Continued line follows null line!!!'''
# must be an opening/definition
else:
# get "cleaned up" versions
name, value = map( strip, split( line, ':', 1))
# store
package[name] = value
# save for continuation and deletion
last = (name, value)
del self.lines[ packageindex+1:]
return package # will be an empty dictionary if there is no next package
Enjoy yourselves,
Mike
...
> The program was quite easy to write, and easy to read afterwards. The
> problem is that it is also quite slow. On my system, it takes about
> 27 CPU seconds (as reported by `time' shell builtin) to do the work,
> which can extend to more than a minute of real time, depending on the
> system load.
>
> As a comparison, the equivalent Perl program does the same thing in 9
> CPU seconds. I tried everything I knew to make the Python version
> fast. I tried to use `re' to avoid returning headers other than the
> ones we're interested in. I tried changing self.__current to just
> current to avoid a dictionary lookup. I tried to make self.__current
> a list, to avoid the expensive `current = current + line' operation.
> All of these things made the program measure slower.
>
> I would really appreciate some suggestions. The code is not large,
> and is (I hope) rather elegant. I am a Python beginner, so I'd also
> appreciate tips on Python style and OO technique. I'll post/mail the
> Perl equivalent on demand.
Hi,
the following code is about five times faster here. That means it's
faster than perl, I suppose :-). It is of course not quite as general
as your's, but it seems to fit the job nicely. For this kind of
problem, I usually don't bother with OO.
The main speed advantage seems to be that the files aren't processed
line by line, which is of course very memory consuming.
HTH,
Robert
#! /usr/bin/env python
import string
def get_installed():
fstatus = open('/var/lib/dpkg/status', 'r')
status = fstatus.read()
fstatus.close()
status = string.split(status, '\n\n')
installed = []
for package in status:
fields = string.split(package, '\n')
name = fields[0][9:]
for line in fields:
if line[:7] == 'Status:':
if string.split(line[8:])[-1] == 'installed':
installed.append(name)
break
return installed
def get_sizes(packages):
favailable = open('/var/lib/dpkg/available', 'r')
available = favailable.read()
favailable.close()
available = string.split(available, '\n\n')
results = []
for package in available:
fields = string.split(package, '\n')
name = fields[0][9:]
if name in packages:
for line in fields:
if line[:15] == 'Installed-Size:':
results.append(name, int(line[16:]))
break
return results
def main():
results = get_sizes(get_installed())
results.sort(lambda a, b: cmp(b[1], a[1]))
for r in results:
print '%s: %d' % r
if __name__ == '__main__':
main()
--
Robert Vollmert rvol...@gmx.net
[Robert Vollmert, among others, does various irksome things to
speed it up, and remarks ...]
> The main speed advantage seems to be that the files aren't
> processed line by line, which is of course very memory consuming.
This probably bears repetition about once a month <wink>: Python does
line-at-a-time input on top of the platform's stdio package. On the popular
platforms, Perl does line-at-a-time input by breaking into the platform's
stdio package, peeking and poking C FILE structs directly in
platform-specific ways. This can give Perl a huge speed advantage on
line-at-a-time input; indeed, on Windows, Perl does line-at-a-time input
faster than pure optimized C/C++ code using Microsoft's fgets, and about 3x
faster than Python under Win95. Perl's speed advantage here is directly
proportional to the stupidity of the platform's stdio implementation ("read
a line from a file? who on earth would want to do that?" <wink>).
eschew-files-embrace-3d-buttons-ly y'rs - tim
[...]
if __name__ == '__main__':
import profile
profile.run('main()')
a=cStringIO.StringIO()
for i in range(10000):
a.write('h')
a=a.getvalue()
rather then:
a=''
for i in range(10000):
a=a+'h'
> As a Python exercise, I wrote a simple program to "scratch an itch",
> i.e. do something useful. However, I found that Python's lack of
> speed really bytes me here, so I'd like to hear suggestions for
> speedup. People who don't like that kind of topic, please skip to the
> following article. Others, read on.
>
> When browsing my Debian packages, I found I often wanted to know the
> size of the installed packages. To achieve this, I wrote a simple
> program that prints the installed packages and their sizes, all sorted
> by size. Output looks like this:
>
> tetex-base: 27652
> tetex-extra: 27118
> cmucl-safe: 22174
> mozilla: 13007
> timidity-patches: 11662
> locales: 11489
> quake2: 10800
> [...]
>
> To achieve this, the program needs to parse two files:
> /var/lib/dpkg/status, to find which packages are installed, and
> /var/lib/dpkg/available, to find the size of the installed packages.
[snip]
I had a very similar problem, I have this University project to
visualise the Debian package file, so I needed to read in into useful
language (C was discard as fast but too much work). The read tiem was
really slow, so I only do it once. After loading the file and
creating a data structure that stores the information, I pickle it.
Read time drops by an order of magnitude. Not so applicable to your
situation possibly but you could use it if you don't install new
packages often.
--
Tim Evans
note that:
lines = fp.readlines(16384)
if not lines:
break
for line in lines:
...
is usually much faster than
line = fp.readline()
if not line:
break
...
(adjust the readlines buffer size as suitable)
but sure, Python's standard file object is slow.
working on a replacement (as an image processing
guy, I've never understood why anyone ever used
stdio), but there's so much code to write, and
sooo little time....
</F>
> I had a very similar problem, I have this University project to
> visualise the Debian package file, so I needed to read in into
> useful language (C was discard as fast but too much work). The read
> tiem was really slow, so I only do it once. After loading the file
> and creating a data structure that stores the information, I pickle
> it. Read time drops by an order of magnitude. Not so applicable to
> your situation possibly but you could use it if you don't install
> new packages often.
Thanks for the suggestion. However, I do install new packages often,
so that's not really an option.
> the following code is about five times faster here. That means it's
> faster than perl, I suppose :-).
Wow, this *is* fast. However, it also loses in generality. For
instance, if I wanted to do anything else, I'd have to rewrite it
completely.
> but sure, Python's standard file object is slow.
> working on a replacement (as an image processing
> guy, I've never understood why anyone ever used
> stdio), but there's so much code to write, and
> sooo little time....
It's not exactly general-purpose right now (there are still methods to
add), but I was able to get about an order of magnitude speed-up on
readline() by adapting the QIO functions from INN. In its current
form it probably won't work right on anything but Unix/Linux (I can
never remember what to do about those pesky line endings). FWIW,
readlines() is about neck and neck with the built-in
Anyway, it's only 10K and in C if anyone wants to play.
<URL:http://members.xoom.com/meowing/programs/qio.tar.gz>
I'll test this if I have to, but do you know off-hand how this compares
to
a = []
for i in range(10000):
a.append('h')
a = string.join(a,'')
What if you use range(100000) or range(1000000)?
--
--- Aahz (@netcom.com)
Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)
Executing
import cStringIO, time, string
def doit(n, r):
print "doing", n, "iterations - range function ==", r
c = time.clock()
for i in r(n):
pass
print "loop overhead:", time.clock() - c
c = time.clock()
a=cStringIO.StringIO()
for i in r(n):
a.write('h')
a=a.getvalue()
print "cStringIO:", time.clock() - c
c = time.clock()
a = []
for i in r(n):
a.append('h')
a = string.join(a,'')
print "list append:", time.clock() - c
for i in [1000, 5000, 10000, 50000, 100000, 1000000]:
doit(i, range)
doit(i, xrange)
I got
doing 1000 iterations - range function == <built-in function range>
loop overhead: 0.0
cStringIO: 0.01
list append: 0.01
doing 1000 iterations - range function == <built-in function xrange>
loop overhead: 0.0
cStringIO: 0.01
list append: 0.01
doing 5000 iterations - range function == <built-in function range>
loop overhead: 0.01
cStringIO: 0.03
list append: 0.04
doing 5000 iterations - range function == <built-in function xrange>
loop overhead: 0.01
cStringIO: 0.03
list append: 0.04
doing 10000 iterations - range function == <built-in function range>
loop overhead: 0.02
cStringIO: 0.06
list append: 0.08
doing 10000 iterations - range function == <built-in function xrange>
loop overhead: 0.02
cStringIO: 0.07
list append: 0.08
doing 50000 iterations - range function == <built-in function range>
loop overhead: 0.11
cStringIO: 0.32
list append: 0.42
doing 50000 iterations - range function == <built-in function xrange>
loop overhead: 0.1
cStringIO: 0.32
list append: 0.39
doing 100000 iterations - range function == <built-in function range>
loop overhead: 0.22
cStringIO: 0.65
list append: 0.92
doing 100000 iterations - range function == <built-in function xrange>
loop overhead: 0.18
cStringIO: 0.63
list append: 0.88
doing 1000000 iterations - range function == <built-in function range>
loop overhead: 2.22
cStringIO: 6.55
list append: 10.27
doing 1000000 iterations - range function == <built-in function xrange>
loop overhead: 1.77
cStringIO: 6.35
list append: 10.04
which suggests there is a performance boost for cStringIO vs list append and
xrange vs range for large numbers, but that for loops with less than about
10,000 iterations, it probably doesn't matter what you use.
[Hrvoje Niksic]
> Wow, this *is* fast. However, it also loses in generality. For
> instance, if I wanted to do anything else, I'd have to rewrite it
> completely.
Was the Perl code to which you're comparing it as general as your Python
code?
conservation-of-agony-ly y'rs - tim
Probably not every time you run your app, though. You could parse the
data, pickle the results, and keep up with the mod times on the two
files. If the source data is newer than the pickle database you have to
update the pickle data. That would slow down some of your processing,
but in most cases would still let you take advantage of the faster
pickle version.
Doug
--
____ Les Schaffer ___| --->> Engineering R&D <<---
Theoretical & Applied Mechanics | Designspring, Inc.
Center for Radiophysics & Space Research | http://www.designspring.com/
Cornell Univ. scha...@tam.cornell.edu | l...@designspring.com
> > Thanks for the suggestion. However, I do install new packages
> > often, so that's not really an option.
>
> Probably not every time you run your app, though.
I use the application to see which packages waste too much space, so
that I know which ones to remove. Not to mention that I run nightly
updates of my Debian system. It is practically *guaranteed* that the
files will have changed between two runs of the program.
Sorry. I like the idea of pickling/marshalling, and I'd love to be
able to use it here. But it can't work that way.
> You could parse the data, pickle the results, and keep up with the
> mod times on the two files. If the source data is newer than the
> pickle database you have to update the pickle data.
The problem is that to "update" the pickle data (which I'd have to do
every time, as explained above), I have to parse the text files. And
it's the parsing that's slow, not the post-processing. If I need to
parse them every time, then I don't need pickling.
> Was the Perl code to which you're comparing it as general as your
> Python code?
The Perl code was almost equivalent to the first posted version of
Python code -- the only important difference is that it used a
functional approach rather than OOP. I can mail it to you if you are
curious.
Generality is important to me because another day I might wish to do
something else with those files, like draw a tree of dependencies, or
something. In that case, I'd surely want to reuse the existing
reader/parser. Speed is good, but not so good to justify loss of code
reuse.
Robert's solution is ideologically similar to the `grep' solution
proposed below.
although many have answered you already, I can't resist in posting yet
another one. Your code takes almost 7 secs on my machine, mine takes a
little over 1 sec, and it's still OO and generic. Besides, I didn't use
any fance Python tricks, just the combination of lasy evaluation and bulk
IO did the trick here.
-----------------
#!/usr/bin/python
import string
class Dpkg_Munger:
def munge_all(self, file):
for raw_package in string.split(open(file).read(), '\n\n'):
if raw_package!='':
self.raw_package = '\n' + raw_package + '\n'
self.munge_one(self)
def __getitem__(self, key, find=string.find):
i = find(self.raw_package, '\n'+key+':')
if i == -1:
raise IndexError
i = i + len(key) + 3
if self.raw_package[i]==' ':
i = i + 1
j = i - 1
while 1:
j = find(self.raw_package,'\n',j+1)
if self.raw_package[j+1] != ' ' and self.raw_package[j+1] != '\t':
break
return self.raw_package[i:j]
class process_status(Dpkg_Munger):
def __init__(self, file):
self.installed = {}
self.munge_all(file)
def munge_one(self, package):
name = package['Package']
status = package['Status']
if string.split(status, ' ')[2] == 'installed':
self.installed[name] = 1
class process_available(Dpkg_Munger):
def __init__(self, file, installed):
self.installed, self.sizes = installed, {}
self.munge_all(file)
def munge_one(self, package):
name = package['Package']
size = package['Installed-Size']
if self.installed.has_key(name):
self.sizes[name] = string.atoi(size)
def main():
installed = process_status('/var/lib/dpkg/status').installed
sizes = process_available('/var/lib/dpkg/available', installed).sizes
lst = sizes.keys()
lst.sort(lambda a, b, sizes=sizes: cmp(sizes[b], sizes[a]))
for pack in lst:
print "%s: %d" % (pack, sizes[pack])
if __name__ == '__main__':
main()
--
groetjes, carel
two small adjustments in the munge_one methods brought it's time just below
the magical 1 sec! so now mine's even more then 7 times faster:)
> def munge_one(self, package):
> name = package['Package']
> status = package['Status']
> if string.split(status, ' ')[2] == 'installed':
this should read:
if status[-10:]==' installed':
> def munge_one(self, package):
> name = package['Package']
> size = package['Installed-Size']
the above line should go in the body of the next if-statement
> if self.installed.has_key(name):
> self.sizes[name] = string.atoi(size)
--
groetjes, carel
>> Was the Perl code to which you're comparing it as general as your
>> Python code?
why is it that all of a sudden Tim's postings don't make it to the newsgroup?
I don't recall having seen the above before, and i got the same impression in
other treads as well. And as you all know, whithout Tim's postings there is
hardly any point in reading the newsgroup at all:)
> Robert's solution is ideologically similar to the `grep' solution
> proposed below.
I seem to have missed this grep solution too. or was it part of Tim's posting?
--
groetjes, carel
While this is hardly relevant to python anymore, I suppose it might
interest some. While looking at the relevant data files, I found that
you only need to parse the status file as it actually contains all
information about installed packages, including an 'Installed-Size'
header.
It describes the currently installed packages, while the available
file is about the newest packages available and might contain invalid
information regarding the currently installed version.
--
Robert Vollmert rvol...@gmx.net
> why is it that all of a sudden Tim's postings don't make it to
> the newsgroup? I don't recall having seen the above before, and
> i got the same impression in other treads as well.
This is an exciting new FeaBug of the timbot delivery mechanism: in a fit
of maniacal egoecentricity, the timbot reprogrammed many of the world's news
readers to show its responses only to people it deemed worthy of reading
them. I'm sure your absence from the list of approved readers is a
temporary glitch due to inferior operating systems (like Windows, MacOS,
EPOC32, Unix, Linux and BeOS) not yet fully grasping the elegance of the
timbot's new "Discriminating Display(tm)" code.
Another cause may be that the timbot emails responses to the mailing-list
version of c.l.py, and the mail->news delivery agent at CNRI suffers
periodic bouts of constipation. The anonymous Barry
"${you_name_it}mas...@python.org" Warsaw spends many hours with his fist in
up to the elbow, but somtimes Perl blockage requires both fists to unplug.
> And as you all know, whithout Tim's postings there is hardly any point in
> reading the newsgroup at all:)
None, I'd say. Well, actually, Tim's (that's me) postings are generally
long, rambling, off-topic, and project an irritating smugness as if their
very lack of sensible content rendered them immune from criticism. The
timbot's postings, however, are pure joy. I've been looking into getting us
separate email addresses so people have an easier time telling us apart.
it-doesn't-help-that-the-mechanical-bastard-stole-my-
signoff-style-either-ly y'rs - tim
You just ruled out a whole bunch of OS's there...so what does the timbot
run on then? ITS? A Lisp Machine? AmigaOS?
--
cliff crawford http://www.people.cornell.edu/pages/cjc26/
There are more stars in the sky than there are
-><- grains of sand on all the beaches of the world.
>> > Thanks for the suggestion. However, I do install new packages
>> > often, so that's not really an option.
>>
>> Probably not every time you run your app, though.
> I use the application to see which packages waste too much space, so
> that I know which ones to remove. Not to mention that I run nightly
> updates of my Debian system. It is practically *guaranteed* that the
> files will have changed between two runs of the program.
>
> Sorry. I like the idea of pickling/marshalling, and I'd love to be
> able to use it here. But it can't work that way.
Maybe you can update the python structures at night too
(the same way the 'locate' database is built at night,
because a 'find' on the whole system is too slow).
>> You could parse the data, pickle the results, and keep up with the
>> mod times on the two files. If the source data is newer than the
>> pickle database you have to update the pickle data.
>
> The problem is that to "update" the pickle data (which I'd have to do
> every time, as explained above), I have to parse the text files. And
> it's the parsing that's slow, not the post-processing. If I need to
> parse them every time, then I don't need pickling.
Yes, the idea was to keep a cache of the pythonic database, so
whenever it has changed, you rebuild it, and save it, for reusing
it later.
Doing something clever would involve using "diff" for instance, but
is probably more complex than making a C module for Python :-).
-- Cedric Adjih
What would be cool is if readlines() returned a lazy sequence
object (ie. only read as much is needed using a certain block
size). This should give the advantages of readlines() without
the concern about sucking up a huge file all at once.
I implemented this idea (probably badly) in pure Python
and got about a 2x speedup verses readline(). It is a small
module so I will post it here.
being-lazy-has-its-advantages'ly Neil
import string
class BlockFile:
def __init__(self, file, blocksize=1024*40, sep='\n'):
self.file = file
self.blocksize = blocksize
self.sep = sep
self.line = -1
self.lines = []
self.end = ''
def __getitem__(self, i):
try:
self.line = self.line+1
return self.lines[self.line]
except IndexError:
self.line = 0
self._get_block()
return self.lines[0]
def _get_block(self):
data = self.file.read(self.blocksize)
if len(data) == 0:
raise IndexError
self.lines = string.split(data, self.sep)
self.lines[0] = self.lines[0] + self.end
if len(data) == self.blocksize:
self.end = self.lines[-1]
del self.lines[-1] # this _should_ be fast
else:
self.end = ''
def test_block(input):
for l in BlockFile(open(input)):
pass
def test_normal(input):
f = open(input)
while 1:
l = f.readline()
if not l:
break
def measure(function, *args):
import time
t1 = time.time()
apply(function, args)
t2 = time.time()
apply(function, args)
return t2-t1
if __name__ == '__main__':
import sys
print 'block time', measure(test_block, sys.argv[1])
print 'normal time', measure(test_normal, sys.argv[1])
> I wonder -- suppose the official Python were to replace the current
> readlines() with this one (or with something similar). Would it be
> possible to preserve all the current behavior (except for the memory
> usage, of course)?
It's tough. I've been trying to do that with my attempt, and the
performance hit of keeping the seeks synchronized makes it not worth
the trouble (IMHO) to support writing or having more than one object
able to read from the same descriptor.
If a read only object is good enough, a cleaned-up version of the QIO
hack thingy I mentioned a few days ago is uploaded. I think I've got
its methods, attributes and exceptions down to a faithful reproduction
of Python's native file object now (except for writing, of
course). readinto() is absent for now since it's undocumented. It can
go in if someone can supply some Python k0dez showing what buffer
objects are good for =)
Also added some convenience features like user-specified line
terminators, automatic chopping, and a way to read SMTP-like sockets
wicked fast.
After adding the requisite bloat, vanilla qio.readline() is still
about 5-8 times as fast as native Python looping through big text
files (where stdio is glibc's). readlines() and read() are not big
wins (and in pathological cases they're slower), but they're there for
completeness.
Details at <URL:http://members.xoom.com/meowing/python/>
Meow.
The fileinput module does something like that in the background, I
think...
I would check out fileinput first, but if that doesn't suit you, you
could use what I rolled for myself a little while back. I think it's
slightly easier to have multiple instances this way:
class lazy_file:
def __init__ (self, filename, sizehint = 10**7, mode = 'r'):
self.file = open (filename, mode)
self.line_count=0
self.sizehint = sizehint
self.__refresh_buffer ()
def __refresh_buffer (self):
self.buffer = self.file.readlines (self.sizehint)
self.buffer_length = len (self.buffer)
if self.buffer:self.buffer_index = 0
else: self.buffer_index = None
def readline (self):
if self.buffer_index==self.buffer_length:
self.__refresh_buffer ()
if not self.buffer: self.line = ''
else:
self.line = self.buffer[self.buffer_index]
self.buffer_index = self.buffer_index + 1
if self.line:self.line_count=self.line_count+1
return self.line
class File_loop:
def __init__(self,filename,sizehint=10**7,mode='r'):
self.file=lazy_file(filename,sizehint,mode)
def __getitem__(self,index):
if self.file.readline():return self.file.line
else:raise IndexError, 'File finished'
You can just use it like
for line in File_loop('filename'):
munge(line)
Alex.
- --
If cars were like computers, they would go 300 m.p.h. and get a hundred
miles to the gallon and cost $50. Except that twice a month someone a
thousand miles away would be able to blow up the car, killing everyone
nearby.
-----BEGIN PGP SIGNATURE-----
Version: PGP for Personal Privacy 5.0
Comment: Processed by Mailcrypt 3.5b6, an Emacs/PGP interface
Charset: noconv
iQA/AwUBN+BT93zN4ZFYpUPAEQJQFACgrZ2asMAo6PPw7He+yVZsXd7qyHQAoMOg
7szYmUaTVubWw0fXZMWxdjeh
=6uiN
-----END PGP SIGNATURE-----
>>> I don't see an alternative to this, except to read the whole file at
>>> once, which I am trying to avoid, as the files are large.
>What would be cool is if readlines() returned a lazy sequence
>object (ie. only read as much is needed using a certain block
>size). This should give the advantages of readlines() without
>the concern about sucking up a huge file all at once.
I wonder -- suppose the official Python were to replace the current
readlines() with this one (or with something similar). Would it be
possible to preserve all the current behavior (except for the memory
usage, of course)?
That sure would seem nice. It wouldn't solve all of our problems, but it
would make the most obvious speedup idiom behave well.
--
-William "Billy" Tanksley
>That sure would seem nice. It wouldn't solve all of our problems, but it
>would make the most obvious speedup idiom behave well.
As I've often been heard to say (generally after a marathon Haskell
hacking session):
You can never have too much laziness.
Personally, I'd love to have almost Haskell-level laziness on tap in
Python, especially in things like file IO. There's a lot to be said
for being able to go slurping through a file and having that stuff
done for you under the hood.
--
Alexander Williams (tha...@gw.total-web.net)
"In the end ... Oblivion Always Wins."
Here it's the other way round:
14.SRam:> python BlockFile.py daten:programme/python/e4xtc.txt
block time 9.74225389957
normal time 5.96557199955
This is on an Amiga, using Python 1.5.2, the file is 276089 Bytes long.
I tried also with blocksizes of 100k and 8k for BlockFile, but the
difference was about nought, so the speed difference is not blocksize
related.
And there is a error in _get_block(self):
> self.lines[0] = self.lines[0] + self.end
This should read: self.lines[0] = self.end + self.lines[0]
Ciao
Marc
--
"Claims about the evils of cannabis have been exaggerated: The question now
is not whether it should be legalised, but how and when."
http://www.newscientist.com/ns/980221/contents.html
PGP-Key fingerprint EE 2E C1 06 D1 28 89 D2 F9 CE 35 7B F8 F4 42 52
Not easily. The code I wrote is only good for putting in a for
loop. If you access the lines out of order things go crazy.
This is similar to the fileinput module.
I coded a C version of this. The speed is quite impressive. For
random 3 MB text file (using time()):
BlockFile: 0.420621991158
readline: 7.21456801891
readlines: 0.672585010529
Even with a cold cache I get 0.6 seconds. Incidently, this C
version is over 2.5 times faster than:
perl -e "while (<>) {}"
If anyone is interested I can make the code available.
Neil
--
"The percentage of users running Windows NT Workstation 4.0 whose PCs stopped
working more than once a month was less than half that of Windows 95 users."
-- microsoft.com/ntworkstation/overview/Reliability/Highest.asp
>Not easily. The code I wrote is only good for putting in a for
>loop. If you access the lines out of order things go crazy.
>This is similar to the fileinput module.
Yes, I see. It's pretty clear that this will have to wait for the 'for'
loop changes we've pondered in the past (i.e. making a distinction between
streams and random access entities).
Hmm... There is a simple stopgap solution. We *could* simply keep track
of accesses, and on the first random access dump everything into memory
and act like a list.
>I coded a C version of this. The speed is quite impressive. For
>random 3 MB text file (using time()):
> BlockFile: 0.420621991158
> readline: 7.21456801891
> readlines: 0.672585010529
>Even with a cold cache I get 0.6 seconds. Incidently, this C
>version is over 2.5 times faster than:
> perl -e "while (<>) {}"
Hey, COOL! Finally we get a sop to our inferiority complex. ;-)
Seriously, Perl and Python have the most amusing codependancy. We both
spend so much time feeling inferior.
It occurs to me that one nice thing to see in Python 1.6 would be a
built-in function and datatype, so a file could be specified perhaps like
this:
for line in url('file:/dos.txt'):
Eventually, perhaps that could become a syntax -- I don't want to look TOO
much like Perl :), but other languages seem to have demonstrated that easy
entry does make for good-looking code -- and after all, that's what we're
here for: to look good. No, to look GREAT.
for line in %file:/dos.txt:
Naw, that doesn't look good in Python. Sure looks nice in Rebol, though.
I guess Python ain't Rebol, huh? :)
>If anyone is interested I can make the code available.
I'm *interested*, but if you make the code available I won't do a blessed
thing about it. I'd love to see it in the official binary so I could use
it, but until then I like to only use things I can depend on portably.
And I'm lazy -- oops, I mean busy.
> Neil
<snore>-ly y'rs
--
-William "Billy" Tanksley
Yes, qio.readline() is about the most frightening function in the Python
universe now <wink>.
> After adding the requisite bloat, vanilla qio.readline() is still
> about 5-8 times as fast as native Python looping through big text
> files (where stdio is glibc's).
I compiled it for Win32 (btw, "mode_t" was not defined, & there are about a
half dozen warnings about mixing signed and unsigned ints), with
optimization cranked up. Under Win95 it was between 2-3 times as fast as
1.5.2 Python doing a native readline() loop over an 11Mb text file. It's
hard to believe that MS's grudging implementation of stdio is that much
better than glibc's, so not sure what to make of this -- possibly that
*nothing* can run really fast under Windows <0.9 wink>.
> ...
> Details at <URL:http://members.xoom.com/meowing/python/>
Nice work!
if-only-windows-had-any-text-files<0.7-wink>-ly y'rs - tim
> meow, about his or her
Most people use 'its' or a choice expletive.
> I compiled it for Win32 (btw, "mode_t" was not defined, & there are about a
> half dozen warnings about mixing signed and unsigned ints)
If you saved 'em, I'd love to see the error messages. I don't have
Windows kit handy to check for these things, and am just a bit
surprised it actually works there at all.
> It's hard to believe that MS's grudging implementation of stdio is
> that much better than glibc's
glibc really does have some odd features, like macros where most other
libs seem to use functions and vice versa. It's almost as if they
think maintainability is important or something.
But maybe you were trying to be fair, while standard benchmaking
practice dictates that one stacks the deck ;) To get up to the 8X
range I glued together 15MB of news overview files with an average
line length of 447. In that case the native loop is dominated by
getc(), while with shorter lines the Python loop overhead masks much
of the difference. Size really does matter, especially when one is
comtemplating whipping out one's Python at an all-Perl shop and wants
to make a good show of it.
Python 2.0 should be very cool.
>Hmm... There is a simple stopgap solution. We *could* simply keep track
>of accesses, and on the first random access dump everything into memory
>and act like a list.
Yes, that could work. It would be complicated though. Not very
Python like.
>I'm *interested*, but if you make the code available I won't do a blessed
>thing about it. I'd love to see it in the official binary so I could use
>it, but until then I like to only use things I can depend on portably.
>And I'm lazy -- oops, I mean busy.
As someone else asked for it, here it is anyhow:
http://www.acs.ucalgary.ca/~nascheme/python/blockfile.tgz
It should be possible to transparently integrate this with the
fileinput module. I'm to lazy right now to do it. :)
Neil
why not just add
import urllib
def lazy(file):
return file.readlines() # for now
def url(url):
return lazy(urllib.urlopen(url))
import __builtin__
__builtin__.url = url
to your startup file ;-)
(I'll leave it to someone else to actually
implement the "lazy" class/factory).
</F>
On NT with a 54meg file and average line length of 147
compare readlines()
Using native readlines
3 function calls in 10.539 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.237 1.237 10.479 10.479 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.060 0.060 10.539 10.539 profile:0(testlines(filespec))
1 9.243 9.243 9.243 9.243 test.py:30(testlines)
Using QIO readlines
3 function calls in 10.212 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.556 1.556 10.211 10.211 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 10.212 10.212
profile:0(testqiolines(filespec))
1 8.655 8.655 8.655 8.655 test.py:38(testqiolines)
compare readline() loops
Using native readline
3 function calls in 31.633 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 31.632 31.632 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 31.633 31.633 profile:0(test(filespec))
1 31.632 31.632 31.632 31.632 test.py:5(test)
Using QIO readline
3 function calls in 5.539 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 5.538 5.538 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 5.539 5.539 profile:0(testqio(filespec))
1 5.538 5.538 5.538 5.538 test.py:20(testqio)
compare read() of a whole file
standard
Size: 53992540
2 function calls in 2.427 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.425 2.425 2.425 2.425 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.001 0.001 2.427 2.427 profile:0(z = f.read(); print
len(z))
qio
Size: 53992540
2 function calls in 15.612 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 15.550 15.550 15.550 15.550 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.061 0.061 15.612 15.612 profile:0(z = f.read(); print
len(z))
--
--Darrell
[Thanks for the patch to make Windows happy, BTW.]
> Ran the test.py and enabled the whole file test as well.
> Looks like readline() is a smoker, but read() needs help.
Right, because you end up with two copies of the file in memory if you
use a single read to get the whole file; that's gonna hurt if your
machine is low on RAM. For whole-file slurping, use the native file
objects which handle that case very efficiently.