Populating a dictionary, fast

610 views
Skip to first unread message

Michael Bacarella

unread,
Nov 10, 2007, 4:56:35 PM11/10/07
to pytho...@python.org

The id2name.txt file is an index of primary keys to strings. They look like this:

11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30 _seconds_ to run.
This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash is not fast enough.)

Ben Finney

unread,
Nov 10, 2007, 5:28:15 PM11/10/07
to
Michael Bacarella <mb...@gpshopper.com> writes:

> id2name = {}
> for line in iter(open('id2name.txt').readline,''):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
>
> This takes about 45 *minutes*
>
> If I comment out the last line in the loop body it takes only about
> 30 _seconds_ to run. This would seem to implicate the line
> id2name[id] = name as being excruciatingly slow.

Or, rather, that the slowdown is caused by allocating these items in a
dictionary at all.

Dictionaries are implemented very efficiently in Python, but there
will still be overhead in inserting millions of distinct items. Of
course, if you just throw each item away instead of allocating space
for it, the loop will run very quickly.

> Is there a fast, functionally equivalent way of doing this?

You could, instead of individual assignments in a 'for' loop, try
letting the 'dict' type operate on a generator::

input_file = open("id2name.txt")
id2name = dict(
(long(id), name) for (id, name) in
line.strip().split(":") for line in input_file
)

All that code inside the 'dict()' call is a "generator expression"; if
you don't know what they are yet, have a read of Python's
documentation on them. It creates a generator which will spit out
key+value tuples to be fed directly to the dict constructor as it
requests them.

That allows the generator to parse each item from the file exactly as
the 'dict' constructor needs it, possibly saving some extra "allocate,
assign, discard" steps. Not having your data set, I can't say if it'll
be significantly faster.

--
\ "Compulsory unification of opinion achieves only the unanimity |
`\ of the graveyard." -- Justice Roberts in 319 U.S. 624 (1943) |
_o__) |
Ben Finney

Steven D'Aprano

unread,
Nov 10, 2007, 5:46:47 PM11/10/07
to
On Sat, 10 Nov 2007 13:56:35 -0800, Michael Bacarella wrote:

> The id2name.txt file is an index of primary keys to strings. They look
> like this:
>
> 11293102971459182412:Descriptive unique name for this record\n
> 950918240981208142:Another name for another record\n
>
> The file's properties are:
>
> # wc -l id2name.txt
>
> 8191180 id2name.txt
> # du -h id2name.txt
> 517M id2name.txt
>
> I'm loading the file into memory with code like this:
>
> id2name = {}
> for line in iter(open('id2name.txt').readline,''):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name

That's an awfully complicated way to iterate over a file. Try this
instead:

id2name = {}
for line in open('id2name.txt'):


id,name = line.strip().split(':')
id = long(id)
id2name[id] = name


On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.


> This takes about 45 *minutes*
>
> If I comment out the last line in the loop body it takes only about 30
> _seconds_ to run. This would seem to implicate the line id2name[id] =
> name as being excruciatingly slow.

No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.


> Is there a fast, functionally equivalent way of doing this?
>
> (Yes, I really do need this cached. No, an RDBMS or disk-based hash is
> not fast enough.)

You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.

--
Steven.

Paul Rubin

unread,
Nov 10, 2007, 5:54:49 PM11/10/07
to
Michael Bacarella <mb...@gpshopper.com> writes:
> Is there a fast, functionally equivalent way of doing this?
>
> (Yes, I really do need this cached. No, an RDBMS or disk-based hash
> is not fast enough.)

As Steven says maybe you need to add more ram to your system. The
memory overhead of dictionary cells is considerable. If worse comes
to worse you could concoct some more storage-efficient representation.

Michael Bacarella

unread,
Nov 10, 2007, 8:18:37 PM11/10/07
to pytho...@python.org
> That's an awfully complicated way to iterate over a file. Try this
> instead:

>
> id2name = {}
> for line in open('id2name.txt'):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
>

> > This takes about 45 *minutes*
> >
> On my system, it takes about a minute and a half to produce a
dictionary
> with 8191180 entries.

Doing something similar on my system is very fast as well.

$ cat dict-8191180.py

#!/usr/bin/python

v = {}

for i in xrange(8191180):

v[i] = i


$ time ./dict-8191180.py

real 0m5.877s
user 0m4.953s
sys 0m0.924s

But...

> > If I comment out the last line in the loop body it takes only about
30
> > _seconds_ to run. This would seem to implicate the line id2name[id] =
> > name as being excruciatingly slow.
>
> No, dictionary access is one of the most highly-optimized, fastest,
most
> efficient parts of Python. What it indicates to me is that your system
is
> running low on memory, and is struggling to find room for 517MB worth
of
> data.


If only it were so easy.


$ free


total used free shared buffers cached


Mem: 7390244 2103448 5286796 0 38996 1982756


-/+ buffers/cache: 81696 7308548


Swap: 2096472 10280 2086192

Here's your Python implementation running as badly as mine did.

$ wc -l id2name.txt

8191180 id2name.txt

$ cat cache-id2name.py

#!/usr/bin/python

id2name = {}

for line in open('id2name.txt'):

id,name = line.strip().split(':',1)

id = long(id)

id2name[id] = name

$ time ./cache-id2name.py

^C

I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p

36 minutes, 05.99 seconds.

To rule out the file reading/parsing logic as culprit, here's same thing, but with

dictionary insertion removed.

$ cat nocache-id2name.py

#!/usr/bin/python

id2name = {}

for line in open('id2name.txt'):

id,name = line.strip().split(':',1)

id = long(id)

$ time ./nocache-id2name.py

real 0m33.518s

user 0m33.070s

sys 0m0.415s

Here's a Perl implementation running very fast.

$ cat cache-id2name.pl

#!/usr/bin/perl

my %id2name = ();

my $line;

my $id;

my $name;

open(F,"<id2name.txt");

foreach $line (<F>) {

chomp($line);

($id,$name) = split(/:/,$line,1);

$id = int($id);

$id2name{$id} = $name;

}

$ time ./cache-id2name.pl

real 0m46.363s

user 0m43.730s

sys 0m2.611s

So, you think the Python's dict implementation degrades towards O(N)

performance when it's fed millions of 64-bit pseudo-random longs?


Steven D'Aprano

unread,
Nov 10, 2007, 11:32:44 PM11/10/07
to
On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote:


> So, you think the Python's dict implementation degrades towards O(N)
> performance when it's fed millions of 64-bit pseudo-random longs?

No.

Here's my sample file:

$ wc -l id2name.txt
8191180 id2name.txt

$ ls -lh id2name.txt
-rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt

And the results of reading it into a dict (code follows below):

$ time ./slurp_dict.py
Starting at Sun Nov 11 14:26:51 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 14:29:31 2007
Starting to delete dict...


Traceback (most recent call last):
File "./slurp_dict.py", line 20, in <module>
del id2name
KeyboardInterrupt

real 35m52.334s
user 1m17.663s
sys 0m16.758s


Notice that the dict is completely read into memory in just two and a
half minutes. The script then tries to delete the dict, and 32 minutes
later is still struggling. That's the point I got sick of waiting and
interrupted the script.

Conclusion: it's a memory issue, or maybe a garbage collection issue, not
a problem with dicts.


Here's my code for creating the key:value file in the first place:

#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""

bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
fp.write(template % random.randint(0, bits64))
fp.close()

###

And here's my code for slurping it in to a dict:

#!/usr/bin/python
"""Read a big file into a dict."""

import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.
print "Line %d" % n
id,name = line.strip().split(':', 1)


id = long(id)
id2name[id] = name

print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()

###


So, what can you do? Given that I'm not willing to do any more unpaid
experimentation for you, here are my suggestions, in no particular order:

(1) Presumably you don't want to run your app with the garbage collector
turned off. You might still want to play around with the gc module to see
what you can learn.

(2) More memory will help avoid paging. If you can't get more memory, try
more virtual memory. It will still be slow, but at least the operating
system doesn't have to try moving blocks around as much.

(3) Are you sure you need all eight-million-plus items in the cache all
at once?

(4) There are lots of algorithms out there for dealing with data too big
to fit into main memory. Do some research.

(5) There is a data structure designed for dealing with tens of millions
of records at once. It is called "a database". If you can't find a better
algorithm, and refuse to use an existing RDBMS, I suspect you're going to
end up inventing a primitive, inefficient, buggy database which is no
faster than existing systems out there.

--
Steven.

Yu-Xi Lim

unread,
Nov 11, 2007, 12:06:38 AM11/11/07
to
Steven D'Aprano wrote:
> (2) More memory will help avoid paging. If you can't get more memory, try
> more virtual memory. It will still be slow, but at least the operating
> system doesn't have to try moving blocks around as much.
>

Based on his previous post, it would seem he has 7GB of RAM (with about
5GB free) and 2GB of swap. I don't think RAM is the issue.

Maybe there's something wrong with his specific Python installation.
What version is it and was it compiled by him?

Paul Rubin

unread,
Nov 11, 2007, 12:45:44 AM11/11/07
to
Michael Bacarella <mb...@gpshopper.com> writes:
> If only it were so easy.

I think I know what's going on, the dictionary updates are sending the
GC into quadratic behavior. Try turning off the GC:

import gc
gc.disable()

Michael Bacarella

unread,
Nov 11, 2007, 10:35:07 AM11/11/07
to pytho...@python.org

Tried that already. No difference. :(

DouhetSukd

unread,
Nov 11, 2007, 10:49:40 AM11/11/07
to
On Nov 11, 7:35 am, Michael Bacarella <m...@gpshopper.com> wrote:

> Tried that already. No difference. :(

Not sure if it would make a difference, and it would imply re-
organizing your preceding lines, but what about doing the dictionary
build in one go, rather than incrementally? Using the dict function,
which takes a list of (key,value) tuples. I use it frequently as a
space-saver and it works well enough. It may speed things up, I
dunno. I had to break out your formatting in its own function and
that can't help too much.

Something like:

def fmt(line):


id,name = line.strip().split(':')
id = long(id)

return (id,name)

id2name = dict([fmt(line) for line in
iter(open('id2name.txt').readline,'')])

Cheers

Alberto Berti

unread,
Nov 11, 2007, 10:50:36 AM11/11/07
to pytho...@python.org
>>>>> "Steven" == Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:

Steven> $ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51
Steven> 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line
Steven> 4000000 Line 5000000 Line 6000000 Line 7000000 Line
Steven> 8000000 Items in dict: 8191180 Completed import at Sun Nov
Steven> 11 14:29:31 2007 Starting to delete dict...


Steven> Traceback (most recent call last): File "./slurp_dict.py",
Steven> line 20, in <module> del id2name KeyboardInterrupt

Steven> real 35m52.334s user 1m17.663s sys 0m16.758s


Steven> Notice that the dict is completely read into memory in
Steven> just two and a half minutes. The script then tries to
Steven> delete the dict, and 32 minutes later is still
Steven> struggling. That's the point I got sick of waiting and
Steven> interrupted the script.

Steven> Conclusion: it's a memory issue, or maybe a garbage
Steven> collection issue, not a problem with dicts.


uh, strange results...

I run your same scripts with and without garbage collection enabled
and those are the results:

with gc enabled:

azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:35:12 2007


Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180

Completed import at Sun Nov 11 16:36:03 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:36:09 2007
Finishing at Sun Nov 11 16:36:09 2007

and without gc enabled

azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:39:02 2007


Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180

Completed import at Sun Nov 11 16:39:49 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:39:56 2007
Finishing at Sun Nov 11 16:39:56 2007


all with python2.4 on and i386 Linux

cheers

Alberto

DouhetSukd

unread,
Nov 11, 2007, 10:58:16 AM11/11/07
to
Ah, well, just noticed Ben's suggested this already. Mind you, his
code, while correct in intent, does look a bit fishy (missing those
square brackets), so don't dismiss it just because you had trouble
running it (or mine). Definitely worth a try and I'd be curious to
know if it makes a difference.

Michael Bacarella

unread,
Nov 11, 2007, 11:10:45 AM11/11/07
to pytho...@python.org


This version:

$ python
Python 2.3.4 (#1, May 2 2007, 19:18:17)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

$ rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.3.4 Vendor: CentOS
Release : 14.4 Build Date: Wed 02 May 2007 07:20:29 PM EDT
Install Date: Mon 04 Jun 2007 05:48:29 PM EDT Build Host: builder6
Group : Development/Languages Source RPM: python-2.3.4-14.4.src.rpm
Size : 21137194 License: PSF - see LICENSE
Signature : DSA/SHA1, Sat 05 May 2007 09:33:49 AM EDT, Key ID a53d0bab443e1821
URL : http://www.python.org/

$ uname -a
Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64 x86_64 GNU/Linux


We've also tried it on this version (on a different machine):

$ python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

$ rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.4.3 Vendor: CentOS
Release : 19.el5 Build Date: Wed 14 Mar 2007 11:06:42 PM UTC
Install Date: Mon 29 Oct 2007 07:03:16 PM UTC Build Host: builder6
Group : Development/Languages Source RPM: python-2.4.3-19.el5.src.rpm
Size : 22087600 License: PSF - see LICENSE
Signature : DSA/SHA1, Wed 04 Apr 2007 12:26:58 AM UTC, Key ID a8a447dce8562897
URL : http://www.python.org/
Summary : An interpreted, interactive, object-oriented programming language.

$ uname -a
Linux yyy 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux


Istvan Albert

unread,
Nov 11, 2007, 11:11:09 AM11/11/07
to
On Nov 10, 4:56 pm, Michael Bacarella <m...@gpshopper.com> wrote:

> This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

As others have pointed out there is no way that this takes 45
minutes.Must be something with your system or setup.

A functionally equivalent code for me runs in about 49 seconds!
(it ends up using about twice as much ram as the data on disk)

i.


Michael Bacarella

unread,
Nov 11, 2007, 11:25:15 AM11/11/07
to pytho...@python.org
Firstly, thank you for all of your help so far, I really appreciate it.

> > So, you think the Python's dict implementation degrades towards
O(N)
> > performance when it's fed millions of 64-bit pseudo-random longs?
>
> No.

Yes.

I tried your code (with one change, time on feedback lines) and got the
same terrible
performance against my data set.

$ cat stevescode.py


#!/usr/bin/python
"""Read a big file into a dict."""

import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.

print "Line %d" % n,time.asctime()


id,name = line.strip().split(':', 1)
id = long(id)
id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()

$ ./stevescode.py
Starting at Sun Nov 11 10:46:37 2007
Line 0 Sun Nov 11 10:46:37 2007
Line 1000000 Sun Nov 11 10:48:22 2007
Line 2000000 Sun Nov 11 10:51:08 2007
Line 3000000 Sun Nov 11 10:56:16 2007
Line 4000000 Sun Nov 11 10:58:41 2007
Line 5000000 Sun Nov 11 11:03:26 2007
^C

To prove that my machine is sane, I ran the same against your generated
sample file and got _excellent_ performance. Start to finish in under a minute.

$ cat steves-makedict.py


#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""

bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
fp.write(template % random.randint(0, bits64))
fp.close()

$ ./steves-makedict.py

$ ./stevescode.py
Starting at Sun Nov 11 11:15:31 2007
Line 0 Sun Nov 11 11:15:31 2007
Line 1000000 Sun Nov 11 11:15:37 2007
Line 2000000 Sun Nov 11 11:15:43 2007
Line 3000000 Sun Nov 11 11:15:49 2007
Line 4000000 Sun Nov 11 11:15:54 2007
Line 5000000 Sun Nov 11 11:16:00 2007
Line 6000000 Sun Nov 11 11:16:07 2007
Line 7000000 Sun Nov 11 11:16:12 2007
Line 8000000 Sun Nov 11 11:16:18 2007
Items in dict: 8191180
Completed import at Sun Nov 11 11:16:19 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 11:16:23 2007
Finishing at Sun Nov 11 11:16:23 2007


> Notice that the dict is completely read into memory in just two and a

> half minutes. The script then tries to delete the dict, and 32
minutes
> later is still struggling. That's the point I got sick of waiting and

> interrupted the script.
>
> Conclusion: it's a memory issue, or maybe a garbage collection issue, not
> a problem with dicts.

As you can see, not the case at all against my data set.

> (1) Presumably you don't want to run your app with the garbage collector
> turned off. You might still want to play around with the gc module to see
> what you can learn.

As you can see, your version did no better. :(

> (2) More memory will help avoid paging. If you can't get more memory, try
> more virtual memory. It will still be slow, but at least the
operating
> system doesn't have to try moving blocks around as much.

The machine has 8GB, and is not doing anything else when I run this.

> (3) Are you sure you need all eight-million-plus items in the cache
all
> at once?

Yes.

> (4) There are lots of algorithms out there for dealing with data too big
> to fit into main memory. Do some research.

It DOES fit into main memory and a dictionary is exactly the right way
to do this.

> (5) There is a data structure designed for dealing with tens of millions
> of records at once. It is called "a database". If you can't find a better
> algorithm, and refuse to use an existing RDBMS, I suspect you're
going to
> end up inventing a primitive, inefficient, buggy database which is no

> faster than existing systems out there.

I've tried three disk-based implementations already (BerkeleyDB, cdb, and an RDBMS)
Performance is terrible because they end up doing too many random disk seeks.
Pre-caching all of the data ahead of time has offered us the best performance so far,
but is still slower than it ought to be.

Creating a HEAP TABLE in the RDBMS is an idea, but moving all of this really easy
code into the RDBMS just to find a hashing algorithm that doesn't choke on my keys
sounds pretty lame.

A cached in main memory hash is the right way to do this.

The Perl version runs *very* fast, after all.

Istvan Albert

unread,
Nov 11, 2007, 11:34:37 AM11/11/07
to
On Nov 11, 11:25 am, Michael Bacarella <m...@gpshopper.com> wrote:

> I tried your code (with one change, time on feedback lines) and got the
> same terrible
> performance against my data set.

> To prove that my machine is sane, I ran the same against your generated


> sample file and got _excellent_ performance. Start to finish in under a minute.

One possibility could be that your dataset turns out to be some sort
of pathological worst case for the hashing algorithm in python.

i.

Michael Bacarella

unread,
Nov 11, 2007, 11:51:37 AM11/11/07
to pytho...@python.org
> > This would seem to implicate the line id2name[id] = name as being
excruciatingly slow.
>
> As others have pointed out there is no way that this takes 45
> minutes.Must be something with your system or setup.
>
> A functionally equivalent code for me runs in about 49 seconds!
> (it ends up using about twice as much ram as the data on disk)

You can download the list of keys from here, it's 43M gzipped:
http://www.sendspace.com/file/9530i7

and see it take about 45 minutes with this:

$ cat cache-keys.py
#!/usr/bin/python
v = {}
for line in open('keys.txt'):
v[long(line.strip())] = True

Michael Bacarella

unread,
Nov 11, 2007, 11:54:52 AM11/11/07
to pytho...@python.org
> > I tried your code (with one change, time on feedback lines) and got
the
> > same terrible
> > performance against my data set.
> >
> > To prove that my machine is sane, I ran the same against your
generated
>> sample file and got _excellent_ performance. Start to finish in
under a minute.
>
> One possibility could be that your dataset turns out to be some sort
> of pathological worst case for the hashing algorithm in python.


Cool!

Putting that on the resume. ;)

Arkanes

unread,
Nov 11, 2007, 1:42:08 PM11/11/07
to pytho...@python.org
Michael Bacarella wrote:
> You can download the list of keys from here, it's 43M gzipped:
> http://www.sendspace.com/file/9530i7
>
> and see it take about 45 minutes with this:
>
> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
> v[long(line.strip())] = True
>
>
It takes about 20 seconds for me. It's possible it's related to int/long
unification - try using Python 2.5. If you can't switch to 2.5, try
using string keys instead of longs.

Alberto Berti

unread,
Nov 11, 2007, 12:58:55 PM11/11/07
to pytho...@python.org
>>>>> "Michael" == Michael Bacarella <mb...@gpshopper.com> writes:

>> > This would seem to implicate the line id2name[id] = name as
>> being

Michael> excruciatingly slow.


>>
>> As others have pointed out there is no way that this takes 45
>> minutes.Must be something with your system or setup.
>>
>> A functionally equivalent code for me runs in about 49 seconds!
>> (it ends up using about twice as much ram as the data on disk)

Michael> You can download the list of keys from here, it's 43M
Michael> gzipped: http://www.sendspace.com/file/9530i7

Michael> and see it take about 45 minutes with this:

I've downloaded your keys, run your program and this is the result:

$ du -h keys.txt
128M keys.txt

$ time python cache_keys.py

real 0m55.913s
user 0m35.286s
sys 0m0.852s

$ python
Python 2.4.4 (#2, Apr 26 2007, 00:02:45)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2

$ uname -a
Linux lizard 2.6.21-1-k7 #1 SMP Sat May 26 16:56:05 UTC 2007 i686 GNU/Linux

cheers

Alberto

Marc 'BlackJack' Rintsch

unread,
Nov 11, 2007, 1:04:02 PM11/11/07
to
On Sun, 11 Nov 2007 08:51:37 -0800, Michael Bacarella wrote:

>> As others have pointed out there is no way that this takes 45
>> minutes.Must be something with your system or setup.
>>
>> A functionally equivalent code for me runs in about 49 seconds!
>> (it ends up using about twice as much ram as the data on disk)
>
> You can download the list of keys from here, it's 43M gzipped:
> http://www.sendspace.com/file/9530i7
>
> and see it take about 45 minutes with this:
>
> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
> v[long(line.strip())] = True

Takes about 40 seconds here.

bj@s8n:~$ time python test.py

real 0m38.758s
user 0m25.290s
sys 0m1.580s

Ciao,
Marc 'BlackJack' Rintsch

Message has been deleted

Ricardo Aráoz

unread,
Nov 11, 2007, 2:18:55 PM11/11/07
to Michael Bacarella, pytho...@python.org
Michael Bacarella wrote:
>>> This would seem to implicate the line id2name[id] = name as being
> excruciatingly slow.

>> As others have pointed out there is no way that this takes 45
>> minutes.Must be something with your system or setup.
>>
>> A functionally equivalent code for me runs in about 49 seconds!
>> (it ends up using about twice as much ram as the data on disk)
>
> You can download the list of keys from here, it's 43M gzipped:
> http://www.sendspace.com/file/9530i7
>
> and see it take about 45 minutes with this:
>
> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
> v[long(line.strip())] = True

Have you tried taking the long away? I mean :
v[line.strip()] = True

Ben Finney

unread,
Nov 11, 2007, 5:23:13 PM11/11/07
to
DouhetSukd <Douhe...@gmail.com> writes:

> Ah, well, just noticed Ben's suggested this already. Mind you, his
> code, while correct in intent, does look a bit fishy (missing those
> square brackets)

By "missing those square brackets", what would be a list comprehension
(allocating an entire list in memory at once before proceeding)
becomes a generator expression (generating results from the sequence
only as needed).

Enter 'python "generator expression"' into your favourite search
engine to find out more.

--
\ "What I resent is that the range of your vision should be the |
`\ limit of my action." -- Henry James |
_o__) |
Ben Finney

Istvan Albert

unread,
Nov 11, 2007, 10:40:42 PM11/11/07
to
On Nov 11, 11:51 am, Michael Bacarella <m...@gpshopper.com> wrote:

> and see it take about 45 minutes with this:
>
> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
> v[long(line.strip())] = True

On my system (windows vista) your code (using your data) runs in:

36 seconds with python 2.4
25 seconds with python 2.5
39 seconds with python 3000


i.


Hrvoje Niksic

unread,
Nov 12, 2007, 5:37:42 AM11/12/07
to

This is becoming an FAQ on this newsgroup, having come up first with
lists and now with dictionaries. It can also be considered a serious
implementation problem, since code like Michael's is expected to
perform in (close to) linear time and in fact did so in previous
versions of Python, and still does in Perl and other popular scripting
languages. Neither introductory nor intermediate Python learning
materials don't cover the need to disable GC when building large data
structures.

Maybe the default GC strategy should be rethought.

Steven D'Aprano

unread,
Nov 12, 2007, 6:57:42 AM11/12/07
to
On Sun, 11 Nov 2007 11:42:08 -0700, Arkanes wrote:

> It takes about 20 seconds for me. It's possible it's related to int/long
> unification - try using Python 2.5. If you can't switch to 2.5, try
> using string keys instead of longs.

I'm getting similar behaviour to the Original Poster, and I'm using
Python 2.5 under Linux.

Unlike him, my PC only has 1GB of RAM, and I've got a LOT of other apps
running, so I'm not surprised it's taking about two minutes to build the
dict. I put that down to low memory, and given the limitations of my
system, I'm not worried about that.

But taking 30+ minutes to delete the dict, even with garbage collection
off, that can't be good.

--
Steven.

Steven D'Aprano

unread,
Nov 12, 2007, 7:52:24 AM11/12/07
to
On Sun, 11 Nov 2007 08:25:15 -0800, Michael Bacarella wrote:

> Firstly, thank you for all of your help so far, I really appreciate it.
>
>> > So, you think the Python's dict implementation degrades towards
> O(N)
>> > performance when it's fed millions of 64-bit pseudo-random longs?
>>
>> No.
>
> Yes.
>
> I tried your code (with one change, time on feedback lines) and got the
> same terrible
> performance against my data set.

Hmmm... you're getting even worse performance than I do. I read the dict
in pretty quickly considering the limited memory on my system, and only
run into trouble when I try deleting the dict.

But still... no, Python dicts shouldn't degrade that badly. Whatever is
going on, it almost certainly isn't a problem with the implementation of
dicts.

Here's a thought... try reading the data into one big list of key-value
pairs instead of a dict. If you still get terrible performance, I think
that pretty much eliminates the dict as the culprit.

> To prove that my machine is sane, I ran the same against your generated
> sample file and got _excellent_ performance. Start to finish in under a
> minute.

Not a fair comparison, because it is doing something completely
different. But still valuable to see that your install isn't *completely*
broken.


>> Notice that the dict is completely read into memory in just two and a
>> half minutes. The script then tries to delete the dict, and 32
>> minutes
>> later is still struggling. That's the point I got sick of waiting and
>> interrupted the script.
>>
>> Conclusion: it's a memory issue, or maybe a garbage collection issue,
>> not a problem with dicts.
>
> As you can see, not the case at all against my data set.

Yes, I see now that your problems are worse than my problems.


>> (1) Presumably you don't want to run your app with the garbage
>> collector turned off. You might still want to play around with the gc
>> module to see what you can learn.
>
> As you can see, your version did no better. :(

Somewhat better, but still struggling.


>> (2) More memory will help avoid paging. If you can't get more memory,
>> try more virtual memory. It will still be slow, but at least the
>> operating
>> system doesn't have to try moving blocks around as much.
>
> The machine has 8GB, and is not doing anything else when I run this.

Yes, fair enough.


>> (3) Are you sure you need all eight-million-plus items in the cache
>> all at once?
>
> Yes.

I remain skeptical, but what do I know, I don't even know what you're
doing with the data once you have it :-)


>> (4) There are lots of algorithms out there for dealing with data too
>> big to fit into main memory. Do some research.
>
> It DOES fit into main memory and a dictionary is exactly the right way
> to do this.

I agree now.


I think that you and I are experiencing anomalous results. I'm not even
certain that it is specifically a Python problem -- I'd be a lot happier
if somebody got the same terrible performance on a different OS.

What do we have in common? We're using different versions of Python (you
2.3, me 2.5) and different versions of Linux.

I wonder if we happen to have the same hardware... what's your CPU?

$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
...
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
...


--
Steven.

Hrvoje Niksic

unread,
Nov 12, 2007, 8:22:00 AM11/12/07
to
Michael Bacarella <mb...@gpshopper.com> writes:

> $ uname -a
> Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64 x86_64 GNU/Linux
>
> We've also tried it on this version (on a different machine):
>

> $ uname -a
> Linux yyy 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

Note that both machines are x86_64. Please try your test on a 32-bit
machine and report the results. I suspect performance won't degrade
there.

Most of the people who tried your test couldn't repeat your result
(including me), but some did. It is quite possible that those who did
were running on variants of x86_64, just like you. My guess is you've
hit a bug that degrades performance of hashing of longs, or of large
dicts, only affecting 64-bit architectures.

As a workaround, you can try skipping the long() construction and
simply hashing strings. (There should be no loss of information, and
string objects are probably just as compact as Python longs in that
range.)

Jeffrey Froman

unread,
Nov 12, 2007, 10:41:20 AM11/12/07
to
Hrvoje Niksic wrote:

> Note that both machines are x86_64.  Please try your test on a 32-bit
> machine and report the results.  I suspect performance won't degrade
> there.

This theory seems to be supported by my findings. Running the test on a
32-bit machine took 45 seconds, while the same test on a 64-bit machine is
taking ... 30 minutes SO FAR (I'm getting impatient ;-)

Both machines are running Python2.3.4 under CentOS-4, and both have 1G of
RAM. The 64-bit machine has 2 processors weighing in at 2792.097 Mhz each,
while the 32-bit machine has only 1 1194.857 Mhz processor.


Jeffrey

Michael Bacarella

unread,
Nov 12, 2007, 12:34:10 PM11/12/07
to Michael Bacarella, pytho...@python.org
> id2name[key >> 40][key & 0x10000000000] = name

Oops, typo. It's actually:

Id2name[key >> 40][key & 0xffffffffff] = name


Michael Bacarella

unread,
Nov 12, 2007, 12:39:00 PM11/12/07
to pytho...@python.org
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
>
> On my system (windows vista) your code (using your data) runs in:
>
> 36 seconds with python 2.4
> 25 seconds with python 2.5
> 39 seconds with python 3000

*wow*

The win32 Python or the cygwin Python?

What CPU architecture?


Michael Bacarella

unread,
Nov 12, 2007, 12:46:32 PM11/12/07
to pytho...@python.org
> > You can download the list of keys from here, it's 43M gzipped:
> > http://www.sendspace.com/file/9530i7
> >
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
> >
> >
> It takes about 20 seconds for me. It's possible it's related to
> int/long
> unification - try using Python 2.5. If you can't switch to 2.5, try
> using string keys instead of longs.

Yes, this was it. It ran *very* fast on Python v2.5.

Terribly on v2.4, v2.3.

(I thought I had already evaluated v2.5 but I see now that the server
With 2.5 on it invokes 2.3 for 'python'.)

Thanks!

Paul Rubin

unread,
Nov 12, 2007, 2:31:00 PM11/12/07
to

Are you saying this is a patch that appeared after python 2.3?
Somehow I don't think it's come up in this thread whether you've
tried any later versions.

Istvan Albert

unread,
Nov 12, 2007, 4:40:23 PM11/12/07
to
On Nov 12, 12:39 pm, "Michael Bacarella" <m...@gpshopper.com> wrote:

> The win32 Python or the cygwin Python?
>
> What CPU architecture?

it is the win32 version, a dual core laptop with T5220 Core 2

MrJean1

unread,
Nov 12, 2007, 6:40:06 PM11/12/07
to
On MacOS** your version

#!/usr/bin/python
v = {}
for line in open('keys.txt'):
v[long(line.strip())] = True

using these <http://www.sendspace.com/file/9530i7> keys takes

24.9 secs with ActivePython 2.5.1
and
212.3 secs with Apple's Python 2.3.5.


However, this version

#!/usr/bin/python
v = {}
for line in open('keys.txt'):

v[line] = True

takes
16.5 secs with ActivePython 2.5.1
and
18.7 secs with Apple's Python 2.3.5.


/Jean Brouwers

**) MacOS X 10.4.10 on Mini Mac, 1.83 GHz Intel Core Duo, 2 GB RAM.

Michael Bacarella

unread,
Nov 12, 2007, 12:32:27 PM11/12/07
to pytho...@python.org
See end for solution.

> >> (3) Are you sure you need all eight-million-plus items in the cache
> >> all at once?
> >
> > Yes.
>
> I remain skeptical, but what do I know, I don't even know what you're
> doing with the data once you have it :-)

It's OK, I'd be skeptical too. ;)

> $ cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 107
> model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
> stepping : 1
> cpu MHz : 1000.000
> cache size : 512 KB
> ...
> processor : 1
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 107
> model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
> stepping : 1
> cpu MHz : 1000.000
> cache size : 512 KB

# cat /proc/cpuinfo


processor : 0
vendor_id : AuthenticAMD
cpu family : 15

model : 5
model name : AMD Opteron(tm) Processor 246
stepping : 10
cpu MHz : 2009.305
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips : 3981.31
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 1
vendor_id : AuthenticAMD
cpu family : 15

model : 5
model name : AMD Opteron(tm) Processor 246
stepping : 10
cpu MHz : 2009.305
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips : 4014.08
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp


As for the solution, after trying a half-dozen different integer hashing
functions
and hash table sizes (the brute force approach), on a total whim I switched
to a
model with two dictionary tiers and got whole orders of magnitude
better performance.

The tiering is, for a given key of type long:

id2name[key >> 40][key & 0x10000000000] = name

Much, much better. A few minutes versus hours this way.

I suspect it could be brought down to seconds with a third level of
tiers but this is no longer posing the biggest bottleneck... ;)

Thanks!


Francesc Altet

unread,
Nov 13, 2007, 11:27:11 AM11/13/07
to pytho...@python.org
A Monday 12 November 2007, Michael Bacarella escrigué:

> As for the solution, after trying a half-dozen different integer
> hashing functions
> and hash table sizes (the brute force approach), on a total whim I
> switched to a
> model with two dictionary tiers and got whole orders of magnitude
> better performance.
>
> The tiering is, for a given key of type long:
>
> id2name[key >> 40][key & 0x10000000000] = name
>
> Much, much better. A few minutes versus hours this way.
>
> I suspect it could be brought down to seconds with a third level of
> tiers but this is no longer posing the biggest bottleneck... ;)

I don't know exactly why do you need a dictionary for keeping the data,
but in case you want ultra-fast access to values, there is no
replacement for keeping a sorted list of keys and a list with the
original indices to values, and the proper list of values. Then, to
access a value, you only have to do a binary search on the sorted list,
another lookup in the original indices list and then go straight to the
value in the value list. This should be the faster approach I can
think of.

Another possibility is using an indexed column in a table in a DB.
Lookups there should be much faster than using a dictionary as well.

HTH,

--
>0,0< Francesc Altet     http://www.carabos.com/
V V Cárabos Coop. V.   Enjoy Data
"-"

Paul McGuire

unread,
Nov 13, 2007, 12:09:32 PM11/13/07
to
> Thanks!- Hide quoted text -
>
> - Show quoted text -

Shouldn't this be:

id2name[key >> 40][key & 0xffffffffff] = name

?

-- Paul

Michael Bacarella

unread,
Nov 13, 2007, 12:28:19 PM11/13/07
to pytho...@python.org
> Shouldn't this be:
>
> id2name[key >> 40][key & 0xffffffffff] = name

Yes, exactly, I had done hex(pow(2,40)) when I meant hex(pow(2,40)-1)

I sent my correction a few minutes afterwards but Mailman
queued it for moderator approval (condition with replying to
myself?)


Hrvoje Niksic

unread,
Nov 13, 2007, 2:18:20 PM11/13/07
to
Francesc Altet <fal...@carabos.com> writes:

> I don't know exactly why do you need a dictionary for keeping the
> data, but in case you want ultra-fast access to values, there is no
> replacement for keeping a sorted list of keys and a list with the
> original indices to values, and the proper list of values. Then, to
> access a value, you only have to do a binary search on the sorted
> list, another lookup in the original indices list and then go
> straight to the value in the value list.

Actually, dictionary lookup is faster than binary search because it
doesn't need to search through the table at all, it simply pinpoints
the value (most of the time).

Steven D'Aprano

unread,
Nov 13, 2007, 5:31:57 PM11/13/07
to
On Tue, 13 Nov 2007 17:27:11 +0100, Francesc Altet wrote:

> I don't know exactly why do you need a dictionary for keeping the data,
> but in case you want ultra-fast access to values, there is no
> replacement for keeping a sorted list of keys and a list with the
> original indices to values, and the proper list of values. Then, to
> access a value, you only have to do a binary search on the sorted list,
> another lookup in the original indices list and then go straight to the
> value in the value list. This should be the faster approach I can think
> of.


Maybe on Bizarro World, but not on Planet Earth.

Searching a dict takes on average a constant amount of time, regardless
of the key and the size of the dict. Because dicts are the basis of
Python's namespaces, it is optimized up the wazoo, and being implemented
in C it is extremely fast.

Using your suggestion, to get a value: you have to do a binary search
which is O(log N) instead of O(1), then ANOTHER lookup into a list of
indices, and finally a THIRD lookup to actually retrieve the value -- all
implemented in relatively slow Python instead of fast C.

And let's not even discuss the overhead of keeping these three lists
synchronized.

But don't take my word for it: measure for yourself. I'm not even going
to attempt to code your suggestion, but here's how you measure the time
it takes for dictionary lookup.


# Create a dataset to search on.
D = {}
chars = "abcdefghijklmnopqrstuvwxyz"
triplets = (a+b+c for a in chars for b in chars for c in chars)
for i, s in enumerate(triplets):
D[s] = i # D is of the form {'abc': 12345}

# Now time a successful search for an arbitrary triplet.
import timeit
min(timeit.Timer("D['foo']", "from __main__ import D").repeat(6))


On my PC, it takes about 0.2 seconds per million searches.

--
Steven.

Istvan Albert

unread,
Nov 13, 2007, 9:34:41 PM11/13/07
to
On Nov 13, 11:27 am, Francesc Altet <fal...@carabos.com> wrote:

> Another possibility is using an indexed column in a table in a DB.
> Lookups there should be much faster than using a dictionary as well.

I would agree with others who state that for a simple key based lookup
nothing beats a dictionary.

But knowing that you are one of the authors of the excellent pytables
module I think you are approaching this problem from the perspective
of reading in a large number of values on each access. In those cases
specialized libraries (like the hdf) can directly load into memory
huge amounts of continuous data at speeds that substantially
outperform all other approaches.

i.

Francesc Altet

unread,
Nov 14, 2007, 6:29:49 AM11/14/07
to pytho...@python.org
A Tuesday 13 November 2007, Steven D'Aprano escrigué:

> On Tue, 13 Nov 2007 17:27:11 +0100, Francesc Altet wrote:
> > I don't know exactly why do you need a dictionary for keeping the
> > data, but in case you want ultra-fast access to values, there is no
> > replacement for keeping a sorted list of keys and a list with the
> > original indices to values, and the proper list of values. Then,
> > to access a value, you only have to do a binary search on the
> > sorted list, another lookup in the original indices list and then
> > go straight to the value in the value list. This should be the
> > faster approach I can think of.
>
> Maybe on Bizarro World, but not on Planet Earth.
>
> Searching a dict takes on average a constant amount of time,
> regardless of the key and the size of the dict. Because dicts are the
> basis of Python's namespaces, it is optimized up the wazoo, and being
> implemented in C it is extremely fast.
>
> Using your suggestion, to get a value: you have to do a binary search
> which is O(log N) instead of O(1), then ANOTHER lookup into a list of
> indices, and finally a THIRD lookup to actually retrieve the value --
> all implemented in relatively slow Python instead of fast C.

Well, the bisect module in Python is implemented in fast C. Apart from
that, you are basically right, see below.

> And let's not even discuss the overhead of keeping these three lists
> synchronized.
>
> But don't take my word for it: measure for yourself. I'm not even
> going to attempt to code your suggestion, but here's how you measure
> the time it takes for dictionary lookup.
>
>
> # Create a dataset to search on.
> D = {}
> chars = "abcdefghijklmnopqrstuvwxyz"
> triplets = (a+b+c for a in chars for b in chars for c in chars)
> for i, s in enumerate(triplets):
> D[s] = i # D is of the form {'abc': 12345}
>
> # Now time a successful search for an arbitrary triplet.
> import timeit
> min(timeit.Timer("D['foo']", "from __main__ import D").repeat(6))
>
>
> On my PC, it takes about 0.2 seconds per million searches.

Oh yes. All of you are right, guys. I've implemented code (attached)
for measuring the lookup times using a a dictionary and using a binary
search approach (using numpy, mostly for space efficiency) and here are
my results:

$ python2.5 gq-dict.py
Creating the dictionary...
Time for dict creation: 89.115
Items in dict: 8191180
Timing queries...
Time for querying: 39.44

$ python2.5 gq-binary-search.py
Creating the lists...
Time for lists creation: 4.245
Sorting...
Time for sorting: 6.195
Timing queries...
Time for querying: 108.1

i.e. a dict approach proves to be almost 3x faster than a regular binary
search (5 us/lookup vs 13 us/lookup).

I think I've got messed on some benchmarks that I've done on that
subject some time ago, but either my memory is bad or I've made some
mistake on those experiments. My apologies.

Cheers,

gq-binary-search.py
gq-dict.py

Francesc Altet

unread,
Nov 14, 2007, 6:39:52 AM11/14/07
to pytho...@python.org
A Wednesday 14 November 2007, Istvan Albert escrigué:

Yes, could be. However, I was well aware that the problem was looking
up one single value, so, mea culpa :(. Sorry for introducing noise.

Aaron Watters

unread,
Nov 14, 2007, 9:43:58 AM11/14/07
to
On Nov 12, 12:46 pm, "Michael Bacarella" <m...@gpshopper.com> wrote:
>
> > It takes about 20 seconds for me. It's possible it's related to
> > int/long
> > unification - try using Python 2.5. If you can't switch to 2.5, try
> > using string keys instead of longs.
>
> Yes, this was it. It ran *very* fast on Python v2.5.

Um. Is this the take away from this thread? Longs as dictionary
keys are bad? Only for older versions of Python?

This could be a problem for people like me who build
lots of structures using seek values, which are longs, as done in
http://nucular.sourceforge.net and http://bplusdotnet.sourceforge.net
and elsewhere. Someone please summarize.

-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=white%20trash

Hrvoje Niksic

unread,
Nov 14, 2007, 12:16:25 PM11/14/07
to
Aaron Watters <aaron....@gmail.com> writes:

> On Nov 12, 12:46 pm, "Michael Bacarella" <m...@gpshopper.com> wrote:
>>
>> > It takes about 20 seconds for me. It's possible it's related to
>> > int/long
>> > unification - try using Python 2.5. If you can't switch to 2.5, try
>> > using string keys instead of longs.
>>
>> Yes, this was it. It ran *very* fast on Python v2.5.
>
> Um. Is this the take away from this thread? Longs as dictionary
> keys are bad? Only for older versions of Python?

It sounds like Python 2.4 (and previous versions) had a bug when
populating large dicts on 64-bit architectures.

> Someone please summarize.

Yes, that would be good.

Steven D'Aprano

unread,
Nov 14, 2007, 6:26:17 PM11/14/07
to

No, I found very similar behaviour with Python 2.5.


>> Someone please summarize.
>
> Yes, that would be good.


On systems with multiple CPUs or 64-bit systems, or both, creating and/or
deleting a multi-megabyte dictionary in recent versions of Python (2.3,
2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
compared to seconds if the system only has a single CPU. Turning garbage
collection off doesn't help.


--
Steven.

Aaron Watters

unread,
Nov 15, 2007, 9:40:14 AM11/15/07
to
On Nov 14, 6:26 pm, Steven D'Aprano <st...@REMOVE-THIS-

cybersource.com.au> wrote:
> >> Someone please summarize.
>
> > Yes, that would be good.
>
> On systems with multiple CPUs or 64-bit systems, or both, creating and/or
> deleting a multi-megabyte dictionary in recent versions of Python (2.3,
> 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
> compared to seconds if the system only has a single CPU. Turning garbage
> collection off doesn't help.
>
> --
> Steven.

criminy... Any root cause? patch?

btw, I think I've seen this, but I think you need
to get into 10s of megs or more before it becomes
critical.

Note: I know someone will say "don't scare off the newbies"
but in my experience most Python programmers are highly
experienced professionals who need to know this sort of thing.
The bulk of the newbies are either off in VB land
or struggling with java.

-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=silly+walk

Aaron Watters

unread,
Nov 15, 2007, 11:40:12 AM11/15/07
to
On Nov 14, 6:26 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:

> On systems with multiple CPUs or 64-bit systems, or both, creating and/or
> deleting a multi-megabyte dictionary in recent versions of Python (2.3,
> 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
> compared to seconds if the system only has a single CPU. Turning garbage
> collection off doesn't help.

Fwiw, Testing on a 2 cpu 64 bit machine with 1gb real memory I
consistently
run out of real memory before I see this effect, so I guess it kicks
in for dicts
that consume beyond that. That's better than I feared at any
rate...

-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=especially+nasty+windows

Chris Mellon

unread,
Nov 15, 2007, 11:51:08 AM11/15/07
to pytho...@python.org
On Nov 14, 2007 5:26 PM, Steven D'Aprano

I can't duplicate this in a dual CPU (64 bit, but running in 32 bit
mode with a 32 bit OS) system. I added keys to a dict until I ran out
of memory (a bit over 22 million keys) and deleting the dict took
about 8 seconds (with a stopwatch, so not very precise, but obviously
less than 30 minutes).

>>> d = {}
>>> idx = 0
>>> while idx < 1e10:
... d[idx] = idx
... idx += 1
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
MemoryError
>>> len(d)
22369622
>>> del d

Istvan Albert

unread,
Nov 15, 2007, 2:11:57 PM11/15/07
to
On Nov 14, 6:26 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:

> On systems with multiple CPUs or 64-bit systems, or both, creating and/or
> deleting a multi-megabyte dictionary in recent versions of Python (2.3,
> 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
> compared to seconds if the system only has a single CPU.

Please don't propagate this nonsense. If you see this then the problem
exists between the chair and monitor.

There is nothing wrong with neither creating nor deleting
dictionaries.

i.

Aaron Watters

unread,
Nov 15, 2007, 2:53:56 PM11/15/07
to
On Nov 15, 2:11 pm, Istvan Albert <istvan.alb...@gmail.com> wrote:
> There is nothing wrong with neither creating nor deleting
> dictionaries.

I suspect what happened is this: on 64 bit
machines the data structures for creating dictionaries
are larger (because pointers take twice as much space),
so you run into memory contention issues sooner than
on 32 bit machines, for similar memory sizes.
If there is something deeper going
on please correct me, I would very much like to know.

-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=alien+friend

Michael Bacarella

unread,
Nov 15, 2007, 3:51:25 PM11/15/07
to pytho...@python.org
> On Nov 15, 2:11 pm, Istvan Albert <istvan.alb...@gmail.com> wrote:
> > There is nothing wrong with neither creating nor deleting
> > dictionaries.
>
> I suspect what happened is this: on 64 bit
> machines the data structures for creating dictionaries
> are larger (because pointers take twice as much space),
> so you run into memory contention issues sooner than
> on 32 bit machines, for similar memory sizes.
> If there is something deeper going
> on please correct me, I would very much like to know.

Since some people missed the EUREKA!, here's the executive summary:

Python2.3: about 45 minutes
Python2.4: about 45 minutes
Python2.5: about _30 seconds_

The cut/paste of the EUREKA MOMENT from earlier in this thread:



> > You can download the list of keys from here, it's 43M gzipped:
> > http://www.sendspace.com/file/9530i7
> >
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
> >
> >

> It takes about 20 seconds for me. It's possible it's related to
> int/long
> unification - try using Python 2.5. If you can't switch to 2.5,
try
> using string keys instead of longs.

Yes, this was it. It ran *very* fast on Python v2.5.

Terribly on v2.4, v2.3.

Steven D'Aprano

unread,
Nov 15, 2007, 3:57:52 PM11/15/07
to
On Thu, 15 Nov 2007 10:51:08 -0600, Chris Mellon wrote:

> I can't duplicate this in a dual CPU (64 bit, but running in 32 bit mode
> with a 32 bit OS) system.

Can you try it running in 64-bit mode?

What Python version are you running?

--
Steven.

Hrvoje Niksic

unread,
Nov 15, 2007, 3:51:21 PM11/15/07
to
Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:

>>> Someone please summarize.
>>
>> Yes, that would be good.
>
> On systems with multiple CPUs or 64-bit systems, or both, creating and/or
> deleting a multi-megabyte dictionary in recent versions of Python (2.3,
> 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
> compared to seconds if the system only has a single CPU.

Can you post minimal code that exhibits this behavior on Python 2.5.1?
The OP posted a lot of different versions, most of which worked just
fine for most people.

Steven D'Aprano

unread,
Nov 15, 2007, 4:11:51 PM11/15/07
to

Please read the whole thread before making unsupported statements like
that. You should consider that this behaviour has been observed by
multiple people, before making insulting statements.

Both myself and the original poster have given code that demonstrates
this problem. We've given concrete evidence of a problem which is
replicable across different versions of Python and different versions of
the Linux operating system.

Unless you're accusing both myself and the original poster of outright
lying, of faking our results, what's your explanation? Have you tried
running our code on a 64-bit or multi-CPU system to see for yourself, or
are you just being closed-minded and arrogant?


--
Steven.

Chris Mellon

unread,
Nov 15, 2007, 4:16:36 PM11/15/07
to pytho...@python.org
On Nov 15, 2007 2:57 PM, Steven D'Aprano

C:\>python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

I can't switch to 64 bit right now. This is a dual Core 2 Duo machine
(2 CPUs, 2 cores each).

I've looked at the deallocation code for dict objects and I can't see
anything that should be affected by 64 bitness, at least not nearly
the degree described - it DECREFs all the keys and values in the dict,
calls PyMem_Free on its memory pool (if it was large enough to be
malloced, which this certainly was, and either adds itself back to the
dict object pool or deallocs itself (if its a subclass of dict).

Steven D'Aprano

unread,
Nov 15, 2007, 4:21:41 PM11/15/07
to
On Thu, 15 Nov 2007 15:51:25 -0500, Michael Bacarella wrote:

> Since some people missed the EUREKA!, here's the executive summary:
>
> Python2.3: about 45 minutes
> Python2.4: about 45 minutes
> Python2.5: about _30 seconds_

I'm really happy that upgrading to 2.5 solved the issue for you, but I
get similar behaviour and I'm running 2.5, so it isn't as simple as that.

--
Steven.

Michael Bacarella

unread,
Nov 15, 2007, 4:36:17 PM11/15/07
to pytho...@python.org

Maybe some more details about my environment will help.

Our fast Python is 2.5, built from source. Our slow Python is 2.3,
default installed with the OS, from rpm.

# python
Python 2.5.1 (r251:54863, May 11 2007, 14:17:21)
[GCC 3.4.4 20050721 (Red Hat 3.4.4-2)] on linux2


Type "help", "copyright", "credits" or "license" for more information.

# which python
/usr/local/bin/python

# ls -la /usr/local/bin/python
-rwxr-xr-x 2 root root 5312572 Nov 12 12:54 /usr/local/bin/python

# md5sum /usr/local/bin/python
5c24e54a6cf5a556e9371325d18bc1fb /usr/local/bin/python

# ls -la /usr/bin/python*
lrwxrwxrwx 1 root root 9 Nov 12 12:57 /usr/bin/python -> python2
lrwxrwxrwx 1 root root 6 Nov 12 12:57 /usr/bin/python2 -> python2.5
-rwxr-xr-x 1 root root 8344 May 2 2007 /usr/bin/python2.3
lrwxrwxrwx 1 root root 21 Nov 12 12:57 /usr/bin/python2.5 ->
/usr/local/bin/python

# ls -la /usr/local/src/ | grep Python
drwxr-xr-x 19 1000 1000 4096 Nov 12 12:54 Python-2.5.1
-rw-r--r-- 1 root root 9383651 Nov 12 12:51 Python-2.5.1.tar.bz2

# gcc -v
Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.4/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-java-awt=gtk
--host=x86_64-redhat-linux
Thread model: posix
gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)

# cat /etc/redhat-release
CentOS release 4.2 (Final)

# uname -a
Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64
x86_64 GNU/Linux

# free
total used free shared buffers cached
Mem: 7390244 6961440 428804 0 15520 2549732
-/+ buffers/cache: 4396188 2994056
Swap: 2096472 10280 2086192


***** information about Python 2.3 *****

# rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.3.4 Vendor: CentOS
Release : 14.4 Build Date: Wed 02 May 2007
07:20:29 PM EDT
Install Date: Mon 04 Jun 2007 05:48:29 PM EDT Build Host: builder6
Group : Development/Languages Source RPM:
python-2.3.4-14.4.src.rpm
Size : 21137194 License: PSF - see LICENSE
Signature : DSA/SHA1, Sat 05 May 2007 09:33:49 AM EDT, Key ID
a53d0bab443e1821
URL : http://www.python.org/
Summary : An interpreted, interactive, object-oriented programming
language.
Description :
Python is an interpreted, interactive, object-oriented programming
language often compared to Tcl, Perl, Scheme or Java. Python includes
modules, classes, exceptions, very high level dynamic data types and
dynamic typing. Python supports interfaces to many system calls and
libraries, as well as to various windowing systems (X11, Motif, Tk,
Mac and MFC).

Programmers can write new built-in modules for Python in C or C++.
Python can be used as an extension language for applications that need
a programmable interface. This package contains most of the standard
Python modules, as well as modules for interfacing to the Tix widget
set for Tk and RPM.

Note that documentation for Python is provided in the python-docs
package.

Steven D'Aprano

unread,
Nov 15, 2007, 5:10:02 PM11/15/07
to

Who were testing it on single-CPU, 32 bit systems.

The plot thickens... I wrote another version of my test code, reading the
data into a list of tuples rather than a dict:

$ python slurp_dict4.py # actually slurp a list, despite the name
Starting at Fri Nov 16 08:55:26 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in list: 8191180
Completed import at Fri Nov 16 08:56:26 2007
Starting to delete list...
Completed deletion at Fri Nov 16 08:57:04 2007
Finishing at Fri Nov 16 08:57:04 2007

Quite a reasonable speed, considering my limited memory.

What do we know so far?

(1) The problem occurs whether or not gc is enabled.

(2) It only occurs on some architectures. 64 bit CPU seems to be common
factor.

(3) I don't think we've seen it demonstrated under Windows, but we've
seen it under at least two different Linux distros.

(4) It affects very large dicts, but not very large lists.

(5) I've done tests where instead of one really big dict, the data is put
into lots of smaller dicts. The problem still occurs.

(6) It was suggested the problem is related to long/int unification, but
I've done tests that kept the dict keys as strings, and the problem still
occurs.

(7) It occurs in Python 2.3, 2.4 and 2.5, but not 2.5.1.

Do we treat this as a solved problem and move on?

--
Steven.

Paul Rubin

unread,
Nov 15, 2007, 5:16:35 PM11/15/07