11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n
The file's properties are:
# wc -l id2name.txt
8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt
I'm loading the file into memory with code like this:
id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
This takes about 45 *minutes*
If I comment out the last line in the loop body it takes only about 30 _seconds_ to run.
This would seem to implicate the line id2name[id] = name as being excruciatingly slow.
Is there a fast, functionally equivalent way of doing this?
(Yes, I really do need this cached. No, an RDBMS or disk-based hash is not fast enough.)
> id2name = {}
> for line in iter(open('id2name.txt').readline,''):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
>
> This takes about 45 *minutes*
>
> If I comment out the last line in the loop body it takes only about
> 30 _seconds_ to run. This would seem to implicate the line
> id2name[id] = name as being excruciatingly slow.
Or, rather, that the slowdown is caused by allocating these items in a
dictionary at all.
Dictionaries are implemented very efficiently in Python, but there
will still be overhead in inserting millions of distinct items. Of
course, if you just throw each item away instead of allocating space
for it, the loop will run very quickly.
> Is there a fast, functionally equivalent way of doing this?
You could, instead of individual assignments in a 'for' loop, try
letting the 'dict' type operate on a generator::
input_file = open("id2name.txt")
id2name = dict(
(long(id), name) for (id, name) in
line.strip().split(":") for line in input_file
)
All that code inside the 'dict()' call is a "generator expression"; if
you don't know what they are yet, have a read of Python's
documentation on them. It creates a generator which will spit out
key+value tuples to be fed directly to the dict constructor as it
requests them.
That allows the generator to parse each item from the file exactly as
the 'dict' constructor needs it, possibly saving some extra "allocate,
assign, discard" steps. Not having your data set, I can't say if it'll
be significantly faster.
--
\ "Compulsory unification of opinion achieves only the unanimity |
`\ of the graveyard." -- Justice Roberts in 319 U.S. 624 (1943) |
_o__) |
Ben Finney
> The id2name.txt file is an index of primary keys to strings. They look
> like this:
>
> 11293102971459182412:Descriptive unique name for this record\n
> 950918240981208142:Another name for another record\n
>
> The file's properties are:
>
> # wc -l id2name.txt
>
> 8191180 id2name.txt
> # du -h id2name.txt
> 517M id2name.txt
>
> I'm loading the file into memory with code like this:
>
> id2name = {}
> for line in iter(open('id2name.txt').readline,''):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
That's an awfully complicated way to iterate over a file. Try this
instead:
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.
> This takes about 45 *minutes*
>
> If I comment out the last line in the loop body it takes only about 30
> _seconds_ to run. This would seem to implicate the line id2name[id] =
> name as being excruciatingly slow.
No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.
> Is there a fast, functionally equivalent way of doing this?
>
> (Yes, I really do need this cached. No, an RDBMS or disk-based hash is
> not fast enough.)
You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.
--
Steven.
As Steven says maybe you need to add more ram to your system. The
memory overhead of dictionary cells is considerable. If worse comes
to worse you could concoct some more storage-efficient representation.
>
> id2name = {}
> for line in open('id2name.txt'):
> id,name = line.strip().split(':')
> id = long(id)
> id2name[id] = name
>
> > This takes about 45 *minutes*
> >
> On my system, it takes about a minute and a half to produce a
dictionary
> with 8191180 entries.
Doing something similar on my system is very fast as well.
$ cat dict-8191180.py
#!/usr/bin/python
v = {}
for i in xrange(8191180):
v[i] = i
$ time ./dict-8191180.py
real 0m5.877s
user 0m4.953s
sys 0m0.924s
But...
> > If I comment out the last line in the loop body it takes only about
30
> > _seconds_ to run. This would seem to implicate the line id2name[id] =
> > name as being excruciatingly slow.
>
> No, dictionary access is one of the most highly-optimized, fastest,
most
> efficient parts of Python. What it indicates to me is that your system
is
> running low on memory, and is struggling to find room for 517MB worth
of
> data.
If only it were so easy.
$ free
total used free shared buffers cached
Mem: 7390244 2103448 5286796 0 38996 1982756
-/+ buffers/cache: 81696 7308548
Swap: 2096472 10280 2086192
Here's your Python implementation running as badly as mine did.
$ wc -l id2name.txt
8191180 id2name.txt
$ cat cache-id2name.py
#!/usr/bin/python
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':',1)
id = long(id)
id2name[id] = name
$ time ./cache-id2name.py
^C
I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p
36 minutes, 05.99 seconds.
To rule out the file reading/parsing logic as culprit, here's same thing, but with
dictionary insertion removed.
$ cat nocache-id2name.py
#!/usr/bin/python
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':',1)
id = long(id)
$ time ./nocache-id2name.py
real 0m33.518s
user 0m33.070s
sys 0m0.415s
Here's a Perl implementation running very fast.
$ cat cache-id2name.pl
#!/usr/bin/perl
my %id2name = ();
my $line;
my $id;
my $name;
open(F,"<id2name.txt");
foreach $line (<F>) {
chomp($line);
($id,$name) = split(/:/,$line,1);
$id = int($id);
$id2name{$id} = $name;
}
$ time ./cache-id2name.pl
real 0m46.363s
user 0m43.730s
sys 0m2.611s
So, you think the Python's dict implementation degrades towards O(N)
performance when it's fed millions of 64-bit pseudo-random longs?
> So, you think the Python's dict implementation degrades towards O(N)
> performance when it's fed millions of 64-bit pseudo-random longs?
No.
Here's my sample file:
$ wc -l id2name.txt
8191180 id2name.txt
$ ls -lh id2name.txt
-rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt
And the results of reading it into a dict (code follows below):
$ time ./slurp_dict.py
Starting at Sun Nov 11 14:26:51 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 14:29:31 2007
Starting to delete dict...
Traceback (most recent call last):
File "./slurp_dict.py", line 20, in <module>
del id2name
KeyboardInterrupt
real 35m52.334s
user 1m17.663s
sys 0m16.758s
Notice that the dict is completely read into memory in just two and a
half minutes. The script then tries to delete the dict, and 32 minutes
later is still struggling. That's the point I got sick of waiting and
interrupted the script.
Conclusion: it's a memory issue, or maybe a garbage collection issue, not
a problem with dicts.
Here's my code for creating the key:value file in the first place:
#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""
bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
fp.write(template % random.randint(0, bits64))
fp.close()
###
And here's my code for slurping it in to a dict:
#!/usr/bin/python
"""Read a big file into a dict."""
import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.
print "Line %d" % n
id,name = line.strip().split(':', 1)
id = long(id)
id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()
###
So, what can you do? Given that I'm not willing to do any more unpaid
experimentation for you, here are my suggestions, in no particular order:
(1) Presumably you don't want to run your app with the garbage collector
turned off. You might still want to play around with the gc module to see
what you can learn.
(2) More memory will help avoid paging. If you can't get more memory, try
more virtual memory. It will still be slow, but at least the operating
system doesn't have to try moving blocks around as much.
(3) Are you sure you need all eight-million-plus items in the cache all
at once?
(4) There are lots of algorithms out there for dealing with data too big
to fit into main memory. Do some research.
(5) There is a data structure designed for dealing with tens of millions
of records at once. It is called "a database". If you can't find a better
algorithm, and refuse to use an existing RDBMS, I suspect you're going to
end up inventing a primitive, inefficient, buggy database which is no
faster than existing systems out there.
--
Steven.
Based on his previous post, it would seem he has 7GB of RAM (with about
5GB free) and 2GB of swap. I don't think RAM is the issue.
Maybe there's something wrong with his specific Python installation.
What version is it and was it compiled by him?
I think I know what's going on, the dictionary updates are sending the
GC into quadratic behavior. Try turning off the GC:
import gc
gc.disable()
Tried that already. No difference. :(
> Tried that already. No difference. :(
Not sure if it would make a difference, and it would imply re-
organizing your preceding lines, but what about doing the dictionary
build in one go, rather than incrementally? Using the dict function,
which takes a list of (key,value) tuples. I use it frequently as a
space-saver and it works well enough. It may speed things up, I
dunno. I had to break out your formatting in its own function and
that can't help too much.
Something like:
def fmt(line):
id,name = line.strip().split(':')
id = long(id)
return (id,name)
id2name = dict([fmt(line) for line in
iter(open('id2name.txt').readline,'')])
Cheers
Steven> $ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51
Steven> 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line
Steven> 4000000 Line 5000000 Line 6000000 Line 7000000 Line
Steven> 8000000 Items in dict: 8191180 Completed import at Sun Nov
Steven> 11 14:29:31 2007 Starting to delete dict...
Steven> Traceback (most recent call last): File "./slurp_dict.py",
Steven> line 20, in <module> del id2name KeyboardInterrupt
Steven> real 35m52.334s user 1m17.663s sys 0m16.758s
Steven> Notice that the dict is completely read into memory in
Steven> just two and a half minutes. The script then tries to
Steven> delete the dict, and 32 minutes later is still
Steven> struggling. That's the point I got sick of waiting and
Steven> interrupted the script.
Steven> Conclusion: it's a memory issue, or maybe a garbage
Steven> collection issue, not a problem with dicts.
uh, strange results...
I run your same scripts with and without garbage collection enabled
and those are the results:
with gc enabled:
azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:35:12 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 16:36:03 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:36:09 2007
Finishing at Sun Nov 11 16:36:09 2007
and without gc enabled
azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:39:02 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 16:39:49 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:39:56 2007
Finishing at Sun Nov 11 16:39:56 2007
all with python2.4 on and i386 Linux
cheers
Alberto
This version:
$ python
Python 2.3.4 (#1, May 2 2007, 19:18:17)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
$ rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.3.4 Vendor: CentOS
Release : 14.4 Build Date: Wed 02 May 2007 07:20:29 PM EDT
Install Date: Mon 04 Jun 2007 05:48:29 PM EDT Build Host: builder6
Group : Development/Languages Source RPM: python-2.3.4-14.4.src.rpm
Size : 21137194 License: PSF - see LICENSE
Signature : DSA/SHA1, Sat 05 May 2007 09:33:49 AM EDT, Key ID a53d0bab443e1821
URL : http://www.python.org/
$ uname -a
Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64 x86_64 GNU/Linux
We've also tried it on this version (on a different machine):
$ python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
$ rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.4.3 Vendor: CentOS
Release : 19.el5 Build Date: Wed 14 Mar 2007 11:06:42 PM UTC
Install Date: Mon 29 Oct 2007 07:03:16 PM UTC Build Host: builder6
Group : Development/Languages Source RPM: python-2.4.3-19.el5.src.rpm
Size : 22087600 License: PSF - see LICENSE
Signature : DSA/SHA1, Wed 04 Apr 2007 12:26:58 AM UTC, Key ID a8a447dce8562897
URL : http://www.python.org/
Summary : An interpreted, interactive, object-oriented programming language.
$ uname -a
Linux yyy 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> This would seem to implicate the line id2name[id] = name as being excruciatingly slow.
As others have pointed out there is no way that this takes 45
minutes.Must be something with your system or setup.
A functionally equivalent code for me runs in about 49 seconds!
(it ends up using about twice as much ram as the data on disk)
i.
> > So, you think the Python's dict implementation degrades towards
O(N)
> > performance when it's fed millions of 64-bit pseudo-random longs?
>
> No.
Yes.
I tried your code (with one change, time on feedback lines) and got the
same terrible
performance against my data set.
$ cat stevescode.py
#!/usr/bin/python
"""Read a big file into a dict."""
import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.
print "Line %d" % n,time.asctime()
id,name = line.strip().split(':', 1)
id = long(id)
id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()
$ ./stevescode.py
Starting at Sun Nov 11 10:46:37 2007
Line 0 Sun Nov 11 10:46:37 2007
Line 1000000 Sun Nov 11 10:48:22 2007
Line 2000000 Sun Nov 11 10:51:08 2007
Line 3000000 Sun Nov 11 10:56:16 2007
Line 4000000 Sun Nov 11 10:58:41 2007
Line 5000000 Sun Nov 11 11:03:26 2007
^C
To prove that my machine is sane, I ran the same against your generated
sample file and got _excellent_ performance. Start to finish in under a minute.
$ cat steves-makedict.py
#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""
bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
fp.write(template % random.randint(0, bits64))
fp.close()
$ ./steves-makedict.py
$ ./stevescode.py
Starting at Sun Nov 11 11:15:31 2007
Line 0 Sun Nov 11 11:15:31 2007
Line 1000000 Sun Nov 11 11:15:37 2007
Line 2000000 Sun Nov 11 11:15:43 2007
Line 3000000 Sun Nov 11 11:15:49 2007
Line 4000000 Sun Nov 11 11:15:54 2007
Line 5000000 Sun Nov 11 11:16:00 2007
Line 6000000 Sun Nov 11 11:16:07 2007
Line 7000000 Sun Nov 11 11:16:12 2007
Line 8000000 Sun Nov 11 11:16:18 2007
Items in dict: 8191180
Completed import at Sun Nov 11 11:16:19 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 11:16:23 2007
Finishing at Sun Nov 11 11:16:23 2007
> Notice that the dict is completely read into memory in just two and a
> half minutes. The script then tries to delete the dict, and 32
minutes
> later is still struggling. That's the point I got sick of waiting and
> interrupted the script.
>
> Conclusion: it's a memory issue, or maybe a garbage collection issue, not
> a problem with dicts.
As you can see, not the case at all against my data set.
> (1) Presumably you don't want to run your app with the garbage collector
> turned off. You might still want to play around with the gc module to see
> what you can learn.
As you can see, your version did no better. :(
> (2) More memory will help avoid paging. If you can't get more memory, try
> more virtual memory. It will still be slow, but at least the
operating
> system doesn't have to try moving blocks around as much.
The machine has 8GB, and is not doing anything else when I run this.
> (3) Are you sure you need all eight-million-plus items in the cache
all
> at once?
Yes.
> (4) There are lots of algorithms out there for dealing with data too big
> to fit into main memory. Do some research.
It DOES fit into main memory and a dictionary is exactly the right way
to do this.
> (5) There is a data structure designed for dealing with tens of millions
> of records at once. It is called "a database". If you can't find a better
> algorithm, and refuse to use an existing RDBMS, I suspect you're
going to
> end up inventing a primitive, inefficient, buggy database which is no
> faster than existing systems out there.
I've tried three disk-based implementations already (BerkeleyDB, cdb, and an RDBMS)
Performance is terrible because they end up doing too many random disk seeks.
Pre-caching all of the data ahead of time has offered us the best performance so far,
but is still slower than it ought to be.
Creating a HEAP TABLE in the RDBMS is an idea, but moving all of this really easy
code into the RDBMS just to find a hashing algorithm that doesn't choke on my keys
sounds pretty lame.
A cached in main memory hash is the right way to do this.
The Perl version runs *very* fast, after all.
> I tried your code (with one change, time on feedback lines) and got the
> same terrible
> performance against my data set.
> To prove that my machine is sane, I ran the same against your generated
> sample file and got _excellent_ performance. Start to finish in under a minute.
One possibility could be that your dataset turns out to be some sort
of pathological worst case for the hashing algorithm in python.
i.
You can download the list of keys from here, it's 43M gzipped:
http://www.sendspace.com/file/9530i7
and see it take about 45 minutes with this:
$ cat cache-keys.py
#!/usr/bin/python
v = {}
for line in open('keys.txt'):
v[long(line.strip())] = True
Cool!
Putting that on the resume. ;)
>> > This would seem to implicate the line id2name[id] = name as
>> being
Michael> excruciatingly slow.
>>
>> As others have pointed out there is no way that this takes 45
>> minutes.Must be something with your system or setup.
>>
>> A functionally equivalent code for me runs in about 49 seconds!
>> (it ends up using about twice as much ram as the data on disk)
Michael> You can download the list of keys from here, it's 43M
Michael> gzipped: http://www.sendspace.com/file/9530i7
Michael> and see it take about 45 minutes with this:
I've downloaded your keys, run your program and this is the result:
$ du -h keys.txt
128M keys.txt
$ time python cache_keys.py
real 0m55.913s
user 0m35.286s
sys 0m0.852s
$ python
Python 2.4.4 (#2, Apr 26 2007, 00:02:45)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
$ uname -a
Linux lizard 2.6.21-1-k7 #1 SMP Sat May 26 16:56:05 UTC 2007 i686 GNU/Linux
cheers
Alberto
>> As others have pointed out there is no way that this takes 45
>> minutes.Must be something with your system or setup.
>>
>> A functionally equivalent code for me runs in about 49 seconds!
>> (it ends up using about twice as much ram as the data on disk)
>
> You can download the list of keys from here, it's 43M gzipped:
> http://www.sendspace.com/file/9530i7
>
> and see it take about 45 minutes with this:
>
> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
> v[long(line.strip())] = True
Takes about 40 seconds here.
bj@s8n:~$ time python test.py
real 0m38.758s
user 0m25.290s
sys 0m1.580s
Ciao,
Marc 'BlackJack' Rintsch
Have you tried taking the long away? I mean :
v[line.strip()] = True
> Ah, well, just noticed Ben's suggested this already. Mind you, his
> code, while correct in intent, does look a bit fishy (missing those
> square brackets)
By "missing those square brackets", what would be a list comprehension
(allocating an entire list in memory at once before proceeding)
becomes a generator expression (generating results from the sequence
only as needed).
Enter 'python "generator expression"' into your favourite search
engine to find out more.
--
\ "What I resent is that the range of your vision should be the |
`\ limit of my action." -- Henry James |
_o__) |
Ben Finney
> and see it take about 45 minutes with this:
>
> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
> v[long(line.strip())] = True
On my system (windows vista) your code (using your data) runs in:
36 seconds with python 2.4
25 seconds with python 2.5
39 seconds with python 3000
i.
This is becoming an FAQ on this newsgroup, having come up first with
lists and now with dictionaries. It can also be considered a serious
implementation problem, since code like Michael's is expected to
perform in (close to) linear time and in fact did so in previous
versions of Python, and still does in Perl and other popular scripting
languages. Neither introductory nor intermediate Python learning
materials don't cover the need to disable GC when building large data
structures.
Maybe the default GC strategy should be rethought.
> It takes about 20 seconds for me. It's possible it's related to int/long
> unification - try using Python 2.5. If you can't switch to 2.5, try
> using string keys instead of longs.
I'm getting similar behaviour to the Original Poster, and I'm using
Python 2.5 under Linux.
Unlike him, my PC only has 1GB of RAM, and I've got a LOT of other apps
running, so I'm not surprised it's taking about two minutes to build the
dict. I put that down to low memory, and given the limitations of my
system, I'm not worried about that.
But taking 30+ minutes to delete the dict, even with garbage collection
off, that can't be good.
--
Steven.
> Firstly, thank you for all of your help so far, I really appreciate it.
>
>> > So, you think the Python's dict implementation degrades towards
> O(N)
>> > performance when it's fed millions of 64-bit pseudo-random longs?
>>
>> No.
>
> Yes.
>
> I tried your code (with one change, time on feedback lines) and got the
> same terrible
> performance against my data set.
Hmmm... you're getting even worse performance than I do. I read the dict
in pretty quickly considering the limited memory on my system, and only
run into trouble when I try deleting the dict.
But still... no, Python dicts shouldn't degrade that badly. Whatever is
going on, it almost certainly isn't a problem with the implementation of
dicts.
Here's a thought... try reading the data into one big list of key-value
pairs instead of a dict. If you still get terrible performance, I think
that pretty much eliminates the dict as the culprit.
> To prove that my machine is sane, I ran the same against your generated
> sample file and got _excellent_ performance. Start to finish in under a
> minute.
Not a fair comparison, because it is doing something completely
different. But still valuable to see that your install isn't *completely*
broken.
>> Notice that the dict is completely read into memory in just two and a
>> half minutes. The script then tries to delete the dict, and 32
>> minutes
>> later is still struggling. That's the point I got sick of waiting and
>> interrupted the script.
>>
>> Conclusion: it's a memory issue, or maybe a garbage collection issue,
>> not a problem with dicts.
>
> As you can see, not the case at all against my data set.
Yes, I see now that your problems are worse than my problems.
>> (1) Presumably you don't want to run your app with the garbage
>> collector turned off. You might still want to play around with the gc
>> module to see what you can learn.
>
> As you can see, your version did no better. :(
Somewhat better, but still struggling.
>> (2) More memory will help avoid paging. If you can't get more memory,
>> try more virtual memory. It will still be slow, but at least the
>> operating
>> system doesn't have to try moving blocks around as much.
>
> The machine has 8GB, and is not doing anything else when I run this.
Yes, fair enough.
>> (3) Are you sure you need all eight-million-plus items in the cache
>> all at once?
>
> Yes.
I remain skeptical, but what do I know, I don't even know what you're
doing with the data once you have it :-)
>> (4) There are lots of algorithms out there for dealing with data too
>> big to fit into main memory. Do some research.
>
> It DOES fit into main memory and a dictionary is exactly the right way
> to do this.
I agree now.
I think that you and I are experiencing anomalous results. I'm not even
certain that it is specifically a Python problem -- I'd be a lot happier
if somebody got the same terrible performance on a different OS.
What do we have in common? We're using different versions of Python (you
2.3, me 2.5) and different versions of Linux.
I wonder if we happen to have the same hardware... what's your CPU?
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
...
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
...
--
Steven.
> $ uname -a
> Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64 x86_64 GNU/Linux
>
> We've also tried it on this version (on a different machine):
>
> $ uname -a
> Linux yyy 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
Note that both machines are x86_64. Please try your test on a 32-bit
machine and report the results. I suspect performance won't degrade
there.
Most of the people who tried your test couldn't repeat your result
(including me), but some did. It is quite possible that those who did
were running on variants of x86_64, just like you. My guess is you've
hit a bug that degrades performance of hashing of longs, or of large
dicts, only affecting 64-bit architectures.
As a workaround, you can try skipping the long() construction and
simply hashing strings. (There should be no loss of information, and
string objects are probably just as compact as Python longs in that
range.)
> Note that both machines are x86_64. Please try your test on a 32-bit
> machine and report the results. I suspect performance won't degrade
> there.
This theory seems to be supported by my findings. Running the test on a
32-bit machine took 45 seconds, while the same test on a 64-bit machine is
taking ... 30 minutes SO FAR (I'm getting impatient ;-)
Both machines are running Python2.3.4 under CentOS-4, and both have 1G of
RAM. The 64-bit machine has 2 processors weighing in at 2792.097 Mhz each,
while the 32-bit machine has only 1 1194.857 Mhz processor.
Jeffrey
Oops, typo. It's actually:
Id2name[key >> 40][key & 0xffffffffff] = name
*wow*
The win32 Python or the cygwin Python?
What CPU architecture?
Yes, this was it. It ran *very* fast on Python v2.5.
Terribly on v2.4, v2.3.
(I thought I had already evaluated v2.5 but I see now that the server
With 2.5 on it invokes 2.3 for 'python'.)
Thanks!
Are you saying this is a patch that appeared after python 2.3?
Somehow I don't think it's come up in this thread whether you've
tried any later versions.
> The win32 Python or the cygwin Python?
>
> What CPU architecture?
it is the win32 version, a dual core laptop with T5220 Core 2
#!/usr/bin/python
v = {}
for line in open('keys.txt'):
v[long(line.strip())] = True
using these <http://www.sendspace.com/file/9530i7> keys takes
24.9 secs with ActivePython 2.5.1
and
212.3 secs with Apple's Python 2.3.5.
However, this version
#!/usr/bin/python
v = {}
for line in open('keys.txt'):
v[line] = True
takes
16.5 secs with ActivePython 2.5.1
and
18.7 secs with Apple's Python 2.3.5.
/Jean Brouwers
**) MacOS X 10.4.10 on Mini Mac, 1.83 GHz Intel Core Duo, 2 GB RAM.
> >> (3) Are you sure you need all eight-million-plus items in the cache
> >> all at once?
> >
> > Yes.
>
> I remain skeptical, but what do I know, I don't even know what you're
> doing with the data once you have it :-)
It's OK, I'd be skeptical too. ;)
> $ cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 107
> model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
> stepping : 1
> cpu MHz : 1000.000
> cache size : 512 KB
> ...
> processor : 1
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 107
> model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
> stepping : 1
> cpu MHz : 1000.000
> cache size : 512 KB
# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 246
stepping : 10
cpu MHz : 2009.305
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips : 3981.31
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 246
stepping : 10
cpu MHz : 2009.305
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips : 4014.08
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
As for the solution, after trying a half-dozen different integer hashing
functions
and hash table sizes (the brute force approach), on a total whim I switched
to a
model with two dictionary tiers and got whole orders of magnitude
better performance.
The tiering is, for a given key of type long:
id2name[key >> 40][key & 0x10000000000] = name
Much, much better. A few minutes versus hours this way.
I suspect it could be brought down to seconds with a third level of
tiers but this is no longer posing the biggest bottleneck... ;)
Thanks!
I don't know exactly why do you need a dictionary for keeping the data,
but in case you want ultra-fast access to values, there is no
replacement for keeping a sorted list of keys and a list with the
original indices to values, and the proper list of values. Then, to
access a value, you only have to do a binary search on the sorted list,
another lookup in the original indices list and then go straight to the
value in the value list. This should be the faster approach I can
think of.
Another possibility is using an indexed column in a table in a DB.
Lookups there should be much faster than using a dictionary as well.
HTH,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
Shouldn't this be:
id2name[key >> 40][key & 0xffffffffff] = name
?
-- Paul
Yes, exactly, I had done hex(pow(2,40)) when I meant hex(pow(2,40)-1)
I sent my correction a few minutes afterwards but Mailman
queued it for moderator approval (condition with replying to
myself?)
> I don't know exactly why do you need a dictionary for keeping the
> data, but in case you want ultra-fast access to values, there is no
> replacement for keeping a sorted list of keys and a list with the
> original indices to values, and the proper list of values. Then, to
> access a value, you only have to do a binary search on the sorted
> list, another lookup in the original indices list and then go
> straight to the value in the value list.
Actually, dictionary lookup is faster than binary search because it
doesn't need to search through the table at all, it simply pinpoints
the value (most of the time).
> I don't know exactly why do you need a dictionary for keeping the data,
> but in case you want ultra-fast access to values, there is no
> replacement for keeping a sorted list of keys and a list with the
> original indices to values, and the proper list of values. Then, to
> access a value, you only have to do a binary search on the sorted list,
> another lookup in the original indices list and then go straight to the
> value in the value list. This should be the faster approach I can think
> of.
Maybe on Bizarro World, but not on Planet Earth.
Searching a dict takes on average a constant amount of time, regardless
of the key and the size of the dict. Because dicts are the basis of
Python's namespaces, it is optimized up the wazoo, and being implemented
in C it is extremely fast.
Using your suggestion, to get a value: you have to do a binary search
which is O(log N) instead of O(1), then ANOTHER lookup into a list of
indices, and finally a THIRD lookup to actually retrieve the value -- all
implemented in relatively slow Python instead of fast C.
And let's not even discuss the overhead of keeping these three lists
synchronized.
But don't take my word for it: measure for yourself. I'm not even going
to attempt to code your suggestion, but here's how you measure the time
it takes for dictionary lookup.
# Create a dataset to search on.
D = {}
chars = "abcdefghijklmnopqrstuvwxyz"
triplets = (a+b+c for a in chars for b in chars for c in chars)
for i, s in enumerate(triplets):
D[s] = i # D is of the form {'abc': 12345}
# Now time a successful search for an arbitrary triplet.
import timeit
min(timeit.Timer("D['foo']", "from __main__ import D").repeat(6))
On my PC, it takes about 0.2 seconds per million searches.
--
Steven.
> Another possibility is using an indexed column in a table in a DB.
> Lookups there should be much faster than using a dictionary as well.
I would agree with others who state that for a simple key based lookup
nothing beats a dictionary.
But knowing that you are one of the authors of the excellent pytables
module I think you are approaching this problem from the perspective
of reading in a large number of values on each access. In those cases
specialized libraries (like the hdf) can directly load into memory
huge amounts of continuous data at speeds that substantially
outperform all other approaches.
i.
Well, the bisect module in Python is implemented in fast C. Apart from
that, you are basically right, see below.
> And let's not even discuss the overhead of keeping these three lists
> synchronized.
>
> But don't take my word for it: measure for yourself. I'm not even
> going to attempt to code your suggestion, but here's how you measure
> the time it takes for dictionary lookup.
>
>
> # Create a dataset to search on.
> D = {}
> chars = "abcdefghijklmnopqrstuvwxyz"
> triplets = (a+b+c for a in chars for b in chars for c in chars)
> for i, s in enumerate(triplets):
> D[s] = i # D is of the form {'abc': 12345}
>
> # Now time a successful search for an arbitrary triplet.
> import timeit
> min(timeit.Timer("D['foo']", "from __main__ import D").repeat(6))
>
>
> On my PC, it takes about 0.2 seconds per million searches.
Oh yes. All of you are right, guys. I've implemented code (attached)
for measuring the lookup times using a a dictionary and using a binary
search approach (using numpy, mostly for space efficiency) and here are
my results:
$ python2.5 gq-dict.py
Creating the dictionary...
Time for dict creation: 89.115
Items in dict: 8191180
Timing queries...
Time for querying: 39.44
$ python2.5 gq-binary-search.py
Creating the lists...
Time for lists creation: 4.245
Sorting...
Time for sorting: 6.195
Timing queries...
Time for querying: 108.1
i.e. a dict approach proves to be almost 3x faster than a regular binary
search (5 us/lookup vs 13 us/lookup).
I think I've got messed on some benchmarks that I've done on that
subject some time ago, but either my memory is bad or I've made some
mistake on those experiments. My apologies.
Cheers,
Yes, could be. However, I was well aware that the problem was looking
up one single value, so, mea culpa :(. Sorry for introducing noise.
Um. Is this the take away from this thread? Longs as dictionary
keys are bad? Only for older versions of Python?
This could be a problem for people like me who build
lots of structures using seek values, which are longs, as done in
http://nucular.sourceforge.net and http://bplusdotnet.sourceforge.net
and elsewhere. Someone please summarize.
-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=white%20trash
> On Nov 12, 12:46 pm, "Michael Bacarella" <m...@gpshopper.com> wrote:
>>
>> > It takes about 20 seconds for me. It's possible it's related to
>> > int/long
>> > unification - try using Python 2.5. If you can't switch to 2.5, try
>> > using string keys instead of longs.
>>
>> Yes, this was it. It ran *very* fast on Python v2.5.
>
> Um. Is this the take away from this thread? Longs as dictionary
> keys are bad? Only for older versions of Python?
It sounds like Python 2.4 (and previous versions) had a bug when
populating large dicts on 64-bit architectures.
> Someone please summarize.
Yes, that would be good.
No, I found very similar behaviour with Python 2.5.
>> Someone please summarize.
>
> Yes, that would be good.
On systems with multiple CPUs or 64-bit systems, or both, creating and/or
deleting a multi-megabyte dictionary in recent versions of Python (2.3,
2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
compared to seconds if the system only has a single CPU. Turning garbage
collection off doesn't help.
--
Steven.
criminy... Any root cause? patch?
btw, I think I've seen this, but I think you need
to get into 10s of megs or more before it becomes
critical.
Note: I know someone will say "don't scare off the newbies"
but in my experience most Python programmers are highly
experienced professionals who need to know this sort of thing.
The bulk of the newbies are either off in VB land
or struggling with java.
-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=silly+walk
> On systems with multiple CPUs or 64-bit systems, or both, creating and/or
> deleting a multi-megabyte dictionary in recent versions of Python (2.3,
> 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
> compared to seconds if the system only has a single CPU. Turning garbage
> collection off doesn't help.
Fwiw, Testing on a 2 cpu 64 bit machine with 1gb real memory I
consistently
run out of real memory before I see this effect, so I guess it kicks
in for dicts
that consume beyond that. That's better than I feared at any
rate...
-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=especially+nasty+windows
I can't duplicate this in a dual CPU (64 bit, but running in 32 bit
mode with a 32 bit OS) system. I added keys to a dict until I ran out
of memory (a bit over 22 million keys) and deleting the dict took
about 8 seconds (with a stopwatch, so not very precise, but obviously
less than 30 minutes).
>>> d = {}
>>> idx = 0
>>> while idx < 1e10:
... d[idx] = idx
... idx += 1
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
MemoryError
>>> len(d)
22369622
>>> del d
> On systems with multiple CPUs or 64-bit systems, or both, creating and/or
> deleting a multi-megabyte dictionary in recent versions of Python (2.3,
> 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
> compared to seconds if the system only has a single CPU.
Please don't propagate this nonsense. If you see this then the problem
exists between the chair and monitor.
There is nothing wrong with neither creating nor deleting
dictionaries.
i.
I suspect what happened is this: on 64 bit
machines the data structures for creating dictionaries
are larger (because pointers take twice as much space),
so you run into memory contention issues sooner than
on 32 bit machines, for similar memory sizes.
If there is something deeper going
on please correct me, I would very much like to know.
-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydistro.py/go?FREETEXT=alien+friend
Since some people missed the EUREKA!, here's the executive summary:
Python2.3: about 45 minutes
Python2.4: about 45 minutes
Python2.5: about _30 seconds_
The cut/paste of the EUREKA MOMENT from earlier in this thread:
> > You can download the list of keys from here, it's 43M gzipped:
> > http://www.sendspace.com/file/9530i7
> >
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
> >
> >
> It takes about 20 seconds for me. It's possible it's related to
> int/long
> unification - try using Python 2.5. If you can't switch to 2.5,
try
> using string keys instead of longs.
Yes, this was it. It ran *very* fast on Python v2.5.
Terribly on v2.4, v2.3.
> I can't duplicate this in a dual CPU (64 bit, but running in 32 bit mode
> with a 32 bit OS) system.
Can you try it running in 64-bit mode?
What Python version are you running?
--
Steven.
>>> Someone please summarize.
>>
>> Yes, that would be good.
>
> On systems with multiple CPUs or 64-bit systems, or both, creating and/or
> deleting a multi-megabyte dictionary in recent versions of Python (2.3,
> 2.4, 2.5 at least) takes a LONG time, of the order of 30+ minutes,
> compared to seconds if the system only has a single CPU.
Can you post minimal code that exhibits this behavior on Python 2.5.1?
The OP posted a lot of different versions, most of which worked just
fine for most people.
Please read the whole thread before making unsupported statements like
that. You should consider that this behaviour has been observed by
multiple people, before making insulting statements.
Both myself and the original poster have given code that demonstrates
this problem. We've given concrete evidence of a problem which is
replicable across different versions of Python and different versions of
the Linux operating system.
Unless you're accusing both myself and the original poster of outright
lying, of faking our results, what's your explanation? Have you tried
running our code on a 64-bit or multi-CPU system to see for yourself, or
are you just being closed-minded and arrogant?
--
Steven.
C:\>python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
I can't switch to 64 bit right now. This is a dual Core 2 Duo machine
(2 CPUs, 2 cores each).
I've looked at the deallocation code for dict objects and I can't see
anything that should be affected by 64 bitness, at least not nearly
the degree described - it DECREFs all the keys and values in the dict,
calls PyMem_Free on its memory pool (if it was large enough to be
malloced, which this certainly was, and either adds itself back to the
dict object pool or deallocs itself (if its a subclass of dict).
> Since some people missed the EUREKA!, here's the executive summary:
>
> Python2.3: about 45 minutes
> Python2.4: about 45 minutes
> Python2.5: about _30 seconds_
I'm really happy that upgrading to 2.5 solved the issue for you, but I
get similar behaviour and I'm running 2.5, so it isn't as simple as that.
--
Steven.
Maybe some more details about my environment will help.
Our fast Python is 2.5, built from source. Our slow Python is 2.3,
default installed with the OS, from rpm.
# python
Python 2.5.1 (r251:54863, May 11 2007, 14:17:21)
[GCC 3.4.4 20050721 (Red Hat 3.4.4-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
# which python
/usr/local/bin/python
# ls -la /usr/local/bin/python
-rwxr-xr-x 2 root root 5312572 Nov 12 12:54 /usr/local/bin/python
# md5sum /usr/local/bin/python
5c24e54a6cf5a556e9371325d18bc1fb /usr/local/bin/python
# ls -la /usr/bin/python*
lrwxrwxrwx 1 root root 9 Nov 12 12:57 /usr/bin/python -> python2
lrwxrwxrwx 1 root root 6 Nov 12 12:57 /usr/bin/python2 -> python2.5
-rwxr-xr-x 1 root root 8344 May 2 2007 /usr/bin/python2.3
lrwxrwxrwx 1 root root 21 Nov 12 12:57 /usr/bin/python2.5 ->
/usr/local/bin/python
# ls -la /usr/local/src/ | grep Python
drwxr-xr-x 19 1000 1000 4096 Nov 12 12:54 Python-2.5.1
-rw-r--r-- 1 root root 9383651 Nov 12 12:51 Python-2.5.1.tar.bz2
# gcc -v
Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.4/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-java-awt=gtk
--host=x86_64-redhat-linux
Thread model: posix
gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)
# cat /etc/redhat-release
CentOS release 4.2 (Final)
# uname -a
Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64
x86_64 GNU/Linux
# free
total used free shared buffers cached
Mem: 7390244 6961440 428804 0 15520 2549732
-/+ buffers/cache: 4396188 2994056
Swap: 2096472 10280 2086192
***** information about Python 2.3 *****
# rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.3.4 Vendor: CentOS
Release : 14.4 Build Date: Wed 02 May 2007
07:20:29 PM EDT
Install Date: Mon 04 Jun 2007 05:48:29 PM EDT Build Host: builder6
Group : Development/Languages Source RPM:
python-2.3.4-14.4.src.rpm
Size : 21137194 License: PSF - see LICENSE
Signature : DSA/SHA1, Sat 05 May 2007 09:33:49 AM EDT, Key ID
a53d0bab443e1821
URL : http://www.python.org/
Summary : An interpreted, interactive, object-oriented programming
language.
Description :
Python is an interpreted, interactive, object-oriented programming
language often compared to Tcl, Perl, Scheme or Java. Python includes
modules, classes, exceptions, very high level dynamic data types and
dynamic typing. Python supports interfaces to many system calls and
libraries, as well as to various windowing systems (X11, Motif, Tk,
Mac and MFC).
Programmers can write new built-in modules for Python in C or C++.
Python can be used as an extension language for applications that need
a programmable interface. This package contains most of the standard
Python modules, as well as modules for interfacing to the Tix widget
set for Tk and RPM.
Note that documentation for Python is provided in the python-docs
package.
Who were testing it on single-CPU, 32 bit systems.
The plot thickens... I wrote another version of my test code, reading the
data into a list of tuples rather than a dict:
$ python slurp_dict4.py # actually slurp a list, despite the name
Starting at Fri Nov 16 08:55:26 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in list: 8191180
Completed import at Fri Nov 16 08:56:26 2007
Starting to delete list...
Completed deletion at Fri Nov 16 08:57:04 2007
Finishing at Fri Nov 16 08:57:04 2007
Quite a reasonable speed, considering my limited memory.
What do we know so far?
(1) The problem occurs whether or not gc is enabled.
(2) It only occurs on some architectures. 64 bit CPU seems to be common
factor.
(3) I don't think we've seen it demonstrated under Windows, but we've
seen it under at least two different Linux distros.
(4) It affects very large dicts, but not very large lists.
(5) I've done tests where instead of one really big dict, the data is put
into lots of smaller dicts. The problem still occurs.
(6) It was suggested the problem is related to long/int unification, but
I've done tests that kept the dict keys as strings, and the problem still
occurs.
(7) It occurs in Python 2.3, 2.4 and 2.5, but not 2.5.1.
Do we treat this as a solved problem and move on?
--
Steven.
I'm not comfortable with this. Better to identify the cause for
certain. I'll look at it sometime if I get a chance. I have some
64-bit multi-cpu machines I can try it on.
If the problem was fixed accidentally as an undocumented by product of a
patch aimed at something else, it might get similarly unfixed without
triggering a test failure. So if someone pins down the problem source well
enough to warn future maintainers in a comment or commit note, all the
better. Otherwise, wait until a reversion happens, if ever.
> Unless you're accusing both myself and the original poster of outright
> lying, of faking our results, what's your explanation?
I don't attribute it to malice, I think you're simply measuring
something else. You both must be having some some system setting that
forces your application to swap disk.
Do you really believe that you cannot create or delete a large
dictionary with python versions less than 2.5 (on a 64 bit or multi-
cpu system)? That a bug of this magnitude has not been noticed until
someone posted on clp?
> Have you tried
> running our code on a 64-bit or multi-CPU system to see for yourself,
the answer is: yes and yes, I see nothing out of the ordinary.
i.
>> Can you post minimal code that exhibits this behavior on Python 2.5.1?
>> The OP posted a lot of different versions, most of which worked just
>> fine for most people.
>
> Who were testing it on single-CPU, 32 bit systems.
Still, I'd like to see a test case that fails (works slowly) for you,
so that I (and others) can try it on different machines. As I said,
the OP posted several versions of his code, and tended to depend on a
specific dataset. A minimal test case would help more people debug
it.
You're right, it is completely inappropriate for us to be showing our
dirty laundry to the public. From now we will try to deal with our troubles
privately.
I am so sorry to have ruined the decorum.
> You're right, it is completely inappropriate for us to be showing our
> dirty laundry to the public.
you are misinterpreting my words on many levels,
(and I of course could have refrained from the "chair-monitor" jab as
well)
anyhow, it is what it is, I could not reproduce any of the weird
behaviors myself, I got nothing more to add to this discussion
> Can you try it running in 64-bit mode?
Here are my results using the following test.py:
$ cat test.py
#!/usr/bin/python
import time
print "Starting: %s" % time.ctime()
v = {}
for line in open('keys.txt'):
v[long(line.strip())] = True
print "Finished: %s" % time.ctime()
32-bit architecture:
-----------------------------------------
[machine1]$ python2.3 test.py
Starting: Fri Nov 16 11:51:22 2007
Finished: Fri Nov 16 11:52:39 2007
[machine2]$ python2.5 test.py
Starting: Fri Nov 16 11:57:57 2007
Finished: Fri Nov 16 11:58:39 2007
64-bit architecture (64-bit mode):
-----------------------------------------
[machine3]$ python2.3 test.py
Starting: Fri Nov 16 11:51:44 2007
Finished: Fri Nov 16 12:31:54 2007
[machine3]$ python2.5 test.py
Starting: Fri Nov 16 11:50:03 2007
Finished: Fri Nov 16 11:50:31 2007
Jeffrey
Jeffrey
http://groups.google.com.au/group/comp.lang.python/msg/b33ceaf01db10a86
#!/usr/bin/python
"""Read a big file into a dict."""
import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
if n % 1000000 == 0:
# Give feedback.
print "Line %d" % n
id,name = line.strip().split(':', 1)
id = long(id)
id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
gc.enable()
print "Finishing at %s" % time.asctime()
I've since tried variants where the dict keys were kept as strings, and
it made no difference to the speed, and where the data was kept as a list
of (key, value) tuples, which made a HUGE difference.
You should also read this post here:
http://groups.google.com.au/group/comp.lang.python/msg/2535dc213bc45f84
showing Perl running very fast on the same machine that Python was
running like a one-legged sloth.
--
Steven.
> I am so sorry to have ruined the decorum.
Oh dear!
I confidently predict that this thread will now
degenerate to include such things as dignitas
and gravitas.
- Hendrik
Just for the record. I was unable to stop thinking about this, and
after some investigation, I guess that my rememberings were gathered
from some benchmarks with code in Pyrex (i.e. pretty near to C speed).
In the attached benchmark, I compare the speed of dictionary lookups in
a big dictionary (more than 8 millions of entries) against doing the
same, but using a binary search approach. Here are the results:
$ python2.5 gq-binary-search.py
Creating the dictionary...
Time for dict creation: 13.001
Calibrating loop...
Calibrating time: 3.095
Timing queries with a dict...
Time for dict query: 7.747
Timing queries with binary search...
Time for binary queries: 3.551
so, a dictionary lookup takes around 950 ns, while a binary search
lookup takes around 380 ns, i.e. binary searches are more than 2x
faster than dictionary lookups, at least in this benchmark.
It might well be that the difference is due to the fact that the binary
lookup in this case is made for only 64-bit integers, while the
dictionary code has to check the type of index data first. OTOH, it is
also apparent that the binary code can be optimized for cache misses by
using two levels of indirection for binary lookups, but this is too
much work for today ;). At any rate, it seems rather clear that binary
lookups can be competive against hashes, when using Pyrex at least!
Cheers,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
FYI, I tried on two 64 bit SMP machines (4 way and 2 way), running
Mandriva 2007 and 2008
2.6.17-6mdv #1 SMP Wed Oct 25 12:17:57 MDT 2006 x86_64 AMD Opteron(tm)
Processor 280 GNU/Linux
Python 2.4.3 (#2, May 7 2007, 15:15:17)
[GCC 4.1.1 20060724 (prerelease) (4.1.1-3mdk)] on linux2
2.6.23.1-1mdvsmp #1 SMP Sat Oct 20 18:04:52 EDT 2007 x86_64 Intel(R)
Core(TM)2 CPU 6400 @ 2.13GHz GNU/Linux
Python 2.5.1 (r251:54863, Sep 13 2007, 09:02:56)
[GCC 4.2.1 20070828 (prerelease) (4.2.1-6mdv2008.0)] on linux2
I could not reproduce the problem with Steven's test file id2name.txt
on either.
sendspace seems to be over capacity, so I haven't been able download
your
keys.txt.
I wonder if this might not be a glibc issue?
If you own built python 2.5 linked with the same glibc version as the
system python?
What does
ldd /usr/bin/python
say vs.
ldd /usr/local/bin/python
?
Harri
> Just for the record. I was unable to stop thinking about this, and
> after some investigation, I guess that my rememberings were gathered
> from some benchmarks with code in Pyrex (i.e. pretty near to C speed).
Pretty interesting and quite unexpected.
i.
Yeah, and also bogus :(
It turned out that in my previous benchmark, I've used plain numpy int
arrays to do measurements, and when you extract an element out of a
numpy array what you obtain is a *numpy scalar*, which is a different
beast than a python (long) integer. Unfortunately, pyrex does swallow
it and happily converted to a long long C, but produces a wrong result.
This looks like a pyrex fault, and I should report it to the pyrex
list.
At any rate, with the corrected benchmarks using plain python lists
instead (attached), the numbers are:
Calibrating loop...
Calibrating time: 1.458
Creating the dictionary...
Time for dict creation: 15.305
Timing queries with a dict...
Time for dict query: 11.632
Creating BinarySearch object...
Time for BinarySearch creation: 8.648
Timing queries with a BinarySearch...
Time for BinarySearch queries: 19.393
That is, dictionaries do lookups 1.7x faster than a binary lookup (1.42
us/lookup vs 2.37 us/lookup), which is, I suppose, what is expected.
Well, this seems the end of my 'peculiar' theory, but at least served
to uncover what it seems a bug in pyrex :P