Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Populating a dictionary, fast
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 73 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Michael Bacarella  
View profile  
 More options Nov 10 2007, 4:56 pm
Newsgroups: comp.lang.python
From: Michael Bacarella <m...@gpshopper.com>
Date: Sat, 10 Nov 2007 13:56:35 -0800 (PST)
Local: Sat, Nov 10 2007 4:56 pm
Subject: Populating a dictionary, fast

The id2name.txt file is an index of primary keys to strings.  They look like this:

11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M    id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
    id,name = line.strip().split(':')
    id = long(id)
    id2name[id] = name

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30 _seconds_ to run.
This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached.  No, an RDBMS or disk-based hash is not fast enough.)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ben Finney  
View profile  
 More options Nov 10 2007, 5:28 pm
Newsgroups: comp.lang.python
From: Ben Finney <bignose+hates-s...@benfinney.id.au>
Date: Sun, 11 Nov 2007 09:28:15 +1100
Local: Sat, Nov 10 2007 5:28 pm
Subject: Re: Populating a dictionary, fast

Michael Bacarella <m...@gpshopper.com> writes:
> id2name = {}
> for line in iter(open('id2name.txt').readline,''):
>     id,name = line.strip().split(':')
>     id = long(id)
>     id2name[id] = name

> This takes about 45 *minutes*

> If I comment out the last line in the loop body it takes only about
> 30 _seconds_ to run.  This would seem to implicate the line
> id2name[id] = name as being excruciatingly slow.

Or, rather, that the slowdown is caused by allocating these items in a
dictionary at all.

Dictionaries are implemented very efficiently in Python, but there
will still be overhead in inserting millions of distinct items. Of
course, if you just throw each item away instead of allocating space
for it, the loop will run very quickly.

> Is there a fast, functionally equivalent way of doing this?

You could, instead of individual assignments in a 'for' loop, try
letting the 'dict' type operate on a generator::

    input_file = open("id2name.txt")
    id2name = dict(
        (long(id), name) for (id, name) in
            line.strip().split(":") for line in input_file
    )

All that code inside the 'dict()' call is a "generator expression"; if
you don't know what they are yet, have a read of Python's
documentation on them. It creates a generator which will spit out
key+value tuples to be fed directly to the dict constructor as it
requests them.

That allows the generator to parse each item from the file exactly as
the 'dict' constructor needs it, possibly saving some extra "allocate,
assign, discard" steps. Not having your data set, I can't say if it'll
be significantly faster.

--
 \      "Compulsory unification of opinion achieves only the unanimity |
  `\     of the graveyard."  -- Justice Roberts in 319 U.S. 624 (1943) |
_o__)                                                                  |
Ben Finney


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Nov 10 2007, 5:46 pm
Newsgroups: comp.lang.python
From: Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au>
Date: Sat, 10 Nov 2007 22:46:47 -0000
Local: Sat, Nov 10 2007 5:46 pm
Subject: Re: Populating a dictionary, fast

That's an awfully complicated way to iterate over a file. Try this
instead:

id2name = {}
for line in open('id2name.txt'):
    id,name = line.strip().split(':')
    id = long(id)
    id2name[id] = name

On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.

> This takes about 45 *minutes*

> If I comment out the last line in the loop body it takes only about 30
> _seconds_ to run. This would seem to implicate the line id2name[id] =
> name as being excruciatingly slow.

No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.

> Is there a fast, functionally equivalent way of doing this?

> (Yes, I really do need this cached.  No, an RDBMS or disk-based hash is
> not fast enough.)

You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.

--
Steven.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Rubin  
View profile  
 More options Nov 10 2007, 5:54 pm
Newsgroups: comp.lang.python
From: Paul Rubin <http://phr...@NOSPAM.invalid>
Date: 10 Nov 2007 14:54:49 -0800
Local: Sat, Nov 10 2007 5:54 pm
Subject: Re: Populating a dictionary, fast

Michael Bacarella <m...@gpshopper.com> writes:
> Is there a fast, functionally equivalent way of doing this?

> (Yes, I really do need this cached.  No, an RDBMS or disk-based hash
> is not fast enough.)

As Steven says maybe you need to add more ram to your system.  The
memory overhead of dictionary cells is considerable.  If worse comes
to worse you could concoct some more storage-efficient representation.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Bacarella  
View profile  
 More options Nov 10 2007, 8:18 pm
Newsgroups: comp.lang.python
From: Michael Bacarella <m...@gpshopper.com>
Date: Sat, 10 Nov 2007 17:18:37 -0800 (PST)
Local: Sat, Nov 10 2007 8:18 pm
Subject: Re: Populating a dictionary, fast

> That's an awfully complicated way to iterate over a file. Try this
> instead:

> id2name = {}
> for line in open('id2name.txt'):
>    id,name = line.strip().split(':')
>    id = long(id)
>    id2name[id] = name

> > This takes about 45 *minutes*

> On my system, it takes about a minute and a half to produce a
 dictionary
> with 8191180 entries.

Doing something similar on my system is very fast as well.

$ cat dict-8191180.py

#!/usr/bin/python

v = {}

for i in xrange(8191180):

        v[i] = i

$ time ./dict-8191180.py

real    0m5.877s
user    0m4.953s
sys     0m0.924s

But...

> > If I comment out the last line in the loop body it takes only about
 30
> > _seconds_ to run. This would seem to implicate the line id2name[id] =
> > name as being excruciatingly slow.

> No, dictionary access is one of the most highly-optimized, fastest,
 most
> efficient parts of Python. What it indicates to me is that your system
 is
> running low on memory, and is struggling to find room for 517MB worth
 of
> data.

If only it were so easy.

$ free

             total       used       free     shared    buffers     cached

Mem:       7390244    2103448    5286796          0      38996    1982756

-/+ buffers/cache:      81696    7308548

Swap:      2096472      10280    2086192

Here's your Python implementation running as badly as mine did.

$ wc -l id2name.txt

8191180 id2name.txt

$ cat cache-id2name.py

#!/usr/bin/python

id2name = {}

for line in open('id2name.txt'):

        id,name = line.strip().split(':',1)

        id = long(id)

        id2name[id] = name

$ time ./cache-id2name.py

^C

I let it go 30 minutes before killing it since I had to leave.  Here it is in top before I did the deed.

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

18802 root      25   0 1301m 1.2g 1148 R 99.9 17.1  36:05.99 cache-id2name.p

36 minutes, 05.99 seconds.

To rule out the file reading/parsing logic as culprit, here's same thing, but with

dictionary insertion removed.

$ cat nocache-id2name.py

#!/usr/bin/python

id2name = {}

for line in open('id2name.txt'):

        id,name = line.strip().split(':',1)

        id = long(id)

$ time ./nocache-id2name.py

real    0m33.518s

user    0m33.070s

sys     0m0.415s

Here's a Perl implementation running very fast.

$ cat cache-id2name.pl

#!/usr/bin/perl

my %id2name = ();

my $line;

my $id;

my $name;

open(F,"<id2name.txt");

foreach $line (<F>) {

        chomp($line);

        ($id,$name) = split(/:/,$line,1);

        $id = int($id);

        $id2name{$id} = $name;

}

$ time ./cache-id2name.pl

real    0m46.363s

user    0m43.730s

sys     0m2.611s

So, you think the Python's dict implementation degrades towards O(N)

performance when it's fed millions of 64-bit pseudo-random longs?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven D'Aprano  
View profile  
 More options Nov 10 2007, 11:32 pm
Newsgroups: comp.lang.python
From: Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au>
Date: Sun, 11 Nov 2007 04:32:44 -0000
Local: Sat, Nov 10 2007 11:32 pm
Subject: Re: Populating a dictionary, fast

On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote:
> So, you think the Python's dict implementation degrades towards O(N)
> performance when it's fed millions of 64-bit pseudo-random longs?

No.

Here's my sample file:

$ wc -l id2name.txt
8191180 id2name.txt
$ ls -lh id2name.txt
-rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt

And the results of reading it into a dict (code follows below):

$ time ./slurp_dict.py
Starting at Sun Nov 11 14:26:51 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 14:29:31 2007
Starting to delete dict...

Traceback (most recent call last):
  File "./slurp_dict.py", line 20, in <module>
    del id2name
KeyboardInterrupt

real    35m52.334s
user    1m17.663s
sys     0m16.758s

Notice that the dict is completely read into memory in just two and a
half minutes. The script then tries to delete the dict, and 32 minutes
later is still struggling. That's the point I got sick of waiting and
interrupted the script.

Conclusion: it's a memory issue, or maybe a garbage collection issue, not
a problem with dicts.

Here's my code for creating the key:value file in the first place:

#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""

bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
    fp.write(template % random.randint(0, bits64))
fp.close()

###

And here's my code for slurping it in to a dict:

#!/usr/bin/python
"""Read a big file into a dict."""

import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
    if n % 1000000 == 0:
        # Give feedback.
        print "Line %d" % n
    id,name = line.strip().split(':', 1)
    id = long(id)
    id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
    gc.enable()
print "Finishing at %s" % time.asctime()

###

So, what can you do? Given that I'm not willing to do any more unpaid
experimentation for you, here are my suggestions, in no particular order:

(1) Presumably you don't want to run your app with the garbage collector
turned off. You might still want to play around with the gc module to see
what you can learn.

(2) More memory will help avoid paging. If you can't get more memory, try
more virtual memory. It will still be slow, but at least the operating
system doesn't have to try moving blocks around as much.

(3) Are you sure you need all eight-million-plus items in the cache all
at once?

(4) There are lots of algorithms out there for dealing with data too big
to fit into main memory. Do some research.

(5) There is a data structure designed for dealing with tens of millions
of records at once. It is called "a database". If you can't find a better
algorithm, and refuse to use an existing RDBMS, I suspect you're going to
end up inventing a primitive, inefficient, buggy database which is no
faster than existing systems out there.

--
Steven.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Yu-Xi Lim  
View profile  
 More options Nov 11 2007, 12:06 am
Newsgroups: comp.lang.python
From: Yu-Xi Lim <y...@ece.gatech.edu>
Date: Sun, 11 Nov 2007 00:06:38 -0500
Local: Sun, Nov 11 2007 12:06 am
Subject: Re: Populating a dictionary, fast

Steven D'Aprano wrote:
> (2) More memory will help avoid paging. If you can't get more memory, try
> more virtual memory. It will still be slow, but at least the operating
> system doesn't have to try moving blocks around as much.

Based on his previous post, it would seem he has 7GB of RAM (with about
5GB free) and 2GB of swap. I don't think RAM is the issue.

Maybe there's something wrong with his specific Python installation.
What version is it and was it compiled by him?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Rubin  
View profile  
 More options Nov 11 2007, 12:45 am
Newsgroups: comp.lang.python
From: Paul Rubin <http://phr...@NOSPAM.invalid>
Date: 10 Nov 2007 21:45:44 -0800
Local: Sun, Nov 11 2007 12:45 am
Subject: Re: Populating a dictionary, fast

Michael Bacarella <m...@gpshopper.com> writes:
> If only it were so easy.

I think I know what's going on, the dictionary updates are sending the
GC into quadratic behavior.  Try turning off the GC:

    import gc
    gc.disable()


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Bacarella  
View profile  
 More options Nov 11 2007, 10:35 am
Newsgroups: comp.lang.python
From: Michael Bacarella <m...@gpshopper.com>
Date: Sun, 11 Nov 2007 07:35:07 -0800 (PST)
Local: Sun, Nov 11 2007 10:35 am
Subject: Re: Populating a dictionary, fast

> ----- Original Message ----
> From: Paul Rubin <http://phr...@NOSPAM.invalid>
> To: python-l...@python.org
> Sent: Sunday, November 11, 2007 12:45:44 AM
> Subject: Re: Populating a dictionary, fast

> Michael Bacarella <m...@gpshopper.com> writes:
> > If only it were so easy.

> I think I know what's going on, the dictionary updates are sending the
> GC into quadratic behavior.  Try turning off the GC:

>    import gc
>    gc.disable()

Tried that already.  No difference. :(

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
DouhetSukd  
View profile  
 More options Nov 11 2007, 10:49 am
Newsgroups: comp.lang.python
From: DouhetSukd <DouhetS...@gmail.com>
Date: Sun, 11 Nov 2007 07:49:40 -0800
Local: Sun, Nov 11 2007 10:49 am
Subject: Re: Populating a dictionary, fast
On Nov 11, 7:35 am, Michael Bacarella <m...@gpshopper.com> wrote:

> Tried that already.  No difference. :(

Not sure if it would make a difference, and it would imply re-
organizing your preceding lines, but what about doing the dictionary
build in one go, rather than incrementally?  Using the dict function,
which takes a list of (key,value) tuples.  I use it frequently as a
space-saver and it works well enough.  It may speed things up, I
dunno.  I had to break out your formatting in its own function and
that can't help too much.

Something like:

def fmt(line):
    id,name = line.strip().split(':')
    id = long(id)
    return (id,name)

id2name = dict([fmt(line) for line in
iter(open('id2name.txt').readline,'')])

Cheers


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alberto Berti  
View profile  
 More options Nov 11 2007, 10:50 am
Newsgroups: comp.lang.python
From: Alberto Berti <albe...@metapensiero.it>
Date: Sun, 11 Nov 2007 16:50:36 +0100
Local: Sun, Nov 11 2007 10:50 am
Subject: Re: Populating a dictionary, fast

>>>>> "Steven" == Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:

    Steven> $ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51
    Steven> 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line
    Steven> 4000000 Line 5000000 Line 6000000 Line 7000000 Line
    Steven> 8000000 Items in dict: 8191180 Completed import at Sun Nov
    Steven> 11 14:29:31 2007 Starting to delete dict...

    Steven> Traceback (most recent call last): File "./slurp_dict.py",
    Steven> line 20, in <module> del id2name KeyboardInterrupt

    Steven> real 35m52.334s user 1m17.663s sys 0m16.758s

    Steven> Notice that the dict is completely read into memory in
    Steven> just two and a half minutes. The script then tries to
    Steven> delete the dict, and 32 minutes later is still
    Steven> struggling. That's the point I got sick of waiting and
    Steven> interrupted the script.

    Steven> Conclusion: it's a memory issue, or maybe a garbage
    Steven> collection issue, not a problem with dicts.

uh, strange results...

I run your same scripts with and without garbage collection enabled
and those are the results:

with gc enabled:

azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:35:12 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 16:36:03 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:36:09 2007
Finishing at Sun Nov 11 16:36:09 2007

and without gc enabled

azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
Starting at Sun Nov 11 16:39:02 2007
Line 0
Line 1000000
Line 2000000
Line 3000000
Line 4000000
Line 5000000
Line 6000000
Line 7000000
Line 8000000
Items in dict: 8191180
Completed import at Sun Nov 11 16:39:49 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 16:39:56 2007
Finishing at Sun Nov 11 16:39:56 2007

all with python2.4 on and i386 Linux

cheers

Alberto


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
DouhetSukd  
View profile  
 More options Nov 11 2007, 10:58 am
Newsgroups: comp.lang.python
From: DouhetSukd <DouhetS...@gmail.com>
Date: Sun, 11 Nov 2007 07:58:16 -0800
Local: Sun, Nov 11 2007 10:58 am
Subject: Re: Populating a dictionary, fast
Ah, well, just noticed Ben's suggested this already.  Mind you, his
code, while correct in intent, does look a bit fishy (missing those
square brackets), so don't dismiss it just because you had trouble
running it (or mine).  Definitely worth a try and I'd be curious to
know if it makes a difference.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Bacarella  
View profile  
 More options Nov 11 2007, 11:10 am
Newsgroups: comp.lang.python
From: Michael Bacarella <m...@gpshopper.com>
Date: Sun, 11 Nov 2007 08:10:45 -0800 (PST)
Local: Sun, Nov 11 2007 11:10 am
Subject: Re: Populating a dictionary, fast

> Steven D'Aprano wrote:
> > (2) More memory will help avoid paging. If you can't get more memory,
 try
> > more virtual memory. It will still be slow, but at least the
 operating
> > system doesn't have to try moving blocks around as much.

> Based on his previous post, it would seem he has 7GB of RAM (with about
> 5GB free) and 2GB of swap. I don't think RAM is the issue.

> Maybe there's something wrong with his specific Python installation.
> What version is it and was it compiled by him?

This version:

$ python
Python 2.3.4 (#1, May  2 2007, 19:18:17)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.


$ rpm -qi python
Name        : python                       Relocations: (not relocatable)
Version     : 2.3.4                             Vendor: CentOS
Release     : 14.4                          Build Date: Wed 02 May 2007 07:20:29 PM EDT
Install Date: Mon 04 Jun 2007 05:48:29 PM EDT      Build Host: builder6
Group       : Development/Languages         Source RPM: python-2.3.4-14.4.src.rpm
Size        : 21137194                         License: PSF - see LICENSE
Signature   : DSA/SHA1, Sat 05 May 2007 09:33:49 AM EDT, Key ID a53d0bab443e1821
URL         : http://www.python.org/

$ uname -a
Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64 x86_64 GNU/Linux

We've also tried it on this version (on a different machine):

$ python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

$ rpm -qi python
Name        : python                       Relocations: (not relocatable)
Version     : 2.4.3                             Vendor: CentOS
Release     : 19.el5                        Build Date: Wed 14 Mar 2007 11:06:42 PM UTC
Install Date: Mon 29 Oct 2007 07:03:16 PM UTC      Build Host: builder6
Group       : Development/Languages         Source RPM: python-2.4.3-19.el5.src.rpm
Size        : 22087600                         License: PSF - see LICENSE
Signature   : DSA/SHA1, Wed 04 Apr 2007 12:26:58 AM UTC, Key ID a8a447dce8562897
URL         : http://www.python.org/
Summary     : An interpreted, interactive, object-oriented programming language.

$ uname -a
Linux yyy 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Istvan Albert  
View profile  
 More options Nov 11 2007, 11:11 am
Newsgroups: comp.lang.python
From: Istvan Albert <istvan.alb...@gmail.com>
Date: Sun, 11 Nov 2007 16:11:09 -0000
Local: Sun, Nov 11 2007 11:11 am
Subject: Re: Populating a dictionary, fast
On Nov 10, 4:56 pm, Michael Bacarella <m...@gpshopper.com> wrote:

> This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

As others have pointed out there is no way that this takes 45
minutes.Must be something with your system or setup.

A functionally equivalent code for me runs in about 49 seconds!
(it ends up using about twice as much ram as the data on disk)

i.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Bacarella  
View profile  
 More options Nov 11 2007, 11:25 am
Newsgroups: comp.lang.python
From: Michael Bacarella <m...@gpshopper.com>
Date: Sun, 11 Nov 2007 08:25:15 -0800 (PST)
Local: Sun, Nov 11 2007 11:25 am
Subject: Re: Populating a dictionary, fast
Firstly, thank you for all of your help so far, I really appreciate it.

> > So, you think the Python's dict implementation degrades towards
 O(N)
> > performance when it's fed millions of 64-bit pseudo-random longs?

> No.

Yes.

I tried your code (with one change, time on feedback lines) and got the
 same terrible
performance against my data set.

$ cat stevescode.py
#!/usr/bin/python
"""Read a big file into a dict."""

import gc
import time
print "Starting at %s" % time.asctime()
flag = gc.isenabled()
gc.disable()
id2name = {}
for n, line in enumerate(open('id2name.txt', 'r')):
    if n % 1000000 == 0:
        # Give feedback.
        print "Line %d" % n,time.asctime()
    id,name = line.strip().split(':', 1)
    id = long(id)
    id2name[id] = name
print "Items in dict:", len(id2name)
print "Completed import at %s" % time.asctime()
print "Starting to delete dict..."
del id2name
print "Completed deletion at %s" % time.asctime()
if flag:
    gc.enable()
print "Finishing at %s" % time.asctime()

$ ./stevescode.py
Starting at Sun Nov 11 10:46:37 2007
Line 0 Sun Nov 11 10:46:37 2007
Line 1000000 Sun Nov 11 10:48:22 2007
Line 2000000 Sun Nov 11 10:51:08 2007
Line 3000000 Sun Nov 11 10:56:16 2007
Line 4000000 Sun Nov 11 10:58:41 2007
Line 5000000 Sun Nov 11 11:03:26 2007
^C

To prove that my machine is sane, I ran the same against your generated
sample file and got _excellent_ performance.  Start to finish in under a minute.

$ cat steves-makedict.py
#!/usr/bin/python
"""Make a big file of 64-bit integer keys plus random values."""

bits64 = 2**64
import random
template = '%d:This is a bunch of text...\n'
fp = open('id2name.txt', 'w')
for i in xrange(8191180):
    fp.write(template % random.randint(0, bits64))
fp.close()

$ ./steves-makedict.py

$ ./stevescode.py
Starting at Sun Nov 11 11:15:31 2007
Line 0 Sun Nov 11 11:15:31 2007
Line 1000000 Sun Nov 11 11:15:37 2007
Line 2000000 Sun Nov 11 11:15:43 2007
Line 3000000 Sun Nov 11 11:15:49 2007
Line 4000000 Sun Nov 11 11:15:54 2007
Line 5000000 Sun Nov 11 11:16:00 2007
Line 6000000 Sun Nov 11 11:16:07 2007
Line 7000000 Sun Nov 11 11:16:12 2007
Line 8000000 Sun Nov 11 11:16:18 2007
Items in dict: 8191180
Completed import at Sun Nov 11 11:16:19 2007
Starting to delete dict...
Completed deletion at Sun Nov 11 11:16:23 2007
Finishing at Sun Nov 11 11:16:23 2007

> Notice that the dict is completely read into memory in just two and a
> half minutes. The script then tries to delete the dict, and 32
 minutes
> later is still struggling. That's the point I got sick of waiting and
> interrupted the script.

> Conclusion: it's a memory issue, or maybe a garbage collection issue, not
> a problem with dicts.

As you can see, not the case at all against my data set.

> (1) Presumably you don't want to run your app with the garbage collector
> turned off. You might still want to play around with the gc module to see
> what you can learn.

As you can see, your version did no better. :(

> (2) More memory will help avoid paging. If you can't get more memory, try
> more virtual memory. It will still be slow, but at least the
 operating
> system doesn't have to try moving blocks around as much.

The machine has 8GB, and is not doing anything else when I run this.

> (3) Are you sure you need all eight-million-plus items in the cache
 all
> at once?

Yes.

> (4) There are lots of algorithms out there for dealing with data too big
> to fit into main memory. Do some research.

It DOES fit into main memory and a dictionary is exactly the right way
 to do this.

> (5) There is a data structure designed for dealing with tens of millions
> of records at once. It is called "a database". If you can't find a better
> algorithm, and refuse to use an existing RDBMS, I suspect you're
 going to
> end up inventing a primitive, inefficient, buggy database which is no
> faster than existing systems out there.

I've tried three disk-based implementations already (BerkeleyDB, cdb, and an RDBMS)
Performance is terrible because they end up doing too many random disk seeks.
Pre-caching all of the data ahead of time has offered us the best performance so far,
but is still slower than it ought to be.

Creating a HEAP TABLE in the RDBMS is an idea, but moving all of this really easy
code into the RDBMS just to find a hashing algorithm that doesn't choke on my keys
sounds pretty lame.

 A cached in main memory hash is the right way to do this.

The Perl version runs *very* fast, after all.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Istvan Albert  
View profile  
 More options Nov 11 2007, 11:34 am
Newsgroups: comp.lang.python
From: Istvan Albert <istvan.alb...@gmail.com>
Date: Sun, 11 Nov 2007 16:34:37 -0000
Local: Sun, Nov 11 2007 11:34 am
Subject: Re: Populating a dictionary, fast
On Nov 11, 11:25 am, Michael Bacarella <m...@gpshopper.com> wrote:

> I tried your code (with one change, time on feedback lines) and got the
>  same terrible
> performance against my data set.
> To prove that my machine is sane, I ran the same against your generated
> sample file and got _excellent_ performance.  Start to finish in under a minute.

One possibility could be that your dataset turns out to be some sort
of pathological worst case for the hashing algorithm in python.

i.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Bacarella  
View profile  
 More options Nov 11 2007, 11:51 am
Newsgroups: comp.lang.python
From: Michael Bacarella <m...@gpshopper.com>
Date: Sun, 11 Nov 2007 08:51:37 -0800 (PST)
Local: Sun, Nov 11 2007 11:51 am
Subject: Re: Populating a dictionary, fast
> > This would seem to implicate the line id2name[id] = name as being

 excruciatingly slow.

> As others have pointed out there is no way that this takes 45
> minutes.Must be something with your system or setup.

> A functionally equivalent code for me runs in about 49 seconds!
> (it ends up using about twice as much ram as the data on disk)

You can download the list of keys from here, it's 43M gzipped:
    http://www.sendspace.com/file/9530i7

and see it take about 45 minutes with this:

$ cat cache-keys.py
#!/usr/bin/python
v = {}
for line in open('keys.txt'):
        v[long(line.strip())] = True


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Bacarella  
View profile  
 More options Nov 11 2007, 11:54 am
Newsgroups: comp.lang.python
From: Michael Bacarella <m...@gpshopper.com>
Date: Sun, 11 Nov 2007 08:54:52 -0800 (PST)
Local: Sun, Nov 11 2007 11:54 am
Subject: Re: Populating a dictionary, fast

> > I tried your code (with one change, time on feedback lines) and got
 the
> >  same terrible
> > performance against my data set.

> > To prove that my machine is sane, I ran the same against your
 generated
>>  sample file and got _excellent_ performance.  Start to finish in
 under a minute.

> One possibility could be that your dataset turns out to be some sort
> of pathological worst case for the hashing algorithm in python.

Cool!

Putting that on the resume. ;)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Arkanes  
View profile  
 More options Nov 11 2007, 1:42 pm
Newsgroups: comp.lang.python
From: Arkanes <arka...@gmail.com>
Date: Sun, 11 Nov 2007 11:42:08 -0700
Local: Sun, Nov 11 2007 1:42 pm
Subject: Re: Populating a dictionary, fast
Michael Bacarella wrote:
> You can download the list of keys from here, it's 43M gzipped:
>     http://www.sendspace.com/file/9530i7

> and see it take about 45 minutes with this:

> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
>         v[long(line.strip())] = True

It takes about 20 seconds for me. It's possible it's related to int/long
unification - try using Python 2.5. If you can't switch to 2.5, try
using string keys instead of longs.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alberto Berti  
View profile  
 More options Nov 11 2007, 12:58 pm
Newsgroups: comp.lang.python
From: Alberto Berti <albe...@metapensiero.it>
Date: Sun, 11 Nov 2007 18:58:55 +0100
Local: Sun, Nov 11 2007 12:58 pm
Subject: Re: Populating a dictionary, fast

>>>>> "Michael" == Michael Bacarella <m...@gpshopper.com> writes:

    >> > This would seem to implicate the line id2name[id] = name as
    >> being
    Michael>  excruciatingly slow.
    >>
    >> As others have pointed out there is no way that this takes 45
    >> minutes.Must be something with your system or setup.
    >>
    >> A functionally equivalent code for me runs in about 49 seconds!
    >> (it ends up using about twice as much ram as the data on disk)

    Michael> You can download the list of keys from here, it's 43M
    Michael> gzipped: http://www.sendspace.com/file/9530i7

    Michael> and see it take about 45 minutes with this:

I've downloaded your keys, run your program and this is the result:

$ du -h keys.txt
128M    keys.txt

$ time python cache_keys.py

real    0m55.913s
user    0m35.286s
sys     0m0.852s

$ python
Python 2.4.4 (#2, Apr 26 2007, 00:02:45)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2

$ uname -a
Linux lizard 2.6.21-1-k7 #1 SMP Sat May 26 16:56:05 UTC 2007 i686 GNU/Linux

cheers

Alberto


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Marc 'BlackJack' Rintsch  
View profile  
 More options Nov 11 2007, 1:04 pm
Newsgroups: comp.lang.python
From: Marc 'BlackJack' Rintsch <bj_...@gmx.net>
Date: 11 Nov 2007 18:04:02 GMT
Local: Sun, Nov 11 2007 1:04 pm
Subject: Re: Populating a dictionary, fast

Takes about 40 seconds here.

bj@s8n:~$ time python test.py

real    0m38.758s
user    0m25.290s
sys     0m1.580s

Ciao,
        Marc 'BlackJack' Rintsch


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ricardo Aráoz  
View profile  
 More options Nov 11 2007, 2:18 pm
Newsgroups: comp.lang.python
From: Ricardo Aráoz <ricar...@gmail.com>
Date: Sun, 11 Nov 2007 16:18:55 -0300
Local: Sun, Nov 11 2007 2:18 pm
Subject: Re: Populating a dictionary, fast

Have you tried taking the long away? I mean :
         v[line.strip()] = True

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ben Finney  
View profile  
 More options Nov 11 2007, 5:23 pm
Newsgroups: comp.lang.python
From: Ben Finney <bignose+hates-s...@benfinney.id.au>
Date: Mon, 12 Nov 2007 09:23:13 +1100
Local: Sun, Nov 11 2007 5:23 pm
Subject: Re: Populating a dictionary, fast

DouhetSukd <DouhetS...@gmail.com> writes:
> Ah, well, just noticed Ben's suggested this already.  Mind you, his
> code, while correct in intent, does look a bit fishy (missing those
> square brackets)

By "missing those square brackets", what would be a list comprehension
(allocating an entire list in memory at once before proceeding)
becomes a generator expression (generating results from the sequence
only as needed).

Enter 'python "generator expression"' into your favourite search
engine to find out more.

--
 \       "What I resent is that the range of your vision should be the |
  `\                              limit of my action."  -- Henry James |
_o__)                                                                  |
Ben Finney


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Istvan Albert  
View profile  
 More options Nov 11 2007, 10:40 pm
Newsgroups: comp.lang.python
From: Istvan Albert <istvan.alb...@gmail.com>
Date: Mon, 12 Nov 2007 03:40:42 -0000
Local: Sun, Nov 11 2007 10:40 pm
Subject: Re: Populating a dictionary, fast
On Nov 11, 11:51 am, Michael Bacarella <m...@gpshopper.com> wrote:

> and see it take about 45 minutes with this:

> $ cat cache-keys.py
> #!/usr/bin/python
> v = {}
> for line in open('keys.txt'):
>         v[long(line.strip())] = True

On my system (windows vista) your code (using your data) runs in:

36 seconds with python 2.4
25 seconds with python 2.5
39 seconds with python 3000

i.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hrvoje Niksic  
View profile  
 More options Nov 12 2007, 5:37 am
Newsgroups: comp.lang.python
From: Hrvoje Niksic <hnik...@xemacs.org>
Date: Mon, 12 Nov 2007 11:37:42 +0100
Local: Mon, Nov 12 2007 5:37 am
Subject: Re: Populating a dictionary, fast

Paul Rubin <http://phr...@NOSPAM.invalid> writes:
> Michael Bacarella <m...@gpshopper.com> writes:
>> If only it were so easy.

> I think I know what's going on, the dictionary updates are sending the
> GC into quadratic behavior.  Try turning off the GC:

>     import gc
>     gc.disable()

This is becoming an FAQ on this newsgroup, having come up first with
lists and now with dictionaries.  It can also be considered a serious
implementation problem, since code like Michael's is expected to
perform in (close to) linear time and in fact did so in previous
versions of Python, and still does in Perl and other popular scripting
languages.  Neither introductory nor intermediate Python learning
materials don't cover the need to disable GC when building large data
structures.

Maybe the default GC strategy should be rethought.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 73   Newer >
« Back to Discussions « Newer topic     Older topic »