The id2name.txt file is an index of primary keys to strings. They look like this:
11293102971459182412:Descriptive unique name for this record\n 950918240981208142:Another name for another record\n
The file's properties are:
# wc -l id2name.txt
8191180 id2name.txt # du -h id2name.txt 517M id2name.txt
I'm loading the file into memory with code like this:
id2name = {} for line in iter(open('id2name.txt').readline,''): id,name = line.strip().split(':') id = long(id) id2name[id] = name
This takes about 45 *minutes*
If I comment out the last line in the loop body it takes only about 30 _seconds_ to run. This would seem to implicate the line id2name[id] = name as being excruciatingly slow.
Is there a fast, functionally equivalent way of doing this?
(Yes, I really do need this cached. No, an RDBMS or disk-based hash is not fast enough.)
Michael Bacarella <m...@gpshopper.com> writes: > id2name = {} > for line in iter(open('id2name.txt').readline,''): > id,name = line.strip().split(':') > id = long(id) > id2name[id] = name
> This takes about 45 *minutes*
> If I comment out the last line in the loop body it takes only about > 30 _seconds_ to run. This would seem to implicate the line > id2name[id] = name as being excruciatingly slow.
Or, rather, that the slowdown is caused by allocating these items in a dictionary at all.
Dictionaries are implemented very efficiently in Python, but there will still be overhead in inserting millions of distinct items. Of course, if you just throw each item away instead of allocating space for it, the loop will run very quickly.
> Is there a fast, functionally equivalent way of doing this?
You could, instead of individual assignments in a 'for' loop, try letting the 'dict' type operate on a generator::
input_file = open("id2name.txt") id2name = dict( (long(id), name) for (id, name) in line.strip().split(":") for line in input_file )
All that code inside the 'dict()' call is a "generator expression"; if you don't know what they are yet, have a read of Python's documentation on them. It creates a generator which will spit out key+value tuples to be fed directly to the dict constructor as it requests them.
That allows the generator to parse each item from the file exactly as the 'dict' constructor needs it, possibly saving some extra "allocate, assign, discard" steps. Not having your data set, I can't say if it'll be significantly faster.
-- \ "Compulsory unification of opinion achieves only the unanimity | `\ of the graveyard." -- Justice Roberts in 319 U.S. 624 (1943) | _o__) | Ben Finney
> I'm loading the file into memory with code like this:
> id2name = {} > for line in iter(open('id2name.txt').readline,''): > id,name = line.strip().split(':') > id = long(id) > id2name[id] = name
That's an awfully complicated way to iterate over a file. Try this instead:
id2name = {} for line in open('id2name.txt'): id,name = line.strip().split(':') id = long(id) id2name[id] = name
On my system, it takes about a minute and a half to produce a dictionary with 8191180 entries.
> This takes about 45 *minutes*
> If I comment out the last line in the loop body it takes only about 30 > _seconds_ to run. This would seem to implicate the line id2name[id] = > name as being excruciatingly slow.
No, dictionary access is one of the most highly-optimized, fastest, most efficient parts of Python. What it indicates to me is that your system is running low on memory, and is struggling to find room for 517MB worth of data.
> Is there a fast, functionally equivalent way of doing this?
> (Yes, I really do need this cached. No, an RDBMS or disk-based hash is > not fast enough.)
You'll pardon me if I'm skeptical. Considering the convoluted, weird way you had to iterate over a file, I wonder what other less-than-efficient parts of your code you are struggling under. Nine times out of ten, if a program runs too slowly, it's because you're using the wrong algorithm.
Michael Bacarella <m...@gpshopper.com> writes: > Is there a fast, functionally equivalent way of doing this?
> (Yes, I really do need this cached. No, an RDBMS or disk-based hash > is not fast enough.)
As Steven says maybe you need to add more ram to your system. The memory overhead of dictionary cells is considerable. If worse comes to worse you could concoct some more storage-efficient representation.
> That's an awfully complicated way to iterate over a file. Try this > instead:
> id2name = {} > for line in open('id2name.txt'): > id,name = line.strip().split(':') > id = long(id) > id2name[id] = name
> > This takes about 45 *minutes*
> On my system, it takes about a minute and a half to produce a dictionary > with 8191180 entries.
Doing something similar on my system is very fast as well.
$ cat dict-8191180.py
#!/usr/bin/python
v = {}
for i in xrange(8191180):
v[i] = i
$ time ./dict-8191180.py
real 0m5.877s user 0m4.953s sys 0m0.924s
But...
> > If I comment out the last line in the loop body it takes only about 30 > > _seconds_ to run. This would seem to implicate the line id2name[id] = > > name as being excruciatingly slow.
> No, dictionary access is one of the most highly-optimized, fastest, most > efficient parts of Python. What it indicates to me is that your system is > running low on memory, and is struggling to find room for 517MB worth of > data.
If only it were so easy.
$ free
total used free shared buffers cached
Mem: 7390244 2103448 5286796 0 38996 1982756
-/+ buffers/cache: 81696 7308548
Swap: 2096472 10280 2086192
Here's your Python implementation running as badly as mine did.
$ wc -l id2name.txt
8191180 id2name.txt
$ cat cache-id2name.py
#!/usr/bin/python
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':',1)
id = long(id)
id2name[id] = name
$ time ./cache-id2name.py
^C
I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote: > So, you think the Python's dict implementation degrades towards O(N) > performance when it's fed millions of 64-bit pseudo-random longs?
No.
Here's my sample file:
$ wc -l id2name.txt 8191180 id2name.txt $ ls -lh id2name.txt -rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt
And the results of reading it into a dict (code follows below):
$ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line 4000000 Line 5000000 Line 6000000 Line 7000000 Line 8000000 Items in dict: 8191180 Completed import at Sun Nov 11 14:29:31 2007 Starting to delete dict...
Traceback (most recent call last): File "./slurp_dict.py", line 20, in <module> del id2name KeyboardInterrupt
real 35m52.334s user 1m17.663s sys 0m16.758s
Notice that the dict is completely read into memory in just two and a half minutes. The script then tries to delete the dict, and 32 minutes later is still struggling. That's the point I got sick of waiting and interrupted the script.
Conclusion: it's a memory issue, or maybe a garbage collection issue, not a problem with dicts.
Here's my code for creating the key:value file in the first place:
#!/usr/bin/python """Make a big file of 64-bit integer keys plus random values."""
bits64 = 2**64 import random template = '%d:This is a bunch of text...\n' fp = open('id2name.txt', 'w') for i in xrange(8191180): fp.write(template % random.randint(0, bits64)) fp.close()
###
And here's my code for slurping it in to a dict:
#!/usr/bin/python """Read a big file into a dict."""
import gc import time print "Starting at %s" % time.asctime() flag = gc.isenabled() gc.disable() id2name = {} for n, line in enumerate(open('id2name.txt', 'r')): if n % 1000000 == 0: # Give feedback. print "Line %d" % n id,name = line.strip().split(':', 1) id = long(id) id2name[id] = name print "Items in dict:", len(id2name) print "Completed import at %s" % time.asctime() print "Starting to delete dict..." del id2name print "Completed deletion at %s" % time.asctime() if flag: gc.enable() print "Finishing at %s" % time.asctime()
###
So, what can you do? Given that I'm not willing to do any more unpaid experimentation for you, here are my suggestions, in no particular order:
(1) Presumably you don't want to run your app with the garbage collector turned off. You might still want to play around with the gc module to see what you can learn.
(2) More memory will help avoid paging. If you can't get more memory, try more virtual memory. It will still be slow, but at least the operating system doesn't have to try moving blocks around as much.
(3) Are you sure you need all eight-million-plus items in the cache all at once?
(4) There are lots of algorithms out there for dealing with data too big to fit into main memory. Do some research.
(5) There is a data structure designed for dealing with tens of millions of records at once. It is called "a database". If you can't find a better algorithm, and refuse to use an existing RDBMS, I suspect you're going to end up inventing a primitive, inefficient, buggy database which is no faster than existing systems out there.
Steven D'Aprano wrote: > (2) More memory will help avoid paging. If you can't get more memory, try > more virtual memory. It will still be slow, but at least the operating > system doesn't have to try moving blocks around as much.
Based on his previous post, it would seem he has 7GB of RAM (with about 5GB free) and 2GB of swap. I don't think RAM is the issue.
Maybe there's something wrong with his specific Python installation. What version is it and was it compiled by him?
> ----- Original Message ---- > From: Paul Rubin <http://phr...@NOSPAM.invalid> > To: python-l...@python.org > Sent: Sunday, November 11, 2007 12:45:44 AM > Subject: Re: Populating a dictionary, fast
> Michael Bacarella <m...@gpshopper.com> writes: > > If only it were so easy.
> I think I know what's going on, the dictionary updates are sending the > GC into quadratic behavior. Try turning off the GC:
On Nov 11, 7:35 am, Michael Bacarella <m...@gpshopper.com> wrote:
> Tried that already. No difference. :(
Not sure if it would make a difference, and it would imply re- organizing your preceding lines, but what about doing the dictionary build in one go, rather than incrementally? Using the dict function, which takes a list of (key,value) tuples. I use it frequently as a space-saver and it works well enough. It may speed things up, I dunno. I had to break out your formatting in its own function and that can't help too much.
Something like:
def fmt(line): id,name = line.strip().split(':') id = long(id) return (id,name)
id2name = dict([fmt(line) for line in iter(open('id2name.txt').readline,'')])
>>>>> "Steven" == Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
Steven> $ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51 Steven> 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line Steven> 4000000 Line 5000000 Line 6000000 Line 7000000 Line Steven> 8000000 Items in dict: 8191180 Completed import at Sun Nov Steven> 11 14:29:31 2007 Starting to delete dict...
Steven> Traceback (most recent call last): File "./slurp_dict.py", Steven> line 20, in <module> del id2name KeyboardInterrupt
Steven> real 35m52.334s user 1m17.663s sys 0m16.758s
Steven> Notice that the dict is completely read into memory in Steven> just two and a half minutes. The script then tries to Steven> delete the dict, and 32 minutes later is still Steven> struggling. That's the point I got sick of waiting and Steven> interrupted the script.
Steven> Conclusion: it's a memory issue, or maybe a garbage Steven> collection issue, not a problem with dicts.
uh, strange results...
I run your same scripts with and without garbage collection enabled and those are the results:
with gc enabled:
azazel@lizard:~/wip/zodb_test$ python slurp_dict.py Starting at Sun Nov 11 16:35:12 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line 4000000 Line 5000000 Line 6000000 Line 7000000 Line 8000000 Items in dict: 8191180 Completed import at Sun Nov 11 16:36:03 2007 Starting to delete dict... Completed deletion at Sun Nov 11 16:36:09 2007 Finishing at Sun Nov 11 16:36:09 2007
and without gc enabled
azazel@lizard:~/wip/zodb_test$ python slurp_dict.py Starting at Sun Nov 11 16:39:02 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line 4000000 Line 5000000 Line 6000000 Line 7000000 Line 8000000 Items in dict: 8191180 Completed import at Sun Nov 11 16:39:49 2007 Starting to delete dict... Completed deletion at Sun Nov 11 16:39:56 2007 Finishing at Sun Nov 11 16:39:56 2007
Ah, well, just noticed Ben's suggested this already. Mind you, his code, while correct in intent, does look a bit fishy (missing those square brackets), so don't dismiss it just because you had trouble running it (or mine). Definitely worth a try and I'd be curious to know if it makes a difference.
> Steven D'Aprano wrote: > > (2) More memory will help avoid paging. If you can't get more memory, try > > more virtual memory. It will still be slow, but at least the operating > > system doesn't have to try moving blocks around as much.
> Based on his previous post, it would seem he has 7GB of RAM (with about > 5GB free) and 2GB of swap. I don't think RAM is the issue.
> Maybe there's something wrong with his specific Python installation. > What version is it and was it compiled by him?
This version:
$ python Python 2.3.4 (#1, May 2 2007, 19:18:17) [GCC 3.4.6 20060404 (Red Hat 3.4.6-8)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
$ rpm -qi python Name : python Relocations: (not relocatable) Version : 2.3.4 Vendor: CentOS Release : 14.4 Build Date: Wed 02 May 2007 07:20:29 PM EDT Install Date: Mon 04 Jun 2007 05:48:29 PM EDT Build Host: builder6 Group : Development/Languages Source RPM: python-2.3.4-14.4.src.rpm Size : 21137194 License: PSF - see LICENSE Signature : DSA/SHA1, Sat 05 May 2007 09:33:49 AM EDT, Key ID a53d0bab443e1821 URL : http://www.python.org/
$ uname -a Linux xxx 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 2005 x86_64 x86_64 x86_64 GNU/Linux
We've also tried it on this version (on a different machine):
$ python Python 2.4.3 (#1, Mar 14 2007, 19:01:42) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
$ rpm -qi python Name : python Relocations: (not relocatable) Version : 2.4.3 Vendor: CentOS Release : 19.el5 Build Date: Wed 14 Mar 2007 11:06:42 PM UTC Install Date: Mon 29 Oct 2007 07:03:16 PM UTC Build Host: builder6 Group : Development/Languages Source RPM: python-2.4.3-19.el5.src.rpm Size : 22087600 License: PSF - see LICENSE Signature : DSA/SHA1, Wed 04 Apr 2007 12:26:58 AM UTC, Key ID a8a447dce8562897 URL : http://www.python.org/ Summary : An interpreted, interactive, object-oriented programming language.
$ uname -a Linux yyy 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
Firstly, thank you for all of your help so far, I really appreciate it.
> > So, you think the Python's dict implementation degrades towards O(N) > > performance when it's fed millions of 64-bit pseudo-random longs?
> No.
Yes.
I tried your code (with one change, time on feedback lines) and got the same terrible performance against my data set.
$ cat stevescode.py #!/usr/bin/python """Read a big file into a dict."""
import gc import time print "Starting at %s" % time.asctime() flag = gc.isenabled() gc.disable() id2name = {} for n, line in enumerate(open('id2name.txt', 'r')): if n % 1000000 == 0: # Give feedback. print "Line %d" % n,time.asctime() id,name = line.strip().split(':', 1) id = long(id) id2name[id] = name print "Items in dict:", len(id2name) print "Completed import at %s" % time.asctime() print "Starting to delete dict..." del id2name print "Completed deletion at %s" % time.asctime() if flag: gc.enable() print "Finishing at %s" % time.asctime()
$ ./stevescode.py Starting at Sun Nov 11 10:46:37 2007 Line 0 Sun Nov 11 10:46:37 2007 Line 1000000 Sun Nov 11 10:48:22 2007 Line 2000000 Sun Nov 11 10:51:08 2007 Line 3000000 Sun Nov 11 10:56:16 2007 Line 4000000 Sun Nov 11 10:58:41 2007 Line 5000000 Sun Nov 11 11:03:26 2007 ^C
To prove that my machine is sane, I ran the same against your generated sample file and got _excellent_ performance. Start to finish in under a minute.
$ cat steves-makedict.py #!/usr/bin/python """Make a big file of 64-bit integer keys plus random values."""
bits64 = 2**64 import random template = '%d:This is a bunch of text...\n' fp = open('id2name.txt', 'w') for i in xrange(8191180): fp.write(template % random.randint(0, bits64)) fp.close()
$ ./steves-makedict.py
$ ./stevescode.py Starting at Sun Nov 11 11:15:31 2007 Line 0 Sun Nov 11 11:15:31 2007 Line 1000000 Sun Nov 11 11:15:37 2007 Line 2000000 Sun Nov 11 11:15:43 2007 Line 3000000 Sun Nov 11 11:15:49 2007 Line 4000000 Sun Nov 11 11:15:54 2007 Line 5000000 Sun Nov 11 11:16:00 2007 Line 6000000 Sun Nov 11 11:16:07 2007 Line 7000000 Sun Nov 11 11:16:12 2007 Line 8000000 Sun Nov 11 11:16:18 2007 Items in dict: 8191180 Completed import at Sun Nov 11 11:16:19 2007 Starting to delete dict... Completed deletion at Sun Nov 11 11:16:23 2007 Finishing at Sun Nov 11 11:16:23 2007
> Notice that the dict is completely read into memory in just two and a > half minutes. The script then tries to delete the dict, and 32 minutes > later is still struggling. That's the point I got sick of waiting and > interrupted the script.
> Conclusion: it's a memory issue, or maybe a garbage collection issue, not > a problem with dicts.
As you can see, not the case at all against my data set.
> (1) Presumably you don't want to run your app with the garbage collector > turned off. You might still want to play around with the gc module to see > what you can learn.
As you can see, your version did no better. :(
> (2) More memory will help avoid paging. If you can't get more memory, try > more virtual memory. It will still be slow, but at least the operating > system doesn't have to try moving blocks around as much.
The machine has 8GB, and is not doing anything else when I run this.
> (3) Are you sure you need all eight-million-plus items in the cache all > at once?
Yes.
> (4) There are lots of algorithms out there for dealing with data too big > to fit into main memory. Do some research.
It DOES fit into main memory and a dictionary is exactly the right way to do this.
> (5) There is a data structure designed for dealing with tens of millions > of records at once. It is called "a database". If you can't find a better > algorithm, and refuse to use an existing RDBMS, I suspect you're going to > end up inventing a primitive, inefficient, buggy database which is no > faster than existing systems out there.
I've tried three disk-based implementations already (BerkeleyDB, cdb, and an RDBMS) Performance is terrible because they end up doing too many random disk seeks. Pre-caching all of the data ahead of time has offered us the best performance so far, but is still slower than it ought to be.
Creating a HEAP TABLE in the RDBMS is an idea, but moving all of this really easy code into the RDBMS just to find a hashing algorithm that doesn't choke on my keys sounds pretty lame.
A cached in main memory hash is the right way to do this.
On Nov 11, 11:25 am, Michael Bacarella <m...@gpshopper.com> wrote:
> I tried your code (with one change, time on feedback lines) and got the > same terrible > performance against my data set. > To prove that my machine is sane, I ran the same against your generated > sample file and got _excellent_ performance. Start to finish in under a minute.
One possibility could be that your dataset turns out to be some sort of pathological worst case for the hashing algorithm in python.
> > I tried your code (with one change, time on feedback lines) and got the > > same terrible > > performance against my data set.
> > To prove that my machine is sane, I ran the same against your generated >> sample file and got _excellent_ performance. Start to finish in under a minute.
> One possibility could be that your dataset turns out to be some sort > of pathological worst case for the hashing algorithm in python.
> $ cat cache-keys.py > #!/usr/bin/python > v = {} > for line in open('keys.txt'): > v[long(line.strip())] = True
It takes about 20 seconds for me. It's possible it's related to int/long unification - try using Python 2.5. If you can't switch to 2.5, try using string keys instead of longs.
>>>>> "Michael" == Michael Bacarella <m...@gpshopper.com> writes:
>> > This would seem to implicate the line id2name[id] = name as >> being Michael> excruciatingly slow. >> >> As others have pointed out there is no way that this takes 45 >> minutes.Must be something with your system or setup. >> >> A functionally equivalent code for me runs in about 49 seconds! >> (it ends up using about twice as much ram as the data on disk)
On Sun, 11 Nov 2007 08:51:37 -0800, Michael Bacarella wrote: >> As others have pointed out there is no way that this takes 45 >> minutes.Must be something with your system or setup.
>> A functionally equivalent code for me runs in about 49 seconds! >> (it ends up using about twice as much ram as the data on disk)
Michael Bacarella wrote: >>> This would seem to implicate the line id2name[id] = name as being > excruciatingly slow. >> As others have pointed out there is no way that this takes 45 >> minutes.Must be something with your system or setup.
>> A functionally equivalent code for me runs in about 49 seconds! >> (it ends up using about twice as much ram as the data on disk)
DouhetSukd <DouhetS...@gmail.com> writes: > Ah, well, just noticed Ben's suggested this already. Mind you, his > code, while correct in intent, does look a bit fishy (missing those > square brackets)
By "missing those square brackets", what would be a list comprehension (allocating an entire list in memory at once before proceeding) becomes a generator expression (generating results from the sequence only as needed).
Enter 'python "generator expression"' into your favourite search engine to find out more.
-- \ "What I resent is that the range of your vision should be the | `\ limit of my action." -- Henry James | _o__) | Ben Finney
Paul Rubin <http://phr...@NOSPAM.invalid> writes: > Michael Bacarella <m...@gpshopper.com> writes: >> If only it were so easy.
> I think I know what's going on, the dictionary updates are sending the > GC into quadratic behavior. Try turning off the GC:
> import gc > gc.disable()
This is becoming an FAQ on this newsgroup, having come up first with lists and now with dictionaries. It can also be considered a serious implementation problem, since code like Michael's is expected to perform in (close to) linear time and in fact did so in previous versions of Python, and still does in Perl and other popular scripting languages. Neither introductory nor intermediate Python learning materials don't cover the need to disable GC when building large data structures.
Maybe the default GC strategy should be rethought.