I have spent a lot of time in C#, where the standard advice is not to
mess about with the garbage collector because you'll probably just
make things worse. Does that advice apply in Python? Is it a bad idea
to call gc.disable() before loading the trie and then enable it again
afterwards? Does the fact that the garbage collector is taking so much
time indicate I'm doing something particularly bad here? Is there some
way to give the garbage collector a hint that the whole trie is going
to be long-lived and get it promoted straight to generation 2 rather
than scanning it over and over?
$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m12.523s
user 0m12.380s
sys 0m0.140s
$ time python -c "import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m12.592s
user 0m12.480s
sys 0m0.110s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m6.176s
user 0m5.980s
sys 0m0.190s
$ time python -c "import gc;gc.disable();import
trie;t=trie.Trie();t.load(open('sowpods.txt'))"
real 0m6.331s
user 0m5.530s
sys 0m0.170s
=== trie.py ===
class Trie(object):
__slots__=("root", "active")
def __init__(self):
self.root=[]
self.active=False
def insert(self, word):
if len(word) == 0:
self.active=True
else:
head = word[0]
for ch, child in reversed(self.root):
if ch == head:
child.insert(word[1:])
return
child = Trie()
self.root.append((head, child))
child.insert(word[1:])
def seek(self, word):
if len(word) == 0:
return self
head = word[0]
for ch, child in self.root:
if ch == head:
return child.seek(word[1:])
return EMPTY_TRIE
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.root) and not self.active
def endings(self, prefix=""):
if self.active:
yield prefix
for ch, child in self.root:
for ending in child.endings(prefix+ch):
yield ending
EMPTY_TRIE = Trie()
I believe not. It is known that certain patterns of object creation and
destruction can lead to bad gc behavior. No one has discovered a setting
of the internal tuning parameters for which there are no bad patterns
and I suspect there are not any such. This does not negate Xavier's
suggestion that a code change might also solve your problem.
tjr
Well, you are creating and destroying a lot of objects in the process,
so that will provoke the garbage collector. But you are also doing
reversing and searching, and that's slow. Does your application
really need to be able to keep things in order like this, or do you
just want to know if a word is in the dictionary? If you just want to
load up a dictionary and be able to see if words are in it, I would
use a dict instead of a list.
Even if you want to be able to print out the data in order, you might
consider using a dict instead of a list. The print operation could
use one sort at the end, instead of searching all the nodes on
insertion.
You could also use a character that doesn't appear in your data as a
sentinel, and then you don't need a separate active indicator -- every
leaf node will be empty and be referred to by the sentinel character.
You are also doing a lot of recursive operations that could be done
without recursing.
Finally, do you really need to keep an additional object around for
each node in the tree?
I have modified your trie code to use a dict and a sentinel, while
keeping basically the same API. This may or may not be the best way
to do this, depending on your other code which uses this data
structure. It could also probably be made faster by removing the
setdefault, and not re-building objects when you need them, and even
this implementation will load faster if you disable the gc, but in any
case, this might give you some ideas about how to make your code go
faster.
Regards,
Pat
from collections import defaultdict
class TrieTop(object):
sentinel = ' ' # Something not in the data
def __init__(self, data=None):
def defaultrecurse():
return defaultdict(defaultrecurse)
if data is None:
data = defaultrecurse()
self.data = data
def insert(self, word):
data = self.data
for ch in word:
data = data[ch]
data[self.sentinel]
def seek(self, word):
data = self.data
for ch in word:
data = data.get(ch)
if data is None:
return EMPTY_TRIE
return TrieTop(data)
def load(self, file):
for line in file:
self.insert(line.strip().lower())
def empty(self):
return (not self.data)
def endings(self, prefix=""):
def recurse(data, prefix):
if not data:
yield prefix[:-1]
return
for ch, data in data.iteritems():
for result in recurse(data, prefix + ch):
yield result
return sorted(recurse(self.data, prefix))
EMPTY_TRIE = TrieTop()
> No one has discovered a setting
> of the internal tuning parameters for which there are no bad patterns
> and I suspect there are not any such. This does not negate Xavier's
> suggestion that a code change might also solve your problem.
Could it be that for implementing a structure like a trie as the OP is,
where a lot of CPU cycles can be spent manipulating the structure, a high-
level language like Python, Perl or Ruby just gets in the way?
My feeling would be, try to get the language to do as much of the work for
you as possible. If you can’t do that, then you might be better off with a
lower-level language.
I would rather say that the specific problem of the trie data structure is
that it has extremely little benefit over other available data structures.
There may still be a couple of niches where it makes sense to consider it
as an alternative, but given that dicts are so heavily optimised in Python,
it'll be hard for tries to compete even when written in a low-level language.
Remember that the original use case was to load a dictionary from a text
file. For this use case, a trie can be very wasteful in terms of memory and
rather CPU cache unfriendly on traversal, whereas hash values are a) rather
fast to calculate for a string, and b) often just calculated once and then
kept alive in the string object for later reuse.
> My feeling would be, try to get the language to do as much of the work for
> you as possible. If you can’t do that, then you might be better off with a
> lower-level language.
I agree with the first sentence, but I'd like to underline the word 'might'
in the second. As this newsgroup shows, very often it's enough to look for
a better algorithmic approach first.
Stefan
Not true.
>There may still be a couple of niches where it makes sense to consider it
>as an alternative, but given that dicts are so heavily optimised in Python,
>it'll be hard for tries to compete even when written in a low-level language.
It depends. If your data is not in nearly sorted order,
trees are some of the best mechanisms available.
>Remember that the original use case was to load a dictionary from a text
>file. For this use case, a trie can be very wasteful in terms of memory and
>rather CPU cache unfriendly on traversal, whereas hash values are a) rather
>fast to calculate for a string, and b) often just calculated once and then
>kept alive in the string object for later reuse.
You still have to walk the bucket in a hash map/table.
Performance may be orders of magnitude worse than for trees.
>> My feeling would be, try to get the language to do as much of the work for
>> you as possible. If you can’t do that, then you might be better off with a
>> lower-level language.
>
>I agree with the first sentence, but I'd like to underline the word 'might'
>in the second. As this newsgroup shows, very often it's enough to look for
>a better algorithmic approach first.
>
>Stefan
>
--
You want to know who you are?
http://oshosearch.net/Convert/search.php
Most Osho books on line:
"walk the bucket" shouldn't be a significant cost factor here, especially
if you are doing meaningful work with the traversed items. In the CPython
implementation the total hash table size is less than a constant times
the number of actual items. Moreover, it's a linear scan over an array
rather than having to dereference pointers as in tree.
"Orders of magnitude worse", in any case, sounds very exaggerated.
(and, of course, as the OP said, there's the major difference that the
dict type is implemented in C, which makes constant factors an order of
magnitude smaller than for a Python trie implementation)
The worst case can lose orders of magnitude if a lot of values hash
to the same bucket.
Well, perhaps one order of magnitude.
>>> for i in xrange(100):
... n = 32*i+1
... assert hash(2**n) == hash(2)
...
>>> d1 = dict.fromkeys(xrange(100))
>>> d2 = dict.fromkeys([2**(32*i+1) for i in xrange(100)])
>>>
>>> from timeit import Timer
>>> setup = "from __main__ import d1, d2"
>>> t1 = Timer("for k in d1.keys(): x = d1[k]", setup)
>>> t2 = Timer("for k in d2.keys(): x = d2[k]", setup)
>>>
>>> min(t1.repeat(number=1000, repeat=5))
0.026707887649536133
>>> min(t2.repeat(number=1000, repeat=5))
0.33103203773498535
--
Steven
Try with much more than 100 items (you might want to construct the
entries a little more intricately to avoid such big numbers). The point
is that access becomes O(N) instead of O(1). See:
http://www.cs.rice.edu/~scrosby/hash/
for the consequences. http://cr.yp.to/critbit.html discusses the
issue a little more.
But the ratio grows with the number of collisions:
$ python extrapolate.py
10
0.00120401382446
0.00753307342529
ratio: 6.25663366337
100
0.00542402267456
0.316139936447
ratio: 58.2851428571
1000
0.00553417205811
3.36690688133
ratio: 608.384930209
$ cat extrapolate.py
from timeit import Timer
class Item(object):
def __init__(self, value, hash=None):
self.value = value
self.hash = value if hash is None else hash
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return self.hash
setup = "from __main__ import d"
bench = "for k in d: x = d[k]"
for n, number in (10,100), (100,100), (1000,10):
print n
d1 = dict.fromkeys(Item(i) for i in xrange(n))
d2 = dict.fromkeys(Item(i, 0) for i in xrange(n))
ab = []
for d in d1, d2:
t = Timer(bench, setup)
ab.append(min(t.repeat(number=number, repeat=3)))
print ab[-1]
print "ratio:", ab[1]/ab[0]
print
See also http://xkcd.com/605/
Peter
While this is theoretically true, and it's good to be aware of this
possibility, common string hash functions make it so rare in practice that
a hash table will almost always outperform a trie for exact lookups. If it
happens, it will either show up clearly enough in benchmarks or not be
worth bothering.
Stefan
It is unlikely to happen by accident. You might care that it can
happen on purpose. See: http://www.cs.rice.edu/~scrosby/hash/
that I cited in another post. The article shows some sample attacks
on Python cgi's.
Certainly interesting in a purely academic point of view, but in real
life if you want to cause a denial of service by overwhelming a server,
there are far more obvious options than trying to guess the server's use
of hash tables and trying to cause lots of collisions in them.
If you look at the very low bandwidth used in some of those hashtable
attacks, it's hard to see any other somewhat-generic attack that's
comparably effective. Usually we think of "DOS" as involving massive
botnets and the like, not a dribble of a few hundred characters per
second.