Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Perfect hashing for Py

113 views
Skip to first unread message

bearoph...@lycos.com

unread,
Jul 11, 2008, 8:01:12 AM7/11/08
to
Following links from this thread:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/179e1a45485ab36a#

I have found this perfect hash (minimal too) implementation:
http://burtleburtle.net/bob/hash/perfect.html

I have already translated part of it to D, and it seems to work well
enough. As discussed in the PyConDue, I think this may be used in
frozenset (and frozendict) to build a (minimal too?) perfect hash on
the fly, to allow (hopefully) faster retrieval of items that don't
change.
That code is C and I think it's public domain, so if experiments show
it gives enough speed up, it may be added to CPython 2.6/3.

Bye,
bearophile

Casey

unread,
Jul 11, 2008, 7:46:41 PM7/11/08
to
On Jul 11, 8:01 am, bearophileH...@lycos.com wrote:
> Following links from this thread:http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

>
> I have found this perfect hash (minimal too) implementation:http://burtleburtle.net/bob/hash/perfect.html
>
> I have already translated part of it to D, and it seems to work well
> enough. As discussed in the PyConDue, I think this may be used in
> frozenset (and frozendict) to build a (minimal too?) perfect hash on
> the fly, to allow (hopefully) faster retrieval of items that don't
> change.
> That code is C and I think it's public domain, so if experiments show
> it gives enough speed up, it may be added to CPython 2.6/3.
>
> Bye,
> bearophile

It would be interesting to see if such an algorithm could actually
provide any significant performance improvements for the size of sets
that I suspect are most often used in practice. The chance of a hash
collision for a good 32-bit general is fairly low even for a set of
1,000,000 unique elements, which seems to me to be a pretty large
memory-based set. Compare that with the cost of determining a perfect
hash (O(n**2)?).

From my perspective, a perfect hash would certainly be a welcome
addition to the Python library or even as an optional algorithm
supporting
hash-based collections.

Raymond Hettinger

unread,
Jul 12, 2008, 3:13:26 AM7/12/08
to
On Jul 11, 3:01 pm, bearophileH...@lycos.com wrote:
> I have found this perfect hash (minimal too) implementation:http://burtleburtle.net/bob/hash/perfect.html
>
> I have already translated part of it to D, and it seems to work well
> enough. As discussed in the PyConDue, I think this may be used in
> frozenset (and frozendict) to build a (minimal too?) perfect hash on
> the fly, to allow (hopefully) faster retrieval of items that don't
> change.

A few thoughts: Objects currently have their own hash function that
is independent of a given container. We also have to hash types other
than strings (used in the examples in the links you provided). While
a container such as a dict or set is being built, we don't even know
how many unique keys there are until the container is fully
populated. Set-to-set operations and set-to-dict operations get their
speed from knowing that both containers use the same hash values --
this would get disrupted if each container had its own hash function.
Some types like strings (especially interned strings) remember their
own hash value -- this makes it very fast to look their values in a
set or dict -- that would be defeated if each container has its own
hash function which would need to be recomputed for every lookup. An
important use case for sets is uniquification -- in that use case,
speed is determined by insertion time, we just don't care about
subsequent lookup time -- anything that slows down insertion (such as
computing a perfect hash value) works to the detriment of that use
case). In the current design, sets and dicts are never more than 2/3
full and are usually much more sparse than that -- accordingly lookups
average between 1 and 1.5 probes per lookup. This is somewhat hard to
beat with perfect hashing since we typically get very few collisions
anyway -- so you get the cost of increased insertion time, loss of
objects being able to remember their own hash, challenges with non-
string keys, a more complicated algorithm, and decreased
interoperability for set-to-set and set-to-dict options -- the only
benefits are to reduce collisions to zero when they are already not
that common.

For frozensets, it may be possible to add an optimize() method that
takes a completely built frozenset and optimizes its insertion order
and/or increases its size to make it arbitrarily sparse. That would
only be useful for cases when the costs of optimizing the table is
fully repaid by an extensive number of lookups. There are use some
cases where it would be pay table optimization costs in order to win
decreased lookup costs, but there are plenty of use cases where that
would not be a winning trade-off. So, the optimize() step would need
to be optional, not builtin.


Raymond

FWIW, I mentioned all this at PyConDue but the message must have
gotten lost.

Raymond Hettinger

unread,
Jul 12, 2008, 10:11:35 AM7/12/08
to
On Jul 12, 10:13 am, Raymond Hettinger <pyt...@rcn.com> wrote:
> On Jul 11, 3:01 pm, bearophileH...@lycos.com wrote:
>
> > I have found this perfect hash (minimal too) implementation:http://burtleburtle.net/bob/hash/perfect.html
>
> > I have already translated part of it to D, and it seems to work well
> > enough. As discussed in the PyConDue, I think this may be used in
> > frozenset (and frozendict) to build a (minimal too?) perfect hash on
> > the fly, to allow (hopefully) faster retrieval of items that don't
> > change.

I played around with the idea and have a rough implementation sketch:

class PerfectSet(collections.Set):
__slots__ = ['len', 'hashvalue', 'hashfunc', 'hashtable']
DUMMY = object()
def __init__(self, keys, sparseness=0.5):
keys = frozenset(keys)
self.len = len(keys)
self.hashvalue = hash(keys)
n = (len(keys) / sparseness) + 1
assert n > self.len
self.hashfunc = make_perfect_hash_function(keys, n)
self.hashtable = [self.DUMMY] * n
for k in keys:
self.hashtable[self.hashfunc(k)] = k
del __len__(self, k):
return self.len
def __iter__(self, k)
return (k for k in self.hashtable if k is not self.DUMMY)
def __contains__(self, k):
return self.table[self.hashfunc(k)] is not self.DUMMY

The perfect hash function will need to use the regular hash(obj) as a
starting point. This helps with non-string types and it preserves
existing relationships between __hash__ and __eq__.

The construction is more expensive than with regular frozensets so it
is only useful when lookups are very common.

Playing around with the idea suggests that a sparseness variable for
frozensets would largely accomplish the same goal without the overhead
of creating and applying a perfect hash function. Sparse hash tables
rarely collide.


Raymond

0 new messages