Store hashes of objects in BTrees instead of objects?

Éloi Rivard

unread,

Jul 6, 2021, 8:43:17 PM7/6/21

to zodb

Hello,

The BTrees documentation tells us that :

There are variants of the data structures specialized to numbers, which are faster and more memory efficient than those dealing with objects.

Also Jason Madden did some benchmarking to illustrate this.

I just realized that, at least on my system, python hashes are less than 64bits long so they can fit as LOBTree keys.

>>> from BTrees.LOBTree import LOBTree
>>> h = hash("whatever")
>>> h.bit_length()
63

>>> t = LOBTree()
>>> t[h] = "anything"

>>> t[h]

'anything'

I was wondering if there are drawbacks to systematically using hashes of strings as LOBTree keys, instead of those whole strings as OOBTree keys.

What do you think?

Jason Madden

unread,

Jul 7, 2021, 12:07:48 PM7/7/21

to Éloi Rivard, zodb

> On Jul 6, 2021, at 19:43, Éloi Rivard <azm...@gmail.com> wrote:
>
> I was wondering if there are drawbacks to systematically using hashes of strings as LOBTree keys, instead of those whole strings as OOBTree keys.

I can think of two potential drawbacks.

First, hashes of strings in CPython are not stable (by default in Python 3, opt-in in Python 2.7). They change from one process to the next. So this can't be used for persistence, it would be limited to transient data in a single process. It's like trying to use `id(an_object)`.

$ python3.9 -c 'print(hash("hi"))'
-7167107578322364224
$ python3.9 -c 'print(hash("hi"))'
7111998896984987603
$ python3.9 -c 'print(hash("hi"))'
5216739811521604812

(Yes, you can opt-out of this in Python 3, but doing so has its own downsides.)

Second, there's the issue of hash collisions. It's possible for two completely unequal strings to still have equal hash() values. In that case, because you've thrown away the original string key, you can't say for sure that you're really getting back the data you want. In other words, something like this is theoretically possible:

tree = LOBTree()
tree[hash("hi")] = 42
assert len(tree) == 1
assert tree[hash("bye")] == 42

Granted, in modern versions of Python 3 this is often ignored because the algorithm that `str.__hash__()` uses is explicitly designed to spread hash values out. But it's not impossible: there are many more possible strings than there are distinct 64-bit numbers. Limiting ourselves to the 26 lower-case characters of the English alphabet, plus a space, we find that there are more than 2**64 possible strings of length 14 [27**14]; in fact, that's bigger than 2**66, meaning that there are something around 2**64 strings in that group that, by definition, would have to have duplicate hashes. If we allow upper and lower case letters, we're guaranteed of duplicates after 12 characters.

~Jason

Éloi Rivard

unread,

Jul 8, 2021, 8:43:35 AM7/8/21

to zodb

I was ignoring that the hash salt for those native python types was different on different processes, and generated different hashes. Some types seems to be safe in that regard though:

$ python3.9 -c "import uuid; print(hash(uuid.UUID('0ef31c24-06c8-460c-93c4-30f39c7c8ef1')))"

818830583600693081

$ python3.9 -c "import uuid; print(hash(uuid.UUID('0ef31c24-06c8-460c-93c4-30f39c7c8ef1')))"

81883058360069308

It looks that this can be mitigated by using hashing functions from hashlib, then extract a 64bit integer from the hashes.

$ python3.9 -c "import hashlib; print(int(hashlib.sha256(b'hi').hexdigest(), base=16) % 2 ** 64)"
14450742454081714852

I get your point on the string lengths. Now the question is to know if that collision probability is acceptable in my context...

Thank you for the hints!

Jim Fulton

unread,

Jul 8, 2021, 9:37:14 AM7/8/21

to Jason Madden, Éloi Rivard, zodb

Also, range searches, which are an important feature become useless.

--
You received this message because you are subscribed to the Google Groups "zodb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zodb+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zodb/6F0BAE57-F01E-4D87-938E-DB01602086BC%40nextthought.com.

Jim Fulton

unread,

Jul 8, 2021, 9:44:34 AM7/8/21

to Éloi Rivard, zodb

As noted elsewhere, this would be problematic because:

- Hash collisions

- Loss of range searches.

(Unstable hashes could be mitigated using a custom hash function, at least for selected types.)

A better approach, IMO, would be to build a scalable persistent hash table on top of BTrees. Use a BTree to maintain the hash table. The BTree values could either be hash buckets containing keys with the same hash mapped to their values, or target values in the common case of no collisions.

Jim

--
You received this message because you are subscribed to the Google Groups "zodb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zodb+uns...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/zodb/5762150c-6e0a-4104-b2bb-f0b476c26a9en%40googlegroups.com.

Reply all

Reply to author

Forward