> On Jul 6, 2021, at 19:43, Éloi Rivard <
azm...@gmail.com> wrote:
>
> I was wondering if there are drawbacks to systematically using hashes of strings as LOBTree keys, instead of those whole strings as OOBTree keys.
I can think of two potential drawbacks.
First, hashes of strings in CPython are not stable (by default in Python 3, opt-in in Python 2.7). They change from one process to the next. So this can't be used for persistence, it would be limited to transient data in a single process. It's like trying to use `id(an_object)`.
$ python3.9 -c 'print(hash("hi"))'
-7167107578322364224
$ python3.9 -c 'print(hash("hi"))'
7111998896984987603
$ python3.9 -c 'print(hash("hi"))'
5216739811521604812
(Yes, you can opt-out of this in Python 3, but doing so has its own downsides.)
Second, there's the issue of hash collisions. It's possible for two completely unequal strings to still have equal hash() values. In that case, because you've thrown away the original string key, you can't say for sure that you're really getting back the data you want. In other words, something like this is theoretically possible:
tree = LOBTree()
tree[hash("hi")] = 42
assert len(tree) == 1
assert tree[hash("bye")] == 42
Granted, in modern versions of Python 3 this is often ignored because the algorithm that `str.__hash__()` uses is explicitly designed to spread hash values out. But it's not impossible: there are many more possible strings than there are distinct 64-bit numbers. Limiting ourselves to the 26 lower-case characters of the English alphabet, plus a space, we find that there are more than 2**64 possible strings of length 14 [27**14]; in fact, that's bigger than 2**66, meaning that there are something around 2**64 strings in that group that, by definition, would have to have duplicate hashes. If we allow upper and lower case letters, we're guaranteed of duplicates after 12 characters.
~Jason