[Python-ideas] incremental hashing in __hash_

j...@math.brown.edu

unread,

Dec 27, 2016, 10:15:30 PM12/27/16

to python...@python.org

Suppose you have implemented an immutable Position type to represent
the state of a game played on an MxN board, where the board size can
grow quite large.

Or suppose you have implemented an immutable, ordered collection type.
For example, the collections-extended package provides a
frozensetlist[1]. One of my own packages provides a frozen, ordered
bidirectional mapping type.[2]

These types should be hashable so that they can be inserted into sets
and mappings. The order-sensitivity of the contents prevents them from
using the built-in collections.Set._hash() helper in their __hash__
implementations, to keep from unnecessarily causing hash collisions
for objects that compare unequal due only to having a different
ordering of the same set of contained items.

According to https://docs.python.org/3/reference/datamodel.html#object.__hash__
:

"""
it is advised to mix together the hash values of the components of the
object that also play a part in comparison of objects by packing them
into a tuple and hashing the tuple. Example:

def __hash__(self):
return hash((self.name, self.nick, self.color))

"""

Applying this advice to the use cases above would require creating an
arbitrarily large tuple in memory before passing it to hash(), which
is then just thrown away. It would be preferable if there were a way
to pass multiple values to hash() in a streaming fashion, such that
the overall hash were computed incrementally, without building up a
large object in memory first.

Should there be better support for this use case? Perhaps hash() could
support an alternative signature, allowing it to accept a stream of
values whose combined hash would be computed incrementally in
*constant* space and linear time, e.g. "hash(items=iter(self))".

In the meantime, what is the best way to incrementally compute a good
hash value for such objects using built-in Python routines? (As a
library author, it would be preferable to use a routine with explicit
support for computing a hash incrementally, rather than having to
worry about how to correctly combine results from multiple calls to
hash(contained_item) in library code. (Simply XORing such results
together would not be order-sensitive, and so wouldn't work.) Using a
routine with explicit support for incremental hashing would allow
libraries to focus on doing one thing well.[3,4,5])

I know that hashlib provides algorithms that support incremental
hashing, but those use at least 128 bits. Since hash() throws out
anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in
hashlib seems like overkill. Am I right in thinking that's the wrong
tool for the job?

On the other hand, would binascii.crc32 be suitable, at least for
32-bit systems? (And is there some 64-bit incremental hash algorithm
available for 64-bit systems? It seems Python has no support for crc64
built in.) For example:

import binascii, struct

class FrozenOrderedCollection:
def __hash__(self):
if hasattr(self, '__hashval'): # Computed lazily.
return self.__hashval
hv = crc32(b'FrozenOrderedCollection')
for i in self:
hv = binascii.crc32(struct.pack('@l', hash(i)), hv)
hv &= 0xffffffff
self.__hashval = hv
return hv

Note that this example illustrates two other common requirements of
these use cases:

(i) lazily computing the hash value on first use, and then caching it
for future use

(ii) priming the overall hash value with some class-specific initial
value, so that if an instance of a different type of collection, which
comprised the same items but which compared unequal, were to compute
its hash value out of the same constituent items, we make sure our
hash value differs. (On that note, should the documentation in
https://docs.python.org/3/reference/datamodel.html#object.__hash__
quoted above be updated to add this advice? The current advice to
"return hash((self.name, self.nick, self.color))" would cause a hash
collision with a tuple of the same values, even though the tuple
should presumably compare unequal with this object.)

To summarize these questions:

1. Should hash() add support for incremental hashing?

2. In the meantime, what is the best way to compute a hash of a
combination of many values incrementally (in constant space and linear
time), using only what's available in the standard library? Ideally
there is some routine available that uses exactly hash_info.hash_bits
number of bits, and that does the combining of incremental results for
you.

3. Should the https://docs.python.org/3/reference/datamodel.html#object.__hash__
documentation be updated to include suitable advice for these use
cases, in particular, that the overall hash value should be computed
lazily, incrementally, and should be primed with a class-unique value?

Thanks in advance for a helpful discussion, and best wishes.

Josh

References:

[1] http://collections-extended.lenzm.net/api.html#collections_extended.frozensetlist

[2] https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict

[3] http://stackoverflow.com/questions/2909106/python-whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010

[4] http://stackoverflow.com/a/2909572/161642

[5] http://stackoverflow.com/a/27952689/161642
_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Ryan Gonzalez

unread,

Dec 27, 2016, 10:29:02 PM12/27/16

to j...@math.brown.edu, python...@python.org

You could always try to make a Python version of the C tuple hashing function[1] (requires the total # of elements) or PyPy's[2] (seems like it would allow true incremental hashing). API idea:

hasher = IncrementalHasher()

hasher.add(one_item_to_hash) # updates hasher.hash property with result

# repeat

return hasher.hash

[1]: https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331
[2]: https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562

--
Ryan (ライアン)
Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else
http://kirbyfan64.github.io/

j...@math.brown.edu

unread,

Dec 28, 2016, 11:27:38 AM12/28/16

to Ryan Gonzalez, python...@python.org

I actually have been poking around that code already. I also found https://github.com/vperron/python-superfasthash/blob/master/superfasthash.py in case of interest.

But it still seems like library authors with this use case should keep their library code free of implementation details like this, and instead use a higher-level API provided by Python.

Thanks,
Josh

Ned Batchelder

unread,

Dec 28, 2016, 11:54:36 AM12/28/16

to python...@python.org

On 12/27/16 10:13 PM, j...@math.brown.edu wrote:

Applying this advice to the use cases above would require creating an
arbitrarily large tuple in memory before passing it to hash(), which
is then just thrown away. It would be preferable if there were a way
to pass multiple values to hash() in a streaming fashion, such that
the overall hash were computed incrementally, without building up a
large object in memory first.

Should there be better support for this use case? Perhaps hash() could
support an alternative signature, allowing it to accept a stream of
values whose combined hash would be computed incrementally in
*constant* space and linear time, e.g. "hash(items=iter(self))".

You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:

    def hash_stream(them):
        val = 0
        for it in them:
            val = hash((val, it))
        return val

Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating.

Or maybe I am misunderstanding the requirements?

--Ned.

Ethan Furman

unread,

Dec 28, 2016, 12:11:53 PM12/28/16

to python...@python.org

On 12/27/2016 07:13 PM, j...@math.brown.edu wrote:

> According to the docs [6]:

>
> """
> it is advised to mix together the hash values of the components of the
> object that also play a part in comparison of objects by packing them
> into a tuple and hashing the tuple. Example:
>
> def __hash__(self):
> return hash((self.name, self.nick, self.color))
>
> """
>
>
> Applying this advice to the use cases above would require creating an
> arbitrarily large tuple in memory before passing it to hash(), which
> is then just thrown away. It would be preferable if there were a way
> to pass multiple values to hash() in a streaming fashion, such that
> the overall hash were computed incrementally, without building up a
> large object in memory first.

Part of the reason for creating __hash__ like above is that:

- it's simple
- it's reliable

However, it's not the only way to have a hash algorithm that works; in fact, the beginning of the sentence you quoted says:

> The only required property is that objects which compare equal have
> the same hash value;

In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers).

> (ii) priming the overall hash value with some class-specific initial
> value, so that if an instance of a different type of collection, which
> comprised the same items but which compared unequal, were to compute
> its hash value out of the same constituent items, we make sure our
> hash value differs.

This is unnecessary: hashes are compared first as a way to weed out impossible matches, but when the hashes are the same an actual __eq__ test is still done [7].

--
~Ethan~

[6] https://docs.python.org/3/reference/datamodel.html#object.__hash__
[7] some test code to prove above points:

--- 8< ------------------------------------------------------------
from unittest import main, TestCase

class Eggs(object):
def __init__(self, value):
self.value = value
def __hash__(self):
return hash(self.value)
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.value == other.value
return NotImplemented

class Spam(object):
def __init__(self, value):
self.value = value
def __hash__(self):
return hash(self.value)
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.value == other.value
return NotImplemented

e10 = Eggs(1)
e20 = Eggs(2)
e11 = Eggs(1)
e21 = Eggs(2)

s10 = Spam(1)
s20 = Spam(2)
s11 = Spam(1)
s21 = Spam(2)

bag = {}
bag[e10] = 1
bag[s10] = 2
bag[e20] = 3
bag[s20] = 4

class TestEqualityAndHashing(TestCase):

def test_equal(self):
# same class, same value --> equal
self.assertEqual(e10, e11)
self.assertEqual(e20, e21)
self.assertEqual(s10, s11)
self.assertEqual(s20, s21)

def test_not_equal(self):
# different class, same value --> not equal
self.assertEqual(e10.value, s10.value)
self.assertNotEqual(e10, s10)
self.assertEqual(e20.value, s20.value)
self.assertNotEqual(e20, s20)

def test_same_hash(self):
# same class, same value, same hash
self.assertEqual(hash(e10), hash(e11))
self.assertEqual(hash(e20), hash(e21))
self.assertEqual(hash(s10), hash(s11))
self.assertEqual(hash(s20), hash(s21))

# different class, same value, same hash
self.assertEqual(hash(e10), hash(s10))
self.assertEqual(hash(e11), hash(s11))
self.assertEqual(hash(e20), hash(s20))
self.assertEqual(hash(e21), hash(s21))

def test_as_key(self):
# different objects from different classes with same hash should still be distinct
self.assertEqual(len(bag), 4)
self.assertEqual(bag[e10], 1)
self.assertEqual(bag[s10], 2)
self.assertEqual(bag[e20], 3)
self.assertEqual(bag[s20], 4)

# different objects from same classes with same hash should not be distinct
self.assertEqual(bag[e11], 1)
self.assertEqual(bag[s11], 2)
self.assertEqual(bag[e21], 3)
self.assertEqual(bag[s21], 4)

main()
--- 8< ------------------------------------------------------------

j...@math.brown.edu

unread,

Dec 28, 2016, 12:29:31 PM12/28/16

to Ned Batchelder, python...@python.org

On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder <n...@nedbatchelder.com> wrote:

You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:

    def hash_stream(them):
        val = 0
        for it in them:
            val = hash((val, it))
        return val

Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating.

Or maybe I am misunderstanding the requirements?

This is better than solutions like http://stackoverflow.com/a/27952689/161642 in the sense that it's a little higher level (no bit shifting or magic numbers).

But it's not clear that it's any better in the sense that you're still rolling your own incremental hash algorithm out of a lower-level primitive that doesn't document support for this, and therefore taking responsibility yourself for how well it distributes values into buckets.

Are you confident this results in good hash performance? Is this better than a solution built on top of a hash function with an explicit API for calculating a hash incrementally, such as the crc32 example I included? (And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.)

Don't we still probably want either:

1) Python to provide some such hash_stream() function as a built-in,

or failing that,

2) the https://docs.python.org/3/reference/datamodel.html#object.__hash__ documentation to bless this as the recommended solution to this problem, thereby providing assurance of its performance?

If that makes sense, I'd be happy to file an issue, and include the start of a patch providing either 1 or 2.

Thanks very much for the helpful response.

j...@math.brown.edu

unread,

Dec 28, 2016, 12:46:27 PM12/28/16

to Ethan Furman, python...@python.org

On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <et...@stoneleaf.us> wrote:

In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers).

Yes, I realize that unequal objects can return the same hash value with only performance, and not correctness, suffering. It's the performance I'm concerned about. That's what I meant by "...to keep from unnecessarily causing hash collisions..." in my original message, but sorry this wasn't clearer. We should be able to do this in a way that doesn't increase hash collisions unnecessarily.

Ned Batchelder

unread,

Dec 28, 2016, 4:28:00 PM12/28/16

to j...@math.brown.edu, python...@python.org

On 12/28/16 12:27 PM, j...@math.brown.edu wrote:

On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder <n...@nedbatchelder.com> wrote:

You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:

    def hash_stream(them):
        val = 0
        for it in them:
            val = hash((val, it))
        return val

Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating.

Or maybe I am misunderstanding the requirements?

This is better than solutions like http://stackoverflow.com/a/27952689/161642 in the sense that it's a little higher level (no bit shifting or magic numbers).

But it's not clear that it's any better in the sense that you're still rolling your own incremental hash algorithm out of a lower-level primitive that doesn't document support for this, and therefore taking responsibility yourself for how well it distributes values into buckets.

Are you confident this results in good hash performance? Is this better than a solution built on top of a hash function with an explicit API for calculating a hash incrementally, such as the crc32 example I included? (And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.)

I don't have the theoretical background to defend this function. But it seems to me that if we believe that hash((int, thing)) distributes well, then how could this function not distribute well?

--Ned.

Steven D'Aprano

unread,

Dec 29, 2016, 3:21:18 AM12/29/16

to python...@python.org

With respect Josh, I feel that this thread is based on premature
optimization. It seems to me that you're *assuming* that anything less
than some theoretically ideal O(1) space O(N) time hash function is
clearly and obviously unsuitable.

Of course I might be completely wrong. Perhaps you have implemented your
own __hash__ methods as suggested by the docs, as well as Ned's version,
and profiled your code and discovered that __hash__ is a significant
bottleneck. In which case, I'll apologise for doubting you, but in my
defence I'll say that the language you have used in this thread so far
gives no hint that you've actually profiled your code and found the
bottleneck.

As I see it, this thread includes a few questions:

(1) What is a good way to generate a hash one item at a time?

I think Ned's answer is "good enough", subject to evidence to the
contrary. If somebody cares to spend the time to analyse it, that's
excellent, but we're volunteers and its the holiday period and most
people have probably got better things to do. But we shouldn't let the
perfect be the enemy of the good.

But for what it's worth, I've done an *extremely* quick and dirty test
to see whether the incremental hash function gives a good spread of
values, by comparing it to the standard hash() function.

py> import statistics
py> def incrhash(iterable):
... h = hash(())
... for x in iterable:
... h = hash((h, x))
... return h
...
py>
py> data1 = []
py> data2 = []
py> for i in range(1000):
... it = range(i, i+100)
... data1.append(hash(tuple(it)))
... data2.append(incrhash(it))
...
py> # Are there any collisions?
... len(set(data1)), len(set(data2))
(1000, 1000)
py> # compare spread of values
... statistics.stdev(data1), statistics.stdev(data2)
(1231914201.0980585, 1227850884.443638)
py> max(data1)-min(data1), max(data2)-min(data2)
(4287398438, 4287569008)

Neither the built-in hash() nor the incremental hash gives any
collisions over this (admittedly small) data set, and both have very
similar spreads of values as measured by either the standard deviation
or the statistical range. The means are quite different:

py> statistics.mean(data1), statistics.mean(data2)
(-8577110.944, 2854438.568)

but I don't think that matters. So that's good enough for me.

(2) Should Ned's incremental hash, or some alternative with better
properties, go into the standard library?

I'm not convinced that your examples are common enough that the stdlib
should be burdened with supporting it. On the other hand, I don't think
it is an especially *large* burden, so perhaps it could be justified.
Count me as sitting on the fence on this one.

Perhaps a reasonable compromise would be to include it as a recipe in
the docs.

(3) If it does go in the stdlib, where should it go?

I'm suspicious of functions that change their behaviour depending on how
they are called, so I'm not keen on your suggestion of adding a second
API to the hash built-in:

hash(obj) # return hash of obj

hash(iterable=obj) # return incrementally calculated hash of obj

That feels wrong to me. I'd rather add a generator to the itertools
module:

itertools.iterhash(iterable) # yield incremental hashes

or, copying the API of itertools.chain, add a method to hash:

hash.from_iterable(iterable) # return hash calculated incrementally

--
Steve

Chris Angelico

unread,

Dec 29, 2016, 3:36:01 AM12/29/16

to python-ideas

On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano <st...@pearwood.info> wrote:
> I'd rather add a generator to the itertools
> module:
>
> itertools.iterhash(iterable) # yield incremental hashes
>
> or, copying the API of itertools.chain, add a method to hash:
>
> hash.from_iterable(iterable) # return hash calculated incrementally

The itertools module is mainly designed to be consumed lazily. The
hash has to be calculated eagerly, so it's not really a good fit for
itertools. The only real advantage of this "hash from iterable" over
hash(tuple(it)) is avoiding the intermediate tuple, so I'd want to see
evidence that that's actually significant.

ChrisA

j...@math.brown.edu

unread,

Dec 29, 2016, 3:31:44 PM12/29/16

to Steven D'Aprano, python...@python.org

Thanks for the thoughtful discussion, it's been very interesting.

Hash algorithms seem particularly sensitive and tricky to get right, with a great deal of research going into choices of constants, etc. and lots of gotchas. So it seemed worth asking about. If I had to bet on whether repeatedly accumulating pairwise hash() results would maintain the same desired properties that hash(tuple(items)) guarantees, I'd want to get confirmation from someone with expertise in this first, hence my starting this thread.

But as you showed, it's certainly possible to do some exploration in the meantime. Prompted by your helpful comparison, I just put together https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further compare hash_tuple vs. hash_incremental.

Based on that, hash_incremental seems to have a comparable distribution to hash_tuple. I'm not sure if the methodology there is sound, as I'm new to analysis like this. So I'd still welcome confirmation from someone with actual expertise in Python's internal hash algorithms. But so far this sure seems good enough for the use cases I described.

Given sufficiently good distribution, I'd expect there to be unanimous agreement that an immutable collection, which could contain arbitrarily many items, should strongly prefer hash_incremental(self) over hash(tuple(self)), for the same reason we use generator comprehensions instead of list comprehensions when appropriate. Please correct me if I'm wrong.

+1 for the "hash.from_iterable" API you proposed, if some additional support for this is added to Python. Otherwise +1 for including Ned's recipe in the docs. Again, happy to submit a patch for either of these if it would be helpful.

And to be clear, I really appreciate the time that contributors have put into this thread, and into Python in general. Thoughtful responses are always appreciated, and never expected. I'm just interested in learning and in helping improve Python when I might have an opportunity. My Python open source work has been done on a voluntary basis too, and I haven't even gotten to use Python for paid/closed source work in several years, alas.

Thanks again,
Josh

Tim Delaney

unread,

Dec 29, 2016, 9:00:38 PM12/29/16

to Python-Ideas

On 29 December 2016 at 19:20, Steven D'Aprano <st...@pearwood.info> wrote:

With respect Josh, I feel that this thread is based on premature
optimization. It seems to me that you're *assuming* that anything less
than some theoretically ideal O(1) space O(N) time hash function is
clearly and obviously unsuitable.

Of course I might be completely wrong. Perhaps you have implemented your
own __hash__ methods as suggested by the docs, as well as Ned's version,
and profiled your code and discovered that __hash__ is a significant
bottleneck. In which case, I'll apologise for doubting you, but in my
defence I'll say that the language you have used in this thread so far
gives no hint that you've actually profiled your code and found the
bottleneck.

In Josh's defence, the initial use case he put forward is exactly the kind of scenario where it's worthwhile optimising ahead of time. Quite often a poorly implemented hash function doesn't manifest as a problem until you scale up massively - and a developer may not have the capability to scale up to a suitable level in-house, resulting in performance issues at customer sites.

I had one particular case (fortunately discovered before going to customers) where a field was included in the equality check, but wasn't part of the hash. Unfortunately, the lack of this one field resulted in objects only being allocated to a few buckets (in a Java HashMap), resulting in every access having to walk a potentially very long chain doing equality comparisons - O(N) behaviour from an amortised O(1) data structure.

Unit tests - no worries. Small-scale tests - everything looked fine. Once we started our load tests though everything slowed to a crawl. 100% CPU, throughput at a standstill ... it didn't look good.

Adding that one field to the hash resulted in the ability to scale up to hundreds of thousands of objects with minimal CPU. I can't remember if it was millions we tested to (it was around 10 years ago ...).

Tim Delaney

Nick Coghlan

unread,

Dec 30, 2016, 11:32:43 AM12/30/16

to Chris Angelico, python-ideas

On 29 December 2016 at 18:35, Chris Angelico <ros...@gmail.com> wrote:

On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano <st...@pearwood.info> wrote:
> I'd rather add a generator to the itertools
> module:
>
> itertools.iterhash(iterable) # yield incremental hashes
>
> or, copying the API of itertools.chain, add a method to hash:
>
> hash.from_iterable(iterable) # return hash calculated incrementally

The itertools module is mainly designed to be consumed lazily. The
hash has to be calculated eagerly, so it's not really a good fit for
itertools.

I understood the "iterhash" suggestion to be akin to itertools.accumulate:

    >>> for value, tally in enumerate(accumulate(range(10))): print(value, tally)
    ...
    0 0
    1 1
    2 3
    3 6
    4 10
    5 15
    6 21
    7 28
    8 36
    9 45

However, I think including Ned's recipe (or something like it) in https://docs.python.org/3/reference/datamodel.html#object.__hash__ as a tool for avoiding large temporary tuple allocations may be a better way to start off as:

1. It's applicable to all currently released versions of Python, not just to 3.7+

2. It provides more scope for people to experiment with their own variants of the idea before committing to a *particular* version somewhere in the standard library

Cheers,

Nick.

--

Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

j...@math.brown.edu

unread,

Dec 30, 2016, 12:38:04 PM12/30/16

to Nick Coghlan, python-ideas

Updating the docs sounds like the more important change for now, given 3.7+. But before the docs make an official recommendation for that recipe, were the analyses that Steve and I did sufficient to confirm that its hash distribution and performance is good enough at scale, or is more rigorous analysis necessary?

I've been trying to find a reasonably detailed and up-to-date reference on Python hash() result requirements and analysis methodology, with instructions on how to confirm if they're met, but am still looking. Would find that an interesting read if it's out there. But I'd take just an authoritative thumbs up here too. Just haven't heard one yet.

And regarding any built-in support that might get added, I just want to make sure Ryan Gonzalez's proposal (the first reply on this thread) didn't get buried:

hasher = IncrementalHasher()

hasher.add(one_item_to_hash) # updates hasher.hash property with result

# repeat

return hasher.hash

I think this is the only proposal so far that actually adds an explicit API for performing an incremental update. (i.e. The other "hash_stream(iterable)" -style proposals are all-or-nothing.) This would bring Python's built-in hash() algorithm's support up to parity with the other algorithms in the standard library (hashlib, crc32). Maybe that's valuable?

Stephen J. Turnbull

unread,

Dec 30, 2016, 12:43:31 PM12/30/16

to j...@math.brown.edu, python...@python.org

j...@math.brown.edu writes:

> But as you showed, it's certainly possible to do some exploration in the
> meantime. Prompted by your helpful comparison, I just put together
> https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further
> compare hash_tuple vs. hash_incremental.

It's been a while :-) since I read Knuth[1] (and that was when Knuth
was authoritative on this subject 8^O), but neither methodology is
particularly helpful here. The ideal is effectively a uniform
distribution on a circle, which has no mean. Therefore standard
deviation is also uninteresting, since its definition uses the mean.
The right test is something like a χ² test against a uniform
distribution.

The scatter plots (of hash against test data) simply show the same
thing, without the precision of a statistical test. (Note: do not
deprecate the computer visualization + human eye + human brain system.
It is the best known machine for detecting significant patterns and
anomolies, though YMMV.) But it's not very good at detecting high-
dimensional patterns.

However, it's nowhere near good enough for a hash function to have a
uniform distribution on random data. It actually needs to be
"chaotic" in the sense that it also spreads "nearby" data out, where
"nearby" here would probably mean that as 4000-bit strings less than
m% (for small m) differ, as in real data you usually expect a certain
amount of structure (continuity in time, clustering around a mean,
line noise in a communication channel) so that you'd be likely to get
lots of "clusters" of very similar data. You don't want them pounding
on a small subset of buckets.

> I'm not sure if the methodology there is sound, as I'm new to
> analysis like this.

Even decades later, starting with Knuth[1] can't hurt. :-)

> Given sufficiently good distribution, I'd expect there to be unanimous
> agreement that an immutable collection, which could contain arbitrarily
> many items, should strongly prefer hash_incremental(self) over
> hash(tuple(self)), for the same reason we use generator comprehensions
> instead of list comprehensions when appropriate. Please correct me if I'm
> wrong.

I think that's a misleading analogy. Just stick to

For very large collections where we already have the data,
duplicating the data is very expensive. Furthermore, since the
purpose of hashing is making equality comparisons cheap, this is
likely to happen in a loop.

On the other hand, if there are going to be a *lot* of "large"
collections being stored, then they're actually not all that large
compared to the system, and you might not actually care that much
about the cost of the emphemeral tuples, because the real cost is in
an n^2 set of comparisons. From the point of view of NumPy, this is
an "are you kidding?" argument because large datasets are its stock in
trade, but for core Python, this may be sufficiently esoteric that it
should be delegated to

On balance, the arguments that Steven d'Aprano advanced for having a
statistics module in the stdlib vs. importing pandas apply here. In
particular, I think there are a huge number of options for an
iterative hash. For example, Ned chained 2-tuples. But perhaps for
efficient time in bounded space you want to use bounded but larger
tuples. I don't know -- and that's the point. If we have a TOOWTDI
API like hash.from_iterable then smart people (and people with time on
their hands to do exhaustive experiments!) can tune that over time, as
has been done with the sort API.

Another option is yielding partials, as Steven says:

> > itertools.iterhash(iterable) # yield incremental hashes

That's a very interesting idea, though I suspect it rarely would make
a big performance improvement.

I'm not sure I like the "hash.from_iterable" name for this API. The
problem is that if it's not a concrete collection[3], then you throw
away the data. If the hash algorithm happens to suck for certain
data, you'll get a lot of collisions, and conclude that your data is
much more concentrated than it actually is. I find it hard to imagine
a use case where you have large data where you only care about whether
two data points are different (cf. equality comparisons for floats).
You want to know how they're different. So I think I would prefer to
name it "hash.from_collection" or similar. Of course whether the
implementation actually raises on a generator or other non-concrete
iterable is a fine point.

Footnotes:
[1] Of course I mean The Art of Computer Programming, Ch. 3.

[2] Including the newly ordered dicts, maybe? Somebody tweet
@dabeaz! What evil can he do with this?

[3] Yeah, I know, "re-iterables". But we don't have a definition,
let alone an API for identifying, those Things.

Matthias Bussonnier

unread,

Dec 30, 2016, 3:01:02 PM12/30/16

to Nick Coghlan, python-ideas

On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan <ncog...@gmail.com> wrote:
>
> I understood the "iterhash" suggestion to be akin to itertools.accumulate:
>

> >>> for value, tally in enumerate(accumulate(range(10))): print(value, ...

It reminds me of hmac[1]/hashlib[2], with the API : h.update(...)
before a .digest().
It is slightly lower level than a `from_iterable`, but would be a bit
more flexible.
If the API were kept similar things would be easier to remember.
--
M

[1]: https://docs.python.org/3/library/hmac.html#hmac.HMAC.update
[2]: https://docs.python.org/3/library/hashlib-blake2.html#module-hashlib

Christian Heimes

unread,

Dec 30, 2016, 3:56:02 PM12/30/16

to python...@python.org

On 2016-12-30 20:59, Matthias Bussonnier wrote:
> On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan <ncog...@gmail.com> wrote:
>>
>> I understood the "iterhash" suggestion to be akin to itertools.accumulate:
>>
>> >>> for value, tally in enumerate(accumulate(range(10))): print(value, ...
>
> It reminds me of hmac[1]/hashlib[2], with the API : h.update(...)
> before a .digest().
> It is slightly lower level than a `from_iterable`, but would be a bit
> more flexible.
> If the API were kept similar things would be easier to remember.

Hi,

I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
hashlib module.

Before we come up with a new API or recipe, I would like to understand
the problem first. Why does the initial op consider hash(large_tuple) a
performance issue? If you have an object with lots of members that
affect both __hash__ and __eq__, then __hash__ is really least of your
concern. The hash has to be computed just once and then will stay the
same over the life time of the object. Once computed the hash can be
cached.

On the other hand __eq__ is at least called once for every successful
hash lookup. On the worst case it is called n-1 for a dict of size n for
a match *and* a hashmap miss. Every __eq__ call has to compare between 1
and m member attributes. For a dict with 1,000 elements with 1,000
members each, that's just 1,000 hash computations with roughly 8 bB
memory allocation but almost a million comparisons in worst case.

A hasher objects adds further overhead, e.g. object allocation, creation
of a bound methods for each call etc. It's also less CPU cache friendly
than the linear data structure of a tuple.

Christian

j...@math.brown.edu

unread,

Dec 30, 2016, 6:39:56 PM12/30/16

to Christian Heimes, Stephen J. Turnbull, python-ideas

On Fri, Dec 30, 2016 at 3:54 PM, Christian Heimes <chri...@python.org> wrote:

Hi,

I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
hashlib module.

Before we come up with a new API or recipe, I would like to understand
the problem first. Why does the initial op consider hash(large_tuple) a
performance issue? If you have an object with lots of members that
affect both __hash__ and __eq__, then __hash__ is really least of your
concern. The hash has to be computed just once and then will stay the
same over the life time of the object. Once computed the hash can be
cached.

On the other hand __eq__ is at least called once for every successful
hash lookup. On the worst case it is called n-1 for a dict of size n for
a match *and* a hashmap miss. Every __eq__ call has to compare between 1
and m member attributes. For a dict with 1,000 elements with 1,000
members each, that's just 1,000 hash computations with roughly 8 bB
memory allocation but almost a million comparisons in worst case.

A hasher objects adds further overhead, e.g. object allocation, creation
of a bound methods for each call etc. It's also less CPU cache friendly
than the linear data structure of a tuple.

Thanks for joining the discussion, great to have your input.

In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood?

Here is some example code:

class FrozenOrderedCollection:

...

def __eq__(self, other):

if not isinstance(other, FrozenOrderedCollection):

return NotImplemented

if len(self) != len(other):

return False

return all(i == j for (i, j) in zip(self, other))

def __hash__(self):

if hasattr(self, '_hashval'):

return self._hashval

hash_initial = hash(self.__class__)

self._hashval = reduce(lambda h, i: hash((h, i)), self, hash_initial)

return self._hashval

Is it the case that internally, the code is mostly there to compute the hash of such an object in incremental fashion, and it's just a matter of figuring out the right API to expose it?

If creating an arbitrarily large tuple, passing that to hash(), and then throwing it away really is the best practice, then perhaps that should be explicitly documented, since I think many would find that counterintuitive?

@Stephen J. Turnbull, thank you for your input -- still digesting, but very interesting.

Thanks again to everyone for the helpful discussion.

Ethan Furman

unread,

Dec 30, 2016, 7:21:26 PM12/30/16

to python...@python.org

On 12/30/2016 03:36 PM, j...@math.brown.edu wrote:

> In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood?

If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done.

If that wasn't your intention then, while there can certainly be a few quick checks to rule out equality (such as length) if those match, the expensive full equality check will have to happen.

--
~Ethan~

j...@math.brown.edu

unread,

Dec 30, 2016, 7:33:18 PM12/30/16

to Ethan Furman, python-ideas

On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman <et...@stoneleaf.us> wrote:

On 12/30/2016 03:36 PM, j...@math.brown.edu wrote:

In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood?

If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done.

No, I was talking about the identity check done by a set or dict when doing a lookup to check if the object in a hash bucket is identical to the object being looked up. In that case, there is no need for the set or dict to even call __eq__. Right?

The FrozenOrderedCollection.__eq__ implementation I sketched out was as intended -- no identity check there.

If that wasn't your intention then, while there can certainly be a few quick checks to rule out equality (such as length) if those match, the expensive full equality check will have to happen.

I think we're misunderstanding each other, but at the risk of saying the same thing again: Because it's an ordered collection, the equality check for two unequal instances with the same hash value will very likely be able to bail after comparing lengths or the first items. With a good hash function, the odds of two unequal instances both hashing to the same value and having their first N items equal should be vanishingly small, no?

Ethan Furman

unread,

Dec 30, 2016, 8:05:42 PM12/30/16

to python-ideas

On 12/30/2016 04:31 PM, j...@math.brown.edu wrote:

> On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote:

>> If you are relying on an identity check for equality then no two
>> FrozenOrderedCollection instances can be equal. Was that your
>> intention? It it was, then just hash the instance's id() and
>> you're done.
>
> No, I was talking about the identity check done by a set or dict
> when doing a lookup to check if the object in a hash bucket is
> identical to the object being looked up. In that case, there is
> no need for the set or dict to even call __eq__. Right?

No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.

j...@math.brown.edu

unread,

Dec 30, 2016, 8:12:00 PM12/30/16

to Ethan Furman, python-ideas

On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman <et...@stoneleaf.us> wrote:

No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.

Python 3.6.0 (default, Dec 24 2016, 00:01:50)

[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> d = {1: 'int', 1.0: 'float'}

>>> d

{1: 'float'}

IPython 5.1.0 -- An enhanced Interactive Python.

In [1]: class Foo:

...: def __eq__(self, other):

...: return True

...: def __init__(self, val):

...: self.val = val

...: def __repr__(self):

...: return '<Foo %r>' % self.val

...: def __hash__(self):

...: return 42

...:

In [2]: f1 = Foo(1)

In [3]: f2 = Foo(2)

In [4]: x = {f1: 1, f2: 2}

In [5]: x

Out[5]: {<Foo 1>: 2}

I'm having trouble showing that two equal but nonidentical objects can both be in the same dict.

j...@math.brown.edu

unread,

Dec 30, 2016, 9:13:34 PM12/30/16

to Ethan Furman, python-ideas

On Fri, Dec 30, 2016 at 8:10 PM, <j...@math.brown.edu> wrote:

On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman <et...@stoneleaf.us> wrote:
No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.

...

I'm having trouble showing that two equal but nonidentical objects can both be in the same dict.

(Because they can't, though) your point stands that Python had to call __eq__ in these cases.

But with instances of large, immutable, ordered collections, an application could either:

1. accept that it might create a duplicate, equivalent instance of an existing instance with sufficient infrequency that it's okay taking the performance hit, or

2. try to avoid creating duplicate instances in the first place, using the existing, equivalent instance it created as a singleton. Then a set or dict lookup could use the identity check, and avoid the expensive __eq__ call.

I think it's much more important to focus on what happens with unequal instances here, since there are usually many more of them. And with them, the performance hit of the __eq__ calls definitely does not necessarily dominate that of __hash__. Right?

Ethan Furman

unread,

Dec 30, 2016, 9:22:19 PM12/30/16

to python-ideas

On 12/30/2016 06:12 PM, j...@math.brown.edu wrote:

> ... your point stands that Python had to call __eq__ in these cases.

>
> But with instances of large, immutable, ordered collections, an
> application could either:
>
> 1. accept that it might create a duplicate, equivalent instance of
> an existing instance with sufficient infrequency that it's okay
> taking the performance hit, or
>
> 2. try to avoid creating duplicate instances in the first place,
> using the existing, equivalent instance it created as a singleton.
> Then a set or dict lookup could use the identity check, and avoid
> the expensive __eq__ call.

> I think it's much more important to focus on what happens with
> unequal instances here, since there are usually many more of them.
> And with them, the performance hit of the __eq__ calls definitely
> does not necessarily dominate that of __hash__. Right?

I don't think so. As someone else said, a hash can be calculated once and then cached, but __eq__ has to be called every time. Depending on the clustering of your data that could be quick... or not.

j...@math.brown.edu

unread,

Dec 30, 2016, 9:49:31 PM12/30/16

to Ethan Furman, python-ideas

On Fri, Dec 30, 2016 at 9:21 PM, Ethan Furman <et...@stoneleaf.us> wrote:

I don't think so. As someone else said, a hash can be calculated once and then cached, but __eq__ has to be called every time. Depending on the clustering of your data that could be quick... or not.

__eq__ only has to be called when a hash bucket is non-empty. In that case, it may be O(n) in pathological cases, but it could also be O(1) every time. On the other hand, __hash__ has to be called on every lookup, is O(n) on the first call, and ideally O(1) after. I'd expect that __eq__ may often not dominate, and avoiding an unnecessary large tuple allocation on the first __hash__ call could be helpful. If a __hash__ implementation can avoid creating an arbitrarily large tuple unnecessarily and still perform well, why not save the space? If the resulting hash distribution is worse, that's another story, but it seems worth documenting one way or the other, since the current docs give memory-conscious Python programmers pause for this use case.

Ethan Furman

unread,

Dec 30, 2016, 10:09:10 PM12/30/16

to python-ideas

On 12/30/2016 06:47 PM, j...@math.brown.edu wrote:

> __eq__ only has to be called when a hash bucket is non-empty. In that case,
> it may be O(n) in pathological cases, but it could also be O(1) every time.
> On the other hand, __hash__ has to be called on every lookup, is O(n) on
> the first call, and ideally O(1) after. I'd expect that __eq__ may often
> not dominate, and avoiding an unnecessary large tuple allocation on the
> first __hash__ call could be helpful. If a __hash__ implementation can
> avoid creating an arbitrarily large tuple unnecessarily and still perform
> well, why not save the space? If the resulting hash distribution is worse,
> that's another story, but it seems worth documenting one way or the other,
> since the current docs give memory-conscious Python programmers pause for
> this use case.

A quick list of what we have agreed on:

- __hash__ is called once for every dict/set lookup
- __hash__ is calculated once because we are caching the value
- __eq__ is being called an unknown number of times, but can be quite
expensive in terms of time
- items with the same hash are probably cheap in terms of __eq__ (?)

So maybe this will work?

def __hash__(self):
return hash(self.name) * hash(self.nick) * hash(self.color)

In other words, don't create a new tuple, just use the ones you already have and toss in a couple maths operations. (and have somebody vet that ;)

j...@math.brown.edu

unread,

Dec 30, 2016, 10:25:47 PM12/30/16

to Ethan Furman, python-ideas

On Fri, Dec 30, 2016 at 10:08 PM, Ethan Furman <et...@stoneleaf.us> wrote:

So maybe this will work?

def __hash__(self):
return hash(self.name) * hash(self.nick) * hash(self.color)

In other words, don't create a new tuple, just use the ones you already have and toss in a couple maths operations. (and have somebody vet that ;)

See the "Simply XORing such results together would not be order-sensitive, and so wouldn't work" from my original post. (Like XOR, multiplication is also commutative.)

e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2, 1]), we should try to avoid making their hashes equal, or else we increase collisions unnecessarily.

More generally, I think the trick is to get an even, chaotic distribution into sys.hash_info.hash_bits out of this hash algorithm, and I think simply multiplying hash values together like this wouldn't do that.

Nathaniel Smith

unread,

Dec 30, 2016, 10:31:58 PM12/30/16

to j...@math.brown.edu, python-ideas

On Fri, Dec 30, 2016 at 9:29 AM, <j...@math.brown.edu> wrote:
> Updating the docs sounds like the more important change for now, given 3.7+.
> But before the docs make an official recommendation for that recipe, were
> the analyses that Steve and I did sufficient to confirm that its hash
> distribution and performance is good enough at scale, or is more rigorous
> analysis necessary?
>
> I've been trying to find a reasonably detailed and up-to-date reference on
> Python hash() result requirements and analysis methodology, with
> instructions on how to confirm if they're met, but am still looking. Would
> find that an interesting read if it's out there. But I'd take just an
> authoritative thumbs up here too. Just haven't heard one yet.

I'm not an expert on hash table implementation subtleties, but I can
quote Objects/dictobject.c: "Most hash schemes depend on having a
"good" hash function, in the sense of simulating randomness. Python
doesn't ..."

https://github.com/python/cpython/blob/d0a2f68a5f972e4474f5ecce182752f6f7a22178/Objects/dictobject.c#L133

(Basically IIUC the idea is that Python dicts use a relatively
sophisticated probe sequence to resolve collisions which allows the
use of relatively "poor" hashes. There are interesting trade-offs
here...)

-n

--
Nathaniel J. Smith -- https://vorpus.org

Chris Angelico

unread,

Dec 30, 2016, 10:36:24 PM12/30/16

to python-ideas

On Sat, Dec 31, 2016 at 2:24 PM, <j...@math.brown.edu> wrote:
> See the "Simply XORing such results together would not be order-sensitive,
> and so wouldn't work" from my original post. (Like XOR, multiplication is
> also commutative.)
>
> e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2,
> 1]), we should try to avoid making their hashes equal, or else we increase
> collisions unnecessarily.

How likely is it that you'll have this form of collision, rather than
some other? Remember, collisions *will* happen, so don't try to
eliminate them all; just try to minimize the chances of *too many*
collisions. So if you're going to be frequently dealing with (1,2,3)
and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about
order; but otherwise, one possible cause of a collision is no worse
than any other. Keep your algorithm simple, and don't sweat the
details that you aren't sure matter.

ChrisA

j...@math.brown.edu

unread,

Dec 31, 2016, 12:14:44 AM12/31/16

to Nathaniel Smith, Chris Angelico, python-ideas

On Fri, Dec 30, 2016 at 10:30 PM, Nathaniel Smith <n...@pobox.com> wrote:

...

"Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't ..."

https://github.com/python/cpython/blob/d0a2f68a/Objects/dictobject.c#L133

...

Thanks for that link, fascinating to read the rest of that comment!!

Btw, the part you quoted seemed like more a defense for what followed, i.e. the choice to make hash(some_int) == some_int. I'm not sure how much the part you quoted applies generally. e.g. The https://docs.python.org/3/reference/datamodel.html#object.__hash__ docs don't say, "Don't worry about your __hash__ implementation, dict's collision resolution strategy is so good it probably doesn't matter anyway."

But it also doesn't have any discussion of any of the tradeoffs you mentioned that might be worth considering. What should you do when there are arbitrarily many "components of the object that play a part in comparison of objects"? The "hash((self._name, self._nick, self._color))" code snippet is the only example it gives. This leaves people who have use cases like mine wondering whether it's still advised to scale this up to the arbitrarily many items that instances of their class can contain.

If not, then what is advised? Passing a tuple of fewer items to a single hash() call, e.g. hash(tuple(islice(self, CUTOFF)))? Ned's recipe of pairwise-accumulating hash() results over all the items? Or only pairwise-accumulating up to a certain cutoff? Stephen J. Turnbull's idea to use fewer accumulation steps and larger-than-2-tuples? Passing all the items into some other cheap, built-in hash algorithm that actually has an incremental update API (crc32?)?

Still hoping someone can give some authoritative advice, and hope it's still reasonable to be asking. If not, I'll cut it out.

On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <ros...@gmail.com> wrote:

How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.

In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"

Steven D'Aprano

unread,

Dec 31, 2016, 1:59:41 AM12/31/16

to python...@python.org

On Fri, Dec 30, 2016 at 09:47:54PM -0500, j...@math.brown.edu wrote:

> __eq__ only has to be called when a hash bucket is non-empty. In that case,
> it may be O(n) in pathological cases, but it could also be O(1) every time.
> On the other hand, __hash__ has to be called on every lookup, is O(n) on
> the first call, and ideally O(1) after. I'd expect that __eq__ may often
> not dominate, and avoiding an unnecessary large tuple allocation on the
> first __hash__ call could be helpful.

Sorry to be the broken record repeating himself, but this sounds
*exactly* like premature optimization here.

My suggestion is that you are overthinking things, or at least worrying
about issues before you've got any evidence that they are going to be
real issues. I expect that the amount of time and effort you've spent in
analysing the theorectical problems here and writing to this list is
more than the time it would have taken for you to write the simplest
__hash__ method that could work (using the advice in the docs to make a
tuple). You could have implemented that in five minutes, and be running
code by now.

Of course I understand that performance issues may not be visible when
you have 100 or 100 thousand items but may be horrific when you have 100
million. I get that. But you're aware of the possibility, so you can
write a performance test that generates 100 million objects and tests
performance. *If* you find an actual problem, then you can look into
changing your __hash__ method. You could come back here and talk about
actual performance issues instead of hypothetical issues, or you could
hire an expert to tune your hash function. (And if you do pay for an
expert to solve this, please consider giving back to the community.)

Remember that the specific details of __hash__ should be an internal
implementation detail for your class. You shouldn't fear changing the
hash algorithm as often as you need to, including in bug fix releases.
You don't have to wait for the perfect hash function, just a "good
enough for now" one to get started.

I'm not trying to be dismissive of your concerns. They may be real
issues that you have to solve. I'm just saying, you should check your
facts first rather than solve hypothetic problems. I have seen, and even
written, far too much pessimized code in the name of "this must be
better, it stands to reason" to give much credence to theoretical
arguments about performance.

--
Steve

Steven D'Aprano

unread,

Dec 31, 2016, 2:01:12 AM12/31/16

to python...@python.org

On Fri, Dec 30, 2016 at 07:08:27PM -0800, Ethan Furman wrote:

> So maybe this will work?
>
> def __hash__(self):
> return hash(self.name) * hash(self.nick) * hash(self.color)

I don't like the multiplications. If any of the three hashes return
zero, the overall hash will be zero. I think you need better mixing than
that. Look at tuple:

py> hash((0, 1, 2))
-421559672
py> hash(0) * hash(1) * hash(2)
0

--
Steve

Nick Coghlan

unread,

Dec 31, 2016, 7:18:18 AM12/31/16

to j...@math.brown.edu, python-ideas

On 31 December 2016 at 15:13, <j...@math.brown.edu> wrote:

On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <ros...@gmail.com> wrote:
How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.

In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"

Yes, this is essentially what we're suggesting you do - start with a "good enough" hash that may have scalability problems (e.g. due to memory copying) or mathematical distribution problems (e.g. due to a naive mathematical combination of values), and then improve it over time based on real world usage of the library.

Alternatively, you could take the existing tuple hashing algorithm and reimplement that in pure Python: https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336

Based on microbenchmarks, you could then find the size breakpoint where it makes sense to switch between "hash(tuple(self))" (with memory copying, but a more optimised implementation of the algorithm) and a pure Python "tuple_hash(self)". In either case, caching the result on the instance would minimise the computational cost over the lifetime of the object.

Cheers,

Nick.

P.S. Having realised that the tuple hash *algorithm* can be re-used without actually creating a tuple, I'm more amenable to the idea of exposing a "hash.from_iterable" callable that produces the same result as "hash(tuple(iterable))" without actually creating the intermediate tuple.

Brendan Barnwell

unread,

Dec 31, 2016, 8:15:21 AM12/31/16

to python...@python.org

On 2016-12-30 17:04, Ethan Furman wrote:

> On 12/30/2016 04:31 PM,j...@math.brown.edu wrote:
>>> On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote:
>>>>> If you are relying on an identity check for equality then no
>>>>> two FrozenOrderedCollection instances can be equal. Was that
>>>>> your intention? It it was, then just hash the instance's
>>>>> id() and you're done.
>>>
>>> No, I was talking about the identity check done by a set or dict
>>> when doing a lookup to check if the object in a hash bucket is
>>> identical to the object being looked up. In that case, there is
>>> no need for the set or dict to even call __eq__. Right?
> No. It is possible to have two keys be equal but different -- an
> easy example is 1 and 1.0; they both hash the same, equal the same,
> but are not identical. dict has to check equality when two different
> objects hash the same but have non-matching identities.

I think that is the same as what he said. The point is that if they
*are* the same object, you *don't* need to check equality.

--
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is no
path, and leave a trail."
--author unknown

j...@math.brown.edu

unread,

Dec 31, 2016, 5:41:19 PM12/31/16

to Nick Coghlan, python-ideas

Nice!

I just realized, similar to tupleobject.c's "tuplehash" routine, I think the frozenset_hash algorithm (implemented in setobject.c) can't be reused without actually creating a frozenset either. As mentioned, a set hashing algorithm is exposed as collections.Set._hash() in _collections_abc.py, which can be passed an iterable, but that routine is implemented in Python. So here again it seems like you have to choose between either creating an ephemeral copy of your data so you can use the fast C routine, or streaming your data to a slower Python implementation. At least in this case the Python implementation is built-in though.

Given my current shortage of information, for now I'm thinking of handling this problem in my library by exposing a parameter that users can tune if needed. See bidict/_frozen.py in https://github.com/jab/bidict/commit/485bf98#diff-215302d205b9f3650d58ee0337f77297, and check out the _HASH_NITEMS_MAX attribute.

I have to run for now, but thanks again everyone, and happy new year!

j...@math.brown.edu

unread,

Jan 2, 2017, 12:11:57 AM1/2/17

to python-ideas

On Sat, Dec 31, 2016 at 5:39 PM, <j...@math.brown.edu> wrote:

a set hashing algorithm is exposed as collections.Set._hash() in _collections_abc.py, which can be passed an iterable

(By which I meant to say, "which can be passed a set-like iterable such as a Keys- or ItemsView of an existing mapping, so that the hash can be computed from the existing data rather than copying it all into a new frozenset." Should have been more precise.)

j...@math.brown.edu

unread,

Jan 4, 2017, 4:39:36 PM1/4/17

to Nick Coghlan, python-ideas

Instead of the proposals like "hash.from_iterable()", would it make sense to allow tuple.__hash__() to accept any iterable, when called as a classmethod? (And similarly with frozenset.__hash__(), so that the fast C implementation of that algorithm could be used, rather than the slow collections.Set._hash() implementation. Then the duplicated implementation in _collections_abc.py's Set._hash() could be removed completely, delegating to frozenset.__hash__() instead.)

Would this API more cleanly communicate the algorithm being used and the implementation, while making a smaller increase in API surface area compared to introducing a new function?

Stephen J. Turnbull

unread,

Jan 4, 2017, 6:40:54 PM1/4/17

to j...@math.brown.edu, python-ideas

j...@math.brown.edu writes:

> Instead of the proposals like "hash.from_iterable()", would it make sense
> to allow tuple.__hash__() to accept any iterable, when called as a
> classmethod?

[...]

> Would this API more cleanly communicate the algorithm being used and the
> implementation, while making a smaller increase in API surface area
> compared to introducing a new function?

I don't understand what you're proposing. There are three problems
with the API. First, the "obvious" meaning of
"tuple.__hash__(iterable)" is "hash(tuple(iterable))", and that's the
obvious spelling. Second, there's no reason for a dunder method to be
polymorphic; this is hardly discoverable. Third, dunders are normally
not called from user code (including class implementations, although
they're less ugly there), suggesting that there should be a helper for
this.

I'm unclear on how the function is supposed to be implemented, since
presumably tuple.__hash__ knows about the tuple data structure's
implementation (memory layout), but it can't know about an arbitrary
iterable.

Steve

Steven D'Aprano

unread,

Jan 4, 2017, 7:37:26 PM1/4/17

to python...@python.org

On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
> Instead of the proposals like "hash.from_iterable()", would it make sense
> to allow tuple.__hash__() to accept any iterable, when called as a
> classmethod?

The public API for calculating the hash of something is to call the
hash() builtin function on some object, e.g. to call tuple.__hash__ you
write hash((a, b, c)). The __hash__ dunder method is implementation, not
interface, and normally shouldn't be called directly.

Unless I'm missing something obvious, your proposal would require the
caller to call the dunder methods directly:

class X:
def __hash__(self):
return tuple.__hash__(iter(self))

I consider that a poor interface design.

But even if we decide to make an exception in this case, tuple.__hash__
is currently an ordinary instance method right now. There's probably
code that relies on that fact and expects that:

tuple.__hash__((a, b, c))

is currently the same as

(a, b, c).__hash__()

(Starting with the hash() builtin itself, I expect, although that is
easy enough to fix if needed.) Your proposal will break backwards
compatibility, as it requires a change in semantics:

(1) (a, b, c).__hash__() must keep the current behaviour, which
means behaving like a bound instance method;

(2) But tuple.__hash__ will no longer return an unbound method (actually
a function object, but the difference is unimportant) and instead will
return something that behaves like a bound class method.

Here's an implementation which does this:

http://code.activestate.com/recipes/577030-dualmethod-descriptor/

so such a thing is possible. But it breaks backwards-compatability and
introduces something which I consider to be an unclean API (calling a
dunder method directly). Unless there's a *really* strong advantage to

tuple.__hash__(...)

over

hash.from_iterable(...)

(or equivalent), I would be against this change.

> (And similarly with frozenset.__hash__(), so that the fast C
> implementation of that algorithm could be used, rather than the slow
> collections.Set._hash() implementation. Then the duplicated implementation
> in _collections_abc.py's Set._hash() could be removed completely,
> delegating to frozenset.__hash__() instead.)

This is a good point. Until now, I've been assuming that
hash.from_iterable should consider order. But frozenset shows us that
sometimes the hash should *not* consider order.

This hints that perhaps the hash.from_iterable() should have its own
optional dunder method. Or maybe we need two functions: an ordered
version and an unordered version.

Hmmm... just tossing out a wild idea here... let's get rid of the dunder
method part of your suggestion, and add new public class methods to
tuple and frozenset:

tuple.hash_from_iter(iterable)
frozenset.hash_from_iter(iterable)

That gets rid of all the objections about backwards compatibility, since
these are new methods. They're not dunder names, so there are no
objections to being used as part of the public API.

A possible objection is the question, is this functionality *actually*
important enough to bother?

Another possible objection: are these methods part of the sequence/set
API? If not, do they really belong on the tuple/frozenset? Maybe they
belong elsewhere?

> Would this API more cleanly communicate the algorithm being used and the
> implementation,

No. If you want to communicate the algorithm being used, write some
documentation.

Seriously, the public API doesn't communicate the algorithm used for the
implementation. How can it? We can keep the same interface and change
the implementation, or change the interface and keep the implementation.
The two are (mostly) independent.

> while making a smaller increase in API surface area
> compared to introducing a new function?

It's difficult to quantify "API surface area". On the one hand, we have
the addition of one or two new functions or methods. Contrast with:

* introducing a new kind of method into the built-ins (one which
behaves like a classmethod when called from the class, and like
an instance method when called from an instance);

* changing tuple.__hash__ from an ordinary method to one of the
above special methods;

* and likewise for frozenset.__hash__;

* change __hash__ from "only used as implementation, not as
interface" to "sometimes used as interface".

To me, adding one or two new methods/functions is the smaller, or at
least less disruptive, change.

--
Steve

Neil Girdhar

unread,

Jan 4, 2017, 8:04:05 PM1/4/17

to python-ideas, python...@python.org, st...@pearwood.info

Couldn't you add __hash__ to collections.abc.Iterable ? Essentially, expose __hash__ there; then all iterables automatically have a default hash that hashes their ordered contents.

Paul Moore

unread,

Jan 5, 2017, 3:58:28 AM1/5/17

to Steven D'Aprano, Python-Ideas

On 5 January 2017 at 00:31, Steven D'Aprano <st...@pearwood.info> wrote:
> This is a good point. Until now, I've been assuming that
> hash.from_iterable should consider order. But frozenset shows us that
> sometimes the hash should *not* consider order.
>
> This hints that perhaps the hash.from_iterable() should have its own
> optional dunder method. Or maybe we need two functions: an ordered
> version and an unordered version.
>
> Hmmm... just tossing out a wild idea here... let's get rid of the dunder
> method part of your suggestion, and add new public class methods to
> tuple and frozenset:
>
> tuple.hash_from_iter(iterable)
> frozenset.hash_from_iter(iterable)
>
>
> That gets rid of all the objections about backwards compatibility, since
> these are new methods. They're not dunder names, so there are no
> objections to being used as part of the public API.
>
> A possible objection is the question, is this functionality *actually*
> important enough to bother?
>
> Another possible objection: are these methods part of the sequence/set
> API? If not, do they really belong on the tuple/frozenset? Maybe they
> belong elsewhere?

At this point I'd be inclined to say that a 3rd party hashing_utils
module would be a reasonable place to thrash out these design
decisions before committing to a permanent design in the stdlib. The
popularity of such a module would also give a level of indication as
to whether this is an important optimisation in practice.

Paul

Matt Gilson

unread,

Jan 5, 2017, 4:01:29 AM1/5/17

to Neil Girdhar, Python-Ideas, python-ideas

But, I think that the problem with adding `__hash__` to collections.abc.Iterable is that not all iterables are immutable -- And if they aren't immutable, then allowing them to be hashed is likely to be a pretty bad idea...

I'm still having a hard time being convinced that this is very much of an optimization at all ...

If you start hashing tuples that are large enough that memory is a concern, then that's going to also take a really long time and probably be prohibitive anyway. Just for kicks, I decided to throw together a simple script to time how much penalty you pay for hashing a tuple:

class F(object):

def __init__(self, arg):

self.arg = arg

def __hash__(self):

return hash(tuple(self.arg))

class T(object):

def __init__(self, arg):

self.arg = tuple(arg)

def __hash__(self):

return hash(self.arg)

class C(object):

def __init__(self, arg):

self.arg = tuple(arg)

self._hash = None

def __hash__(self):

if self._hash is None:

self._hash = hash(tuple(self.arg))

return self._hash

import timeit

print(timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(500)))'))

print(timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(500)))'))

print(timeit.timeit('hash(c)', 'from __main__ import C; c = C(list(range(500)))'))

results = []

for i in range(1, 11):

n = i * 100

t1 = timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(%d)))' % i)

t2 = timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(%d)))' % i)

results.append(t1/t2)

print(results)

F is going to create a new tuple each time and then hash it. T already has a tuple, so we'll only pay the cost of hashing a tuple, not the cost of constructing a tuple and C caches the hash value and re-uses it once it is known. C is the winner by a factor of 10 or more (no surprise there). But the real interesting thing is that the the ratio of the timing results from hashing `F` vs. `T` is relatively constant in the range of my test (up to 1000 elements) and that ratio's value is approximately 1.3. For most applications, that seems reasonable. If you really need a speed-up, then I suppose you could recode the thing in Cython and see what happens, but I doubt that will be frequently necessary. If you _do_ code it up in Cython, put it up on Pypi and see if people use it...

--

Matt Gilson // SOFTWARE ENGINEER

E: ma...@getpattern.com // P: 603.892.7736

We’re looking for beta testers. Go here to sign up!

M.-A. Lemburg

unread,

Jan 5, 2017, 4:30:30 AM1/5/17

to j...@math.brown.edu, python...@python.org

On 28.12.2016 04:13, j...@math.brown.edu wrote:
> Suppose you have implemented an immutable Position type to represent
> the state of a game played on an MxN board, where the board size can
> grow quite large.
> ...
>
> According to https://docs.python.org/3/reference/datamodel.html#object.__hash__
> :
>
>
> """
> it is advised to mix together the hash values of the components of the
> object that also play a part in comparison of objects by packing them
> into a tuple and hashing the tuple. Example:
>
> def __hash__(self):
> return hash((self.name, self.nick, self.color))
>
> """
>
>
> Applying this advice to the use cases above would require creating an
> arbitrarily large tuple in memory before passing it to hash(), which
> is then just thrown away. It would be preferable if there were a way
> to pass multiple values to hash() in a streaming fashion, such that
> the overall hash were computed incrementally, without building up a
> large object in memory first.

I think there's a misunderstanding here: the hash(obj) built-in
merely interfaces to the obj.__hash__() method (or the tp_hash slot
for C types) and returns whatever these methods give.

It doesn't implement any logic by itself.

If you would like to implement a more efficient hash algorithm
for your types, just go ahead and write them as .__hash__()
method or tp_hash slot method and you're done.

The example from the docs is just to showcase an example of
how such a hash function should work, i.e. to mix in all
relevant data attributes.

In your case, you'd probably use a simple for loop to calculate
the hash without creating tuples or any other temporary
structures.

Here's the hash implementation tuples use as an example

/* The addend 82520, was selected from the range(0, 1000000) for
generating the greatest number of prime multipliers for tuples
upto length eight:

1082527, 1165049, 1082531, 1165057, 1247581, 1330103, 1082533,
1330111, 1412633, 1165069, 1247599, 1495177, 1577699

Tests have shown that it's not worth to cache the hash value, see
issue #9685.
*/

static Py_hash_t
tuplehash(PyTupleObject *v)
{
Py_uhash_t x; /* Unsigned for defined overflow behavior. */
Py_hash_t y;
Py_ssize_t len = Py_SIZE(v);
PyObject **p;
Py_uhash_t mult = _PyHASH_MULTIPLIER;
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash
stability */
mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
x = -2;
return x;
}

As you can see, there's some magic going on there to make
sure that the hash values behave well when used as "keys"
for the dictionary implementation (which is their main
purpose in Python).

You are free to create your own hash implementation.
The only characteristic to pay attention to is to have
objects which compare equal give the same hash value.
This is needed to be able to map such objects to the same
dictionary slots.

There should be no need to have a special hash function which
works on iterables. As long as those iterable objects define
their own .__hash__() method or tp_slot, the hash() built-in
(and Python's dict implementation) will use these and, if needed,
those methods can then use an approach to build hash values
using iterators on the object's internal data along similar
lines as the above tuple implementation.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 05 2017)
>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/
>>> Python Database Interfaces ... http://products.egenix.com/
>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
http://www.malemburg.com/

Neil Girdhar

unread,

Jan 5, 2017, 8:27:11 AM1/5/17

to Matt Gilson, python-ideas, Python-Ideas

On Thu, Jan 5, 2017 at 4:00 AM Matt Gilson <ma...@getpattern.com> wrote:

But, I think that the problem with adding `__hash__` to collections.abc.Iterable is that not all iterables are immutable -- And if they aren't immutable, then allowing them to be hashed is likely to be a pretty bad idea...

Good point. A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__. Since tuple inherits from it, it can choose to delegate up. Then I think everyone is happy.

Neil Girdhar

unread,

Jan 5, 2017, 8:28:42 AM1/5/17

to python...@googlegroups.com, j...@math.brown.edu, python...@python.org

The point is that the OP doesn't want to write his own hash function, but wants Python to provide a standard way of hashing an iterable. Today, the standard way is to convert to tuple and call hash on that. That may not be efficient. FWIW from a style perspective, I agree with OP.

--

---
You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/XcuC01a8SYs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-ideas...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Random832

unread,

Jan 5, 2017, 9:02:07 AM1/5/17

to python...@python.org

On Thu, Jan 5, 2017, at 04:00, Matt Gilson wrote:
> But, I think that the problem with adding `__hash__` to
> collections.abc.Iterable is that not all iterables are immutable -- And
> if
> they aren't immutable, then allowing them to be hashed is likely to be a
> pretty bad idea...

Why? This should never cause an interpreter-crashing bug, because
user-defined types can have bad hash methods anyway. And without that,
the reason for not applying the "consenting adults" principle and
allowing people to add mutable objects to a *short-lived* dict without
intending to change them while the dict is in use has never been clear
to me. I think mutable types not having a hash method was a mistake in
the first place.

Paul Moore

unread,

Jan 5, 2017, 10:59:32 AM1/5/17

to Neil Girdhar, j...@math.brown.edu, python-ideas, Python-Ideas

On 5 January 2017 at 13:28, Neil Girdhar <miste...@gmail.com> wrote:
> The point is that the OP doesn't want to write his own hash function, but
> wants Python to provide a standard way of hashing an iterable. Today, the
> standard way is to convert to tuple and call hash on that. That may not be
> efficient. FWIW from a style perspective, I agree with OP.

The debate here regarding tuple/frozenset indicates that there may not
be a "standard way" of hashing an iterable (should order matter?).
Although I agree that assuming order matters is a reasonable
assumption to make in the absence of any better information.

Hashing is low enough level that providing helpers in the stdlib is
not unreasonable. It's not obvious (to me, at least) that it's a
common enough need to warrant it, though. Do we have any information
on how often people implement their own __hash__, or how often
hash(tuple(my_iterable)) would be an acceptable hash, except for the
cost of creating the tuple? The OP's request is the only time this has
come up as a requirement, to my knowledge. Hence my suggestion to copy
the tuple implementation, modify it to work with general iterables,
and publish it as a 3rd party module - its usage might give us an idea
of how often this need arises. (The other option would be for someone
to do some analysis of published code).

Assuming it is a sufficiently useful primitive to add, then we can
debate naming. But I'd prefer it to be named in such a way that it
makes it clear that it's a low-level helper for people writing their
own __hash__ function, and not some sort of variant of hashing (which
hash.from_iterable implies to me).

Paul

Matt Gilson

unread,

Jan 5, 2017, 12:33:03 PM1/5/17

to Paul Moore, Neil Girdhar, j...@math.brown.edu, Python-Ideas

I agree with Paul -- I'm not convinced that this is common enough or that the benefits are big enough to warrant something builtin. However, I did decide to dust off some of my old skills and I threw together a simple gist to see how hard it would be to create something using Cython based on the CPython tuple hash algorithm. I don't know how well it works for arbitrary iterables without a `__length_hint__`, but seems to work as intended for iterables that have the length hint.

https://gist.github.com/mgilson/129859a79487a483163980db25b709bf

If you're interested, or want to pick this up and actually do something with it, feel free... Also, I haven't written anything using Cython for ages, so if this could be further optimized, feel free to let me know.

Neil Girdhar

unread,

Jan 5, 2017, 3:31:25 PM1/5/17

to Paul Moore, python-ideas, j...@math.brown.edu, Python-Ideas

On Thu, Jan 5, 2017 at 10:58 AM Paul Moore <p.f....@gmail.com> wrote:

On 5 January 2017 at 13:28, Neil Girdhar <miste...@gmail.com> wrote:
> The point is that the OP doesn't want to write his own hash function, but
> wants Python to provide a standard way of hashing an iterable. Today, the
> standard way is to convert to tuple and call hash on that. That may not be
> efficient. FWIW from a style perspective, I agree with OP.

The debate here regarding tuple/frozenset indicates that there may not
be a "standard way" of hashing an iterable (should order matter?).
Although I agree that assuming order matters is a reasonable
assumption to make in the absence of any better information.

That's another good point. In keeping with my abc proposal, why not add abstract base classes with __hash__:

* ImmutableIterable, and

* ImmutableSet.

ImmutableSet inherits from ImmutableIterable, and overrides __hash__ in such a way that order is ignored.

This presumably involves very little new code — it's just a propagating up of the code that's already in set and tuple.

The advantage is that instead of implementing __hash__ for your type, you declare your intention by inheriting from an abc and get an automatically-provided hash function.

Stephen J. Turnbull

unread,

Jan 5, 2017, 9:11:25 PM1/5/17

to Paul Moore, Neil Girdhar, j...@math.brown.edu, Python-Ideas

Paul Moore writes:

> The debate here regarding tuple/frozenset indicates that there may not
> be a "standard way" of hashing an iterable (should order matter?).

If part of the data structure, yes, if an implementation accident, no.

> Although I agree that assuming order matters is a reasonable
> assumption to make in the absence of any better information.

I don't think so. Eg, with dicts now ordered by insertion, an
order-dependent default hash for collections means

a = {}
b = {}
a['1'] = 1
a['2'] = 2
b['2'] = 2
b['1'] = 1
hash(a) != hash(b) # modulo usual probability of collision

(and modulo normally not hashing mutables). For the same reason I
expect I'd disagree with Neil's proposal for an ImmutableWhatever
default __hash__ although the hash comparison is "cheap", it's still a
pessimization. Haven't thought that through, though.

BTW, it occurs to me that now that dictionaries are versioned, in some
cases it *may* make sense to hash dictionaries even though they are
mutable, although the "hash" would need to somehow account for the
version changing. Seems messy but maybe someone has an idea?

Steve

Neil Girdhar

unread,

Jan 5, 2017, 11:37:51 PM1/5/17

to Stephen J. Turnbull, Paul Moore, j...@math.brown.edu, Python-Ideas

On Thu, Jan 5, 2017 at 9:10 PM Stephen J. Turnbull <turnbull....@u.tsukuba.ac.jp> wrote:

Paul Moore writes:

> The debate here regarding tuple/frozenset indicates that there may not
> be a "standard way" of hashing an iterable (should order matter?).

If part of the data structure, yes, if an implementation accident, no.

> Although I agree that assuming order matters is a reasonable
> assumption to make in the absence of any better information.

I don't think so. Eg, with dicts now ordered by insertion, an
order-dependent default hash for collections means

a = {}
b = {}
a['1'] = 1
a['2'] = 2
b['2'] = 2
b['1'] = 1
hash(a) != hash(b) # modulo usual probability of collision

(and modulo normally not hashing mutables). For the same reason I
expect I'd disagree with Neil's proposal for an ImmutableWhatever
default __hash__ although the hash comparison is "cheap", it's still a
pessimization. Haven't thought that through, though.

I don't understand this? How is providing a default method in an abstract base class a pessimization? If it happens to be slower than the code in the current methods, it can still be overridden.

BTW, it occurs to me that now that dictionaries are versioned, in some
cases it *may* make sense to hash dictionaries even though they are
mutable, although the "hash" would need to somehow account for the
version changing. Seems messy but maybe someone has an idea?

I think it's important to keep in mind that dictionaries are not versioned in Python. They happen to be versioned in CPython as an unexposed implementation detail. I don't think that such details should have any bearing on potential changes to Python.

Steve

Neil Girdhar

unread,

Jan 6, 2017, 6:04:41 AM1/6/17

to Stephen J. Turnbull, j...@math.brown.edu, Python-Ideas

On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull <turnbull....@u.tsukuba.ac.jp> wrote:

Neil Girdhar writes:

> I don't understand this? How is providing a default method in an
> abstract base class a pessimization? If it happens to be slower
> than the code in the current methods, it can still be overridden.

How often will people override until it's bitten them? How many
people will not even notice until they've lost business due to slow
response? If you don't have a default, that's much more obvious.
Note that if there is a default, the collections are "large", and
equality comparisons are "rare", this could be a substantial overhead.

I still don't understand what you're talking about here. You're saying that we shouldn't provide a __hash__ in case the default hash happens to be slower than what the user wants and so by not providing it, we force the user to write a fast one? Doesn't that argument apply to all methods provided by abcs?

> > BTW, it occurs to me that now that dictionaries are versioned, in some
> > cases it *may* make sense to hash dictionaries even though they are
> > mutable, although the "hash" would need to somehow account for the
> > version changing. Seems messy but maybe someone has an idea?

> I think it's important to keep in mind that dictionaries are not versioned
> in Python. They happen to be versioned in CPython as an unexposed
> implementation detail. I don't think that such details should have any
> bearing on potential changes to Python.

AFAIK the use of the hash member for equality checking is an
implementation detail too, although the language reference does
mention that set, frozenset and dict are "hashed collections". The
basic requirements on hashes are that (1) objects that compare equal
must hash to the same value, and (2) the hash bucket must not change
over an object's lifetime (this is the "messy" aspect that probably
kills the idea -- you'd need to fix up all hashed collections that

contain the object as a key).

Are you saying that __hash__ is called by __eq__?

Paul Moore

unread,

Jan 6, 2017, 6:17:26 AM1/6/17

to Neil Girdhar, j...@math.brown.edu, Python-Ideas

On 6 January 2017 at 09:02, Neil Girdhar <miste...@gmail.com> wrote:
>
> Yeah, looks like you missed a revision. There were two emails. I suggested
> adding ImmutableIterable and ImmutableSet, and so there is an obvious
> implementation of __hash__ for both.

OK, sorry.

The proposal is still getting more complicated, though, and I really
don't see how it's better than having some low-level helper functions
for people who need to build custom __hash__ implementations. The "one
obvious way" to customise hashing is to implement __hash__, not to
derive from a base class that says my class is an
Immutable{Iterable,Set}.

Paul

Stephen J. Turnbull

unread,

Jan 6, 2017, 6:38:11 AM1/6/17

to Neil Girdhar, j...@math.brown.edu, Python-Ideas

Neil Girdhar writes:

> I don't understand this? How is providing a default method in an
> abstract base class a pessimization? If it happens to be slower
> than the code in the current methods, it can still be overridden.

How often will people override until it's bitten them? How many
people will not even notice until they've lost business due to slow
response? If you don't have a default, that's much more obvious.
Note that if there is a default, the collections are "large", and
equality comparisons are "rare", this could be a substantial overhead.

> > BTW, it occurs to me that now that dictionaries are versioned, in some
> > cases it *may* make sense to hash dictionaries even though they are
> > mutable, although the "hash" would need to somehow account for the
> > version changing. Seems messy but maybe someone has an idea?

> I think it's important to keep in mind that dictionaries are not versioned
> in Python. They happen to be versioned in CPython as an unexposed
> implementation detail. I don't think that such details should have any
> bearing on potential changes to Python.

AFAIK the use of the hash member for equality checking is an
implementation detail too, although the language reference does
mention that set, frozenset and dict are "hashed collections". The
basic requirements on hashes are that (1) objects that compare equal
must hash to the same value, and (2) the hash bucket must not change
over an object's lifetime (this is the "messy" aspect that probably
kills the idea -- you'd need to fix up all hashed collections that
contain the object as a key).

Paul Moore

unread,

Jan 6, 2017, 6:48:36 AM1/6/17

to Neil Girdhar, j...@math.brown.edu, Python-Ideas

On 6 January 2017 at 07:26, Neil Girdhar <miste...@gmail.com> wrote:
> On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull
> <turnbull....@u.tsukuba.ac.jp> wrote:
>>
>> Neil Girdhar writes:
>>
>> > I don't understand this? How is providing a default method in an
>> > abstract base class a pessimization? If it happens to be slower
>> > than the code in the current methods, it can still be overridden.
>>
>> How often will people override until it's bitten them? How many
>> people will not even notice until they've lost business due to slow
>> response? If you don't have a default, that's much more obvious.
>> Note that if there is a default, the collections are "large", and
>> equality comparisons are "rare", this could be a substantial overhead.
>
>
> I still don't understand what you're talking about here. You're saying that
> we shouldn't provide a __hash__ in case the default hash happens to be
> slower than what the user wants and so by not providing it, we force the
> user to write a fast one? Doesn't that argument apply to all methods
> provided by abcs?

The point here is that ABCs should provide defaults for methods where
there is an *obvious* default. It's not at all clear that there's an
obvious default for __hash__.

Unless I missed a revision of your proposal, what you suggested was:

> A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__.

So what classes would derive from ImmutableIterable? Frozenset? A
user-defined frozendict? There's no "obvious" default that would work
for both those cases. And that's before we even get to the question of
whether the default has the right performance characteristics, which
is highly application-dependent.

It's not clear to me if you expect ImmutableIterable to provide
anything other than a default implementation of hash. If not, then how
is making it an ABC any better than simply providing a helper function
that people can use in their own __hash__ implementation? That would
make it explicit what people are doing, and avoid any tendency towards
people thinking they "should" inherit from ImmutableIterable and yet
needing to override the only method it provides.

Paul

Neil Girdhar

unread,

Jan 6, 2017, 6:49:37 AM1/6/17

to Paul Moore, j...@math.brown.edu, Python-Ideas

On Fri, Jan 6, 2017 at 3:59 AM Paul Moore <p.f....@gmail.com> wrote:

On 6 January 2017 at 07:26, Neil Girdhar <miste...@gmail.com> wrote:
> On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull
> <turnbull....@u.tsukuba.ac.jp> wrote:
>>
>> Neil Girdhar writes:
>>
>> > I don't understand this? How is providing a default method in an
>> > abstract base class a pessimization? If it happens to be slower
>> > than the code in the current methods, it can still be overridden.
>>
>> How often will people override until it's bitten them? How many
>> people will not even notice until they've lost business due to slow
>> response? If you don't have a default, that's much more obvious.
>> Note that if there is a default, the collections are "large", and
>> equality comparisons are "rare", this could be a substantial overhead.
>
>
> I still don't understand what you're talking about here. You're saying that
> we shouldn't provide a __hash__ in case the default hash happens to be
> slower than what the user wants and so by not providing it, we force the
> user to write a fast one? Doesn't that argument apply to all methods
> provided by abcs?

The point here is that ABCs should provide defaults for methods where
there is an *obvious* default. It's not at all clear that there's an
obvious default for __hash__.

Unless I missed a revision of your proposal, what you suggested was:

Yeah, looks like you missed a revision. There were two emails. I suggested adding ImmutableIterable and ImmutableSet, and so there is an obvious implementation of __hash__ for both.

> A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__.

Reply all

Reply to author

Forward

[Python-ideas] incremental hashing in __hash__

j...@math.brown.edu

Ryan Gonzalez

j...@math.brown.edu

Ned Batchelder

Ethan Furman

j...@math.brown.edu

j...@math.brown.edu

Ned Batchelder

Steven D'Aprano

Chris Angelico

j...@math.brown.edu

Tim Delaney

Nick Coghlan

j...@math.brown.edu

Stephen J. Turnbull

Matthias Bussonnier

Christian Heimes

j...@math.brown.edu

Ethan Furman

j...@math.brown.edu

Ethan Furman

j...@math.brown.edu

j...@math.brown.edu

Ethan Furman

j...@math.brown.edu

Ethan Furman

j...@math.brown.edu

Nathaniel Smith

Chris Angelico

j...@math.brown.edu

Steven D'Aprano

Steven D'Aprano

Nick Coghlan

Brendan Barnwell

j...@math.brown.edu

j...@math.brown.edu

j...@math.brown.edu

Stephen J. Turnbull

Steven D'Aprano

Neil Girdhar

Paul Moore

Matt Gilson

M.-A. Lemburg

Neil Girdhar

Neil Girdhar

Random832

Paul Moore

Matt Gilson

Neil Girdhar

Stephen J. Turnbull

Neil Girdhar

Neil Girdhar

Paul Moore

Stephen J. Turnbull

Paul Moore

Neil Girdhar

[Python-ideas] incremental hashing in hash