Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first. Should there be better support for this use case? Perhaps hash() could support an alternative signature, allowing it to accept a stream of values whose combined hash would be computed incrementally in *constant* space and linear time, e.g. "hash(items=iter(self))".
You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:
def hash_stream(them):
val = 0
for it in them:
val = hash((val, it))
return val
Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating.
Or maybe I am misunderstanding the requirements?
In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers).
On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder <n...@nedbatchelder.com> wrote:
You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:
def hash_stream(them):
val = 0
for it in them:
val = hash((val, it))
return val
Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating.
Or maybe I am misunderstanding the requirements?
This is better than solutions like http://stackoverflow.com/a/27952689/161642 in the sense that it's a little higher level (no bit shifting or magic numbers).
But it's not clear that it's any better in the sense that you're still rolling your own incremental hash algorithm out of a lower-level primitive that doesn't document support for this, and therefore taking responsibility yourself for how well it distributes values into buckets.
Are you confident this results in good hash performance? Is this better than a solution built on top of a hash function with an explicit API for calculating a hash incrementally, such as the crc32 example I included? (And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.)
With respect Josh, I feel that this thread is based on premature
optimization. It seems to me that you're *assuming* that anything less
than some theoretically ideal O(1) space O(N) time hash function is
clearly and obviously unsuitable.
Of course I might be completely wrong. Perhaps you have implemented your
own __hash__ methods as suggested by the docs, as well as Ned's version,
and profiled your code and discovered that __hash__ is a significant
bottleneck. In which case, I'll apologise for doubting you, but in my
defence I'll say that the language you have used in this thread so far
gives no hint that you've actually profiled your code and found the
bottleneck.
On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano <st...@pearwood.info> wrote:
> I'd rather add a generator to the itertools
> module:
>
> itertools.iterhash(iterable) # yield incremental hashes
>
> or, copying the API of itertools.chain, add a method to hash:
>
> hash.from_iterable(iterable) # return hash calculated incrementally
The itertools module is mainly designed to be consumed lazily. The
hash has to be calculated eagerly, so it's not really a good fit for
itertools.
> But as you showed, it's certainly possible to do some exploration in the
> meantime. Prompted by your helpful comparison, I just put together
> https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further
> compare hash_tuple vs. hash_incremental.
It's been a while :-) since I read Knuth[1] (and that was when Knuth
was authoritative on this subject 8^O), but neither methodology is
particularly helpful here. The ideal is effectively a uniform
distribution on a circle, which has no mean. Therefore standard
deviation is also uninteresting, since its definition uses the mean.
The right test is something like a χ² test against a uniform
distribution.
The scatter plots (of hash against test data) simply show the same
thing, without the precision of a statistical test. (Note: do not
deprecate the computer visualization + human eye + human brain system.
It is the best known machine for detecting significant patterns and
anomolies, though YMMV.) But it's not very good at detecting high-
dimensional patterns.
However, it's nowhere near good enough for a hash function to have a
uniform distribution on random data. It actually needs to be
"chaotic" in the sense that it also spreads "nearby" data out, where
"nearby" here would probably mean that as 4000-bit strings less than
m% (for small m) differ, as in real data you usually expect a certain
amount of structure (continuity in time, clustering around a mean,
line noise in a communication channel) so that you'd be likely to get
lots of "clusters" of very similar data. You don't want them pounding
on a small subset of buckets.
> I'm not sure if the methodology there is sound, as I'm new to
> analysis like this.
Even decades later, starting with Knuth[1] can't hurt. :-)
> Given sufficiently good distribution, I'd expect there to be unanimous
> agreement that an immutable collection, which could contain arbitrarily
> many items, should strongly prefer hash_incremental(self) over
> hash(tuple(self)), for the same reason we use generator comprehensions
> instead of list comprehensions when appropriate. Please correct me if I'm
> wrong.
I think that's a misleading analogy. Just stick to
For very large collections where we already have the data,
duplicating the data is very expensive. Furthermore, since the
purpose of hashing is making equality comparisons cheap, this is
likely to happen in a loop.
On the other hand, if there are going to be a *lot* of "large"
collections being stored, then they're actually not all that large
compared to the system, and you might not actually care that much
about the cost of the emphemeral tuples, because the real cost is in
an n^2 set of comparisons. From the point of view of NumPy, this is
an "are you kidding?" argument because large datasets are its stock in
trade, but for core Python, this may be sufficiently esoteric that it
should be delegated to
On balance, the arguments that Steven d'Aprano advanced for having a
statistics module in the stdlib vs. importing pandas apply here. In
particular, I think there are a huge number of options for an
iterative hash. For example, Ned chained 2-tuples. But perhaps for
efficient time in bounded space you want to use bounded but larger
tuples. I don't know -- and that's the point. If we have a TOOWTDI
API like hash.from_iterable then smart people (and people with time on
their hands to do exhaustive experiments!) can tune that over time, as
has been done with the sort API.
Another option is yielding partials, as Steven says:
> > itertools.iterhash(iterable) # yield incremental hashes
That's a very interesting idea, though I suspect it rarely would make
a big performance improvement.
I'm not sure I like the "hash.from_iterable" name for this API. The
problem is that if it's not a concrete collection[3], then you throw
away the data. If the hash algorithm happens to suck for certain
data, you'll get a lot of collisions, and conclude that your data is
much more concentrated than it actually is. I find it hard to imagine
a use case where you have large data where you only care about whether
two data points are different (cf. equality comparisons for floats).
You want to know how they're different. So I think I would prefer to
name it "hash.from_collection" or similar. Of course whether the
implementation actually raises on a generator or other non-concrete
iterable is a fine point.
Footnotes:
[1] Of course I mean The Art of Computer Programming, Ch. 3.
[2] Including the newly ordered dicts, maybe? Somebody tweet
@dabeaz! What evil can he do with this?
[3] Yeah, I know, "re-iterables". But we don't have a definition,
let alone an API for identifying, those Things.
Hi,
I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
hashlib module.
Before we come up with a new API or recipe, I would like to understand
the problem first. Why does the initial op consider hash(large_tuple) a
performance issue? If you have an object with lots of members that
affect both __hash__ and __eq__, then __hash__ is really least of your
concern. The hash has to be computed just once and then will stay the
same over the life time of the object. Once computed the hash can be
cached.
On the other hand __eq__ is at least called once for every successful
hash lookup. On the worst case it is called n-1 for a dict of size n for
a match *and* a hashmap miss. Every __eq__ call has to compare between 1
and m member attributes. For a dict with 1,000 elements with 1,000
members each, that's just 1,000 hash computations with roughly 8 bB
memory allocation but almost a million comparisons in worst case.
A hasher objects adds further overhead, e.g. object allocation, creation
of a bound methods for each call etc. It's also less CPU cache friendly
than the linear data structure of a tuple.
On 12/30/2016 03:36 PM, j...@math.brown.edu wrote:
In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood?
If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done.
If that wasn't your intention then, while there can certainly be a few quick checks to rule out equality (such as length) if those match, the expensive full equality check will have to happen.
No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.
On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman <et...@stoneleaf.us> wrote:No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.
...
I'm having trouble showing that two equal but nonidentical objects can both be in the same dict.
I don't think so. As someone else said, a hash can be calculated once and then cached, but __eq__ has to be called every time. Depending on the clustering of your data that could be quick... or not.
So maybe this will work?
def __hash__(self):
return hash(self.name) * hash(self.nick) * hash(self.color)
In other words, don't create a new tuple, just use the ones you already have and toss in a couple maths operations. (and have somebody vet that ;)
"Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't ..."
https://github.com/python/cpython/blob/d0a2f68a/Objects/dictobject.c#L133
...
How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.
On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <ros...@gmail.com> wrote:How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"
a set hashing algorithm is exposed as collections.Set._hash() in _collections_abc.py, which can be passed an iterable
Matt Gilson // SOFTWARE ENGINEER
E: ma...@getpattern.com // P: 603.892.7736
But, I think that the problem with adding `__hash__` to collections.abc.Iterable is that not all iterables are immutable -- And if they aren't immutable, then allowing them to be hashed is likely to be a pretty bad idea...
--
---
You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/XcuC01a8SYs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-ideas...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On 5 January 2017 at 13:28, Neil Girdhar <miste...@gmail.com> wrote:
> The point is that the OP doesn't want to write his own hash function, but
> wants Python to provide a standard way of hashing an iterable. Today, the
> standard way is to convert to tuple and call hash on that. That may not be
> efficient. FWIW from a style perspective, I agree with OP.
The debate here regarding tuple/frozenset indicates that there may not
be a "standard way" of hashing an iterable (should order matter?).
Although I agree that assuming order matters is a reasonable
assumption to make in the absence of any better information.
Paul Moore writes:
> The debate here regarding tuple/frozenset indicates that there may not
> be a "standard way" of hashing an iterable (should order matter?).
If part of the data structure, yes, if an implementation accident, no.
> Although I agree that assuming order matters is a reasonable
> assumption to make in the absence of any better information.
I don't think so. Eg, with dicts now ordered by insertion, an
order-dependent default hash for collections means
a = {}
b = {}
a['1'] = 1
a['2'] = 2
b['2'] = 2
b['1'] = 1
hash(a) != hash(b) # modulo usual probability of collision
(and modulo normally not hashing mutables). For the same reason I
expect I'd disagree with Neil's proposal for an ImmutableWhatever
default __hash__ although the hash comparison is "cheap", it's still a
pessimization. Haven't thought that through, though.
BTW, it occurs to me that now that dictionaries are versioned, in some
cases it *may* make sense to hash dictionaries even though they are
mutable, although the "hash" would need to somehow account for the
version changing. Seems messy but maybe someone has an idea?
Steve
Neil Girdhar writes:
> I don't understand this? How is providing a default method in an
> abstract base class a pessimization? If it happens to be slower
> than the code in the current methods, it can still be overridden.
How often will people override until it's bitten them? How many
people will not even notice until they've lost business due to slow
response? If you don't have a default, that's much more obvious.
Note that if there is a default, the collections are "large", and
equality comparisons are "rare", this could be a substantial overhead.
> > BTW, it occurs to me that now that dictionaries are versioned, in some
> > cases it *may* make sense to hash dictionaries even though they are
> > mutable, although the "hash" would need to somehow account for the
> > version changing. Seems messy but maybe someone has an idea?
> I think it's important to keep in mind that dictionaries are not versioned
> in Python. They happen to be versioned in CPython as an unexposed
> implementation detail. I don't think that such details should have any
> bearing on potential changes to Python.
AFAIK the use of the hash member for equality checking is an
implementation detail too, although the language reference does
mention that set, frozenset and dict are "hashed collections". The
basic requirements on hashes are that (1) objects that compare equal
must hash to the same value, and (2) the hash bucket must not change
over an object's lifetime (this is the "messy" aspect that probably
kills the idea -- you'd need to fix up all hashed collections that
contain the object as a key).
On 6 January 2017 at 07:26, Neil Girdhar <miste...@gmail.com> wrote:
> On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull
> <turnbull....@u.tsukuba.ac.jp> wrote:
>>
>> Neil Girdhar writes:
>>
>> > I don't understand this? How is providing a default method in an
>> > abstract base class a pessimization? If it happens to be slower
>> > than the code in the current methods, it can still be overridden.
>>
>> How often will people override until it's bitten them? How many
>> people will not even notice until they've lost business due to slow
>> response? If you don't have a default, that's much more obvious.
>> Note that if there is a default, the collections are "large", and
>> equality comparisons are "rare", this could be a substantial overhead.
>
>
> I still don't understand what you're talking about here. You're saying that
> we shouldn't provide a __hash__ in case the default hash happens to be
> slower than what the user wants and so by not providing it, we force the
> user to write a fast one? Doesn't that argument apply to all methods
> provided by abcs?
The point here is that ABCs should provide defaults for methods where
there is an *obvious* default. It's not at all clear that there's an
obvious default for __hash__.
Unless I missed a revision of your proposal, what you suggested was:
> A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__.