Status: New
Owner: ----
Labels: Type-Defect Priority-Medium
New issue 306 by
matt.gi...@gmail.com: Python: Message objects should not
be hashable
http://code.google.com/p/protobuf/issues/detail?id=306
What steps will reproduce the problem?
1. Create a simple .proto file; anything will do:
package test;
message Person {
required string name = 1;
}
2. Create two Message objects and set their fields identically:
>>> import test_pb2
>>> p = test_pb2.Person()
>>> q = test_pb2.Person()
>>>
p.name = "Fred"
>>>
q.name = "Fred"
3. Note that the two objects compare equally, but their hashes produce
different results:
>>> p == q
True
>>> hash(p) == hash(q)
False
What is the expected output?
>>> hash(p) == hash(q)
TypeError: unhashable type: 'Person'
Rationale
The specification for hashing in Python
(
http://docs.python.org/reference/datamodel.html#object.__hash__) specifies
that "The only required property is that objects which compare equal have
the same hash value". Therefore, it is a violation of Python's semantics to
have p and q not hash to the same value. Practical consequences of this are
that if p and q are both inserted into a set or dictionary keys, it will be
undefined whether they will both be stored, or whether one will overwrite
the other (depending on the hash buckets used).
Unfortunately, it is not appropriate to override __hash__ and have the two
objects hash equally when they are considered equal, because they are
mutable. The above specification continues, "If a class defines mutable
objects and implements a __cmp__() or __eq__() method, it should not
implement __hash__(), since hashable collection implementations require
that a object’s hash value is immutable (if the object’s hash value
changes, it will be in the wrong hash bucket)."
The only valid solution is for Message objects to be unhashable (which can
be accomplished by setting __hash__ = None in the Message class). This is
the approach taken by all mutable built-in types in the Python standard
library (e.g., list, set and dict).
This may break existing code, so perhaps it could be introduced as an
option in protoc (which would set __hash__ = None on all of the generated
classes). This would be a useful option, since all code which relies on the
hashability of Message objects is potentially buggy, due to the undefined
behaviour when inserting Messages into hash tables described above.