Pickling of CategoryObject instances

70 views
Skip to first unread message

Nils Bruin

unread,
Sep 14, 2013, 10:00:22 PM9/14/13
to sage-...@googlegroups.com
The problems arising on #15149 arise from the fact that if objects occur as keys in dictionaries, then their hashes might be required prior to `setstate` being called on them during unpickling. If we trace where the failing `hash` comes from, we find it's

CategoryObject.__hash__(self):
        if self._hash_value == -1:
            self._hash_value = hash(repr(self))
        return self._hash_value

so in a way, it's CategoryObject that breaks the unpickling: If CategoryObject introduces a custom hash, it should also ensure that (at least by default), pickling ensures that it gets unpacked appropriately and is usable prior to __setstate__, a requirement that may seem a little surprising at first. Of course, CategoryObject has no idea what's required to compute repr(self), but if it would do something like:

def __reduce__(self):
    return _CategoryObjectReconstructor, (type(self),self.hash()),self.__dict__

def _CategoryObjectReconstructor(T,hashval):
    v = CategoryObject.__new__(T)
    v._hash_value = hashval
    return v

we'd at least recover "hash" (and I think it's safe to assume that repr(loads(dumps(obj))) == repr(obj), so we'd get the same result).

Nils Bruin

unread,
Sep 15, 2013, 2:40:45 AM9/15/13
to sage-...@googlegroups.com


On Saturday, September 14, 2013 7:00:22 PM UTC-7, Nils Bruin wrote:
so in a way, it's CategoryObject that breaks the unpickling: If CategoryObject introduces a custom hash, it should also ensure that (at least by default), pickling ensures that it gets unpacked appropriately and is usable prior to __setstate__, a requirement that may seem a little surprising at first. Of course, CategoryObject has no idea what's required to compute repr(self), but if it would do something like:

And that isn't enough by itself: We need to make sure that equality testing works too. That's a bit hard to do. In fact, it's questionable that CategoryObject implements a hash without implementing an equality test.

Would it be reasonable to:
 1] equip CategoryObject with a __init__(self,*args,**kwargs)
 2] declare that proper use of CategoryObject includes calling this __init__
 3] let the __init__ save args and kwargs (this is the most problematic part, I'd say -- it could lead to extra memory use. Not circularity, though: these objects exist *PRIOR* to the new object being created
 4] implement a default hash and eq based in the creation parameters
 5] include the creation args and kwargs in the pickling data
 6] ensure that there's a __reduce__ that reinstates the cached init (args,kwargs) upon construction during unpickling, so that the default hash and eq work prior to __setstate__ being called.

Of course, objects inheriting from CategoryObject would be free to implement their own pickling, hash, and equality, in which case calling CategoryObject.__init__ would not be required either. But at least this way we'd provide a way to get basic functionality (including pickling).

Otherwise, proper use would have to include "write a pickling procedure that ensures that prior to __setstate__ (which CategoryObject does provide),  equality (which CategoryObject does not provide) and hashing (with CategoryObject provides in a rudimentary way) must already work.

Nils Bruin

unread,
Sep 16, 2013, 4:18:45 PM9/16/13
to sage-...@googlegroups.com
Perhaps as a bit more concrete example, if we do

base=object
class Node(base):
    def __init__(self,i):
        self.i=i
        self.neighbours=set()
    def __repr__(self):
        return "Node %s with neighbours %s"%(self.i,sorted([c.i for c in self.neighbours]))
    def connected_component(self):
        nodelist=[]
        todo=[self]
        visited=set()
        while todo:
            n=todo.pop()
            if n.i not in visited:
                nodelist.append(n)
                todo.extend(n.neighbours)
                visited.add(n.i)
        nodelist.sort(key=lambda n:n.i)
        return nodelist

n=4
N=[Node(i) for i in range(n)]
for j in range(n):
    N[j].neighbours=set(N[i] for i in range(n) if i != j)

n=loads(dumps(N[0]))

things work just fine.

If we change the first line to base=SageObject, we get an attribute error, which can be traced back to the definition of __hash__ (defined on SageObject) trying to call __repr__ . Thus, we can say that SageObject breaks pickling on circular structures, in that things that work perfectly fine when inheriting from "object", break when we use SageObject instead. This should probably at least be documented, but ideally we'd find some solution so that pickling doesn't break, or at least that there is a well-documented and low-cost way of making it work again.


On Saturday, September 14, 2013 7:00:22 PM UTC-7, Nils Bruin wrote:

Simon King

unread,
Sep 17, 2013, 5:15:14 AM9/17/13
to sage-...@googlegroups.com
Hi Nils,

On 2013-09-16, Nils Bruin <nbr...@sfu.ca> wrote:
> If we change the first line to base=SageObject, we get an attribute error,
> which can be traced back to the definition of __hash__ (defined on
> SageObject) trying to call __repr__ . Thus, we can say that SageObject
> breaks pickling on circular structures, in that things that work perfectly
> fine when inheriting from "object", break when we use SageObject instead.

I am of course not the person who created SageObject. But I think I can
see the rationale for how its hash is implemented: In Python's <object>,
the default is that comparison is by identity, and hence the hash is
obtained from the object's address in memory, id(). In Sage, we are more
likely in a situation where non-identical objects evaluate equal. Hence,
if we want a default hash at all, then it should take this into account.
And it is a good guess that two Sage objects evaluate equal only if their
string representations are equal (but not "if and only if").

Some problems, though: The string representation is not always available
when we need the hash. Computing the string representation is quite
expensive, and hence the default hash is very slow; but at least it is
cached now, as of version Sage-5.12.beta5. And, worst: The string
representation can be customized after object creation, and so the hash
value depends on whether the hash was first called before or after
renaming. Example:

sage: class MyParent(Parent): pass
sage: P1, P2, P3 = MyParent(), MyParent(), MyParent()
sage: repr(P1)==repr(P2)==repr(P3)
True
sage: P2.rename("foobar")
sage: P3.rename("foobar")
sage: hash(P2)
3433925302934160649
sage: P2.reset_name()
sage: P3.reset_name()
sage: repr(P1)==repr(P2)==repr(P3)
True
sage: hash(P2)==hash(P3)
False
sage: hash(P1)==hash(P3)
True

Hence, the hash of P2 and P3 are different, simply since the hash of P2
has been computed before the name was reset.

I have no recommendation what we should do.

Best regards,
Simon


Volker Braun

unread,
Sep 17, 2013, 6:00:32 AM9/17/13
to sage-...@googlegroups.com
The "rename" feature was always a bit odd. Does anybody use it? I think it comes up occasionally on sage-devel but, so far, inertia always won. 

Also, the fact that SageObject has a __hash__() in the first place is dubious. Only immutable object should be hash-able, not everything under the sun. Though by now we probably have lots of people using mutable objects as dictionary keys, so its probably difficult to change. 

In terms of immediately actionable items to improve pickling, I think we should first stop pickling caches. If only to be able to have reproducable doctests that don't depend on what has been cached before.

We should also check (using the TestSuite) that  __reduce__()[1] doesn't link back to the object being pickled, since that is not supported by Python.

Travis Scrimshaw

unread,
Sep 17, 2013, 10:55:52 AM9/17/13
to sage-...@googlegroups.com

The "rename" feature was always a bit odd. Does anybody use it? I think it comes up occasionally on sage-devel but, so far, inertia always won. 

It's useful when you want to create specific named examples. For example, you're creating the complete graph, but don't want it to return a string representation as "Graph with n vertices" to help distinguish it. Perhaps instead of rename() which can be called at anytime, we instead make the alternative name a part of the constructor?
 
Also, the fact that SageObject has a __hash__() in the first place is dubious. Only immutable object should be hash-able, not everything under the sun. Though by now we probably have lots of people using mutable objects as dictionary keys, so its probably difficult to change. 

I'd be somewhat surprised if there were lots of mutable dictionary keys since things that are considered mutable (ex. graphs/vectors/matrices) all have custom hashes that throw errors if called while mutable. It'd be interesting to find out how much of Sage breaks when changing the default __hash__().

Best,
Travis

Nils Bruin

unread,
Sep 17, 2013, 12:22:20 PM9/17/13
to sage-...@googlegroups.com
On Tuesday, September 17, 2013 2:15:14 AM UTC-7, Simon King wrote:
And it is a good guess that two Sage objects evaluate equal only if their
string representations are equal (but not "if and only if").

Some problems, though: The string representation is not always available
when we need the hash. Computing the string representation is quite
expensive, and hence the default hash is very slow; but at least it is
cached now, as of version Sage-5.12.beta5. And, worst: The string
representation can be customized after object creation, and so the hash
value depends on whether the hash was first called before or after
renaming.

Those are some serious warts indeed. With a caching hash we could cheaply repair pickling by storing the hash cache and reinstating that early, but that means that this uncomfortable "hashing depends on renaming status when first called" will get perpetuated into pickles too.
 
A lot of objects in sage get hashed that may not be meant to: cached functions (and hence CachedRepresentation and UniqueRepresentation, whose __init__ is cached) use the arguments they receive as a key into a dictionary, so I suspect that changing some of these less fortunate default hashes will have surprisingly large effects.

Going forward: I think we should try to contain the problem:

Proper use of SageObject involves overriding the problematic hasht, as well as providing a proper __eq__ . Having to write a __reduce__ that is safe in the face of circularity, or code that verifies there's no circularity upon pickling, which would still require writing a __reduce__, would then be part of proper use as well.

Since that's so onerous, the real advice then would be: Inherit from <these classes> rather than from SageObject, SageCategoryObject, because they implement reasonable __hash__, __eq__ and __reduce__ for you. UniqueRepresentation is one of the classes that does that, but people should of course only inherit from it if that behaviour is actually desired. Which classes can be included in <these classes>?

Nils Bruin

unread,
Sep 17, 2013, 1:12:33 PM9/17/13
to sage-...@googlegroups.com
On Tuesday, September 17, 2013 2:15:14 AM UTC-7, Simon King wrote:
And, worst: The string
representation can be customized after object creation, and so the hash
value depends on whether the hash was first called before or after
renaming.

We can fix that, assuming that `rename` is never used in particularly time-critical situations: Force the computation (and caching) of the hash prior to renaming. With that in place, pickling the cached hash shouldn't be too much of an issue either. Of course, if people implement their own __eq__ (as they almost always have to), they would still need to provide a __reduce__ that takes care of restoring enough attributes early to compute that. We may want to provide a __getnewargs__ type hook to make that easy.

Nils Bruin

unread,
Sep 18, 2013, 1:29:37 AM9/18/13
to sage-...@googlegroups.com
see http://trac.sagemath.org/ticket/15207 for a very preliminary draft of something that might help implementing better pickling procedures.

Simon King

unread,
Sep 18, 2013, 5:23:16 AM9/18/13
to sage-...@googlegroups.com
Hi Nils,

On 2013-09-17, Nils Bruin <nbr...@sfu.ca> wrote:
> On Tuesday, September 17, 2013 2:15:14 AM UTC-7, Simon King wrote:
>>
>> And, worst: The string
>> representation can be customized after object creation, and so the hash
>> value depends on whether the hash was first called before or after
>> renaming.
>>
>
> We can fix that, assuming that `rename` is never used in particularly
> time-critical situations: Force the computation (and caching) of the hash
> prior to renaming.

Hm. There might be situations in which a string representation will
*never* be available before renaming, which means that one can't call
the hash before renaming. I don't know if we really have concrete
examples of this scenario in Sage.

> With that in place, pickling the cached hash shouldn't
> be too much of an issue either.

Indeed this would be needed, since currently the cached hash is not pickled.
Consequence:

sage: class MyParent(Parent): pass
sage: P1 = MyParent()
sage: hash(P1)
1193921704
sage: P1.rename("foobar")
sage: hash(P1)
1193921704
sage: P2 = loads(dumps(P1))
sage: hash(P2)
-1969371895
sage: P1
foobar
sage: P2
foobar

Explanation: The custom name assigned after computing the hash of P1 is
stored in some attribute, which is preserved under pickling. But the
hash value stored for P1 is not preserved under pickling. Hence, P2 uses
the hash *after* renaming.

Best regards,
Simon

Simon King

unread,
Sep 18, 2013, 8:30:29 AM9/18/13
to sage-...@googlegroups.com
Hi Nils,

On 2013-09-17, Nils Bruin <nbr...@sfu.ca> wrote:
> Since that's so onerous, the real advice then would be: Inherit from <these
> classes> rather than from SageObject, SageCategoryObject, because they
> implement reasonable __hash__, __eq__ and __reduce__ for you.

"Rather than" is perhaps to strong. Better: "In addition to".

> UniqueRepresentation is one of the classes that does that, but people
> should of course only inherit from it if that behaviour is actually
> desired. Which classes can be included in <these classes>?
>

I think your suggestion at #15207 is reasonable: Create a base class
RichPickle that makes it very easy to implement a "hash-safe" __reduce__
method. Namely, just provide a class attribute that lists all attribute
names that are used inside __hash__.

And of course the recommendation would not be "replace SageObject by
RichPickle", but "if you implement a Python class with __hash__ and you
don't bother to implement your own __reduce__ for the class, then add
RichPickle to the list of base classes (in addition to SageObject and so
on)".

Best regards,
Simon


Reply all
Reply to author
Forward
0 new messages