70 views

Skip to first unread message

Sep 14, 2013, 10:00:22 PM9/14/13

to sage-...@googlegroups.com

The problems arising on #15149 arise from the fact that if objects occur as keys in dictionaries, then their hashes might be required prior to `setstate` being called on them during unpickling. If we trace where the failing `hash` comes from, we find it's

CategoryObject.__hash__(self):

if self._hash_value == -1:

self._hash_value = hash(repr(self))

return self._hash_value

so in a way, it's CategoryObject that breaks the unpickling: If CategoryObject introduces a custom hash, it should also ensure that (at least by default), pickling ensures that it gets unpacked appropriately and is usable prior to __setstate__, a requirement that may seem a little surprising at first. Of course, CategoryObject has no idea what's required to compute repr(self), but if it would do something like:

def __reduce__(self):

return _CategoryObjectReconstructor, (type(self),self.hash()),self.__dict__

def _CategoryObjectReconstructor(T,hashval):

v = CategoryObject.__new__(T)

v._hash_value = hashval

return v

we'd at least recover "hash" (and I think it's safe to assume that repr(loads(dumps(obj))) == repr(obj), so we'd get the same result).

CategoryObject.__hash__(self):

if self._hash_value == -1:

self._hash_value = hash(repr(self))

return self._hash_value

so in a way, it's CategoryObject that breaks the unpickling: If CategoryObject introduces a custom hash, it should also ensure that (at least by default), pickling ensures that it gets unpacked appropriately and is usable prior to __setstate__, a requirement that may seem a little surprising at first. Of course, CategoryObject has no idea what's required to compute repr(self), but if it would do something like:

def __reduce__(self):

return _CategoryObjectReconstructor, (type(self),self.hash()),self.__dict__

def _CategoryObjectReconstructor(T,hashval):

v = CategoryObject.__new__(T)

v._hash_value = hashval

return v

we'd at least recover "hash" (and I think it's safe to assume that repr(loads(dumps(obj))) == repr(obj), so we'd get the same result).

Sep 15, 2013, 2:40:45 AM9/15/13

to sage-...@googlegroups.com

On Saturday, September 14, 2013 7:00:22 PM UTC-7, Nils Bruin wrote:

so in a way, it's CategoryObject that breaks the unpickling: If CategoryObject introduces a custom hash, it should also ensure that (at least by default), pickling ensures that it gets unpacked appropriately and is usable prior to __setstate__, a requirement that may seem a little surprising at first. Of course, CategoryObject has no idea what's required to compute repr(self), but if it would do something like:

And that isn't enough by itself: We need to make sure that equality testing works too. That's a bit hard to do. In fact, it's questionable that CategoryObject implements a hash without implementing an equality test.

Would it be reasonable to:

1] equip CategoryObject with a __init__(self,*args,**kwargs)

2] declare that proper use of CategoryObject includes calling this __init__

3] let the __init__ save args and kwargs (this is the most problematic part, I'd say -- it could lead to extra memory use. Not circularity, though: these objects exist *PRIOR* to the new object being created

4] implement a default hash and eq based in the creation parameters

5] include the creation args and kwargs in the pickling data

6] ensure that there's a __reduce__ that reinstates the cached init (args,kwargs) upon construction during unpickling, so that the default hash and eq work prior to __setstate__ being called.

Of course, objects inheriting from CategoryObject would be free to implement their own pickling, hash, and equality, in which case calling CategoryObject.__init__ would not be required either. But at least this way we'd provide a way to get basic functionality (including pickling).

Otherwise, proper use would have to include "write a pickling procedure that ensures that prior to __setstate__ (which CategoryObject does provide), equality (which CategoryObject does not provide) and hashing (with CategoryObject provides in a rudimentary way) must already work.

Would it be reasonable to:

1] equip CategoryObject with a __init__(self,*args,**kwargs)

2] declare that proper use of CategoryObject includes calling this __init__

3] let the __init__ save args and kwargs (this is the most problematic part, I'd say -- it could lead to extra memory use. Not circularity, though: these objects exist *PRIOR* to the new object being created

4] implement a default hash and eq based in the creation parameters

5] include the creation args and kwargs in the pickling data

6] ensure that there's a __reduce__ that reinstates the cached init (args,kwargs) upon construction during unpickling, so that the default hash and eq work prior to __setstate__ being called.

Of course, objects inheriting from CategoryObject would be free to implement their own pickling, hash, and equality, in which case calling CategoryObject.__init__ would not be required either. But at least this way we'd provide a way to get basic functionality (including pickling).

Otherwise, proper use would have to include "write a pickling procedure that ensures that prior to __setstate__ (which CategoryObject does provide), equality (which CategoryObject does not provide) and hashing (with CategoryObject provides in a rudimentary way) must already work.

Sep 16, 2013, 4:18:45 PM9/16/13

to sage-...@googlegroups.com

Perhaps as a bit more concrete example, if we do

base=object

class Node(base):

def __init__(self,i):

self.i=i

self.neighbours=set()

def __repr__(self):

return "Node %s with neighbours %s"%(self.i,sorted([c.i for c in self.neighbours]))

def connected_component(self):

nodelist=[]

todo=[self]

visited=set()

while todo:

n=todo.pop()

if n.i not in visited:

nodelist.append(n)

todo.extend(n.neighbours)

visited.add(n.i)

nodelist.sort(key=lambda n:n.i)

return nodelist

n=4

N=[Node(i) for i in range(n)]

for j in range(n):

N[j].neighbours=set(N[i] for i in range(n) if i != j)

n=loads(dumps(N[0]))

things work just fine.

If we change the first line to base=SageObject, we get an attribute error, which can be traced back to the definition of __hash__ (defined on SageObject) trying to call __repr__ . Thus, we can say that SageObject breaks pickling on circular structures, in that things that work perfectly fine when inheriting from "object", break when we use SageObject instead. This should probably at least be documented, but ideally we'd find some solution so that pickling doesn't break, or at least that there is a well-documented and low-cost way of making it work again.

base=object

class Node(base):

def __init__(self,i):

self.i=i

self.neighbours=set()

def __repr__(self):

return "Node %s with neighbours %s"%(self.i,sorted([c.i for c in self.neighbours]))

def connected_component(self):

nodelist=[]

todo=[self]

visited=set()

while todo:

n=todo.pop()

if n.i not in visited:

nodelist.append(n)

todo.extend(n.neighbours)

visited.add(n.i)

nodelist.sort(key=lambda n:n.i)

return nodelist

n=4

N=[Node(i) for i in range(n)]

for j in range(n):

N[j].neighbours=set(N[i] for i in range(n) if i != j)

n=loads(dumps(N[0]))

things work just fine.

If we change the first line to base=SageObject, we get an attribute error, which can be traced back to the definition of __hash__ (defined on SageObject) trying to call __repr__ . Thus, we can say that SageObject breaks pickling on circular structures, in that things that work perfectly fine when inheriting from "object", break when we use SageObject instead. This should probably at least be documented, but ideally we'd find some solution so that pickling doesn't break, or at least that there is a well-documented and low-cost way of making it work again.

On Saturday, September 14, 2013 7:00:22 PM UTC-7, Nils Bruin wrote:

Sep 17, 2013, 5:15:14 AM9/17/13

to sage-...@googlegroups.com

Hi Nils,

On 2013-09-16, Nils Bruin <nbr...@sfu.ca> wrote:

> If we change the first line to base=SageObject, we get an attribute error,

> which can be traced back to the definition of __hash__ (defined on

> SageObject) trying to call __repr__ . Thus, we can say that SageObject

> breaks pickling on circular structures, in that things that work perfectly

> fine when inheriting from "object", break when we use SageObject instead.

I am of course not the person who created SageObject. But I think I can

see the rationale for how its hash is implemented: In Python's <object>,

the default is that comparison is by identity, and hence the hash is

obtained from the object's address in memory, id(). In Sage, we are more

likely in a situation where non-identical objects evaluate equal. Hence,

if we want a default hash at all, then it should take this into account.

And it is a good guess that two Sage objects evaluate equal only if their

string representations are equal (but not "if and only if").

Some problems, though: The string representation is not always available

when we need the hash. Computing the string representation is quite

expensive, and hence the default hash is very slow; but at least it is

cached now, as of version Sage-5.12.beta5. And, worst: The string

representation can be customized after object creation, and so the hash

value depends on whether the hash was first called before or after

renaming. Example:

sage: class MyParent(Parent): pass

sage: P1, P2, P3 = MyParent(), MyParent(), MyParent()

sage: repr(P1)==repr(P2)==repr(P3)

True

sage: P2.rename("foobar")

sage: P3.rename("foobar")

sage: hash(P2)

3433925302934160649

sage: P2.reset_name()

sage: P3.reset_name()

sage: repr(P1)==repr(P2)==repr(P3)

True

sage: hash(P2)==hash(P3)

False

sage: hash(P1)==hash(P3)

True

Hence, the hash of P2 and P3 are different, simply since the hash of P2

has been computed before the name was reset.

I have no recommendation what we should do.

Best regards,

Simon

On 2013-09-16, Nils Bruin <nbr...@sfu.ca> wrote:

> If we change the first line to base=SageObject, we get an attribute error,

> which can be traced back to the definition of __hash__ (defined on

> SageObject) trying to call __repr__ . Thus, we can say that SageObject

> breaks pickling on circular structures, in that things that work perfectly

> fine when inheriting from "object", break when we use SageObject instead.

see the rationale for how its hash is implemented: In Python's <object>,

the default is that comparison is by identity, and hence the hash is

obtained from the object's address in memory, id(). In Sage, we are more

likely in a situation where non-identical objects evaluate equal. Hence,

if we want a default hash at all, then it should take this into account.

And it is a good guess that two Sage objects evaluate equal only if their

string representations are equal (but not "if and only if").

Some problems, though: The string representation is not always available

when we need the hash. Computing the string representation is quite

expensive, and hence the default hash is very slow; but at least it is

cached now, as of version Sage-5.12.beta5. And, worst: The string

representation can be customized after object creation, and so the hash

value depends on whether the hash was first called before or after

renaming. Example:

sage: class MyParent(Parent): pass

sage: P1, P2, P3 = MyParent(), MyParent(), MyParent()

sage: repr(P1)==repr(P2)==repr(P3)

True

sage: P2.rename("foobar")

sage: P3.rename("foobar")

sage: hash(P2)

3433925302934160649

sage: P2.reset_name()

sage: P3.reset_name()

sage: repr(P1)==repr(P2)==repr(P3)

True

sage: hash(P2)==hash(P3)

False

sage: hash(P1)==hash(P3)

True

Hence, the hash of P2 and P3 are different, simply since the hash of P2

has been computed before the name was reset.

I have no recommendation what we should do.

Best regards,

Simon

Sep 17, 2013, 6:00:32 AM9/17/13

to sage-...@googlegroups.com

The "rename" feature was always a bit odd. Does anybody use it? I think it comes up occasionally on sage-devel but, so far, inertia always won.

Also, the fact that SageObject has a __hash__() in the first place is dubious. Only immutable object should be hash-able, not everything under the sun. Though by now we probably have lots of people using mutable objects as dictionary keys, so its probably difficult to change.

In terms of immediately actionable items to improve pickling, I think we should first stop pickling caches. If only to be able to have reproducable doctests that don't depend on what has been cached before.

We should also check (using the TestSuite) that __reduce__()[1] doesn't link back to the object being pickled, since that is not supported by Python.

Sep 17, 2013, 10:55:52 AM9/17/13

to sage-...@googlegroups.com

The "rename" feature was always a bit odd. Does anybody use it? I think it comes up occasionally on sage-devel but, so far, inertia always won.

It's useful when you want to create specific named examples. For example, you're creating the complete graph, but don't want it to return a string representation as "Graph with n vertices" to help distinguish it. Perhaps instead of rename() which can be called at anytime, we instead make the alternative name a part of the constructor?

Also, the fact that SageObject has a __hash__() in the first place is dubious. Only immutable object should be hash-able, not everything under the sun. Though by now we probably have lots of people using mutable objects as dictionary keys, so its probably difficult to change.

I'd be somewhat surprised if there were lots of mutable dictionary keys since things that are considered mutable (ex. graphs/vectors/matrices) all have custom hashes that throw errors if called while mutable. It'd be interesting to find out how much of Sage breaks when changing the default __hash__().

Best,

Travis

Sep 17, 2013, 12:22:20 PM9/17/13

to sage-...@googlegroups.com

On Tuesday, September 17, 2013 2:15:14 AM UTC-7, Simon King wrote:

And it is a good guess that two Sage objects evaluate equal only if their

string representations are equal (but not "if and only if").

Some problems, though: The string representation is not always available

when we need the hash. Computing the string representation is quite

expensive, and hence the default hash is very slow; but at least it is

cached now, as of version Sage-5.12.beta5. And, worst: The string

representation can be customized after object creation, and so the hash

value depends on whether the hash was first called before or after

renaming.

Those are some serious warts indeed. With a caching hash we could cheaply repair pickling by storing the hash cache and reinstating that early, but that means that this uncomfortable "hashing depends on renaming status when first called" will get perpetuated into pickles too.

A lot of objects in sage get hashed that may not be meant to: cached functions (and hence CachedRepresentation and UniqueRepresentation, whose __init__ is cached) use the arguments they receive as a key into a dictionary, so I suspect that changing some of these less fortunate default hashes will have surprisingly large effects.

Going forward: I think we should try to contain the problem:

Proper use of SageObject involves overriding the problematic hasht, as well as providing a proper __eq__ . Having to write a __reduce__ that is safe in the face of circularity, or code that verifies there's no circularity upon pickling, which would still require writing a __reduce__, would then be part of proper use as well.

Since that's so onerous, the real advice then would be: Inherit from <these classes> rather than from SageObject, SageCategoryObject, because they implement reasonable __hash__, __eq__ and __reduce__ for you. UniqueRepresentation is one of the classes that does that, but people should of course only inherit from it if that behaviour is actually desired. Which classes can be included in <these classes>?

Sep 17, 2013, 1:12:33 PM9/17/13

to sage-...@googlegroups.com

On Tuesday, September 17, 2013 2:15:14 AM UTC-7, Simon King wrote:

And, worst: The string

representation can be customized after object creation, and so the hash

value depends on whether the hash was first called before or after

renaming.

We can fix that, assuming that `rename` is never used in particularly time-critical situations: Force the computation (and caching) of the hash prior to renaming. With that in place, pickling the cached hash shouldn't be too much of an issue either. Of course, if people implement their own __eq__ (as they almost always have to), they would still need to provide a __reduce__ that takes care of restoring enough attributes early to compute that. We may want to provide a __getnewargs__ type hook to make that easy.

Sep 18, 2013, 1:29:37 AM9/18/13

to sage-...@googlegroups.com

see http://trac.sagemath.org/ticket/15207 for a very preliminary draft of something that might help implementing better pickling procedures.

Sep 18, 2013, 5:23:16 AM9/18/13

to sage-...@googlegroups.com

Hi Nils,

On 2013-09-17, Nils Bruin <nbr...@sfu.ca> wrote:

> On Tuesday, September 17, 2013 2:15:14 AM UTC-7, Simon King wrote:

>>

>> And, worst: The string

>> representation can be customized after object creation, and so the hash

>> value depends on whether the hash was first called before or after

>> renaming.

>>

>

> We can fix that, assuming that `rename` is never used in particularly

> time-critical situations: Force the computation (and caching) of the hash

> prior to renaming.

Hm. There might be situations in which a string representation will

*never* be available before renaming, which means that one can't call

the hash before renaming. I don't know if we really have concrete

examples of this scenario in Sage.

> With that in place, pickling the cached hash shouldn't

> be too much of an issue either.

Indeed this would be needed, since currently the cached hash is not pickled.

Consequence:

sage: class MyParent(Parent): pass

sage: P1 = MyParent()

sage: hash(P1)

1193921704

sage: P1.rename("foobar")

sage: hash(P1)

1193921704

sage: P2 = loads(dumps(P1))

sage: hash(P2)

-1969371895

sage: P1

foobar

sage: P2

foobar

Explanation: The custom name assigned after computing the hash of P1 is

stored in some attribute, which is preserved under pickling. But the

hash value stored for P1 is not preserved under pickling. Hence, P2 uses

the hash *after* renaming.

Best regards,

Simon

On 2013-09-17, Nils Bruin <nbr...@sfu.ca> wrote:

> On Tuesday, September 17, 2013 2:15:14 AM UTC-7, Simon King wrote:

>>

>> And, worst: The string

>> representation can be customized after object creation, and so the hash

>> value depends on whether the hash was first called before or after

>> renaming.

>>

>

> We can fix that, assuming that `rename` is never used in particularly

> time-critical situations: Force the computation (and caching) of the hash

> prior to renaming.

*never* be available before renaming, which means that one can't call

the hash before renaming. I don't know if we really have concrete

examples of this scenario in Sage.

> With that in place, pickling the cached hash shouldn't

> be too much of an issue either.

Consequence:

sage: class MyParent(Parent): pass

sage: P1 = MyParent()

sage: hash(P1)

1193921704

sage: P1.rename("foobar")

sage: hash(P1)

1193921704

sage: P2 = loads(dumps(P1))

sage: hash(P2)

-1969371895

sage: P1

foobar

sage: P2

foobar

Explanation: The custom name assigned after computing the hash of P1 is

stored in some attribute, which is preserved under pickling. But the

hash value stored for P1 is not preserved under pickling. Hence, P2 uses

the hash *after* renaming.

Best regards,

Simon

Sep 18, 2013, 8:30:29 AM9/18/13

to sage-...@googlegroups.com

> Since that's so onerous, the real advice then would be: Inherit from <these

> classes> rather than from SageObject, SageCategoryObject, because they

> implement reasonable __hash__, __eq__ and __reduce__ for you.

"Rather than" is perhaps to strong. Better: "In addition to".
> classes> rather than from SageObject, SageCategoryObject, because they

> implement reasonable __hash__, __eq__ and __reduce__ for you.

> UniqueRepresentation is one of the classes that does that, but people

> should of course only inherit from it if that behaviour is actually

> desired. Which classes can be included in <these classes>?

>

RichPickle that makes it very easy to implement a "hash-safe" __reduce__

method. Namely, just provide a class attribute that lists all attribute

names that are used inside __hash__.

And of course the recommendation would not be "replace SageObject by

RichPickle", but "if you implement a Python class with __hash__ and you

don't bother to implement your own __reduce__ for the class, then add

RichPickle to the list of base classes (in addition to SageObject and so

on)".

Best regards,

Simon

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu