Pickling limitations

207 views
Skip to first unread message

Volker Braun

unread,
Sep 4, 2013, 7:29:30 PM9/4/13
to sage-...@googlegroups.com
Having spent a day to look closer at pickling issues, there seem to be some dangerous issues. In particular, Python currently has a bug where you can create pickles that unpickle incorrectly without raising any exceptions (http://trac.sagemath.org/15156). This usually requires a rather esoteric setup where there is a cyclic loop between __reduce__ calls. But in Sage basically every Cython class requires a __reduce__ method, so it is much more likely in Sage to fall into this trap. And with cached_methods and parents it is very easy to generate cyclic references.

The fundamental difficulty is the tension between uniqueness of objects (and constituent objects of objects) and cyclic references. We had some discussion at http://trac.sagemath.org/15155, but I think the general question of how to implement serialization properly belongs into a wider audience.

One possibility that I'd like to bring up is to write our own pickler. Since we have our own framework for creating unique objects, we don't *really* need the pickler to unpickle identical objects to identical objects if there are reference loops. Instead, we could modify the Python pickling framework to just break cyclic references, making it much simpler. Then it would be easy to fix the current bug(s). And the existing UniqueRepresentation framework would then be up to the task to make the objects unique that Sage really needs to be unique. Though perhaps there is a problem with this approach that I haven't thought about?

Nils Bruin

unread,
Sep 7, 2013, 12:12:30 PM9/7/13
to sage-...@googlegroups.com

On Wednesday, September 4, 2013 4:29:30 PM UTC-7, Volker Braun wrote:
One possibility that I'd like to bring up is to write our own pickler. Since we have our own framework for creating unique objects, we don't *really* need the pickler to unpickle identical objects to identical objects if there are reference loops. Instead, we could modify the Python pickling framework to just break cyclic references, making it much simpler. Then it would be easy to fix the current bug(s). And the existing UniqueRepresentation framework would then be up to the task to make the objects unique that Sage really needs to be unique. Though perhaps there is a problem with this approach that I haven't thought about?

I assume you are thinking of copying the python cPickle code into, say, cPickle_sage and then adapting that for our own purposes? That's a little unattractive on general grounds: we'd now have to maintain an extra bit of code and our work on it doesn't benefit Python at large, but if there were big advantages in return it could be worth it.

The thing I'm wondering about is the "just break cyclic references": The current idea of a pickler is that it's relatively dumb. It doesn't really have detailed information on how the objects offered to it fit together or how they are supposed to be created. Most of the time it lets that to the relevant __reduce__ or __get_state__ routines.

I don't think most cyclic structures can be fruitfully serialized and then recreated by breaking an arbitrary cyclic reference. It depends on the nature of the cycle how it should be broken (the garbage collector is different in that respect: the objects don't have to remain sane and valid once GC gets to them, but even there care needs to be taken (see the weakref documentation in python's GC).

Unless we deviate very far from cPickle's design, the cycle breaking information would still need to be encapsulated in __reduce__/__get_state__ routines on the objects rather than in the pickler. In that situation I'd think it would be worth a shot to try to make those routines work with the present cPickle already. Once we figure out common strategies we can try to write tools and decorators to make it easier to apply those strategies.

My initial impression is that forking cPickle will be a lot of work with obvious disadvantages and very few clear advantages and that trying other approaches first seems more promising.

Volker Braun

unread,
Sep 7, 2013, 12:38:04 PM9/7/13
to sage-...@googlegroups.com
The Python pickling module isn't all that much code, the Pickler class is about 500 lines with support for multiple pickling format versions.Just sage.misc.explain_pickle is already over twice as long as the whole Python pickle module. And we can probably use the same unpickling code. 

On Saturday, September 7, 2013 5:12:30 PM UTC+1, Nils Bruin wrote:
I don't think most cyclic structures can be fruitfully serialized and then recreated by breaking an arbitrary cyclic reference. It depends on the nature of the cycle how it should be broken

Is that really true? That is basically my question. What would go wrong? If one can only break cyclic references in specific orders then the whole thing is probably not going to work as we can't predict how a user will compose Sage objects.

Nils Bruin

unread,
Sep 7, 2013, 3:01:20 PM9/7/13
to sage-...@googlegroups.com
On Saturday, September 7, 2013 9:38:04 AM UTC-7, Volker Braun wrote:
Is that really true? That is basically my question. What would go wrong? If one can only break cyclic references in specific orders then the whole thing is probably not going to work as we can't predict how a user will compose Sage objects.

Let's try. I think there are structures where you can't just delete cyclic references at all, unless you take special efforts to recreate them afterwards. For instance, consider

class GraphNode(object):
    def __init__(self,neighbours=None):
        if neighbours is None:
            self.neighbours=[]
        else:
            self.neighbours=neighbours

#make the complete graph on 4 nodes
C=[GraphNode() for i in range(4)]
for c in C:
    c.neighbours.extend([b for b in C if b is not c])

 I can imagine that similar situations could naturally arise in sage. Perhaps the following:

Consider a class for ProjectiveSpace and its associated PicardGroup, created by

P2=ProjectiveSpace(k,2)
PicP2=PicardGroup(P2)

Let's assume computing a PicardGroup is expensive, so we'd like to ensure that PicP2's lifetime coincides with P2's, so we put a reference from P2 to PicP2 on P2. Of course PicP2 needs a reference to P2.

When pickling P2 we don't absolutely need to pickle PicP2, although not doing so means that after unpickling that P2 it might be expensive to obtain PicP2. A solution would be to pickle [P2,PicP2] instead.

Even for pickling PicP2 it would not be strictly necessary to store a reference to P2 explicitly: if it gets pickled to "execute PicardGroup(P2)" the reference to P2 would be implied anyway. However, that's of course NOT how we'd pickle something that's expensive to compute. It should probably have some set_state way of recreating PicP2 with the expensively computed data explicitly passed to it. That probably includes P2.

Now a different example, which I think is a little more like the graph example above. Suppose we store the affine patches to P2 as affine spaces. Since affine spaces can exist on their own, their creation doesn't necessarily imply a reference to P2. However, for pickling the constellation of P2 with its affine patches I would think that including the relations between them (which need to be stored in both directions!) is probably the expected behaviour.

We can of course impose (artificially) a hierarchy here to break circularity: We can say affine patches always know about their projective closure and only cache the references on projective space to its patches for efficiency reasons. But doing so might have consequences when someone does:

A2 = AffineSpace(k,2)
P2= ProjectiveClosure(A2)
allA2s= AllAffinePatches(P2)

(i.e., here P2 is derived from A2 rather than the other way around)

The problem is that the relations between the A2s and the P2 cannot be found by coercion discovery: The relations are only there because of how they are constructed. So if we were to simply break the connections and pickle P2 and allA2s, we'd end up with just a bunch of ambient spaces, without the relations that they are supposed to have.

I suspect similar things can happen with number fields, where certain coercions might get installed upon creation (e.g., a "represent_relative_extension_as_absolute_extension"), but where the creation data on the number field doesn't hold a hint that this happened. So here one would not necessarily end up with a defunct pickle, but unless one also explicitly pickles the appropriate "register_coercion" the pickle would lose information. I think solving this will simply be too complicated: This is state that ends up living outside the objects explicitly referenced, so pickle can't see this.

Finally, for rings there is always a fundamental circularity:

Rings tend to store a "zero" and "one" element. Those being elements, they carry a pointer to their parent, so any ring tends to be part of a cycle (this is actually nice for GC purposes: it extends the life of rings automatically to the next cyclic GC, so frequent ring creation/destruction is automatically moderated. It also increases the burden on GC, making it more expensive)

So automatic pickling of a ring would have to deal with circularity too. Here, the appropriate solution is to not worry about zero and one: they'll get created anyway, so you shouldn't pickle them explicitly.

Most of these situations have fairly simple solutions, but in each case specific knowledge of the objects at hand seems to be required.

Volker Braun

unread,
Sep 8, 2013, 10:15:50 AM9/8/13
to sage-...@googlegroups.com
On Saturday, September 7, 2013 8:01:20 PM UTC+1, Nils Bruin wrote:
C=[GraphNode() for i in range(4)]
for c in C:
    c.neighbours.extend([b for b in C if b is not c])

I agree that you can't use the "object id as unique identifier" pattern. AFAIK the graph theory code in Sage does not work like that, but uses integers to enumerate nodes. If you really want unique GraphNodes then the "proper Sage" way of writing it is using UniqueRepresentation and, therefore, give it some constructor argument that uniquely identifies the node.

In fact, we could still preserve uniqueness in circular references as long as there is no __reduce__ loop involved. The problem with __reduce__ is that it runs code, and running code on incomplete objects leads to unexpected results. Whereas it is safe to stick your hand into the object internals and just change the __dict__ of all instances until they are back to the original state. Actually, this is not quite true as Simon noted, unpickling of dictionaries still calls __hash__ on the keys. But close enough.

First of all, I think we should never pickle cached data. Its just needless complexity, and non-doctestable surprises where pickle-ability now depends on which computations you did before pickling. Plus, switching it off is probably the only thing that we can change quickly.

Second, __reduce__ loops should be forbidden. Technically, they already are since Python does not support them. But unfortunately you just get corrupt pickles instead of a clear error at pickling time. Ideally, we'd have a doctest that ensures that there is no reduce loop in any Sage object. As a corollary, any __reduce__ method should always just pickle the absolute minimum to recreate the object.
 

Nils Bruin

unread,
Sep 8, 2013, 1:35:32 PM9/8/13
to sage-...@googlegroups.com
On Sunday, September 8, 2013 7:15:50 AM UTC-7, Volker Braun wrote:
 
I agree that you can't use the "object id as unique identifier" pattern. AFAIK the graph theory code in Sage does not work like that, but uses integers to enumerate nodes. If you really want unique GraphNodes then the "proper Sage" way of writing it is using UniqueRepresentation and, therefore, give it some constructor argument that uniquely identifies the node.

Oh, I guess it's an example of that as well. The example was meant to illustrate a situation where the circularity holds essential information and it shows one thing about circularity: it must be obtained by *modifying* an existing object. If you only use 'reduce' in the most basic way then you cannot express that: The initial call is just the creation. Via 'setstate', however, there is an opportunity to first create a functional object and then modify its state to the desired point. For instance, in the GraphNode example (I've added id's now so that we can tell them apart):

##############################
class GraphNode(object):
    def __init__(self,i,neighbours=None):
        self.i=i

        if neighbours is None:
            self.neighbours=[]
        else:
            self.neighbours=neighbours

    def __repr__(self):
        return "Node %s with neighbours %s"%(self.i,[c.i for c in self.neighbours])

    def connected_component(self):
        nodelist=[]
        todo=[self]
        visited=set()
        while todo:
            n=todo.pop()
            if n.i not in visited:
                nodelist.append(n)
                todo.extend(n.neighbours)
                visited.add(n.i)
        nodelist.sort(key=lambda n:n.i)
        return nodelist
###########################################

if we try to pickle with the naive reduce method:

    def __reduce__(self):
        return (GraphNode,(self.i,self.neighbours))

and we do

C=[GraphNode(i) for i in range(5)]

for c in C:
    c.neighbours.extend([b for b in C if b is not c])
L=loads(dumps(C[1]))

 we see that L.connected_component() does not reproduce what we'd want to see.

However, if we first create the nodes in a minimal style and then use setstate to adjust the status (we can now do that!) we do get good results:

    def __reduce__(self):
        return (GraphNode,(self.i,),{'neighbours': self.neighbours})

    def __setstate__(self,state):
        if 'neighbours' in state:
            self.neighbours.extend(state['neighbours'])

that's because now we're allowing the pickler to produce code that essentially reflects how we constructed the structure in the first place!


> First of all, I think we should never pickle cached data. Its just needless complexity, and non-doctestable surprises where pickle-ability now depends on which computations you did before pickling. Plus, switching it off is probably the only thing that we can change quickly.

That may be a bit harsh. Data may well get cached because it's extremely precious! For instance, if you have a number field, I'd think the main reason of caching would be to store for instance the computed class group (together with relation matrix that allows efficient computation with it) and, necessarily, the integral basis which may well have needed some factorization hints to complete.

In a non-circular world, you could do that by pickling the class group rather than the number field, so if we find such strict pickling rules are necessary we can still find some (unintuitive) solutions.

However, shouldn't we recognize that caches are essentially *modifications of state* of an object after creation and hence should be set during setstate rather than during creation call? It may well be that that solves most of our reduce loop problems (at least the ones related to caches).

That said, I suspect there's a lot of cache data that *can* safely be discarded upon pickling, because it's there to optimize things that are irrelevant for pickles.

> Second, __reduce__ loops should be forbidden. Technically, they already are since Python does not support them. But unfortunately you just get corrupt pickles instead of a clear error at pickling time. Ideally, we'd have a doctest that ensures that there is no reduce loop in any Sage object. As a corollary, any __reduce__ method should always just pickle the absolute minimum to recreate the object.
 
I think that can be relaxed a bit: data going into the *creation* phase should be "minimal" and avoid circularity, because that's where the loops really bite. I think the setstate phase can deal with circular stuff.

It probably means that we always have to handcraft our reduce methods, which is inconvenient to say the least...

Simon King

unread,
Sep 8, 2013, 7:21:01 PM9/8/13
to sage-...@googlegroups.com
Hi Nils,

On 2013-09-08, Nils Bruin <nbr...@sfu.ca> wrote:
> However, shouldn't we recognize that caches are essentially *modifications
> of state* of an object after creation and hence should be set during
> setstate rather than during creation call? It may well be that that solves
> most of our reduce loop problems (at least the ones related to caches).

The point, at least in the original example, is: We have an object O and
an attribute A of O. For creation of A, we need hash(O). Hence, unpickling
of O will only work if A is not dealt with too early.

However, if it is badly done, then dealing with A in the "setstate"
phase of O might still not work easily: Perhaps hash(O) depends on
further data that are inserted during the "setstate" phase?

Actually this is exactly the problem in the original example (with different
names):
In the "creation" phase, a new instance of O.__class__ is created by means
of __new__, but without calling __init__. Then, O.__dict__ shall be
filled in the "setstate" phase. Python decides to first fill it with
A, whose creation relies on hash(O), but in order to get hash(O) one
would first need to fill another attribute "i" into O.__dict__, which Python
has not dealt with yet.

> I think that can be relaxed a bit: data going into the *creation* phase
> should be "minimal" and avoid circularity, because that's where the loops
> really bite. I think the setstate phase can deal with circular stuff.

As I have shown, problems can occur in both phases.

> It probably means that we always have to handcraft our reduce methods,
> which is inconvenient to say the least...

That's the point I wanted to make in my posts on the ticket (and here,
too): When we create a Python class, we *should* implement pickling
by ourselves, either by providing a __reduce__ method or __initargs__
and stuff. We can't rely on Python doing the right thing for us, since
Python can't deal with constructions like the following:
O.A = {O:1}

And my modification of the original example has shown that the problem
is *not* in UniqueRepresentation or __cached_method. You get the same
with any odd dictionary that appears as attribute of "self" and uses
"self" as a key. Python needs help to accomplish pickling in this case.

Best regards,
Simon


Nils Bruin

unread,
Sep 8, 2013, 10:03:35 PM9/8/13
to sage-...@googlegroups.com


On Wednesday, September 4, 2013 4:29:30 PM UTC-7, Volker Braun wrote:
Having spent a day to look closer at pickling issues, there seem to be some dangerous issues. In particular, Python currently has a bug where you can create pickles that unpickle incorrectly without raising any exceptions (http://trac.sagemath.org/15156).

OK, the problem were two sage classes, CachedFunction and Map, which tried to pickle caches and dictionaries (containing caches!) as part of the construction step. Since those can easily contain circular references, that's not a good idea. The solution is  to move their initialization to the setstate phase.

Of course, this only resolves the circularity part of the problems. Objects that end up unpickling badly because *too much* of their state is deferred to setstate need their `reduce` methods adjusted in the opposite direction.

Note that while Python has hooks such as __getinitargs__ and __getnewargs__ to customize reduction behaviour, most sage code seems to prefer to roll its own methods via custom __reduce__ methods. This gives better control but makes it easier to err in the direction of trying to do too much at construction time.

Volker Braun

unread,
Sep 9, 2013, 6:11:53 AM9/9/13
to sage-...@googlegroups.com
On Monday, September 9, 2013 3:03:35 AM UTC+1, Nils Bruin wrote:
Note that while Python has hooks such as __getinitargs__ and __getnewargs__ to customize reduction behaviour, most sage code seems to prefer to roll its own methods via custom __reduce__ methods.

For Cython classes you pretty much have to use __reduce__, so thats why it is so ubiquitous in Sage.
 
Reply all
Reply to author
Forward
0 new messages