Getting rid of the pickle jar

210 views
Skip to first unread message

Jeroen Demeyer

unread,
Oct 27, 2017, 4:42:53 AM10/27/17
to sage-devel
This is a topic which comes up now and then but which hasn't been
resolved so far.

Sage has a "pickle jar" stored in src/ext/pickle_jar/pickle_jar.tar.bz2
and there are tests in src/sage/structure/sage_object.pyx which check
that every object in the pickle jar can be unpickled without raising an
exception.

The theory is that this ensures that people having old pickles on their
machines can still use them with newer versions of Sage. However, the
pickle jar does not achieve this goal because

(1) The doctest only tests that the pickles can be unpickled. There are
no tests that the unpickled objects still function correctly. See
https://trac.sagemath.org/ticket/16311

(2) Nobody adds new pickles to the pickle jar. The last time that a
pickle was added was in 2011. So, if anything, it only tests that
pickles which are at least 6 years old still work correctly.

Since the pickle jar doesn't do what it is meant to do, I suggest to be
pragmatic and remove it completely.

Having the pickle jar is a burden for development because it prevents
certain refactorings or removals. For example finite_field_ext_pari.py
was deprecated in 2014 but we cannot remove it because it breaks the
pickle jar. I know that I could probably fix this with
register_unpickle_override() but that feels a waste of time because I
wonder if anybody really cares. In the mean time, people need to
maintain that unused deprecated file to make it Python 3 compatible for
example.

I'm sure that some people have suggestions for improving the pickle jar
procedure. But still, the fact remains that many pickles in the current
pickle jar are broken. So there is no point in keeping them.


Thoughts?
Jeroen.

Jean-Pierre Flori

unread,
Oct 27, 2017, 6:26:05 AM10/27/17
to sage-devel
Remove it.

Erik Bray

unread,
Oct 27, 2017, 8:19:31 AM10/27/17
to sage-devel
+1

Plus, while pickling has many valid runtime use-cases, particularly
for IPC, and short-term preservation of objects between interpreter
sessions, it was *never* intended for long-term data storage, in part
precisely because it's directly tied to the source code that was used
to produce the pickle file. The only truly "correct" way to restore
old pickled objects is to do so with the same version of the software
the pickle was created with.

If there does not yet exist, and/or is need for better serialization
formats for objects in Sage that's something worth talking seriously
about (I feel like the MitM work being done is to this effect though).
But pickle ain't it!

Travis Scrimshaw

unread,
Oct 27, 2017, 8:47:26 PM10/27/17
to sage-devel
I am also +1 if there are no issues with the notebooks or CoCalc. Although I do not think this affects them because inorder to use the objects, I have to recreate them instead of having them being (functionally) persistent across sessions.

Plus, while pickling has many valid runtime use-cases, particularly
for IPC, and short-term preservation of objects between interpreter
sessions, it was *never* intended for long-term data storage, in part
precisely because it's directly tied to the source code that was used
to produce the pickle file.  The only truly "correct" way to restore
old pickled objects is to do so with the same version of the software
the pickle was created with.

The TestSuite does a standard pickle/unpickle test for this and IMO every object should have at least one TestSuite(foo).run() test (typically in its __init__ method). We have also made it easier with git tags to revert to previous versions of Sage and then if you really need your object, you can translate it into key data in a file that you can then read back and reconstruct in the later version if the pickling has broken in the meantime. While this means we are less backwards compatible, I think this will not affect too many (any?) users, or at least make then not quit using Sage.
 
Best,
Travis
 

William Stein

unread,
Oct 27, 2017, 11:56:00 PM10/27/17
to sage-...@googlegroups.com
There is definitely no issue with CoCalc.


Plus, while pickling has many valid runtime use-cases, particularly
for IPC, and short-term preservation of objects between interpreter
sessions, it was *never* intended for long-term data storage, in part
precisely because it's directly tied to the source code that was used
to produce the pickle file.  The only truly "correct" way to restore
old pickled objects is to do so with the same version of the software
the pickle was created with.

The TestSuite does a standard pickle/unpickle test for this and IMO every object should have at least one TestSuite(foo).run() test (typically in its __init__ method). We have also made it easier with git tags to revert to previous versions of Sage and then if you really need your object, you can translate it into key data in a file that you can then read back and reconstruct in the later version if the pickling has broken in the meantime. While this means we are less backwards compatible, I think this will not affect too many (any?) users, or at least make then not quit using Sage.

If we were to get rid of the pickle jar, then we should also add documentation (e.g., for the save/load functions, which save/load objects), saying -- basically -- do not trust this at all, except for ephemeral movement of objects, unless you are sure to keep the specific version of sage you used around.   We should say that we gave up and decided not to put even the slightest effort into making save/load work over time.   

William


 
Best,
Travis
 

--
You received this message because you are subscribed to the Google Groups "sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+...@googlegroups.com.
To post to this group, send email to sage-...@googlegroups.com.
Visit this group at https://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
--
-- William Stein

Eric Gourgoulhon

unread,
Oct 28, 2017, 4:17:59 AM10/28/17
to sage-devel
Hi,


Le vendredi 27 octobre 2017 10:42:53 UTC+2, Jeroen Demeyer a écrit :
Since the pickle jar doesn't do what it is meant to do, I suggest to be
pragmatic and remove it completely.


+1

Eric.

Andrew

unread,
Oct 28, 2017, 6:17:17 AM10/28/17
to sage-devel
Definitely +1.

Simon King

unread,
Oct 29, 2017, 5:05:22 AM10/29/17
to sage-...@googlegroups.com
Hi Erik,

On 2017-10-27, Erik Bray <erik....@gmail.com> wrote:
> Plus, while pickling has many valid runtime use-cases, particularly
> for IPC, and short-term preservation of objects between interpreter
> sessions, it was *never* intended for long-term data storage,

Seriously? Said who?

I always thought of pickles as the default way to store the results
of long computations, for later (potentially MUCH later) use. Of course,
it always is possible to say "install an old software version to unpickle
the results" or "write a routine that allows to read the old pickle in
a new software version", but as part of user-friendliness, I would
recommend that SageMath-developers keep considering it as their duty to
make unpickling backward-compatible, to a reasonable extent.

Best regards,
Simon

David Roe

unread,
Oct 29, 2017, 5:11:16 AM10/29/17
to sage-devel
I agree that removing pickles from 6+ years ago is a good idea.

I do think, however, that the idea of being able to save objects between versions of Sage is valuable.  And we need some way to test it.  Maybe we could move to some sort of rolling pickle jar, where we allow deprecations after a certain amount of time?
David

--
You received this message because you are subscribed to the Google Groups "sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+unsubscribe@googlegroups.com.

Dima Pasechnik

unread,
Oct 29, 2017, 10:58:36 AM10/29/17
to sage-devel
The conventional wisdom is to avoid pickles in favour of JSON (the latter is a platform-independent human-parseable text, the former is some Python-only binary stuff).


William Stein

unread,
Oct 29, 2017, 12:48:00 PM10/29/17
to sage-...@googlegroups.com
On Sun, Oct 29, 2017 at 7:58 AM Dima Pasechnik <dim...@gmail.com> wrote:
The conventional wisdom is to avoid pickles in favour of JSON (the latter is a platform-independent human-parseable text, the former is some Python-only binary stuff).

Whether this is a good idea depends a lot on the domain.  E.g., try storing a 10000x10000 matrix over GF(2) in JSON versus as a sickle in Sage (using the save command).  To use JSON, you'll have to start by writing your own custom code to export (and import) to JSON, and whatever format you chose it is going to be 100s of times less compact and slower than the fast custom code we wrote for pickling.

A huge amount of work has been put into making various things (e.g., matrices) pickle and unpickle very efficiently (and stably over versions).  Basically 0 work has gone into converting sage objects to/from JSON.  

 -- William




--
You received this message because you are subscribed to the Google Groups "sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+...@googlegroups.com.

To post to this group, send email to sage-...@googlegroups.com.
Visit this group at https://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
--
-- William Stein

Simon King

unread,
Oct 29, 2017, 3:22:39 PM10/29/17
to sage-...@googlegroups.com
On 2017-10-29, David Roe <roed...@gmail.com> wrote:
> I agree that removing pickles from 6+ years ago is a good idea.
>
> I do think, however, that the idea of being able to save objects between
> versions of Sage is valuable. And we need some way to test it. Maybe we
> could move to some sort of rolling pickle jar, where we allow deprecations
> after a certain amount of time?

Does the following make sense?

Some patchbot could be used to build (one after the other)
old-but-not-too old versions of SageMath. In SageMath version
x-y-z, the original pickle jar is opened and then saved in
version x-y-z. So, we would create *several* (say, n) versions
of the pickle jar, and in the best case each Sage version
would be available on m machines with different architecture.

The above has to be done *once*, resulting in m*n pickle jar
versions.

As part of the release process of a new version of SageMath,
a new version of the pickle jar is created by some patchbots
on m different machines and replaces the m oldest versions of
the pickle jar.

Thus, one has a rolling pickle jar in m*n versions.

The m*n-fold pickle jar should not pollute the SageMath sources.
In that way, a new pickle jar version wouldn't result in a new
to-be-merged git commit. Instead, the jar should only be stored
on some SageMath servers.

It should be tested by our test bots (i.e. the machines
connected to trac tickets) whether all m*n versions of the pickle
jar unpickle, *and* whether all m*n versions of the same unpickled
object actually evaluate equal. It would just (to some extent)
test both machine independence and backwards compatibility.

However, that test would *not* remain part of the test suite that
is executed by a user doing "sage -t". It is a test for test
bots only.

Best regards,
Simon

Jeroen Demeyer

unread,
Oct 29, 2017, 3:46:49 PM10/29/17
to sage-...@googlegroups.com
On 2017-10-29 20:22, Simon King wrote:
> As part of the release process of a new version of SageMath,
> a new version of the pickle jar is created by some patchbots
> on m different machines and replaces the m oldest versions of
> the pickle jar.

That's the easy part. The hard part is deciding what to put in the
pickle jar and how to test that the pickles still work. If we don't have
good ideas for the latter, the pickle jar is pointless.

William Stein

unread,
Oct 29, 2017, 4:07:17 PM10/29/17
to sage-...@googlegroups.com
When I made the pickle jar in the first place, I automated putting the output of all doctests in the pickle jar.



--
You received this message because you are subscribed to the Google Groups "sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+...@googlegroups.com.

To post to this group, send email to sage-...@googlegroups.com.
Visit this group at https://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
--
-- William Stein

Travis Scrimshaw

unread,
Oct 29, 2017, 6:14:09 PM10/29/17
to sage-devel

What about any object that does a TestSuite(foo).run()? This guarantees that it pickles (assuming it is not skipping that test) and is an object that someone would create. Usually when those are marked as "# long time", it is the tests that take a long time (say, iterated over 1000 objects) rather than creation and (un)pickling.

Best,
Travis

David Roe

unread,
Oct 29, 2017, 6:48:43 PM10/29/17
to sage-devel
I think tying it in with TestSuite(foo).run() is a good idea, but we probably don't want to store pickles for every one (since test suites can be run for multiple elements of the same type).  Maybe check to see if a pickle has already been created for a given class, and if not, create one?
David

Volker Braun

unread,
Oct 29, 2017, 7:27:21 PM10/29/17
to sage-devel
Thats still only addressing that objects can be unpickled; You'd also have to run the entire testsuite with the unpickled objets if you want to have any reasonable guarantee that they are actually working. Put differently, how likely do you think it is that some old pickle unpickles and passes superficial tests, but gives a mathematically incorrect answer if you call some specialized method? 

Simon King

unread,
Oct 30, 2017, 3:42:00 AM10/30/17
to sage-...@googlegroups.com
On 2017-10-29, Volker Braun <vbrau...@gmail.com> wrote:
> Thats still only addressing that objects can be unpickled

No, it is also addressing that "the same" objects unpickled from
different SageMath versions and different machines evaluate equal.

> ; You'd also have
> to run the entire testsuite with the unpickled objets if you want to have
> any reasonable guarantee that they are actually working.

Sure, why not? After all, it was suggested in another posting that
objects should be automatically put into the pickle jar if they are
subject to a TestSuite.run() test.

> Put differently,
> how likely do you think it is that some old pickle unpickles and passes
> superficial tests, but gives a mathematically incorrect answer if you call
> some specialized method?

Put differently, would you rather have no test at all than a
superficial consistency test on a wide range of objects, versions and
machines?


Erik Bray

unread,
Oct 30, 2017, 10:13:03 AM10/30/17
to sage-devel
I'll see if I can find an "authoritative" source making this claim,
but I believe it is conventional wisdom (along the lines of "goto
considered evil"--it's not as if there aren't valid uses for it but it
should be avoided unless you know what you're doing).

My point, however, is baked directly into the file format--the pickle
format is very Python version-dependent (there are I think 5 different
pickle formats now) and the way non-trivial objects are pickled is
highly tied to the version of the source code at which that pickled
object was saved. It depends on the module layout, class names,
implementation of that object's __reduce__ at the time the pickle was
made, etc. So while you can use pickle for long-term storage, it's
implicit in the format that if you want to unpickle an object saved at
a given time, you may need to be running the same version of the
software at which the pickle was created (it might be nice if there
were somehow a way to mark this explicitly when pickling Sage objects,
e.g. with the current git hash).

Anyways, as I wrote for short-term storage (e.g. results of long
computations that will be reused later in the same software) it's
fine. For long-term you want a more transparent, more interoperable
serialization format. JSON isn't sufficient for all uses cases, but
there are other alternatives as well, even for binary data.

Jeroen Demeyer

unread,
Oct 30, 2017, 10:13:49 AM10/30/17
to sage-...@googlegroups.com
On 2017-10-30 08:41, Simon King wrote:
> would you rather have no test at all than a
> superficial consistency test on a wide range of objects, versions and
> machines?

Yes. A test which doesn't actually test anything is worse than no test.
That's exactly the situation that we currently have with the pickle jar.

Jeroen Demeyer

unread,
Oct 30, 2017, 10:16:04 AM10/30/17
to sage-...@googlegroups.com
On 2017-10-29 20:22, Simon King wrote:
> The m*n-fold pickle jar should not pollute the SageMath sources.
> In that way, a new pickle jar version wouldn't result in a new
> to-be-merged git commit. Instead, the jar should only be stored
> on some SageMath servers.

I don't like this part. It should be easily available to users for
testing. Maybe make it an optional package and mark the test "# optional
- pickle_jar".

Jeroen Demeyer

unread,
Oct 30, 2017, 10:21:30 AM10/30/17
to sage-...@googlegroups.com
On 2017-10-29 23:14, Travis Scrimshaw wrote:
> What about any object that does a TestSuite(foo).run()? This guarantees
> that it pickles (assuming it is not skipping that test) and is an object
> that someone would create.

Sounds like a good idea to me. We could have TestSuite(foo).run() create
a special pickle which also stores how TestSuite.run() was called. Then
we could re-run the TestSuite when testing the pickle jar.

Jeroen Demeyer

unread,
Oct 30, 2017, 10:35:24 AM10/30/17
to sage-...@googlegroups.com
On 2017-10-30 15:12, Erik Bray wrote:
> My point, however, is baked directly into the file format--the pickle
> format is very Python version-dependent (there are I think 5 different
> pickle formats now)

It's true that the format has changed, but always in a
backward-compatible way. For basic Python objects (lists, numbers,
strings, ...) I'm pretty sure that old pickles can still be unpickled
correctly. The issue is with user-defined classes.

Erik Bray

unread,
Oct 30, 2017, 11:03:23 AM10/30/17
to sage-devel
Right, I could have been more clear about that. But for Sage it's
exactly those user-defined classes that matter most. It would be good
to have some serious thought/discussion about portable serialization
options for Sage objects.

Jeroen Demeyer

unread,
Oct 30, 2017, 11:17:49 AM10/30/17
to sage-...@googlegroups.com
Another very relevant question: are pickles supposed to be
hardware/OS-independent? In other words: can I take a pickle from one
machine and unpickle it on a different machine (assuming that the
software version is the same)?

William Stein

unread,
Oct 30, 2017, 11:29:10 AM10/30/17
to sage-...@googlegroups.com
Not necessarily.    Pickle is *the* canonical extensible object serialization system for Python.   It’s of course very extensible in that users can define how objects pickle, eg by defining a dunder reduce method.  As such they can of course make that method store output in any binary format they want.  

Example: look at some of the cython matrix code I mentioned above.  I hope it is architecture neutral as written, but I’m sure you could easily imagine how to write something similar that isn’t. 

- William



--
You received this message because you are subscribed to the Google Groups "sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+...@googlegroups.com.

To post to this group, send email to sage-...@googlegroups.com.
Visit this group at https://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.
--
-- William Stein

Jeroen Demeyer

unread,
Oct 30, 2017, 11:33:48 AM10/30/17
to sage-...@googlegroups.com
On 2017-10-30 16:28, William Stein wrote:
> Not necessarily. Pickle is *the* canonical extensible object
> serialization system for Python. It’s of course very extensible in
> that users can define how objects pickle, eg by defining a dunder reduce
> method. As such they can of course make that method store output in any
> binary format they want.

The question is not what is possible, but what is the recommended way of
doing things. Are we supposed to make our pickles hardware-independent
or is that hopeless anyway?

Erik Bray

unread,
Oct 30, 2017, 12:53:46 PM10/30/17
to sage-devel
I guess it depends on what you mean by "supposed to". I believe that
the pickle formats for built-in types are hardware-independent. E.g.
int values are stored as little-endian and loaded as little-endian in
a hardware-independent manner. So that at least appears to be the
intent.


But I don't know if there's any *requirement* that a pickle be
hardware-independent when it comes to custom types, though in general
it's probably better that they are. But for custom code that, say,
targets only one hardware platform anyways then that's up to the
developer....

Simon King

unread,
Oct 30, 2017, 1:18:27 PM10/30/17
to sage-...@googlegroups.com
Hi Jeroen,
Should be, and would be tested with the scheme that I proposed for
the future pickle jars.

Cheers,
Simon

Simon King

unread,
Oct 30, 2017, 1:31:39 PM10/30/17
to sage-...@googlegroups.com
Hi Jeroen,

On 2017-10-30, Jeroen Demeyer <jdem...@cage.ugent.be> wrote:
The current pickle jar tests that the pickles unpickle. My proposition
is: Test that pickles of the same object created under different
circumstances (1) unpickle, (2) evaluate equal and (3) pass
TestSuite.run().

That'd be a progress, compared with the current pickle jar, and would
indeed a nontrivial test, since it should be easy to create a custom
pickle format that simply dumps a bit pattern and would behave
differently on a big endian or little endian machine. Or (which is
the case for MeatAxe matrices) you can store data in a memory chunk
of size m*sizeof(long), even when the actual data doesn't fill all bytes
of the last long in the memory chunk; whether or not it is used for
specific data may depend on sizeof(long). In order to make pickling
machine independent (which it *should* be, IMHO), you need to dump
only the bytes that are actually used.

Regards,
Simon

Dima Pasechnik

unread,
Oct 31, 2017, 5:35:47 AM10/31/17
to sage-devel
Hi Simon,
This is a huge commitment, to support backward compatibility of more or less arbitrary
Sage objects.
I am really not sure whether it is a good idea to support.
Perhaps for a very restricted set of "easy" Sage objects, yes, but throwing in the jar complicated  objects, which
depend in a nontrivial way on the category framework, no.

Dima

Cheers,
Simon

Jeroen Demeyer

unread,
Dec 8, 2017, 9:07:46 AM12/8/17
to sage-devel
See https://trac.sagemath.org/ticket/24337 for actually doing this.
Reply all
Reply to author
Forward
0 new messages