[Python-Dev] PEP 574 (pickle 5) implementation and backport available

Antoine Pitrou

unread,

May 24, 2018, 1:59:07 PM5/24/18

to pytho...@python.org

Hi,

While PEP 574 (pickle protocol 5 with out-of-band data) is still in
draft status, I've made available an implementation in branch "pickle5"
in my GitHub fork of CPython:
https://github.com/pitrou/cpython/tree/pickle5

Also I've published an experimental backport on PyPI, for Python 3.6
and 3.7. This should help people play with the new API and features
without having to compile Python:
https://pypi.org/project/pickle5/

Any feedback is welcome.

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Victor Stinner

unread,

May 24, 2018, 4:17:07 PM5/24/18

to Antoine Pitrou, python-dev

Link to the PEP:

"PEP 574 -- Pickle protocol 5 with out-of-band data"
https://www.python.org/dev/peps/pep-0574/

Victor

> Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com

Olivier Grisel

unread,

May 25, 2018, 1:00:55 PM5/25/18

to Antoine Pitrou, pytho...@python.org

I tried this implementation to add no-copy pickling for large numpy arrays and seems to work as expected (for a simple contiguous array). I took some notes on the numpy tracker to advertise this PEP to the numpy developers:

https://github.com/numpy/numpy/issues/11161

--

Olivier

Raymond Hettinger

unread,

May 25, 2018, 1:37:53 PM5/25/18

to Antoine Pitrou, Python-Dev@Python. Org

> On May 24, 2018, at 10:57 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
>
> While PEP 574 (pickle protocol 5 with out-of-band data) is still in
> draft status, I've made available an implementation in branch "pickle5"
> in my GitHub fork of CPython:
> https://github.com/pitrou/cpython/tree/pickle5
>
> Also I've published an experimental backport on PyPI, for Python 3.6
> and 3.7. This should help people play with the new API and features
> without having to compile Python:
> https://pypi.org/project/pickle5/
>
> Any feedback is welcome.

Thanks for doing this.

Hope it isn't too late, but I would like to suggest that protocol 5 support fast compression by default. We normally pickle objects so that they can be transported (saved to a file or sent over a socket). Transport costs (reading and writing a file or socket) are generally proportional to size, so compression is likely to be a net win (much as it was for header compression in HTTP/2).

The PEP lists compression as a possible a refinement only for large objects, but I expect is will be a win for most pickles to compress them in their entirety.

Raymond

Antoine Pitrou

unread,

May 25, 2018, 2:07:51 PM5/25/18

to pytho...@python.org

On Fri, 25 May 2018 10:36:08 -0700
Raymond Hettinger <raymond....@gmail.com> wrote:
> > On May 24, 2018, at 10:57 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
> >
> > While PEP 574 (pickle protocol 5 with out-of-band data) is still in
> > draft status, I've made available an implementation in branch "pickle5"
> > in my GitHub fork of CPython:
> > https://github.com/pitrou/cpython/tree/pickle5
> >
> > Also I've published an experimental backport on PyPI, for Python 3.6
> > and 3.7. This should help people play with the new API and features
> > without having to compile Python:
> > https://pypi.org/project/pickle5/
> >
> > Any feedback is welcome.
>
> Thanks for doing this.
>
> Hope it isn't too late, but I would like to suggest that protocol 5 support fast compression by default. We normally pickle objects so that they can be transported (saved to a file or sent over a socket). Transport costs (reading and writing a file or socket) are generally proportional to size, so compression is likely to be a net win (much as it was for header compression in HTTP/2).
>
> The PEP lists compression as a possible a refinement only for large objects, but I expect is will be a win for most pickles to compress them in their entirety.

It's not too late (the PEP is still a draft, and there's a lot of time
before 3.8), but I wonder what would be the benefit of making it a part
of the pickle specification, rather than compressing independently.

Whether and how to compress is generally a compromise between
transmission (or storage) speed and computation speed. Also, there are
specialized compressors for higher efficiency (for example, Blosc has
datatype-specific compression for Numpy arrays). Such knowledge can be
embodied in domain-specific libraries such as Dask/distributed, but it
cannot really be incorporated in pickle itself.

Do you have something specific in mind?

Regards

Antoine.

Ivan Pozdeev via Python-Dev

unread,

May 25, 2018, 2:30:58 PM5/25/18

to pytho...@python.org

On 25.05.2018 20:36, Raymond Hettinger wrote:
>
>> On May 24, 2018, at 10:57 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
>>
>> While PEP 574 (pickle protocol 5 with out-of-band data) is still in
>> draft status, I've made available an implementation in branch "pickle5"
>> in my GitHub fork of CPython:
>> https://github.com/pitrou/cpython/tree/pickle5
>>
>> Also I've published an experimental backport on PyPI, for Python 3.6
>> and 3.7. This should help people play with the new API and features
>> without having to compile Python:
>> https://pypi.org/project/pickle5/
>>
>> Any feedback is welcome.
> Thanks for doing this.
>
> Hope it isn't too late, but I would like to suggest that protocol 5 support fast compression by default. We normally pickle objects so that they can be transported (saved to a file or sent over a socket). Transport costs (reading and writing a file or socket) are generally proportional to size, so compression is likely to be a net win (much as it was for header compression in HTTP/2).
>
> The PEP lists compression as a possible a refinement only for large objects, but I expect is will be a win for most pickles to compress them in their entirety.

I would advise against that. Pickle format is unreadable as it is,
compression will make it literally impossible to diagnose problems.
Python supports transparent compression, e.g. with the 'zlib' codec.

>
> Raymond
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> https://mail.python.org/mailman/listinfo/python-dev

> Unsubscribe: https://mail.python.org/mailman/options/python-dev/vano%40mail.mipt.ru

--
Regards,
Ivan

Neil Schemenauer

unread,

May 25, 2018, 4:59:22 PM5/25/18

to pytho...@python.org

On 2018-05-25, Antoine Pitrou wrote:
> Do you have something specific in mind?

I think compressed by default is a good idea. My quick proposal:

- Use fast compression like lz4 or zlib with Z_BEST_SPEED

- Add a 'compress' keyword argument with a default of None. For
protocol 5, None means to compress. Providing 'compress' != None
for older protocols will raise an error.

The compression overhead will be small compared to the
pickle/unpickle costs. If someone wants to apply their own (e.g.
better) compression, they can set compress=False.

An alternative idea is to have two different protocol formats. E.g.
5 and 6. One is "pickle 5" with compression, one without
compression. I don't like that as much since it breaks the idea
that higher protocol numbers are "better".

Regards,

Neil

Antoine Pitrou

unread,

May 25, 2018, 5:12:53 PM5/25/18

to pytho...@python.org

On Fri, 25 May 2018 14:50:57 -0600
Neil Schemenauer <nas-p...@arctrix.com> wrote:
> On 2018-05-25, Antoine Pitrou wrote:
> > Do you have something specific in mind?
>
> I think compressed by default is a good idea. My quick proposal:
>
> - Use fast compression like lz4 or zlib with Z_BEST_SPEED
>
> - Add a 'compress' keyword argument with a default of None. For
> protocol 5, None means to compress. Providing 'compress' != None
> for older protocols will raise an error.

The question is what purpose does it serve for pickle to do it rather
than for the user to compress the pickle themselves. You're basically
saving one line of code. Am I missing some other advantage?

(also note that it requires us to ship the lz4 library with Python, or
another modern compression library such as zstd; zlib's performance
characteristics are outdated)

Regards

Antoine.

Neil Schemenauer

unread,

May 25, 2018, 6:36:49 PM5/25/18

to pytho...@python.org

On 2018-05-25, Antoine Pitrou wrote:
> The question is what purpose does it serve for pickle to do it rather
> than for the user to compress the pickle themselves. You're basically
> saving one line of code.

It's one line of code everywhere pickling or unpicking happens. And
you probably need to import a compression module, so at least two
lines. Then maybe you need to figure out if the pickle is
compressed and what kind of compression is used. So, add a few more
lines.

It seems logical to me that users of pickle want it to be fast and
produce small pickles. Compressing by default seems the right
choice, even though it complicates the implementation. Ivan brings
up a valid point that compressed pickles are harder to debug.
However, I think that's much less important than being small.

> it requires us to ship the lz4 library with Python

Yeah, that's not so great. I think zlib with Z_BEST_SPEED would be
fine. However, some people might worry it is too slow or doesn't
compress enough. Having lz4 as a battery included seems like a good
idea anyhow. I understand that it is pretty well established as a
useful compression method. Obviously requiring a new C library to
be included expands the effort of implementation a lot.

This discussion can easily lead into bikeshedding (e.g. relative
merits of different compression schemes). Since I'm not
volunteering to implement anything, I will stop responding at this
point. ;-)

Regards,

Neil

Nathaniel Smith

unread,

May 25, 2018, 8:14:50 PM5/25/18

to Neil Schemenauer, Python Dev

On Fri, May 25, 2018 at 3:35 PM, Neil Schemenauer
<nas-p...@arctrix.com> wrote:
> This discussion can easily lead into bikeshedding (e.g. relative
> merits of different compression schemes). Since I'm not
> volunteering to implement anything, I will stop responding at this
> point. ;-)

I think the bikeshedding -- or more to the point, the fact that
there's a wide variety of options for compressing pickles, and none of
them are appropriate in all circumstances -- means that this is
something that should remain a separate layer.

Even super-fast algorithms like lz4 are inefficient when you're
transmitting pickles between two processes on the same system – they
still add extra memory copies. And that's a very common use case.

-n

--
Nathaniel J. Smith -- https://vorpus.org

Stefan Behnel

unread,

May 26, 2018, 3:14:34 AM5/26/18

to pytho...@python.org

Antoine Pitrou schrieb am 25.05.2018 um 23:11:
> On Fri, 25 May 2018 14:50:57 -0600
> Neil Schemenauer wrote:
>> On 2018-05-25, Antoine Pitrou wrote:
>>> Do you have something specific in mind?
>>
>> I think compressed by default is a good idea. My quick proposal:
>>
>> - Use fast compression like lz4 or zlib with Z_BEST_SPEED
>>
>> - Add a 'compress' keyword argument with a default of None. For
>> protocol 5, None means to compress. Providing 'compress' != None
>> for older protocols will raise an error.
>
> The question is what purpose does it serve for pickle to do it rather
> than for the user to compress the pickle themselves. You're basically
> saving one line of code. Am I missing some other advantage?

Regarding the pickling side, if the pickle is large, then it can save
memory to compress while pickling, rather than compressing after pickling.
But that can also be done with file-like objects, so the advantage is small
here.

I think a major advantage is on the unpickling side rather than the
pickling side. Sure, users can compress a pickle after the fact, but if
there's a (set of) standard algorithms that unpickle can handle
automatically, then it's enough to pass "something pickled" into unpickle,
rather than having to know (or figure out) if and how that pickle was
originally compressed, and build up the decompression pipeline for it to
get everything uncompressed efficiently without accidentally wasting memory
or processing time.

Obviously, auto-decompression opens up a gate for compression bombs, but
then, unpickling data from untrusted sources is discouraged anyway, so...

Stefan

Matthew Rocklin

unread,

May 26, 2018, 12:12:33 PM5/26/18

to pytho...@python.org

Hi all,

I agree that compression is often a good idea when moving serialized objects around on a network, but for what it's worth I as a library author would always set compress=False and then handle it myself as a separate step. There are a few reasons for this:

Bandwidth is often pretty good, especially intra-node, on high performance networks, or on decent modern discs (NVMe)
I often use different compression technologies in different situations. LZ4 is a great all-around default, but often snappy, blosc, or z-standrad are better suited. This depends strongly on the characteristics of the data.
Very often data often isn't compressible, or is already in some compressed form, such as in images, and so compressing only hurts you.

In general, my thought is that compression is a complex topic with enough intricaces that setting a single sane default that works 70+% of the time probably isn't possible (at least not with the applications that I get exposed to).

Instead of baking a particular method into pickle.dumps I would recommend trying to solve this problem through documentation, pointing users to the various compression libraries within the broader Python ecosystem, and perhaps pointing to one of the many blogposts that discuss their strengths and weaknesses.

Best,

-matt

Olivier Grisel

unread,

May 26, 2018, 12:44:53 PM5/26/18

to Matthew Rocklin, pytho...@python.org

+1 for not adding in-pickle compression as it is already very easy to handle compression externally (for instance by passing a compressing file object as an argument to the pickler). Furthermore, as PEP 574 makes it possible to stream the buffer bytes directly to the file-object without any temporary memory copy I don't see any benefit in including the compression into the pickle protocol.

However adding lz4.LZ4File to the standard library in addition to gzip.GzipFile and lzma.LZMAFile is probably a good idea as LZ4 is really fast compared to zlib/gzip. But this is not related to PEP 574.

--

Olivier

Antoine Pitrou

unread,

May 26, 2018, 1:16:27 PM5/26/18

to pytho...@python.org

On Sat, 26 May 2018 18:42:42 +0200
Olivier Grisel <olivier...@ensta.org> wrote:
>
> However adding lz4.LZ4File to the standard library in addition to
> gzip.GzipFile and lzma.LZMAFile is probably a good idea as LZ4 is really
> fast compared to zlib/gzip. But this is not related to PEP 574.

If we go that way, we may probably want zstd as well :-). But, yes,
most likely unrelated to PEP 574.

Regards

Antoine.

Reply all

Reply to author

Forward