Python 2->3 incompatibility in unpickling

50 views
Skip to first unread message

Simon King

unread,
Jan 22, 2020, 4:35:16 PM1/22/20
to sage-...@googlegroups.com
Hi!

I have (Sage-related) data pickled with Python-2. Part of the data is
binary data put into a Python-2 str.

Now, with Python-3, the binary data is put into bytes.

Consequently, when unpickling my old data with Python-3, I want the
Python-2 str to be interpreted as bytes. However, Python-3 insists on
misinterpreting it as str, and I have trouble to turn that str into an
appropriate bytes.

Here is an example:
Unpickling the data initially results in a string, s:
sage: s = '\x80\x1f'
I want it to be interpreted as the following bytes, b:
sage: b = b'\x80\x1f'

How can I efficiently transform s into b? The following works, but I
doubt that it is very efficient:
sage: import struct
sage: struct.pack('{}B'.format(len(s)),*(ord(_) for _ in s)) == b
True

Note that sage.cpython.string.str_to_bytes does't do what I need:
sage: sage.cpython.string.str_to_bytes(s)
b'\xc2\x80\x1f'

Best regards,
Simon


Julien Puydt

unread,
Jan 22, 2020, 4:44:00 PM1/22/20
to sage-...@googlegroups.com
Le mercredi 22 janvier 2020 à 21:35 +0000, Simon King a écrit :
> Consequently, when unpickling my old data with Python-3, I want the
> Python-2 str to be interpreted as bytes. However, Python-3 insists on
> misinterpreting it as str, and I have trouble to turn that str into
> an appropriate bytes.

Did you try forcing the pickling "protocol" parameter ?

JP

Nils Bruin

unread,
Jan 22, 2020, 5:00:04 PM1/22/20
to sage-devel
On Wednesday, January 22, 2020 at 1:35:16 PM UTC-8, Simon King wrote:
Here is an example:
Unpickling the data initially results in a string, s:
  sage: s = '\x80\x1f'
I want it to be interpreted as the following bytes, b:
  sage: b = b'\x80\x1f'

How can I efficiently transform s into b? The following works, but I

This is exactly what we encountered on #28444, where we found that "latin1" decoding has a left-inverse (if decoding acts on the left) of "latin1" encoding. It looks like the encoding Py3 used for your py2-string was "latin1" (that's what we decided was the smart thing to do most of the time for possibly binary data), so

s.encode(encoding="latin1")

should do the trick. That should be pretty efficient.

Simon King

unread,
Jan 22, 2020, 6:01:21 PM1/22/20
to sage-...@googlegroups.com
Hi Nils,

yes, that's exactly the same problem! Thank you for reminding me. I've
totally forgotten how the "correct" encoding was called.

Best regards,
Simon
Reply all
Reply to author
Forward
0 new messages