[SciPy-User] Numpy pickle format

143 views
Skip to first unread message

David Baddeley

unread,
Nov 24, 2010, 5:00:56 PM11/24/10
to scipy...@scipy.org
I was wondering if anyone could point me to any documentation for the (binary)
format of pickled numpy arrays.

To put my request into context, I'm using Pyro to communicate between python and
jython, and would like push numpy arrays into the python end and pull something
I can work with in jython out the other end (I was thinking of a minimal class
wrapping the std libraries array.array, and having some form of shape property
(I can pretty much guarantee that the data going in is c-contiguous, so there
shouldn't be any strides nastiness).

The proper way to do this would be to convert my numpy arrays to this minimal
wrapper before pushing them onto the wire, but I've already got a fair bit of
python code which pushes arrays round using Pyro, which I'd prefer not to have
to rewrite. The pickle representation of array.array is also slightly different
(broken) between cPython and Jython, and although you can pickle and unpickle,
you end up swapping the endedness, so to recover the data [in the Jython ->
Python direction] you've got to create a numpy array and then a view of that
with reversed endedness.

What I was hoping to do instead was to construct a dummy numpy.ndarray class in
jython which knew how to pickle/unpickle numpy arrays.

The ultimate goal is to create a Python -> ImageJ bridge so I can push images
from some python image processing code I've got across into ImageJ without
having to manually save and open the files.

Would appreciate any suggestions,

thanks,
David



_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Christopher Barker

unread,
Nov 24, 2010, 5:20:05 PM11/24/10
to SciPy Users List
On 11/24/10 2:00 PM, David Baddeley wrote:
> I was wondering if anyone could point me to any documentation for the (binary)
> format of pickled numpy arrays.
>
> To put my request into context, I'm using Pyro to communicate between python and
> jython, and would like push numpy arrays into the python end and pull something
> I can work with in jython out the other end

maybe the native (*.npy) format would be easier to deal with.

http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt

And you can pack a bunch of those together in a zip file with savez.

If Jython has a struct and array.array (or SOME sort of binary format
suitable for storing the data), it would be pretty easy to unpack them
in Jython.

-Chris

(I was thinking of a minimal class
> wrapping the std libraries array.array, and having some form of shape property
> (I can pretty much guarantee that the data going in is c-contiguous, so there
> shouldn't be any strides nastiness).
>
> The proper way to do this would be to convert my numpy arrays to this minimal
> wrapper before pushing them onto the wire, but I've already got a fair bit of
> python code which pushes arrays round using Pyro, which I'd prefer not to have
> to rewrite. The pickle representation of array.array is also slightly different
> (broken) between cPython and Jython, and although you can pickle and unpickle,
> you end up swapping the endedness, so to recover the data [in the Jython ->
> Python direction] you've got to create a numpy array and then a view of that
> with reversed endedness.
>
> What I was hoping to do instead was to construct a dummy numpy.ndarray class in
> jython which knew how to pickle/unpickle numpy arrays.
>
> The ultimate goal is to create a Python -> ImageJ bridge so I can push images
> from some python image processing code I've got across into ImageJ without
> having to manually save and open the files.
>
> Would appreciate any suggestions,
>
> thanks,
> David
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user


--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Robert Kern

unread,
Nov 24, 2010, 5:21:36 PM11/24/10
to David Baddeley, SciPy Users List
On Wed, Nov 24, 2010 at 16:00, David Baddeley
<david_b...@yahoo.com.au> wrote:
> I was wondering if anyone could point me to any documentation for the (binary)
> format of pickled numpy arrays.
>
> To put my request into context, I'm using Pyro to communicate between python and
> jython, and would like push numpy arrays into the python end and pull something
> I can work with in jython out the other end (I was thinking of a minimal class
> wrapping the std libraries array.array, and having some form of shape property
> (I can pretty much guarantee that the data going in is c-contiguous, so there
> shouldn't be any strides nastiness).
>
> The proper way to do this would be to convert my numpy arrays to this minimal
> wrapper before pushing them onto the wire, but I've already got a fair bit of
> python code which pushes arrays round using Pyro, which I'd prefer not to have
> to rewrite. The pickle representation of array.array is also slightly different
> (broken) between cPython and Jython, and although you can pickle and unpickle,
> you end up swapping the endedness, so to recover the data [in the Jython ->
> Python direction] you've got to create a numpy array and then a view of that
> with reversed endedness.
>
> What I was hoping to do instead was to construct a dummy numpy.ndarray class in
> jython which knew how to pickle/unpickle  numpy arrays.
>
> The ultimate goal is to create a Python -> ImageJ bridge so I can push images
> from some python image processing code I've got across into ImageJ without
> having to manually save and open the files.

[~]
|3> a = np.arange(5)

[~]
|4> a.__reduce_ex__()
(<function numpy.core.multiarray._reconstruct>,
(numpy.ndarray, (0,), 'b'),
(1,
(5,),
dtype('int32'),
False,
'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00'))

[~]
|6> a.dtype.__reduce_ex__()
(numpy.dtype, ('i4', 0, 1), (3, '<', None, None, None, -1, -1, 0))


See the pickle documentation for how these tuples are interpreted:

http://docs.python.org/library/pickle#object.__reduce__

[~]
|12> x = np.core.multiarray._reconstruct(np.ndarray, (0,), 'b')

[~]
|13> x
array([], dtype=int8)

[~]
|14> x.__setstate__(Out[11][2])

[~]
|15> x
array([0, 1, 2, 3, 4])

[~]
|16> x.__setstate__?
Type: builtin_function_or_method
Base Class: <type 'builtin_function_or_method'>
String Form: <built-in method __setstate__ of numpy.ndarray object
at 0x387df40>
Namespace: Interactive
Docstring:
a.__setstate__(version, shape, dtype, isfortran, rawdata)

For unpickling.

Parameters
----------
version : int
optional pickle version. If omitted defaults to 0.
shape : tuple
dtype : data-type
isFortran : bool
rawdata : string or list
a binary string with the data (or a list if 'a' is an object array)


In order to get pickle to work, you need to stub out the types
numpy.dtype and numpy.ndarray, and the function
numpy.core.multiarray._reconstruct(). You need numpy.dtype and
numpy.ndarray to define appropriate __setstate__ methods.

Check the functions arraydescr_reduce() and arraydescr_setstate() in
numpy/core/src/multiarray/descriptor.c for how to interpret the state
tuple for dtypes. If you're just dealing with straightforward image
types, then you really only need to pay attention to the first element
(the data kind and width, 'i4') in the argument tuple and the second
element (byte order character, '<') in the state tuple.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

David Baddeley

unread,
Nov 24, 2010, 6:22:02 PM11/24/10
to SciPy Users List
Thanks heaps for the detailed reply! That looks like it should be enough info to
get me started ... I know it's a bit of a niche application, but is there likely
to be anyone else out there who's likely to be interested in similar
functionality? Just want to know if it's worth taking the time to think about
supporting some of the additional aspects of the protocol (eg c/fortran order)
before I cobble something together - I wonder if one could wrap JAMA to provide
some very basic array functionality ...

cheers,
David

Francesc Alted

unread,
Nov 29, 2010, 8:46:04 AM11/29/10
to David Baddeley, SciPy Users List
Hi David,

A Thursday 25 November 2010 00:22:02 David Baddeley escrigué:


> Thanks heaps for the detailed reply! That looks like it should be
> enough info to get me started ... I know it's a bit of a niche
> application, but is there likely to be anyone else out there who's
> likely to be interested in similar functionality? Just want to know
> if it's worth taking the time to think about supporting some of the
> additional aspects of the protocol (eg c/fortran order) before I
> cobble something together - I wonder if one could wrap JAMA to
> provide some very basic array functionality ...

I'm interested. I'm after adopting a protocol to send arrays in a way
that can serialize/deserialize them without having to duplicate the
contents in memory (so that the serialized version and the deserialized
one does not have to happen at the same time)..

My idea is to adopt something similar to the native NPY format for
files:

http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt

but adapting it to support blocking --that is, to be able to send parts
of the array by blocks, and be able to restore the original array by
assembling these blocks. That way, the serialized and deserialized do
not have to coexist in the same process memory (only one block has) when
sending the stream to destination. As a plus, this would add the
possibility to compress blocks transparently, and with a little bit of
more effort, perhaps even allowing random access in case the
serialization goes to a file on-disk (and not to a stream).

I'm thinking in supporting just the metadata that NPY supports right
now, that is, the dtype, the C/Fortran order and the shape, that's all.
After this format would be clear, then several implementations can be
done (like Pyro or zeromq, or just by using something in the Python
standard library).

Do you think that this approach would fulfill your requirements?

--
Francesc Alted

Robert Kern

unread,
Nov 29, 2010, 9:09:30 AM11/29/10
to SciPy Users List

Rather than "adapting the format" per se, just wrap your format around
it. Send a message containing the version number of your blocked
format, the number of header blocks, the number of data blocks, and
any information about the compression of the data. Then send the NPY
header in its own message. Then start send the possibly compressed
data messages.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

Francesc Alted

unread,
Nov 29, 2010, 11:04:14 AM11/29/10
to SciPy Users List
A Monday 29 November 2010 15:09:30 Robert Kern escrigué:

> Rather than "adapting the format" per se, just wrap your format
> around it. Send a message containing the version number of your
> blocked format, the number of header blocks, the number of data
> blocks, and any information about the compression of the data. Then
> send the NPY header in its own message. Then start send the possibly
> compressed data messages.

Well, I was thinking basically in extending NPY for incorporating
compression information, but your approach is feasible too (although it
requires sending one additional message). Which advantage would have
your suggestion?

--
Francesc Alted

Robert Kern

unread,
Nov 29, 2010, 2:08:19 PM11/29/10
to SciPy Users List
On Mon, Nov 29, 2010 at 10:04, Francesc Alted <fal...@pytables.org> wrote:
> A Monday 29 November 2010 15:09:30 Robert Kern escrigué:
>> Rather than "adapting the format" per se, just wrap your format
>> around it. Send a message containing the version number of your
>> blocked format, the number of header blocks, the number of data
>> blocks, and any information about the compression of the data. Then
>> send the NPY header in its own message. Then start send the possibly
>> compressed data messages.
>
> Well, I was thinking basically in extending NPY for incorporating
> compression information, but your approach is feasible too (although it
> requires sending one additional message).  Which advantage would have
> your suggestion?

It's standard best practice for developing stacks of network protocols
(c.f. UDP and TCP over IP over Ethernet). Not least, it keeps the two
protocols orthogonal to each other. If I change the NPY format
slightly (i.e. adding another header key but not changing the
header/data separation), you don't have to change your protocol at
all. At least with ZeroMQ, adding an additional block is incredibly
cheap (you should probably err on the side of more blocks rather than
fewer).

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

Reply all
Reply to author
Forward
0 new messages