Re: buffer interface (python-blosc)

11 views
Skip to first unread message

Francesc Alted

unread,
Oct 15, 2010, 5:38:23 AM10/15/10
to Han Genuit, bl...@googlegroups.com, car...@googlegroups.com
Hi Han,

I'm back at home, finally :-) I've been thinking about your efforts for
effectively compressing buffers of arrays. For what I have been able to
grasp, your solution works mainly for Python >= 2.7.

However, I'm thinking that compressing just a buffer is not really that
helpful for people that want to compress arrays in order to transmit
them or save them, because you still need to serialize the shape and
dtype information for a compressed buffer to be deserialized correctly.

Personally, I find the existing `blosc.pack_array` /
`blosc.unpack_array` to be much more useful for the typical use cases
mentioned above. The problem with this approach is two folded:

1) It is based on pickle, and for numpy arrays, performance is a bit
deceiving.

2) You need memory to uncompress the array before it can be unpickled.
That could represent a fair amount of resources for large arrays.

While 1) might be improved by optimizing the pickling / unpickling
routines in NumPy, I see 2) as a major drawback.

My opinion is that it would be best to be able to use a container that
is already compressed in-memory, so that we don't need to uncompress
buffers completely before being able to use data of them. In that
sense, the carray project (http://github.com/FrancescAlted/carray) fits
the bill perfectly.

So, my idea now is to provide a way to serialize carray containers so
that they can be transmitted or stored in files without the need to be
compressed/decompressed (so avoiding 2) situation). It could even be
desirable that they don't have to be transmitted/loaded completely
before the deserialization process can start.

My idea now is to come up with an stream format that can
serialize/deserialize carrays. For example, I like the approach of the
NPY format (http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-
format.txt), but adapted to carrays. That would represent a near-
optimal way for transporting data (to other processes or to disk).

With this, I don't really think there is much point in continuing with
the idea of implementing a compression function just for numpy buffers.

What do you think? Would that make sense to you?

[I'm sending a copy of this to the blosc and carray mailing lists where
we can continue discussing the thing, and others might want to
contribute ideas]

Francesc

A Saturday 09 October 2010 12:33:15 escriguéreu:
> Hmm, another setback. Turns out that the buffer object is not
> referable by weakreference. Would have been nice.. ^^
>
> Cleaning after the buffer really is troublesome for Py <= 2.6..
>
> Regards,
> Han
>
> ________________________________________
> Van: Han Genuit
> Verzonden: vrijdag 8 oktober 2010 15:38
> Aan: 'Francesc Alted'
> Onderwerp: Re: buffer interface (python-blosc)
>
> Hi, just a quick update:
>
> After some experimenting, I decided to try an implementation with a
> weakreference which could bind a callback to the buffer object, and
> clear the memory referred by the buffer when it is garbage
> collected. Trouble is, the buffer object should be private, but I
> can work around that.. Now I’m struggling with the callback – I
> wanted to create a function in C and also refer to it from C. (To
> keep it ‘under the hood’ from Python space.) But it seems to be a
> lot of work to create a method that I can pass as callback function
> to PyWeakref_NewRef… Work in progress. ;-)
>
> Regards,
> Han
>
> ________________________________
> From: Han Genuit
> Sent: Thursday, 07 October, 2010 10:39
> To: 'Francesc Alted'
> Subject: RE: buffer interface (python-blosc)
>
> Hi Francesc,
>
> Thanks! It is the (new) standardized way for creating a memory view
> from a plain buffer, though, I think. It only took me a while to
> figure out that the buffer functions don’t need a PyObject * in all
> instances.. ;-)
>
> About using ctypes, I think the main problem with converting a void*
> buffer to Python-string is that Python creates a NULL-terminated
> C-string, which needs a memcpy regardless.. I already tried to use a
> similar approach for the construction of a ByteArray (which is
> mutable, I later learned, so not the best choice ;-), but from the
> Blosc source, I gathered that you handle the buffers as plain char *
> arrays, without any NULL-termination, so I gave up on that. The old
> buffer object doesn’t need NULL-termination, which made it the best
> candidate, were it not for the lack of an internal free()
> mechanism..
>
> Anyway, I’ll work on it. When I have it integrated cleanly, I’ll send
> you the patch!
>
> Regards,
> Han
>
>
> ________________________________
> From: fal...@gmail.com [mailto:fal...@gmail.com] On Behalf Of
> Francesc Alted Sent: Wednesday, 06 October, 2010 06:01
> To: Han Genuit
> Subject: Re: buffer interface (python-blosc)
>
> Hey Han,
>
> Your approach is pretty neat, I like it! For supporting 2.6, I
> suppose that it should be possible to use ctypes in order to get an
> string out of a buffer without copying.
>
> I'm in an autumn school now, but please send your patch to the github
> site, and I'll check it more in deep next week.
>
> Thanks!
> 2010/10/5 Han Genuit
> <J.W.G...@rijnhuizen.nl<mailto:J.W.G...@rijnhuizen.nl>> Hey
> Francesc,
>
> I have some trouble implementing a (clean) buffer interface for
> python-blosc on version < 2.7, because I can't use the memoryview
> object as carrier.. I implemented the old buffer object, but the
> problem with the old one is that it doesn't release the underlying
> memory by itself, which causes big leaks. To fix that, I have to
> implement a whole new object to release the memory at dealloc and
> base the buffer on that object, but it is quite a can of worms,
> because you have to implement a new type object and various
> functions to get the object to work, and causes a lot of ugly code.
> I didn't get to the point where it works..
>
> For versions >= 2.7, you only need to do this:
>
> static PyObject *
> _get_buffer(void * buf_p, size_t cbytes)
> {
> // Python 2.7++ stuff
> Py_buffer view;
> PyBuffer_FillInfo(&view, NULL, buf_p, cbytes, 1, PyBUF_FORMAT);
> return PyMemoryView_FromBuffer(&view);
> }
>
> And that works. (Without leaks.)
>
> (I haven't tried to other way round, yet, but that didn't make a
> copy, I think..)
>
> Question is; is it worth implementing a whole new object with
> buffer-interfacing and memory management just for Py 2.6?
>
> Regards,
> Han
>
>
>
> --
> Francesc Alted

--
Francesc Alted

Reply all
Reply to author
Forward
0 new messages