View this page "Pickling, data persistence"

7 views
Skip to first unread message

Ian

unread,
Feb 21, 2010, 1:46:44 PM2/21/10
to APAM Python Users
I wrote a couple functions that are useful for storing generic python
objects containing large numpy arrays.

Click on http://groups.google.com/group/apam-python-users/web/pickling-data-persistence
- or copy & paste it into your browser's address bar if that doesn't
work.

Lisandro Dalcin

unread,
Feb 22, 2010, 1:49:48 PM2/22/10
to apam-pyt...@googlegroups.com

Have you benchmarked cPickle.dump(npyobj, fileobj, 2) , i.e., using
pickle's protocol version 2?


--
Lisandro Dalcin
---------------
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594

Ian Langmore

unread,
Feb 22, 2010, 2:32:23 PM2/22/10
to apam-pyt...@googlegroups.com
No I hadn't.  Thanks for pointing it out.  Using protocol=2 I get much faster results than the default protocol=0, but still slower than using numpy's methods.  Here are some benchmarks.  I am storing/reading an object with a few attributes including three 11MB numpy arrays.

cPickle.dump,  protocol=2
dump,  4.6 sec,  created one file of size 110MB
load,  10.2 sec.

pickleme,  < 0.5 sec, created three  npy files total of 33MB + one *pick file  371 bites
unpickleme,  < 0.1 sec.

So the pickleme/unpickleme methods are much faster.  However, cPickle with protocol=2 is much much faster than the cPickle (default protocol=0).  It's so much faster that I wouldn't have bothered writing these specialized methods if I had known about protocol=2.  One disadvantage with pickleme/unpickleme is that they can't be used to pickle anything in python that doesn't have a __dict__ attribute (say an integer).  This could be corrected of course...

-Ian


Lisandro Dalcin wrote:
On 21 February 2010 15:46, Ian <ianla...@gmail.com> wrote:
  
I wrote a couple functions that are useful for storing generic python
objects containing large numpy arrays.

Click on http://groups.google.com/group/apam-python-users/web/pickling-data-persistence
- or copy & paste it into your browser's address bar if that doesn't
work.

    
Have you benchmarked cPickle.dump(npyobj, fileobj, 2) , i.e., using
pickle's protocol version 2?


  

-- 
- Ian

Lisandro Dalcin

unread,
Feb 22, 2010, 2:52:08 PM2/22/10
to apam-pyt...@googlegroups.com
On 22 February 2010 16:32, Ian Langmore <ianla...@gmail.com> wrote:
> No I hadn't.  Thanks for pointing it out.  Using protocol=2 I get much
> faster results than the default protocol=0, but still slower than using
> numpy's methods.  Here are some benchmarks.  I am storing/reading an object
> with a few attributes including three 11MB numpy arrays.
>
> cPickle.dump,  protocol=2
> dump,  4.6 sec,  created one file of size 110MB
> load,  10.2 sec.
>
> pickleme,  < 0.5 sec, created three  npy files total of 33MB + one *pick
> file  371 bites
> unpickleme,  < 0.1 sec.
>
> So the pickleme/unpickleme methods are much faster.  However, cPickle with
> protocol=2 is much much faster than the cPickle (default protocol=0).  It's
> so much faster that I wouldn't have bothered writing these specialized
> methods if I had known about protocol=2.

Still, the differences are notorious... A quick test (with a single
32MB array) in my box shows me that ary.dump(filename) (basically,
cPicle.dump with protocol 2) is faster than numpy.save(filename, ary)
(basically, npy format)


> One disadvantage with
> pickleme/unpickleme is that they can't be used to pickle anything in python
> that doesn't have a __dict__ attribute (say an integer).  This could be
> corrected of course...
>

Sorry, can you elaborate on this limitation? The pickle protocol let
you serialize objects without __dict__ (like built-in types, these you
get with 'cdef class' in Cython). You just have to implement some
special methods, like __reduce__ and __setstate__, or use the
'copy_reg' module...

Ian Langmore

unread,
Feb 22, 2010, 3:18:23 PM2/22/10
to apam-pyt...@googlegroups.com
I'm not sure what is causing these differences. 

The limitations I spoke of are due to the way I wrote pickleme/unpickleme.  I have them search through the object's dictionary.  This could be changed if someone wanted to.  First though I need to experiment to see what is causing the slowdown of cPickle on my system.  If cPickle is indeed faster than numpy.load/save, then there is no reason for my functions.

-Ian
-- 
- Ian

Lisandro Dalcin

unread,
Feb 22, 2010, 3:32:50 PM2/22/10
to apam-pyt...@googlegroups.com
On 22 February 2010 17:18, Ian Langmore <ianla...@gmail.com> wrote:
> I'm not sure what is causing these differences.
>
> The limitations I spoke of are due to the way I wrote pickleme/unpickleme.
> I have them search through the object's dictionary.  This could be changed
> if someone wanted to.

OK.. now I understand your comments...


> First though I need to experiment to see what is
> causing the slowdown of cPickle on my system.
> If cPickle is indeed faster
> than numpy.load/save, then there is no reason for my functions.
>

No, sorry, I wrote it wrong... It was the other way around.. See
yourself, np.save() seems to be (a bit) faster in my box (using Python
2.6.2 and numpy 1.3.0).

In [1]: import numpy as np

In [2]: a = np.ones(4e6)

In [3]: a.nbytes
Out[3]: 32000000

In [4]: %timeit a.dump('/tmp/npyarray.tmp')
10 loops, best of 3: 195 ms per loop

In [5]: %timeit np.save('/tmp/npyarray.tmp', a) # note: saves to
'/tmp/npyarray.tmp.npy'
10 loops, best of 3: 136 ms per loop

Ian Langmore

unread,
Feb 22, 2010, 3:51:31 PM2/22/10
to apam-pyt...@googlegroups.com
I get more or less the same results.  dump takes 86 ms, and save 70.4 ms.  Since dump uses pickling, I'm guessing that dump is optimized for numpy arrays.  It appears that something more complicated happens when the object is not simply a numpy array:

In [90]: class Myob:
   ....:     pass
   ....:

In [92]: myob = Myob()

In [93]: myob.name = 'temp'

In [94]: myob.a = np.ones(4e6)

In [95]: f = open('temp.pick', 'w')

In [96]: %timeit cPickle.dump(myob, f, protocol=2)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (17, 0))

Some other error messages....

I am able to dump using protocol=0, but it takes so long that I gave up on timing it.  This is funny since I am able to cPickle with protocol=2 (in iPython) using objects that I created outside of iPython.

-Ian



Lisandro Dalcin wrote:
On 22 February 2010 17:18, Ian Langmore <ianla...@gmail.com> wrote:
  
I'm not sure what is causing these differences.

The limitations I spoke of are due to the way I wrote pickleme/unpickleme.
I have them search through the object's dictionary.  This could be changed
if someone wanted to.
    
OK.. now I understand your comments...


  
First though I need to experiment to see what is
causing the slowdown of cPickle on my system.
If cPickle is indeed faster
than numpy.load/save, then there is no reason for my functions.

    
No, sorry, I wrote it wrong... It was the other way around.. See
yourself, np.save() seems to be (a bit) faster in my box (using Python
2.6.2 and numpy 1.3.0).

In [1]: import numpy as np

In [2]: a = np.ones(4e6)

In [3]: a.nbytes
Out[3]: 32000000

In [4]: %timeit a.dump('/tmp/npyarray.tmp')
10 loops, best of 3: 195 ms per loop

In [5]: %timeit np.save('/tmp/npyarray.tmp', a) # note: saves to
'/tmp/npyarray.tmp.npy'
10 loops, best of 3: 136 ms per loop


  

-- 
- Ian
Reply all
Reply to author
Forward
0 new messages