pickle.load() extremely slow performance

Jim Garrison

unread,

Mar 20, 2009, 6:25:37 PM3/20/09

to

I'm converting a Perl system to Python, and have run into a severe
performance problem with pickle.

One facet of the system involves scanning and loading into memory a
couple of parallel directory trees containing OTO 10^4 files. The
trees don't change during development/testing and the scan takes 30-40
seconds, so to save time I cache the loaded tree structure to disk, in
Perl with module Storable, and in Python with pickle.

In Perl, the save operation produces a file of about 3MB, and both
save and restore take a second or two. In Python, pickle.dump()
produces a similar-size file but takes 20 seconds, and pickle.load()
takes 45 seconds, which is actually LONGER than the time required to
scan the directory trees.

Is there anything I can do to speed up pickle.load() to get
performance comparable to Perl's Storable?

Message has been deleted

John Machin

unread,

Mar 20, 2009, 6:40:25 PM3/20/09

to

Have you read this:
http://www.python.org/doc/2.6/library/pickle.html
?
Have you considered using cPickle instead of pickle?
Have you considered using *ickle.dump(..., protocol=-1) ?

Jim Garrison

unread,

Mar 20, 2009, 8:26:22 PM3/20/09

to

I'm using Python 3 on Windows (Server 2003). According to the docs

"The pickle module has an transparent optimizer (_pickle) written
in C. It is used whenever available. Otherwise the pure Python
implementation is used."

How can I tell if _pickle is being used?

Jim Garrison

unread,

Mar 20, 2009, 8:39:41 PM3/20/09

to

Jim Garrison wrote:
> John Machin wrote:
[snip]

>> Have you considered using cPickle instead of pickle?
>> Have you considered using *ickle.dump(..., protocol=-1) ?
>
> I'm using Python 3 on Windows (Server 2003). According to the docs
>
> "The pickle module has an transparent optimizer (_pickle) written
> in C. It is used whenever available. Otherwise the pure Python
> implementation is used."
>
> How can I tell if _pickle is being used?

Answered my own question

>>> import _pickle
>>> dir (_pickle)
['PickleError', 'Pickler', 'PicklingError', 'Unpickler',
'UnpicklingError', '__doc__', '__name__', '__package__']
>>> dir(_pickle.Pickler)
['__class__', '__delattr__', '__doc__', '__eq__', '__format__',
'__ge__', '__getattribute__', '__gt__', '__hash__', '__init__',
'__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'bin', 'clear_memo', 'dump', 'fast', 'memo', 'persistent_id']
>>> dir(_pickle.Pickler)
['__class__', '__delattr__', '__doc__', '__eq__', '__format__',
'__ge__', '__getattribute__', '__gt__', '__hash__', '__init__',
'__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'bin', 'clear_memo', 'dump', 'fast', 'memo', 'persistent_id']

_pickle seems to be there. Also, if I step into the load
call (pydev under Eclipse) it steps into pickle.load() but
won't step into the call to the Unpickler constructor. I
assume that means it's calling out to the C implementation.

Carl Banks

unread,

Mar 20, 2009, 10:30:07 PM3/20/09

to

The slow performance is most likely due to the poor performance of
Python 3's IO, which is caused by (among other things) bad buffering
strategy. It's a Python 3 growing pain, and is being rewritten.
Python 3.1 should be must faster but it's not been released yet.

As a workaround, mmap the file instead. For example (untested):

f = open('dirlisting.dat','rb')
try:
f.seek(0,2)
size = f.tell()
f.seek(0,0)
m = mmap.mmap(f.fileno(),size,access=mmap.ACCESS_READ)
try:
dir_listing = pickle.loads(m)
finally:
m.close()
finally:
f.close()

Pickling the output left as an exercise.

Carl Banks

bearoph...@lycos.com

unread,

Mar 20, 2009, 10:41:21 PM3/20/09

to

Carl Banks:

> The slow performance is most likely due to the poor performance of

> Python 3's IO, which is caused by [...]

My suggestion for the Original Poster is just to try using Python 2.x,
if possible :-)

Bye,
bearophile

Terry Reedy

unread,

Mar 21, 2009, 12:21:20 AM3/21/09

to pytho...@python.org

Carl Banks wrote:
>
> The slow performance is most likely due to the poor performance of
> Python 3's IO, which is caused by (among other things) bad buffering
> strategy. It's a Python 3 growing pain, and is being rewritten.
> Python 3.1 should be must faster but it's not been released yet.

3.1a1 is out and I believe it has the io improvements.

Benjamin Peterson

unread,

Mar 21, 2009, 1:08:49 PM3/21/09

to pytho...@python.org

Terry Reedy <tjreedy <at> udel.edu> writes:
>
> 3.1a1 is out and I believe it has the io improvements.

Massive ones, too. It'd be interesting to see your results on the alpha.

Jim Garrison

unread,

Mar 23, 2009, 11:57:54 AM3/23/09

to

On 3.1a1 the unpickle step takes 2.4 seconds, an 1875% improvement.

Thanks.

Jean-Paul Calderone

unread,

Mar 23, 2009, 12:40:17 PM3/23/09

to pytho...@python.org

Surely you mean a 94.7% improvement?

Jean-Paul

Steve Holden

unread,

Mar 23, 2009, 1:31:44 PM3/23/09

to pytho...@python.org

Jean-Paul Calderone wrote:
> On Mon, 23 Mar 2009 10:57:54 -0500, Jim Garrison <j...@acm.org> wrote:

> Surely you mean a 94.7% improvement?
>

Well, since it's now running almost twenty times faster, the speed has
increased by 1875%. Not sure what the mathematics of improvement are ...

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Want to know? Come to PyCon - soon! http://us.pycon.org/

Jim Garrison

unread,

Mar 23, 2009, 5:21:05 PM3/23/09

to

Steve Holden wrote:
> Jean-Paul Calderone wrote:
>> On Mon, 23 Mar 2009 10:57:54 -0500, Jim Garrison <j...@acm.org> wrote:
>>> Benjamin Peterson wrote:
>>>> Terry Reedy <tjreedy <at> udel.edu> writes:
>>>>> 3.1a1 is out and I believe it has the io improvements.
>>>> Massive ones, too. It'd be interesting to see your results on the alpha.
>>> On 3.1a1 the unpickle step takes 2.4 seconds, an 1875% improvement.
>> Surely you mean a 94.7% improvement?
>>
> Well, since it's now running almost twenty times faster, the speed has
> increased by 1875%. Not sure what the mathematics of improvement are ...
>
> regards
> Steve

The arithmetic depends on whether you're looking at time or
velocity, which are inverses of each other.

If you double your velocity (100% increase) the time required goes
down by 50%. A 1000% increase in velocity results in a 90% decrease
in time... etc. I guess I equate "performance" to velocity.