cPickle Problem

22 views
Skip to first unread message

Stephan Gabler

unread,
Apr 14, 2011, 5:39:33 AM4/14/11
to gen...@googlegroups.com

Hello List,


I sometimes run into a problem when saving a large lsi model on disk.
There is enough space on the disk, this should not be the problem.

You can find the error message here:
http://pastie.org/1794462

Any ideas?


stephan

Radim

unread,
Apr 15, 2011, 12:46:19 PM4/15/11
to gensim
Hello Stephan, first time I see that error! Do you have a reproducible
minimal example, so that we can post this to the numpy dev forum? (i
suspect this error is coming from deep within numpy)

Radim

On Apr 14, 11:39 am, Stephan Gabler <stephan.gab...@googlemail.com>
wrote:

Radim

unread,
Apr 21, 2011, 12:17:27 PM4/21/11
to gensim
Matt managed to pinpoint the problem, so I'm forwarding his reply to
clear up the issue.

Thanks a lot for investigating,
Radim


# Od: Matt Goodman
# ----------------------------------------
# I dug in a bit more, and it is a pickle/cpickle error.
#
# The minimal reproducing error is something like the following:
# pickle.dump(open("test.pkl","w"), "a"*2**(32-1))
#
# See here for more details:
# http://www.gossamer-threads.com/lists/python/bugs/904320
#
# <http://www.gossamer-threads.com/lists/python/bugs/904320>I reported
it to
# both numpy (which might wrap the stupid error in the mean time, or
do a size
# check before serializing) and the core python-dev (which already
seem to
# have a similar report <http://bugs.python.org/issue11564>). It is
not a
# typical problem for most python users, but high-end power users can
crash
# into this pretty easily, and I would have at least anticipated a
more
# meaningful error message. As it stands, it looks like it is just
# overflowing a int32, and failing an internal sanity check
ungracefully,
# hence the reporting ugliness.
#
# I definitely see hdf5 in the future of your project. Some time
closer to
# the summer I might have some time to contribute code along that
avenue.
# Your corpus and dictionary concepts could safely share a single
file, and
# the IO would be a lot faster than having to wrangle ASCII each pass.
# --Matthew Goodman
#

Stephan Gabler

unread,
May 4, 2011, 4:05:52 AM5/4/11
to gen...@googlegroups.com

Hey guys,

thanks a lot for investigating the problem.

Radim: what do you suggest how to deal with this problem.
I suggest to add functionality to the lsi_model (or to models in general) to
store its very large matrices by using a different method (numpy saving methods for example) and
then exclude them from the normal pickling process like here:
http://stackoverflow.com/questions/2345944/exclude-objects-field-from-pickling-in-python

What do you think?

stephan

Radim

unread,
May 4, 2011, 4:20:43 PM5/4/11
to gensim
Hello,

> Radim: what do you suggest how to deal with this problem.
> I suggest to add functionality to the lsi_model (or to models in general) to
> store its very large matrices by using a different method (numpy saving methods for example) and
> then exclude them from the normal pickling process like here:http://stackoverflow.com/questions/2345944/exclude-objects-field-from...

good question. I envision switching to storing the model matrices in
raw format, just like you say, so that they can be later mmap'ed (and
then the read-only model shared across more processes, to save
memory).

This is in anticipation of implementing distributed similarity
queries. I would like that done during the gensim coding sprint in
Berlin. Distributing stuff will be one of the major tasks there :-)

If you have a need to get this functionality faster, I can certainly
assist you in adding it and crafting the design.

Best,
Radim


>
> What do you think?
>
> stephan
>
> Am 21.04.2011 um 18:17 schrieb Radim:
>
>
>
> > Matt managed to pinpoint the problem, so I'm forwarding his reply to
> > clear up the issue.
>
> > Thanks a lot for investigating,
> > Radim
>
> > # Od: Matt Goodman
> > # ----------------------------------------
> > # I dug in a bit more, and it is a pickle/cpickle error.
> > #
> > # The minimal reproducing error is something like the following:
> > # pickle.dump(open("test.pkl","w"), "a"*2**(32-1))
> > #
> > # See here for more details:
> > #http://www.gossamer-threads.com/lists/python/bugs/904320
Reply all
Reply to author
Forward
0 new messages