Matt managed to pinpoint the problem, so I'm forwarding his reply to
clear up the issue.
Thanks a lot for investigating,
Radim
# Od: Matt Goodman
# ----------------------------------------
# I dug in a bit more, and it is a pickle/cpickle error.
#
# The minimal reproducing error is something like the following:
# pickle.dump(open("test.pkl","w"), "a"*2**(32-1))
#
# See here for more details:
#
http://www.gossamer-threads.com/lists/python/bugs/904320
#
# <
http://www.gossamer-threads.com/lists/python/bugs/904320>I reported
it to
# both numpy (which might wrap the stupid error in the mean time, or
do a size
# check before serializing) and the core python-dev (which already
seem to
# have a similar report <
http://bugs.python.org/issue11564>). It is
not a
# typical problem for most python users, but high-end power users can
crash
# into this pretty easily, and I would have at least anticipated a
more
# meaningful error message. As it stands, it looks like it is just
# overflowing a int32, and failing an internal sanity check
ungracefully,
# hence the reporting ugliness.
#
# I definitely see hdf5 in the future of your project. Some time
closer to
# the summer I might have some time to contribute code along that
avenue.
# Your corpus and dictionary concepts could safely share a single
file, and
# the IO would be a lot faster than having to wrangle ASCII each pass.
# --Matthew Goodman
#