LSA Memory Error

aneesha

unread,

Jul 30, 2011, 8:31:28 PM7/30/11

to gensim

Hi

I get the following error when trying to run the lsa model:

2011-07-31 10:25:08,801 : INFO : loaded corpus index from c:\gensim
\wiki_tfidf.mm.in
dex
2011-07-31 10:25:08,801 : INFO : initializing corpus reader from c:
\gensim\wiki_tfid
f.mm
2011-07-31 10:25:08,928 : INFO : accepted corpus with 3501556
documents, 100000
features, 542748074 non-zero entries
MmCorpus(3501556 documents, 100000 features, 542748074 non-zero
entries)
2011-07-31 10:25:08,937 : INFO : using serial LSI version on this node
2011-07-31 10:25:08,938 : INFO : updating model with new documents
2011-07-31 10:25:08,979 : INFO : preparing a new chunk of documents
2011-07-31 10:25:56,012 : INFO : using 100 extra samples and 2 power
iterations
2011-07-31 10:25:56,884 : INFO : 1st phase: constructing (100000, 300)
action ma
trix
2011-07-31 10:26:27,049 : INFO : orthonormalizing (100000, 300) action
matrix
Traceback (most recent call last):
File "lsa.py", line 13, in <module>
lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word,
num_topics
=200)
File "c:\python27\lib\site-packages\gensim-0.8.0-py2.7.egg\gensim
\models\lsimo
del.py", line 310, in __init__
self.add_documents(corpus)
File "c:\python27\lib\site-packages\gensim-0.8.0-py2.7.egg\gensim
\models\lsimo
del.py", line 366, in add_documents
update = Projection(self.num_terms, self.num_topics, job)
File "c:\python27\lib\site-packages\gensim-0.8.0-py2.7.egg\gensim
\models\lsimo
del.py", line 117, in __init__
power_iters=P2_EXTRA_ITERS, extra_dims=P2_EXTRA_DIMS)
File "c:\python27\lib\site-packages\gensim-0.8.0-py2.7.egg\gensim
\models\lsimo
del.py", line 642, in stochastic_svd
q, r = matutils.qr_destroy(y) # orthonormalize the range
File "c:\python27\lib\site-packages\gensim-0.8.0-py2.7.egg\gensim
\matutils.py"
, line 284, in qr_destroy
a = numpy.asfortranarray(la[0])
File "c:\python27\lib\site-packages\numpy\core\numeric.py", line
408, in asfor
tranarray
return array(a, dtype, copy=False, order='F', ndmin=1)
MemoryError

I am using the following code:

import logging, gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %
(message)s', level=logging.INFO)
id2word = gensim.corpora.Dictionary.load_from_text('c:\gensim
\_wordids.txt')
mm = gensim.corpora.MmCorpus('c:\gensim\_tfidf.mm')
lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word,
num_topics=200)
lsi.save('c:\gensim\wikilsamodel.lsa')
lsi.print_topics(10)

Any help would be much appreciated

Aneesha

Radim

unread,

Jul 31, 2011, 4:30:27 PM7/31/11

to gensim

Hello Aneesha,

memory management works differently under Windows, so it's quite
possible you're getting different "memory limits" in LSA, compared to
the same hardware on a different OS.

You don't say how much RAM (physical, virtual) you have. My
suggestions are:

1. terminate unrelated programs that consume a lot of memory
(browsers, IDEs, ...?)
2. use smaller vocabulary (100k words is a lot, try 50k).

Let me know how that worked, I'll modify the tutorial defaults so that
the they work on Windows as well, for a 2GB RAM machine.

Radim

aneesha

unread,

Jul 31, 2011, 7:17:54 PM7/31/11

to gensim

Hi

I have 8Gb of RAM. I have successfully run LDA with gensim using the
same 100k vocab. If I monitor the RAM usage, I can see that at 4Gb the
memory error is thrown. I will reduce the vocab and see how I go.

Many Thanks

Aneesha

Radim

unread,

Aug 1, 2011, 7:51:54 AM8/1/11

to gensim

On Aug 1, 1:17 am, aneesha <aneesha.bakha...@gmail.com> wrote:
> Hi
>

> I have 8Gb of RAM. I have successfully run LDA with gensim using the
> same 100k vocab. If I monitor the RAM usage, I can see that at 4Gb the
> memory error is thrown. I will reduce the vocab and see how I go.

I'm guessing that limitation comes from using 32bit windows (or 32bit
python on 64bit windows), which I understand only gives 2GB of address
space per process.

How come you see 4GB used I don't know; if you *do* have 64bit python,
there should be no limit. Perhaps it's something specific to numpy on
Windows? Or your memory is so fragmented that malloc cannot find a
contiguous block of free memory for the large matrix.

The wikipedia script certainly shouldn't require 8GB or even 4GB of
RAM, because I don't have that myself :-)

Best,
Radim

Reply all

Reply to author

Forward