ValueError: array is too big

173 views
Skip to first unread message

Cássio Rampelotto Dias

unread,
May 1, 2019, 9:40:19 PM5/1/19
to Gensim
Friends,
I'm building a Raspberry
Pi 3B+ cluster. The cluster configuration is running perfectly the basic LSA tests distributed with the data provided in the example. Now I am testing with larger amounts of data (500 Mb ".txt"). The first test is to check the performance of the LSA running serial on only one raspberry, and then compare the performance running on clustering later. I am having this problem when I run LSA on just one raspberry.

pi@raspberrypi:~/wiki/01.data/100 $ python
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from gensim.corpora.textcorpus import TextCorpus

>>> from gensim.test.utils import datapath
>>> from gensim import utils
>>> from gensim.corpora import Dictionary, HashDictionary, MmCorpus, WikiCorpus
>>> from gensim.models import TfidfModel
>>> class CorpusMiislita(TextCorpus):
... stopwords = set('for a of the and to in on'.split())
... def get_texts(self):
... for doc in self.getstream():
... yield [word for word in utils.to_unicode(doc).lower().split() if word not in self.stopwords]
... def __len__(self):
... self.length = sum(1 for _ in self.get_texts())
... return self.length
...
>>> outp = "/home/pi/wiki/01.data/100/human.txt.bz2"
>>> corpus = CorpusMiislita(datapath(outp))
>>> print(len(corpus))
268532
>>> document = next(iter(corpus.get_texts()))
>>> corpus.dictionary.save_as_text(outp + '_wordids.txt.bz2')
>>> dictionary = Dictionary.load_from_text( outp + '_wordids.txt.bz2')
>>> MmCorpus.serialize(outp + '_bow.mm', corpus)
>>> from gensim import corpora, models
>>> from gensim.corpora import Dictionary
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> corpus = corpora.MmCorpus('/home/pi/wiki/01.data/100/human.txt.bz2_bow.mm')
2019-05-01 21:03:19,159 : INFO : loaded corpus index from /home/pi/wiki/01.data/100/human.txt.bz2_bow.mm.index
2019-05-01 21:03:19,162 : INFO : initializing cython corpus reader from /home/pi/wiki/01.data/100/human.txt.bz2_bow.mm
2019-05-01 21:03:19,166 : INFO : accepted corpus with 268532 documents, 2009886 features, 46109215 non-zero entries
>>> id2word = Dictionary.load_from_text( '/home/pi/wiki/01.data/100/human.txt.bz2_wordids.txt.bz2')
>>> lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200, chunksize=100)
+2019-05-01 21:05:18,984 : INFO : using serial LSI version on this node
2019-05-01 21:05:18,985 : INFO : updating model with new documents
2019-05-01 21:05:19,360 : INFO : preparing a new chunk of documents
2019-05-01 21:05:19,570 : INFO : using 100 extra samples and 2 power iterations
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 445, in __init__
self.add_documents(corpus)
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 512, in add_documents
power_iters=self.power_iters, dtype=self.dtype
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 199, in __init__
extra_dims=self.extra_dims, dtype=dtype)
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 919, in stochastic_svd
y = np.zeros(dtype=dtype, shape=(num_terms, samples))
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
>>>


RP 3B+ specs

CPU: 1.4GHz 64-bit quad-core ARM Cortex-A53;
RAM: 1GB LPDDR2 SDRAM;
SDCARD: 32 Gb;
SO: Linux raspberrypi 4.14.79-v7+ #1159 SMP Sun Nov 4 17:50:20 GMT 2018 armv7 GNU/Linux

Thanks for any help!!



Radim Řehůřek

unread,
May 3, 2019, 3:52:27 AM5/3/19
to Gensim
Hello Cassio,

the code looks fine, except it's also common to normalize the bag-of-words `corpus` first, e.g. by transforming it with TfidfModel.

What is the size of your Dictionary? `len(id2word)`

If it's too large, try trimming unneeded words to reduce memory footprint: see the filter_extremes method.

It's also a good idea to visually inspect whether the documents coming out of your CorpusMiislite match your expectations. You're simply calling `.split()` on strings to get words which, unless you've taken care to preprocess the files, might produce "surprising" tokens.

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages