Friends,
I'm building a Raspberry Pi 3B+ cluster. The cluster configuration is running perfectly the basic LSA tests distributed with the data provided in the example.
Now I am testing with larger amounts of data (500 Mb ".txt"). The first test is to check the performance of the LSA running serial on only one raspberry, and then compare the performance running on clustering later.
I am having this problem when I run LSA on just one raspberry.
pi@raspberrypi:~/wiki/01.data/100 $ python
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath
>>> from gensim import utils
>>> from gensim.corpora import Dictionary, HashDictionary, MmCorpus, WikiCorpus
>>> from gensim.models import TfidfModel
>>> class CorpusMiislita(TextCorpus):
... stopwords = set('for a of the and to in on'.split())
... def get_texts(self):
... for doc in self.getstream():
... yield [word for word in utils.to_unicode(doc).lower().split() if word not in self.stopwords]
... def __len__(self):
... self.length = sum(1 for _ in self.get_texts())
... return self.length
...
>>> outp = "/home/pi/wiki/01.data/100/human.txt.bz2"
>>> corpus = CorpusMiislita(datapath(outp))
>>> print(len(corpus))
268532
>>> document = next(iter(corpus.get_texts()))
>>> corpus.dictionary.save_as_text(outp + '_wordids.txt.bz2')
>>> dictionary = Dictionary.load_from_text( outp + '_wordids.txt.bz2')
>>> MmCorpus.serialize(outp + '_bow.mm', corpus)
>>> from gensim import corpora, models
>>> from gensim.corpora import Dictionary
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> corpus = corpora.MmCorpus('/home/pi/wiki/01.data/100/human.txt.bz2_bow.mm')
2019-05-01 21:03:19,159 : INFO : loaded corpus index from /home/pi/wiki/01.data/100/human.txt.bz2_bow.mm.index
2019-05-01 21:03:19,162 : INFO : initializing cython corpus reader from /home/pi/wiki/01.data/100/human.txt.bz2_bow.mm
2019-05-01 21:03:19,166 : INFO : accepted corpus with 268532 documents, 2009886 features, 46109215 non-zero entries
>>> id2word = Dictionary.load_from_text( '/home/pi/wiki/01.data/100/human.txt.bz2_wordids.txt.bz2')
>>> lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200, chunksize=100)
+2019-05-01 21:05:18,984 : INFO : using serial LSI version on this node
2019-05-01 21:05:18,985 : INFO : updating model with new documents
2019-05-01 21:05:19,360 : INFO : preparing a new chunk of documents
2019-05-01 21:05:19,570 : INFO : using 100 extra samples and 2 power iterations
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 445, in __init__
self.add_documents(corpus)
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 512, in add_documents
power_iters=self.power_iters, dtype=self.dtype
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 199, in __init__
extra_dims=self.extra_dims, dtype=dtype)
File "/home/pi/.local/lib/python3.5/site-packages/gensim/models/lsimodel.py", line 919, in stochastic_svd
y = np.zeros(dtype=dtype, shape=(num_terms, samples))
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
>>>
RP 3B+ specs
CPU: 1.4GHz 64-bit quad-core ARM Cortex-A53;
RAM: 1GB LPDDR2 SDRAM;
SDCARD: 32 Gb;
SO: Linux raspberrypi 4.14.79-v7+ #1159 SMP Sun Nov 4 17:50:20 GMT 2018 armv7 GNU/Linux
Thanks for any help!!