Hi,
Using ldamulticore on a preprocessed english wiki corpus (~ 3 million docs, ~270 000 features) resulted in the error below. Chunksize was set to 10 000, workers set to 6, otherwise I used default settings. The script was run in a virtual environment with python 3.6, gensim 3.1.0, numpy 1.13.3, running on a server with RHEL 7. The same script ran fine with 100 and 200 topics, same settings.
Traceback (most recent call last):
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/queues.py", line 240, in _feed
send_bytes(obj)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -
2147483648 <= number <=
2147483647I am a newbie to gensim, so it could be something trivial I am missing. However, this error looks like the problem about the maximum size of objects passed between processes (see here:
https://stackoverflow.com/questions/16576386/byte-limit-when-transferring-python-objects-between-processes-using-a-pipe). First I thought it was somehow related to the document chunk passed to each worker, but changing the chunk size did not help (same problem with chunksize=2000 or =1000). Is this due to the topics*words matrix then? If yes, is there an easy workaround maybe? Any help is appreciated!
Here is the script I used:
# imports, setups
import logging, gensim, bz2
# main
if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO
# base folder
baseF = '/dartfs-hpc/rc/home/h/f002s3h/Documents/wikiCorpus/'
targetF = '/dartfs-hpc/rc/home/h/f002s3h/Documents/wikiModel_topics500_multicore/'
# load id to word mappings from txt, form dictionary
id2wordDict = gensim.corpora.Dictionary.load_from_text(baseF + 'wiki_wordids.txt.bz2')
# load corpus iterator
mm = gensim.corpora.MmCorpus(baseF + 'wiki_bow.mm.bz2')
# create LDA model, 500 topics, chunks 10000, passes 1, processes 6
ldaModel = gensim.models.ldamulticore.LdaMulticore(corpus=mm, num_topics=500, id2word=id2wordDict, workers = 6, chunksize=10000, passes=1)
ldaModel.save(targetF + 'ldaModel')
Best,
adam