multiprocessing question, LDA multicore

adam....@gmail.com

unread,

Nov 27, 2017, 3:41:09 PM11/27/17

to gensim

Hi,

Using ldamulticore on a preprocessed english wiki corpus (~ 3 million docs, ~270 000 features) resulted in the error below. Chunksize was set to 10 000, workers set to 6, otherwise I used default settings. The script was run in a virtual environment with python 3.6, gensim 3.1.0, numpy 1.13.3, running on a server with RHEL 7. The same script ran fine with 100 and 200 topics, same settings.

Traceback (most recent call last):
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

I am a newbie to gensim, so it could be something trivial I am missing. However, this error looks like the problem about the maximum size of objects passed between processes (see here: https://stackoverflow.com/questions/16576386/byte-limit-when-transferring-python-objects-between-processes-using-a-pipe). First I thought it was somehow related to the document chunk passed to each worker, but changing the chunk size did not help (same problem with chunksize=2000 or =1000). Is this due to the topics*words matrix then? If yes, is there an easy workaround maybe? Any help is appreciated!

Here is the script I used:

# imports, setups
import logging, gensim, bz2

# main
if __name__ == '__main__':

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    logging.root.level = logging.INFO

    # base folder
    baseF = '/dartfs-hpc/rc/home/h/f002s3h/Documents/wikiCorpus/'
    targetF = '/dartfs-hpc/rc/home/h/f002s3h/Documents/wikiModel_topics500_multicore/'

    # load id to word mappings from txt, form dictionary
    id2wordDict = gensim.corpora.Dictionary.load_from_text(baseF + 'wiki_wordids.txt.bz2')
    # load corpus iterator
    mm = gensim.corpora.MmCorpus(baseF + 'wiki_bow.mm.bz2')

    # create LDA model, 500 topics, chunks 10000, passes 1, processes 6
    ldaModel = gensim.models.ldamulticore.LdaMulticore(corpus=mm, num_topics=500, id2word=id2wordDict, workers = 6, chunksize=10000, passes=1)
    ldaModel.save(targetF + 'ldaModel')

Best,
adam

Ivan Menshikh

unread,

Nov 27, 2017, 10:58:16 PM11/27/17

to gensim

Hello,

you are right, this error happens because LdaMulticore tried to pass the big object. Size of this object basically depends on the size of gensim.corpora.Dictionary and number of topics.

Simple workaround - reduce the number of topics and dictionary size (for example, remove non-relevant tokens from dictionary with .filter_extremes method)

adam....@gmail.com

unread,

Nov 28, 2017, 9:37:17 AM11/28/17

to gensim

Thanks for the quick answer. I will rely on the single core version, that works just fine.

best,

Reply all

Reply to author

Forward