LdaMulticore spawning #workers processes but using a single processor

667 views
Skip to first unread message

Stephen Wu

unread,
Jun 18, 2015, 5:02:47 PM6/18/15
to gen...@googlegroups.com
I'm running on a machine with 16 cores.  LdaMulticore seems to recognize that I have 16 cores and by default starts 16 workers.  However, all the workers are divvying up work on the same processor.  So on my 900k-document corpus, this is taking a while.

I had a few hypotheses about why this was the case and talked to others about some of these.  So far, I don't think the culprit is any of the below but I could be wrong:
  • I wrapped LdaMulticore in a custom scikit-learn estimator, and this estimator does give real results after being trained.
  • I am running on a 900k-doc corpus that sits in memory at about 10+GB
  • I'm kicking it off within iPython within a screen session
  • I've tested running a few other Python processes, and they all use the same CPU.  E.g., I'm trying to parse wikipedia using gensim, and its worker(s) also use the same CPU.
Any help appreciated.

Stephen Wu

unread,
Jun 19, 2015, 12:21:16 PM6/19/15
to gen...@googlegroups.com
I killed the processes and reran them with no/minimal changes and parallelization is working just fine.  Unclear why, which is a bit unsatisfying after several hours of digging.
Leading hypothesis: this was probably some OS-level thing, e.g., processes might have wanted to stay on the same processor to make use of caches efficiently.  

stephen

Radim Řehůřek

unread,
Jun 19, 2015, 12:39:53 PM6/19/15
to gen...@googlegroups.com, ste...@trapit.com
Hello Stephen,

do you happen to have a log from when things didn't work (INFO level, or preferably DEBUG)?

I'm thinking maybe one of the processes failed / died for some reason, and the multiprocessing didn't recover. If that's the case, there should be a stack trace in the log.

Just a wild hypothesis :)

Radim

Stephen Wu

unread,
Jun 19, 2015, 1:34:33 PM6/19/15
to gen...@googlegroups.com
Thanks for following up.  I haven't actually gotten the training to work in the end, so I'd welcome you looking at the issue!

I didn't see anything notable in INFO but unfortunately I don't have the logs for LdaMulticore.  I was running make_wiki simultaneously, though, and it was trying to do everything on the same core that LdaMulticore was -- so maybe there's something in that.  The make_wiki process would have completed but was just going really slow.  Below is the fairly normal INFO output of make_wiki, and where I cut it off.

stephen


2015-06-18 10:17:54,373 : INFO : adding document #2990000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250', u'soestdijk', u'phintella']...)
2015-06-18 10:20:31,873 : INFO : discarding 37835 tokens: [(u'giravee', 1), (u'actuariesindia', 1), (u'wonho', 1), (u'nerdocrumbesia', 1), (u'jidova', 1), (u'alfredomacias', 1), (u'ysa\u04f1e', 1), (u'saraldi', 1), (u'belvilacqua', 1), (u'cargharay', 1)]...
2015-06-18 10:20:31,879 : INFO : keeping 2000000 tokens which were in no less than 0 and no more than 3000000 (=100.0%) documents
2015-06-18 10:20:43,771 : INFO : resulting dictionary: Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250', u'soestdijk', u'phintella']...)
2015-06-18 10:20:43,940 : INFO : adding document #3000000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250', u'soestdijk', u'phintella']...)^C
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/swu/trapit/research/.virt/lib/python2.7/site-packages/gensim/scripts/make_wiki.py", line 83, in <module>
    wiki = WikiCorpus(inp, lemmatize=lemmatize) # takes about 9h on a macbook pro, for 3.5m articles (june 2011)
  File "/home/swu/trapit/research/.virt/local/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 270, in __init__
    self.dictionary = Dictionary(self.get_texts())
  File "/home/swu/trapit/research/.virt/local/lib/python2.7/site-packages/gensim/corpora/dictionary.py", line 58, in __init__
    self.add_documents(documents, prune_at=prune_at)
  File "/home/swu/trapit/research/.virt/local/lib/python2.7/site-packages/gensim/corpora/dictionary.py", line 124, in add_documents
    logger.info("adding document #%i to %s", docno, self)
  File "/usr/lib/python2.7/logging/__init__.py", line 1140, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1258, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1268, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1308, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 748, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 867, in emit
    stream.write(fs % msg)
KeyboardInterrupt
Process PoolWorker-15:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 85, in worker
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    racquire()


--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/2pYRRDaFriY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ode...@berkeley.edu

unread,
Jun 25, 2015, 12:01:18 AM6/25/15
to gen...@googlegroups.com
Hello, 

I'm having the same problem and would also really appreciate some help. 

Checking "ps -F -A | grep NameOfMyProgram" shows that Gensim is spawning the correct number of processes by default, but that they are all on the same processor (I'm on a 24 core Red Hat machine). I'm running inside a virtual environment, but it looks like that shouldn't effect things and when I launched from outside the virtual environment processes ran on 4 cores, which was better, but still not good. Note, I think I'm calling Gensim correctly as it does distribute to the two cores on my laptop when I run the same code there. 

Any help or suggestions are really appreciated, as I'm not really sure where to go from here.

Thanks.
Orianna

Stephen Wu

unread,
Jun 25, 2015, 12:01:40 PM6/25/15
to gen...@googlegroups.com
Interesting, Orianna.  My problem does reappear as well -- shutting down processes and restarting them doesn't always work.  Also, I suspect that some of the methods may end up jumping on the same core later on in processing?  Could be totally wrong about that.  Radim, is there gensim-specific logging that you're looking for?

stephen

ode...@berkeley.edu

unread,
Jun 25, 2015, 4:45:25 PM6/25/15
to gen...@googlegroups.com
Hi, 

Yes, this is a very unfortunate problem that I'll be happy to fix. 

Ok, so I double checked that running in the virtual environment isn't causing any problems. When I run outside I also get 26 processes allocating to one processor ( I have 24 processors). The output of ps looks  like:

>> ps -F -A
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
[*snip*]
odemasi  61669 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61670 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61671 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61672 59981  0 2738764 9821696 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61673 59981  0 2738764 9821696 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61674 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61675 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61676 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61681 59981  0 2738764 9821680 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61682 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61683 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61684 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61685 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61686 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61687 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61688 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61689 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61694 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61698 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61699 59981  0 2738764 9821704 23 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61700 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61701 59981  0 2738764 9821704 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61702 59981  0 2738764 9821696 14 03:42 pts/5 00:00:00 python RunLDA.py 2
odemasi  61703 59981  0 2738764 9821696 14 03:42 pts/5 00:00:00 python RunLDA.py 2
[*snip*]

The standard out that I'm getting is: 
/home/odemasi/Packages/venv/lib/python2.6/site-packages/numpy/lib/utils.py:95: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  warnings.warn(depdoc, DeprecationWarning)
/home/odemasi/Packages/venv/lib/python2.6/site-packages/scipy/lib/_util.py:67: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
  DeprecationWarning)
2015-06-25 03:36:38,835 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-06-25 03:39:34,893 : INFO : built Dictionary(5060602 unique tokens: [u'loyalsubscribers', u'iftheyclosedchipotleiddie', u'\u666e\u6bb5\u306e\u53e3\u8abf\u3067\u4f55\u6ce3\u3044\u3066\u308b\u3093\u3067\u3059\u304b\u79c1\u306f\u3069\u3053\u306b\u3082\u884c\u304d\u307e\u305b\u3093\u304b\u3089\u5927\u4e08\u592b\u3067\u3059\u3092\u8a00\u3046', u'deargodmakeatrade', u'billycorgan']...) from 1 documents (total 5060602 corpus positions)
2015-06-25 03:39:36,283 : INFO : using symmetric alpha at 0.01
2015-06-25 03:39:36,283 : INFO : using serial LDA version on this node
2015-06-25 03:42:20,479 : WARNING : input corpus stream has no len(); counting documents
2015-06-25 03:42:25,018 : INFO : running online LDA training, 100 topics, 1 passes over the supplied corpus of 100000 documents, updating every 48000 documents, evaluating every ~100000 documents, iterating 50x with a convergence threshold of 0.001000
2015-06-25 03:42:25,018 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2015-06-25 03:42:25,023 : INFO : training LDA model using 24 processes
2015-06-25 03:42:27,407 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/100000, outstanding queue size 1
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/queues.py", line 242, in _feed
    send(obj)
SystemError: NULL result without error in PyObject_Call
2015-06-25 03:42:30,449 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/100000, outstanding queue size 2
2015-06-25 03:42:30,612 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/100000, outstanding queue size 3
2015-06-25 03:42:30,793 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/100000, outstanding queue size 4


A little more about my application: each document is very tiny and right now I'm constraining the training to 100,000 documents. It takes < 1min to load and stream through the data. I know that running with this little data won't give me much performance gain, but until I can get it dispersing the work I can't run withe more data. The process has already been running for 17 hours, and that seems like a ridiculously long time for a corpus that is a few MB (9 million documents is ~1.5GB). 

Any suggestions of what to check next? 

Thanks!
Orianna

ode...@berkeley.edu

unread,
Jun 26, 2015, 2:16:17 PM6/26/15
to gen...@googlegroups.com
Hi Stephen, 

tl;dr: I'm hoping it's just a problem with openBLAS declaring task affinity to the processor the job is launched from, but I can't resolve the issue with the fixes I found online, so I'm sharing with you in hopes you have brighter ideas than I had. 


Are your scipy and numpy also compiled against OpenBLAS or GotoBLAS? I think that's what I'm working with (OpenBLAS) and it seems that other people have also had trouble getting multiple processes to associate with different cores in Python. In particular, I was looking at the following and it looked like it pertained to our problem:


I tried both launching gensim with:
export OPENBLAS_MAIN_FREE=1 
python myLDAscript.py

and by putting 

import os
os.system('taskset -p 0xffffffff %d' % os.getpid()) # also tried os.system('taskset -p 0xff %d' % os.getpid())

at the begining of myLDAscript.py. Sometimes that gave me a memory error, so I took it back out:

2015-06-26 00:21:20,469 : INFO : using symmetric alpha at 0.01
2015-06-26 00:21:20,469 : INFO : using serial LDA version on this node
Traceback (most recent call last):
  File "RunLDA_copy.py", line 52, in <module>
    lda = models.ldamulticore.LdaMulticore(corpus_memory_friendly, id2word=dictionary, num_topics=NUMTOPICS, workers=None)
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamulticore.py", line 141, in __init__
    gamma_threshold=gamma_threshold)
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamodel.py", line 313, in __init__
    self.sync_state()
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamodel.py", line 326, in sync_state
    self.expElogbeta = numpy.exp(self.state.get_Elogbeta())
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamodel.py", line 161, in get_Elogbeta
    return dirichlet_expectation(self.get_lambda())
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamodel.py", line 157, in get_lambda
    return self.eta + self.sstats
MemoryError


or 

2015-06-26 00:22:24,037 : INFO : using symmetric alpha at 0.01
2015-06-26 00:22:24,037 : INFO : using serial LDA version on this node
Traceback (most recent call last):
  File "RunLDA_copy2.py", line 52, in <module>
    lda = models.ldamulticore.LdaMulticore(corpus_memory_friendly, id2word=dictionary, num_topics=NUMTOPICS, workers=None)
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamulticore.py", line 141, in __init__
    gamma_threshold=gamma_threshold)
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamodel.py", line 311, in __init__
    self.state = LdaState(self.eta, (self.num_topics, self.num_terms))
  File "/usr/lib/python2.6/site-packages/gensim/models/ldamodel.py", line 79, in __init__
    self.sstats = numpy.zeros(shape)
MemoryError

I tried editing gensim/utils.py, gensim/matutils.py and put os.system('taskset -p 0xff %d' % os.getpid()) after the imports in there, but that didn't seem to fix things either, so I took it out. I did try running the toy script (with an svd at the heart of the loop) from the stackoverflow above. It ran and distributed to the multiple cores just fine, so I couldn't reproduce the error the user had, even though I'm also running against OpenBLAS. However, gensim still won't work, so I tried the fixes above to no avail. 

After all that, I was inspired by http://xcorr.net/2013/05/19/python-refuses-to-use-multiple-cores-solution/ and tried following that by putting 

import numpy
import scipy
import affinity
import multiprocessing
affinity.set_process_affinity_mask(0,2**multiprocessing.cpu_count()-1)

at the header of myLDAscript.py. That also didn't work. 

On a realted note, I also tried to get the distributed gensim running on my machine, but, well, that didn't go too well. If you got this working and have any suggestions, it would be great. 


I'm at my wits' end. If you have any thoughts I'd love to hear them, otherwise I might switch to another package that my team has used before. Thanks!
Orianna 

Related: 

ode...@berkeley.edu

unread,
Jun 26, 2015, 2:28:09 PM6/26/15
to gen...@googlegroups.com
I just made the vocabulary smaller and now it seems to be distributing, and even more important, flying. I set the OPENBLAS_MAIN_FREE environment variable and nothing else.

Cloud Marked

unread,
Dec 6, 2016, 12:52:29 PM12/6/16
to gensim
Has anyone ever figured out what the problem was? I am seeing identical behavior with only one difference. The error message is different. Instead of complaining about a NULL, the error message is about the length being invalid. All the other symptoms are same, processes get spawned but all except 1 are doing nothing. I did reset affinity in the code, and set OPENBLAS_MAIN_FREE=1

If anyone has figured it out, please, share.
Reply all
Reply to author
Forward
0 new messages