Problem with workers and run time

Hannes

unread,

Jan 16, 2021, 3:02:52 PM1/16/21

to Gensim

Hello all!

I'm currently working with LDA topic models and stumbled upon some things I can't clarify myself.

First of all: here is one of the scripts I'm running. https://github.com/Veritogen/master_thesis/blob/master/run_pol_lda.py

It will call this code, where I implemented the LDA: https://github.com/Veritogen/master_thesis/blob/master/nlpipe/NlPipe.py

The first problem I encountered was the use of cpus. Somehow gensim seems to occupy all(!) of the processors I have available (at least thats what it looks like in htop, it also takes about 5-6 seconds to open htop), even though I specified the number of proccesses. In this case it is 36, running on a machine with 2 Xeon Gold with 72 (36 real and 36 hyper threaded) processors. In the log it shows the number of processes I passed to ldamulticore. This behaviour is the same on a machine with two xeon silver and my laptop as well. I'm running gensim 3.8.3 on python 3.8.5 and ubuntu 20.04/18.04. Should this behaviour be that way?

My second problem is the run time of the script. As you can see here (https://github.com/Veritogen/master_thesis/blob/master/run_pol_lda.py#L26) I'm taking only 10% out of my dataset which includs ~3.3million documents. For this corpus I run a test on the impact of the maximum document frequency for the tokens (0.3/30%, 0.2/20%, 0.1/10%). Minimum frequency for each term is set to 25. I keep 100.000 of the tokens. Then I calculated the c_v coherence score for each model with the different number of topics and etas. I also run two passes for each document. When I compare the calculation for the coherence scores with the time to train a model listed here: https://radimrehurek.com/gensim/models/ldamulticore.html (~10x the documents I have), its taking really really long even though I use much more processes. In order to figure out whats taking so long, I ran the script with only 20 documents as a sample while timing it locally. It seems that .get_coherence() is taking up 75% of the total run time for this small corpus (I attached the call graph). From this I assume that calculating the coherence is taking most of the time, am I right? Overall, for the document frequncy of 0.3 it took 22 hours to train the lda model and the topic coherence (c_v) with 50/100 topics, with eta 0.5 and 0.7 respectivly on the big machine with all 72 cores on 100%. Is there a fundamental flaw in my code I oversaw?

I hope I described my problem properly. Please feel free to ask for context etc, if there is stuff I didn't explain well. I hope the code I provided is readable and explains itself. Thanks for the work that you put into this library and thanks in advance for any help you might provide.
Greetings, Hannes

call_graph.jpg

Hannes

unread,

Jan 17, 2021, 5:24:39 AM1/17/21

to Gensim

Update: running this script on a different corpus with a similar size (different board of 4chan though) was considerably faster (14h), even though it ran on a smaller machine (48 (24 real, 24 hyper threaded) cores (2x Xeon Silver) and with a minimum document frequency of 1. Gensim also used all the cores. I have no clue where the problem is situated/why the training on the big machine is slower. For the run on the smaller machine I also activated logging for gensim at the debug level. It showed that its only using 10 processes but was pushing all the cores to 100%.

Hannes

unread,

Jan 17, 2021, 6:48:02 AM1/17/21

to Gensim

Update 2: From what I understand, the load average is extremly high. The RAM usage is caused by having two dataframes opened. I think that shouldn't be a problem though. You can see the load of the bigger machine here (its been running the https://github.com/Veritogen/master_thesis/blob/master/run_pol_lda.py script for about two days now.

Here you can see the load average of the smaller machine. Its training only one model at the moment (processes parameter of ldamulticore is set to 10).

Hannes

unread,

Jan 17, 2021, 12:46:42 PM1/17/21

to Gensim

Update3: Setting the workers parameter of ldamulticore to 1 yields the following:

The coherence model runs on exactly the number of processes provided by the processes parameter of the coherence model.

Sumit Sharma

unread,

Jan 17, 2021, 12:58:32 PM1/17/21

to gen...@googlegroups.com

Dear all, Greeting of the day !

I have text file like

"

Recommender systems with social regularization. Although  Recommender Systems  have been comprehensively analyzed in the past decade, the study of social-based recommender systems just started. In this paper, aiming at providing a general method for improving recommender systems by incorporating social network information, we propose a matrix factorization framework with social regularization. The contributions of this paper are four-fold: (1) We elaborate how social network information can benefit recommender systems; (2) We interpret the differences between social-based recommender systems and trust-aware recommender systems; (3) We coin the term  Social Regularization  to represent the social constraints on recommender systems, and we systematically illustrate how to design a matrix factorization objective function with social regularization; and (4) The proposed method is quite general, which can be easily extended to incorporate other contextual information, like social tags, etc. The empirical analysis on two large datasets demonstrates that our approaches outperform other state-of-the-art methods. Recommender Systems, Collaborative Filtering, Social Network, Matrix Factorization, Social Regularization

"

In the above text, I want to identify the key term as

"[ "collaborative filtering","recommender systems","context-aware recommender systems","matrix factorizations","recommendation systems","information retrieval","matrix algebra","collaborative filtering techniques","social networks","regularization","factorization","recommendation algorithms" ] "

Is it possible to find this type of keyword using the gensim model ? or I also want to try DBpedia spotlight. Please help me what model I have to use and how?

I am waiting for the positive response.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/1fe36557-0e4b-478c-a3c3-b5afc221b925n%40googlegroups.com.

--

@ Thank you & Regards

Sumit Sharma |

Research Scholar NIT KKR

Contact:+91 9760198031

Hannes

unread,

Jan 17, 2021, 1:50:23 PM1/17/21

to Gensim

Hello,

unfortunately it is not clear to me what you want to achieve. I think it would be more helpful if you opened your own thread/question in order to explicate your problem properly and not to mix up two separate issues.

Yours
Hannes

Hannes

unread,

Jan 18, 2021, 5:02:56 AM1/18/21

to Gensim

Update4: even though the training is running much faster now, its still slower than described here: https://radimrehurek.com/gensim/models/ldamulticore.html. Its also still a 10th of the data set described in the link (about 300k docs). May that be due to the limitation of the iterator? I've read something like that, but thought that would only be the case when training on a dataset larger than memory. I have the whole data set in memory. It could also be due to the overhead of spawning that many processes. Is there any best practice I can follow without touching the workers paramter of ldamulticore? I've fixed it to using one worker which will use about 64 cores on the big machine. I also stumbled upon information in the log which could be a reason for the usage of too many cores:

2021-01-17 20:45:12,028 : INFO : MainProcess : Note: detected 72 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2021-01-17 20:45:12,028 : INFO : MainProcess : Note: NumExpr detected 72 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2021-01-17 20:45:12,028 : INFO : MainProcess : NumExpr defaulting to 8 threads.

These are the very first entries in the log. I'm still not sure if it could be related as the coherence model is running the defined number of processes. From what I understand though, the reason for usage of this many cores despite wokers=1 could be "numexpr" (https://github.com/pydata/numexpr). It seems to have been installed within my virtual environment as a dependency (pandas is using it). How this relates to ldamulticore is still not clear to me. Maybe its overwriting some numpy method used in ldamulticore and not coherence_model? The log (gist: https://gist.github.com/Veritogen/53e3e91aa431b7e560921d35e8bcd6d2) of the first pass suggests, that indeed only one worker is running, even though almost all cores are running at max and the load average (15 min) is at around 64.

Hannes

unread,

Jan 18, 2021, 5:32:14 AM1/18/21

to Gensim

I guess the hunch with numexpr isn't right. After uninstalling it, ldamulticore is still using all the cores.

Hannes

unread,

Jan 18, 2021, 5:37:15 AM1/18/21

to Gensim

Also the smaller machine (48) cores seems to faster with the same settings compared to the one with 72 cores. I guess that would point towards a overhead due to multiprocessing.

Radim Řehůřek

unread,

Jan 19, 2021, 4:08:30 AM1/19/21

to Gensim

Hi Hannes,

using so many workers that the overhead swamps the computation is one possibility, with so many workers.

Another reason might be your BLAS library, which can internally use multi-threading. In other words:

1. You tell Gensim to use multiple worker processes (e.g. 30).

2. Each process does computation using BLAS, which in turn uses up multiple cores (e.g. 20)

So in total, the training would use 30*20 cores => totally overwhelm the machine.

An easy way to test this is to run with just one Gensim worker. If you still see multiple cores used in top, it's likely due to BLAS (point 2)

I'd recommend using either one worker process, or one BLAS thread (force your BLAS to be single-threaded, no parallelization). The idea is to be explicit about where parallelization happens: either in the Python/Gensim world, or at the low BLAS level. But not both.

Hope that helps,

Radim

Hannes

unread,

Jan 19, 2021, 6:21:36 AM1/19/21

to Gensim

Hello Radim,

thank you for your reply. It indeed explains the problem I ran into. Setting the number of gensim workers to 1 already improved the performance a lot. From what I know, using BLAS in the background could also slow down the whole process if all (hyperthreaded) cores are used, if there are a lot of float point operations (e.g. Intel Skylake (6th Generation) can do only do 64 float point operations per clock cycle WITH and WITHOUT hyper-threading (see: http://ppc.cs.aalto.fi/ch3/hyperthreading/)). So setting BLAS workers to 1 might lead to another speed up and leave the control over the workers to Gensim. This would also give more control over the possible overhead thats coming with using multiprocessing. I'll test this after my run with workers = 1 is finished (tomorrow :D).

After some short googeling, I found https://stackoverflow.com/a/59932798. Would it be helpful to implement something like this with checks if BLAS is used by numpy using numpy.show_config() and then set the number of threads accordingly if ldamulticore is used and/or allow the possibility to set BLAS workers as well? If yes, I could see if I can implement it in about a month (when I'm done with my thesis).

Yours,
Hannes

Hannes

unread,

Jan 19, 2021, 12:00:42 PM1/19/21

to Gensim

Update: I couldn't get BLAS under control using environment variables, what worked though was adding the following before my parameter search which calls ldamulticore:

from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1, user_api='blas'):

search_best_model

Now I the workers parameter of ldamulticore does what its supposed to do. Speedup is huge. Now I'm thinking what a appropriate bach size might be. The 36 processes are chewing through the batches with 2.000 quite fast. Would a batch size of e.g. 10.000 or 20.000 reduce the overhead with multiprocessing? Would that influence the model as well, e.g. the evaluate_every x docs?

Radim Řehůřek

unread,

Jan 21, 2021, 12:08:03 PM1/21/21

to Gensim

This is interesting. I never heard of `with threadpool_limit()`, thanks for following up!

Yes, having a way to control BLAS threading from within Gensim would be great. As long as the implementation is consist and clean and works across different BLAS backends. Win-win for all users.

What BLAS do you use?

About batch size: The optimal batchsize-vs-performance will depend on the sparsity of your vectors and RAM size. And yes, varying batch sizes also affects the model itself – smaller batches lead to more frequent gradient updates. That should imply slower processing, but may be a bit more accurate. In the extreme, once batchsize = entire corpus, you basically get the original LDA, instead of online LDA, with a single parameter update per entire corpus iteration.

Hope that helps,

Radim

Hannes

unread,

Feb 3, 2021, 10:51:31 AM2/3/21

to Gensim

Sorry for the late reply, I'm in the last weeks of my thesis so I'm kinda busy. I'm using openblas. From what I understood, threadpool_limit with the argument 'blas' should work for all implementations of BLAS though. I'll look into once I got more time to spare.

And thank you for your feedback on the batch size!

Reply all

Reply to author

Forward