How to speed up CoherenceModel with "c_v"

457 views
Skip to first unread message

Murad Bashirov

unread,
Dec 4, 2022, 3:26:43 AM12/4/22
to Gensim
Hello everyone!
I am currently trying to get coherence for a corpus with ~21000 documents. I have trained the LDA model with `LdaMulticore`, which helped to achieve a relatively higher training speed. I am currently trying to evaluate the models with different number of topics using `CoherenceModel`. But using option `c_v` it takes to long to `get_coherence()`. Is there any way to speed up this process? Like make `get_coherence` use multiple threads. I know that I can change the `coherence` in `CoherenceModel` to "u_mass", but with "u_mass" I do not get good results, I want to use "c_v". Note that it took around ~40 minutes to calculate coherence with number of topics 4, now I need to do this for num of topics from 6 to 100, which is going to take really long time, and I have a deadline coming.

Thank you for the help.

Best regards,
Murad Bashirov.

Murad Bashirov

unread,
Dec 4, 2022, 4:46:40 AM12/4/22
to Gensim
Apparently there's a silent error that's happening in the background. After turning on the logging, I am encountering following error:
```
2022-12-04 18:29:56,249 : ERROR : worker encountered unexpected exception
Traceback (most recent call last):
  File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 561, in run
    self._run()
  File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 581, in _run
    self.accumulator.partial_accumulate(docs, self.window_size)
  File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 353, in partial_accumulate
    super(WordOccurrenceAccumulator, self).accumulate(texts, window_size)
  File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 296, in accumulate
    self.analyze_text(virtual_document, doc_num)
  File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 360, in analyze_text
    self._slide_window(window, doc_num)
  File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 375, in _slide_window
    self._token_at_edge = window[0]
IndexError: index 0 is out of bounds for axis 0 with size 0
```
I am not sure what is happening. What I am doing in the code:
```
num_keywords = 15
num_topics = list(range(4, 101, 2))

LDA_models = {}
for i in range(4, 101, 2):
    LDA_models[i] = LdaMulticore.load(f"models/{i}_multi_symm.gensim")

coherences = []
for i in num_topics:
    model = CoherenceModel(model=LDA_models[i], texts=data_lemmatized, dictionary=dirichlet_dict, coherence="c_v")
    coherences.append(model.get_coherence())
```
The files `models/{i}_multi_symm.gensim` are the saved object `LdaMulticore` files that I previously trained with
```
lda_model = gensim.models.LdaMulticore(corpus=bow_corpus,
                                       id2word=dirichlet_dict,
                                       num_topics=i,
                                       random_state=42,
                                       chunksize=len(bow_corpus),
                                       passes=1,
                                       workers=15,
                                      )
```
I have no idea how to debug this, if you could help, I would be very glad.

Best regards,
Murad Bashirov

Murad Bashirov

unread,
Dec 4, 2022, 6:55:20 AM12/4/22
to Gensim
Applying https://github.com/RaRe-Technologies/gensim/pull/3406 solved the issue. There were some empty lists in my processed data.
Reply all
Reply to author
Forward
0 new messages