How to speed up CoherenceModel with "c

Murad Bashirov

unread,

Dec 4, 2022, 3:26:43 AM12/4/22

to Gensim

Hello everyone!

I am currently trying to get coherence for a corpus with ~21000 documents. I have trained the LDA model with `LdaMulticore`, which helped to achieve a relatively higher training speed. I am currently trying to evaluate the models with different number of topics using `CoherenceModel`. But using option `c_v` it takes to long to `get_coherence()`. Is there any way to speed up this process? Like make `get_coherence` use multiple threads. I know that I can change the `coherence` in `CoherenceModel` to "u_mass", but with "u_mass" I do not get good results, I want to use "c_v". Note that it took around ~40 minutes to calculate coherence with number of topics 4, now I need to do this for num of topics from 6 to 100, which is going to take really long time, and I have a deadline coming.

Thank you for the help.

Best regards,

Murad Bashirov.

Murad Bashirov

unread,

Dec 4, 2022, 4:46:40 AM12/4/22

to Gensim

Apparently there's a silent error that's happening in the background. After turning on the logging, I am encountering following error:

```

2022-12-04 18:29:56,249 : ERROR : worker encountered unexpected exception

Traceback (most recent call last):
File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 561, in run
self._run()
File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 581, in _run
self.accumulator.partial_accumulate(docs, self.window_size)
File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 353, in partial_accumulate
super(WordOccurrenceAccumulator, self).accumulate(texts, window_size)
File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 296, in accumulate
self.analyze_text(virtual_document, doc_num)
File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 360, in analyze_text
self._slide_window(window, doc_num)
File "/home/muradb/school/hss407/venv/lib/python3.10/site-packages/gensim/topic_coherence/text_analysis.py", line 375, in _slide_window
self._token_at_edge = window[0]
IndexError: index 0 is out of bounds for axis 0 with size 0

```

I am not sure what is happening. What I am doing in the code:

```

num_keywords = 15
num_topics = list(range(4, 101, 2))

LDA_models = {}
for i in range(4, 101, 2):
LDA_models[i] = LdaMulticore.load(f"models/{i}_multi_symm.gensim")

coherences = []
for i in num_topics:
model = CoherenceModel(model=LDA_models[i], texts=data_lemmatized, dictionary=dirichlet_dict, coherence="c_v")
coherences.append(model.get_coherence())

```

The files `models/{i}_multi_symm.gensim` are the saved object `LdaMulticore` files that I previously trained with

```

lda_model = gensim.models.LdaMulticore(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=i,

random_state=42,
chunksize=len(bow_corpus),

passes=1,
workers=15,
)

```

I have no idea how to debug this, if you could help, I would be very glad.

Best regards,

Murad Bashirov

unread,

Dec 4, 2022, 6:55:20 AM12/4/22

to Gensim

Applying https://github.com/RaRe-Technologies/gensim/pull/3406 solved the issue. There were some empty lists in my processed data.

Reply all

Reply to author

Forward

How to speed up CoherenceModel with "c_v"

Murad Bashirov

Murad Bashirov

Murad Bashirov