How to improve the coherence score

Hajar Zankadi

unread,

Nov 29, 2021, 11:17:41 AM11/29/21

to Gensim

I am working with a dataset of about 128K tweets

I trained the LDA model and GSDMM model but both of them gives me a low coherence score ( 0.37, 0.39 respectively)

I tried to train the LDA model with different parameters of passes, iterations and random_state but still get low coherence score.

I also used grid search for hyperparamter tunning of alpha and eta ( and it is time consuming) but still didn't get good results.

For GSDMM, I kept alpha and eta to 0.1 and n_iterations=30 as recommended in the authors ' paper and I only changed K.

Could you please help me to improve the coherence score?

is there any alternative that I should try?

Thank you in advance

Aly Abdurrazek

unread,

Dec 8, 2021, 1:19:14 PM12/8/21

to Gensim

Why won't you try a different model such as coherence aware topic models? it has an implementation you can try on SageMaker? and it optimizes the coherence as it trains...

Also, you may attempt to tune the preprocessing steps to include trigrams and bigrams in the input tokens.

what coherence metric are you using? (C_V U_mass, etc...)

Hajar Zankadi

unread,

Dec 9, 2021, 4:54:14 AM12/9/21

to gen...@googlegroups.com

Hello thank you for your reply

for the coherence aware topic modeling, could you refer me some resources

I used ngram implementation (bigram and trigram) in my preprocessing steps, I also extended the stop word list with the high frequency words in the vocabulary and I used as well the id2word.filter_extremes when generation the id2word vector.

I am using c_v as a metric for the coherence score.

for the data cleaning, I removed hashtags, URLs, links, punctuations, RT tags, @ tags and emojis.

for the data preprocessing, I used tokenization, removing stop words, implementing ngrams and lemmatization.

Thank you for your feedback

kind regards

ᐧ

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/bf7b8a6c-a788-431a-b304-d25eb94ef772n%40googlegroups.com.

--

Hajar Zankadi

Ingénieur d'Etat en Management des Sytèmes d'Information

Diplomée de l'Institut National des Postes et Télécommunications (INPT)

Aly Abdurrazek

unread,

Dec 9, 2021, 11:19:10 AM12/9/21

to Gensim

Hello Hajar,

Please refer to this paper https://aclanthology.org/D18-1096.pdf

You may also use OCTIS. They have implementation of CTM which usually gives high coherence score (as well as diverse topics). Check their Github page: https://github.com/MIND-Lab/OCTIS. It is also as easy as 1, 2, 3.

c_v score matches best with humans, so yeah.. you need to improve the coherence score...

One final thought, have your tried BertTopic, https://maartengr.github.io/BERTopic/index.html?

regards,

Aly

Hajar Zankadi

unread,

Dec 10, 2021, 7:21:29 AM12/10/21

to gen...@googlegroups.com

Hello Aly

Thank you for your help and your feedback

Kind regards

ᐧ

To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/5520f7a5-1272-4105-8a1f-48b810c1f8ffn%40googlegroups.com.

Aly Abdurrazek

unread,

Dec 11, 2021, 10:35:51 AM12/11/21

to Gensim

sure..

Let us know whether it improved :)

BR

Aly

Reply all

Reply to author

Forward