Bug in c_v coherence in combination with dictionary.filter

Myrthe van Dieijen

unread,

Feb 15, 2017, 7:54:51 AM2/15/17

to gensim

Hi,

I've noticed a bug in the c_v coherence code. I'm trying to obtain the c_v coherence measure for various lda models I've estimated as follows:

lda_1_c_v = CoherenceModel(model=lda_1, texts=texts, dictionary=dictionary, coherence='c_v')
print (lda_1_c_v.get_coherence())

Unfortunately I kept getting a KeyError (a 'remark' KeyError, see screenshot attached). I did manage to get the u_mass coherence, where you need to use the corpus in the arguments, not texts. The texts I use is a list of the documents and each document itself is a list of tokens. Hence, it's a list of lists (the same type as used in the tutorials), so the texts aren't the problem either.

After trying many things (I posted a question here earlier but got no response), I noticed that I was able to obtain the c_v coherence if I did not prune the dictionary anymore. I pruned the dictionary via filter_extremes() and then used compactify().

Is there a way to fix this issue? Otherwise I would need to prune the texts from these extreme words before creating a dictionary and corpus, which is a lot more work than using filter_extremes. I really hope there's a way to use both a pruned dictionary and c_v coherence.

Many thanks in advance for your help!

Best wishes,

Myrthe

Gensim KeyError Coherence.jpg

Lev Konstantinovskiy

unread,

Feb 17, 2017, 11:04:43 AM2/17/17

to gensim

Hi Myrthe,

Apologies for the late response.

The problem is indeed that model and texts have different vocabulary. This is unexpected because the texts

To prune text to contain only dictionary words you can use this code [filter(lambda x: x in dictionary.values(), t) for t in texts]

Please let me know if it works for you,

Lev

Rogier Hintzen

unread,

Nov 22, 2017, 7:24:24 AM11/22/17

to gensim

That filtering solution works a treat for me! Thanks very much!

Rogier

Reply all

Reply to author

Forward

Bug in c_v coherence in combination with dictionary.filter_extremes()

Myrthe van Dieijen

Lev Konstantinovskiy

Rogier Hintzen