Differences among Topic Coherence Metrics ("u_mass", "c_v", ...) & Choose the Best Threshold Value to Filter Out "Low-quality" Documents

7,875 views

Skip to first unread message

Feng Sun

unread,

Feb 2, 2018, 2:41:07 PM2/2/18

to gensim

Simply put, I performed LDA on a document collection and its subsets. Each subset is generated (after the orginial model trained with the complete collection) by filtering out documents of which the max topic weight is less than a certain threshold (sometimes called "low-quality" documents). I tested different threshold values and calculate topic coherence (u_mass and c_v) on resulting models. Here are the results (x-axis is threshold):

#topics = 10

threshold	#docs
0	4095
0.1	4095
0.2	4094
0.3	3865
0.4	3082
0.5	2077
0.6	1337
0.7	780

#topics = 30

threshold	#docs
0	4095
0.1	4094
0.2	3982
0.3	3070
0.4	1980
0.5	1169
0.6	647
0.7	363

They yield a similar pattern:

For u_mass, there is a peak, then trends down
For c_v, it monotonous increases

I know that there are multiple values supported for coherence measure: c_v has the best result, u_mass is faster

https://radimrehurek.com/gensim/models/coherencemodel.html

https://rare-technologies.com/validating-gensims-topic-coherence-pipeline/

But what are the exact differences among these values ('u_mass', 'c_v', 'c_uci', and 'c_npmi')?

How to explain the above-mentioned patterns?

Many thanks!

Ivan Menshikh

unread,

Feb 5, 2018, 1:28:14 AM2/5/18

to gensim

Hello Feng,

about "exact difference" you can read in the original article, also, you can discuss this plots with Mack.

Reply all

Reply to author

Forward

0 new messages