Differences among Topic Coherence Metrics ("u_mass", "c_v", ...) & Choose the Best Threshold Value to Filter Out "Low-quality" Documents

7,863 views
Skip to first unread message

Feng Sun

unread,
Feb 2, 2018, 2:41:07 PM2/2/18
to gensim
Simply put, I performed LDA on a document collection and its subsets. Each subset is generated (after the orginial model trained with the complete collection) by filtering out documents of which the max topic weight is less than a certain threshold (sometimes called "low-quality" documents). I tested different threshold values and calculate topic coherence (u_mass and c_v) on resulting models. Here are the results (x-axis is threshold):

#topics = 10

threshold#docs
04095
0.14095
0.24094
0.33865
0.43082
0.52077
0.61337
0.7780


#topics = 30

threshold#docs
04095
0.14094
0.23982
0.33070
0.41980
0.51169
0.6647
0.7363



They yield a similar pattern:

  • For u_mass, there is a peak, then trends down
  • For c_v, it monotonous increases
I know that there are multiple values supported for coherence measure: c_v has the best result, u_mass is faster


But what are the exact differences among these values ('u_mass', 'c_v', 'c_uci', and 'c_npmi')?
How to explain the above-mentioned patterns?

Many thanks!

Ivan Menshikh

unread,
Feb 5, 2018, 1:28:14 AM2/5/18
to gensim
Hello Feng,

about "exact difference" you can read in the original article, also, you can discuss this plots with Mack.
Reply all
Reply to author
Forward
0 new messages