Can I use CoherenceModel on held-out data?

237 views
Skip to first unread message

vy...@uit.edu.vn

unread,
Jan 5, 2018, 3:56:42 AM1/5/18
to gensim
I am new to topic modeling and NLP in general. I have learned that perplexity is not correlated with human judgement but topic coherence metrics do. So I would like to measure coherence metrics on held-out data instead of perplexity. According to the article Topic Coherence To Evaluate Topic Models, if I do not misunderstand, I will have to employ the extrinsic topic coherence metric on held-out data. 

My first question is: Is c_v an extrinsic topic coherence metric?

My second question is: Why are some topic coherence measure results either nan or inf on held-out data?

For example:

CM = CoherenceModel(model=model, texts=test_data, dictionary=dictionary, coherence='c_uci')
CM.get_coherence_per_topic()

The result will be: [inf, inf, inf, inf, inf, inf, -13.283505155981141, inf, -13.999429177407952, inf]

Does this imply those topics with results inf do not present in the test data?

However, it looks like that in this graph, all topics present in the test data, except for the topic 10:


Thank you very much for your time and consideration.


Yours sincerely,

Vy Thuy Nguyen.


vy...@uit.edu.vn

unread,
Jan 5, 2018, 4:41:55 AM1/5/18
to gensim
Now I am thinking about it. Maybe inf implies some top popular words of the topic do not present in the test data?

James Allen-Robertson

unread,
Jan 5, 2018, 10:39:54 AM1/5/18
to gensim
If I recall correctly 'inf' means infinite, as in the result of a divide by zero. I'm not entirely sure how you avoid this with what you're doing - not saying you can't, I just don't know the details of how the scores are generated. 

Ivan Menshikh

unread,
Jan 7, 2018, 11:47:32 PM1/7/18
to gensim
Hello,

Inf/nan problem already known, this problem connected with topic coherence formula, you can learn more details about this behavior from Parul and Chinmaya (I'm not sure I remember correctly, but it seems a problem with words that are not found in the training corpus).
Reply all
Reply to author
Forward
0 new messages