cosine similarity and hellenger distance do not agree on the trend of data.

23 views

Skip to first unread message

Cameron Fen

unread,

Mar 28, 2018, 11:41:36 AM3/28/18

to gensim

I have a set of documents I am performing attempting to calculate the similarity of over time. I have implemented word2vec, tfidf and LSI models to get word-vectors and calculated the average similarity of all documents published in every week from 1977 to 2017. This yields a downward trend of cosine similarity that is strongly significant (t-stat in the 100s) in accordance to our hypothesis.

However, with the same data I perform LDA and calculate the Hellenger distance (which opposite of cosine similarity is 1 when the documents are dissimilar and 0 when the documents are perfectly similar). This also exhibits the same downward and strongly significant trend. The problem is this disagrees with the results from the other methods which imply that similarity among documents is decreasing, while LDA seems to suggest with basic certainty (t-stat in the 100s) that similarity is increasing. Can any one suggest what might be going on? I don't know if this is a great cross validated post but don't know where to really post. I use Gensim for the models (word2vec, tfidf,LSI, LDA) as well as the similarity calculations. I use NLTK for data cleaning.

Ivan Menshikh

unread,

Mar 29, 2018, 9:09:32 AM3/29/18

to gensim

Hello Cameron,

how you train your models, can you describe it more concretely (especially what's part of the dataset you used for training and for calculating similarity)?

Reply all

Reply to author

Forward

0 new messages