Choose among Cosine Similarity and WMD (Word's Mover Distance) Similarity

1,270 views
Skip to first unread message

Loreto Parisi

unread,
Oct 23, 2018, 12:44:44 PM10/23/18
to Gensim
I'm using both Cosine Similarity and WMD to compare a list of documents to an input document, where a document has multiple lines separated by one or more '\n'.
I'm using Word2Vec binary model from FastText English WikiNews with embedding dim 300.

Assumed that I have defined those simple methods for text pre-processing, centroid and cosine similarity calculation

def preprocess(doc,stop_words):
doc = doc.lower() # Lower the text.
doc = word_tokenize(doc) # Split into words.
doc = [w for w in doc if not w in stop_words] # Remove stopwords.
doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.
return doc

def sentence_centroid(sentence, wv):
v = np.zeros(300)
for w in sentence:
if w in wv:
v += wv[w]
return v / len(sentence)

def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


I'm doing the following. First I take my input document and I calculate the centroid from my Word2Vec model

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_distances = []
cosine_distances = []



I the iterate it for the list of documents.

for i in range(len(document_list)):
l2 = document_list[i]

# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words), preprocess(l2,stop_words))
wmd_distances.append( wmdistance )
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append( cosine_similarity )

so I have now the WMD instances and the cosine distances for all documents against the inputv
At this point I want to normalize these values.
I first calculate the wmd similarity as 1-wmd_distance. In the code here I'm normalizing against the max value, so I'm doing wmd_max - i where i is the ith wmd distance value
then I normalize between min and max.

# normalize similarity score
if len(wmd_distances) > 1:
wmd_max = np.amax(wmd_distances)
wmd_distances = [(wmd_max - i) for i in wmd_distances]
wmd_distances_norm = [((x-np.min(wmd_distances))/(np.max(wmd_distances)-np.min(wmd_distances))) for x in wmd_distances]
cosine_similarities_norm = [((x-np.min(cosine_similarities))/(np.max(cosine_similarities)-np.min(cosine_similarities))) for x in cosine_similarities]
else:
wmd_distances = [(1-i) for i in wmd_distances]
wmd_distances_norm = wmd_distances
cosine_similarities_norm = cosine_similarities

So my output now is a list of cosine similarities and wmd similarities values, eventually normalized.

Applying this to different documents, I have some issues, first of all I'm not completely sure about the using the max value to get the wmd similarity:

wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]

that maybe could be as simple as wmd_similarity[ i ] = 1 - wmd_distances[ i ], that will eventually introduce negative values.

Second point is the normalization, assumed that this could makes sense, I cannot get rid of the scale of both metrics to choose the best option.
Any hint?


Gordon Mohr

unread,
Oct 24, 2018, 4:11:29 AM10/24/18
to Gensim
It's not clear what you mean by 'normalize', or ultimately hope to achieve by this step. Are you sure you need it?

Cosine-similarities will already be in a range from -1.0 to 1.0. Further, when they come from the same model/process, they'll be comparable to each other. For example, for sentences a, b, c, d, e, and f, if cossim(a,b) > cossim(d,e), then it'd be typical/defensible to say that "a and  b are more similar to each other than d and e". 

However, if you also calculated cossim(a,c), and then *scaled* the cossim(a,b) and cossim(a,c) values based on just the min/max seen in those pairings, the scaled version wouldn't necessarily be meaningfully comparable to some values scaled based on a different set of pairings. (And if you didn't care about such longer-range comparability – just ranks – you probably wouldn't be doing scaling at all.)

For WMDistance, the values are positive and vary more – indeed I'm not sure there is an obvious 'max' value to the distance, as longer and more-different texts could get much larger distances. And for some downstream tasks, there's no need to re-scale the values: the raw distances, or sorted rank of results, or relative differences between raw values, may be enough. 

But if you do need some similarity-value that ranges from 0.0 to 1.0, rather than scaling by observed ranges, a common transformation that's used is:

    similarity = 1 / (1 + distance)

Then the re-scaled values don't depend on what max happened to be in the same grouping. (You could also then shift-and-scale that value to be in the -1.0 to 1.0 range, by multiplying by 2 and substracting 1, but even if doing that comparing the WMD-derived similarity with the cosine-similarity might be nonsensical, given their very-different methods-of-calculation and typical distributions.)

- Gordon

Loreto Parisi

unread,
Oct 30, 2018, 7:56:59 AM10/30/18
to Gensim
Hey Gordon, thank you very much for your suggestions, this means a lot!
I think you were right, there is no need to scale them because of [-1,1] range of cosine-sim and assumed that at the end a direct comparison of the two metrics in terms of distribution does not help.

Putting all together, I have then modified the code like following, according to your suggestions :)

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
wmd_similarities = []
cosine_similarities = []
for i in range(len(lyrics_list)):
l2 = lyrics_list[i]['lyrics_body']
# lyrics centroid
l2v = sentence_centroid(preprocess(l2,stop_words), model)
# wmd similarity
wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words), preprocess(l2,stop_words))
wmsimilarity = 1/(1+wmdistance)
wmd_similarities.append(wmsimilarity)
# cosine similarity
cosine_similarity = cosine_sim(inputv,l2v)
cosine_similarities.append(cosine_similarity)
# re-scaling wmdsim
wmd_similarities_norm = [(2*i-1) for i in wmd_similarities]

I have added the re-scale of the WMD similarities in [-1,1] just for testing purposes.
I will now try it on the real-world data and see what happens and if this approach is closers to what I would expect :)
My two cents, maybe to add your suggestions to the gensim WMD/CosineSim tutorials, because they were definitively very helpful to me, and hopefully for other gensim users :)
Thanks again.
Reply all
Reply to author
Forward
0 new messages