Choose among Cosine Similarity and WMD (Word's Mover Distance) Similarity

Loreto Parisi

unread,

Oct 23, 2018, 12:44:44 PM10/23/18

to Gensim

I'm using both Cosine Similarity and WMD to compare a list of documents to an input document, where a document has multiple lines separated by one or more '\n'.

I'm using Word2Vec binary model from FastText English WikiNews with embedding dim 300.

Assumed that I have defined those simple methods for text pre-processing, centroid and cosine similarity calculation

def preprocess(doc,stop_words):
    doc = doc.lower()  # Lower the text.
    doc = word_tokenize(doc)  # Split into words.
    doc = [w for w in doc if not w in stop_words]  # Remove stopwords.
    doc = [w for w in doc if w.isalpha()]  # Remove numbers and punctuation.
    return doc

def sentence_centroid(sentence, wv):
    v = np.zeros(300)
    for w in sentence:
        if w in wv:
            v += wv[w]
    return v / len(sentence)

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

I'm doing the following. First I take my input document and I calculate the centroid from my Word2Vec model

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
            wmd_distances = []
            cosine_distances = []

I the iterate it for the list of documents.

for i in range(len(document_list)):
                
                l2 = document_list[i]

                # lyrics centroid
                l2v = sentence_centroid(preprocess(l2,stop_words), model)
                # wmd similarity
                wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words), preprocess(l2,stop_words))
                wmd_distances.append( wmdistance )
                
                # cosine similarity
                cosine_similarity = cosine_sim(inputv,l2v)
                cosine_similarities.append( cosine_similarity )

so I have now the WMD instances and the cosine distances for all documents against the inputv

At this point I want to normalize these values.
I first calculate the wmd similarity as 1-wmd_distance. In the code here I'm normalizing against the max value, so I'm doing wmd_max - i where i is the ith wmd distance value

then I normalize between min and max.

# normalize similarity score
            if len(wmd_distances) > 1:
                wmd_max = np.amax(wmd_distances)
                wmd_distances = [(wmd_max - i) for i in wmd_distances]
                wmd_distances_norm = [((x-np.min(wmd_distances))/(np.max(wmd_distances)-np.min(wmd_distances))) for x in wmd_distances]
                cosine_similarities_norm = [((x-np.min(cosine_similarities))/(np.max(cosine_similarities)-np.min(cosine_similarities))) for x in cosine_similarities]
            else:
                wmd_distances = [(1-i) for i in wmd_distances]
                wmd_distances_norm = wmd_distances
                cosine_similarities_norm = cosine_similarities

So my output now is a list of cosine similarities and wmd similarities values, eventually normalized.

Applying this to different documents, I have some issues, first of all I'm not completely sure about the using the max value to get the wmd similarity:

wmd_similarity[ i ] = max( wmd_distances) - wmd_distances[ i ]

that maybe could be as simple as wmd_similarity[ i ] = 1 - wmd_distances[ i ], that will eventually introduce negative values.

Second point is the normalization, assumed that this could makes sense, I cannot get rid of the scale of both metrics to choose the best option.
Any hint?

Gordon Mohr

unread,

Oct 24, 2018, 4:11:29 AM10/24/18

to Gensim

It's not clear what you mean by 'normalize', or ultimately hope to achieve by this step. Are you sure you need it?

Cosine-similarities will already be in a range from -1.0 to 1.0. Further, when they come from the same model/process, they'll be comparable to each other. For example, for sentences a, b, c, d, e, and f, if cossim(a,b) > cossim(d,e), then it'd be typical/defensible to say that "a and b are more similar to each other than d and e".

However, if you also calculated cossim(a,c), and then *scaled* the cossim(a,b) and cossim(a,c) values based on just the min/max seen in those pairings, the scaled version wouldn't necessarily be meaningfully comparable to some values scaled based on a different set of pairings. (And if you didn't care about such longer-range comparability – just ranks – you probably wouldn't be doing scaling at all.)

For WMDistance, the values are positive and vary more – indeed I'm not sure there is an obvious 'max' value to the distance, as longer and more-different texts could get much larger distances. And for some downstream tasks, there's no need to re-scale the values: the raw distances, or sorted rank of results, or relative differences between raw values, may be enough.

But if you do need some similarity-value that ranges from 0.0 to 1.0, rather than scaling by observed ranges, a common transformation that's used is:

similarity = 1 / (1 + distance)

Then the re-scaled values don't depend on what max happened to be in the same grouping. (You could also then shift-and-scale that value to be in the -1.0 to 1.0 range, by multiplying by 2 and substracting 1, but even if doing that comparing the WMD-derived similarity with the cosine-similarity might be nonsensical, given their very-different methods-of-calculation and typical distributions.)

- Gordon

Loreto Parisi

unread,

Oct 30, 2018, 7:56:59 AM10/30/18

to Gensim

Hey Gordon, thank you very much for your suggestions, this means a lot!
I think you were right, there is no need to scale them because of [-1,1] range of cosine-sim and assumed that at the end a direct comparison of the two metrics in terms of distribution does not help.

Putting all together, I have then modified the code like following, according to your suggestions :)

inputv = sentence_centroid(preprocess(lyric_to_compare,stop_words), model)
            wmd_similarities = []
            cosine_similarities = []
            for i in range(len(lyrics_list)):
                
                l2 = lyrics_list[i]['lyrics_body']

                # lyrics centroid
                l2v = sentence_centroid(preprocess(l2,stop_words), model)
                
                # wmd similarity
                wmdistance = model.wmdistance(preprocess(lyric_to_compare,stop_words), preprocess(l2,stop_words))

                # https://groups.google.com/forum/#!topic/gensim/-pRZnsOEaPQ
                wmsimilarity = 1/(1+wmdistance)
                wmd_similarities.append(wmsimilarity)

                
                # cosine similarity
                cosine_similarity = cosine_sim(inputv,l2v)
                cosine_similarities.append(cosine_similarity)

# re-scaling wmdsim
            wmd_similarities_norm = [(2*i-1) for i in wmd_similarities]

I have added the re-scale of the WMD similarities in [-1,1] just for testing purposes.

I will now try it on the real-world data and see what happens and if this approach is closers to what I would expect :)
My two cents, maybe to add your suggestions to the gensim WMD/CosineSim tutorials, because they were definitively very helpful to me, and hopefully for other gensim users :)

Thanks again.

Reply all

Reply to author

Forward