Add the similarity threshold to gensim.models.keyedvectors.KeyedVectors.most_similar

103 views
Skip to first unread message

Yan Xu

unread,
May 7, 2023, 8:58:37 PM5/7/23
to Gensim
Currently, the signature of most_similar is most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, restrict_vocab=None, indexer=None)

If one needs to get the similar based on the similarity above a certain threshold, one has to filter on the list returned from most_similar. This requires additional code when multiple threshold needs to be explored. This is very useful in the training or validation of the training. I propose the additional similarity threshold of most_similar.

For example, the default similarity is 0.95 and it may be customized if need be.
most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, restrict_vocab=None, indexer=None, similarity=0.95)

Thanks,

Gordon Mohr

unread,
May 9, 2023, 1:55:16 PM5/9/23
to Gensim
I can see why that might be useful in some situations, but my initial sense would be that this sort of similarity-floor *shouldn't* be a built-in parameter, because:

* it's very easy to do as idiomatic use of Python `itertools` as a 1- liner outside the function; and
* making it a parameter might mislead users as to the cost/efficiency, and the proper interpretations of cosine-similarity

To explain further, first, here's the way to do it externally:

    import itertools as it
    all_sims = kv_model.most_similar('apple', topn=len(kv_model))
    sims_over_0_95 = list(it.take_while(lambda sim: sim[1] > 0.95, all_sims))

And note – this is important for points below – that if you then want a larger set, you can re-use `all_sims`:

    sims_over_0_50 = list(it.take_while(lambda sim: sim[1] > 0.50, all_sims))

So, why would making this even easier, via a parameter, potentially mislead users?

First, note that if you did the above – both 0.95 and 0.50 probes – via two calls to `.most_similar()`, you'd actually be doing the most-calculation intensive step – pairwise similarities with every model word – twice. And, you'd be doing another somewhat-intensive sorting step twice. (Further, there's an optimization in the sorting – avoiding a full sort of items that are surely out of the topn – using `numpy.argpartition` that might not be as easy to do for a value-threshold, though I might be missing an option to match that, in the floor case.)

If you truly want to compare multiple cutoffs, doing *one* call, returning the max you might possibly need, then carving your alternate-sized results from that may be noticeably faster.

Second – and this is more subtle – there's a tendency for people to think of the similarity values, because they max at 1.0, and often only the 0.0-1.0 range is even thought-about, to view cosine-similarity as if it were some absolute measure of inherent alikeness. They'll think (imprecisely) that 0.90 means either "90% similar" (on some objective basis) or perhaps even confuse with percentiles, thinking it means "more similar than 90% of other items". But it's neither of these, and in fact the *range* of effective similarities can be strongly affected  by other model choices. 

For example, if with plentiful training texts, you train two models, one with `vector_size=50`, and the other with `vector_size=300`, then check for the top-n words most like `apple`, they may be in very-close agreement. The nearest word, in each may be the same. For many purposes, they models may be of similar value (with the smaller model far easier to deploy, or capable of modeling more words in a fixed amount of memory.) But the reported cosine-similarity for that nearest-neighbor will usually be wildly different, because one of the coordinate-systems is far more 'spacious'.

If you do some initial experiments, and think, "0.80 is the cutoff that works for me", & start hardcoding that places, but then fail to realize that `0.8` in one model is *practically* no better than `0.3` in another (with different `vector_size`, or `negative`, or whatever), you'll have taking a wrong turn, & likely wasted time or missed a chance at more-robust analysis. 

I'm not suggesting you're making this error – that you're talking about probing different similarity cutoffs suggests you may be aware how situational the absolute levels can be. But I see this slightly-off mental model *a lot*, and so view every use of absolute similarities, rather than relative similarities or rank orders, with a little suspicion. 

Of course, if an implementation was efficient enough, and some examples of cases where it really helps very vivid, I might change my mind. But maybe also: the 1-liner above could just be mentioned in the method documentation?

- Gordon

Yan Xu

unread,
May 18, 2023, 10:15:30 AM5/18/23
to Gensim
Thanks, one-liner is what I am doing. However, this leaves the question about memory consumption as all similarities are returned and the cut off will leave the unwanted results to the garbage collection.

Gordon Mohr

unread,
May 18, 2023, 8:08:10 PM5/18/23
to Gensim
That's a good point, given the extra memory required to return the list-of-(word, score) tuples.

(Note that even when returning a `topn=1` single nearest-neighbor, the internal operation of `most_similar()` chooses for efficiency of the underlying BLAS array operations calculate the pairwise similarity to every candidate answer, into one array. With common vectors of 100 dimensions or more, this temporary memory use will still be tiny compared to the model, so generally hasn't been a focus of memory-optimization.)

If that memory expansion is a pressing need, you could consider using the `topn=0` option, which then returns the raw array of pariwise similarities (in the model's native key order). You could then do your own sort-of-indexes, mimicking the method's code, keeping the peak memory to being that of…

- the float array of all similarities (as returned by `topn=0`)
- the int array which index is the nth-most-similar item

…which avoids the list/tuples overhead. 

If we then wanted to improve `most_similar()` for such needs – which I'm not sure would be worth the complexity, but could be convinced if the need acute enough & implementation clean enough – some things that might be considered:

- a mode that returns those 2 raw arrays above, instead of anything larger
- a max batch-of-pairwise similarities to calculate at once, not even costing the full-size array-of-floats, but at a speed penalty
- a mode that returns an iterator, that peels off each (key, score) result from the 2 arrays at once, skipping the full list-of-tuples memory cost - but letting a consumer read them until they meet whatever count/similarity/etc threshold they prefer

- Gordon
Reply all
Reply to author
Forward
0 new messages