I can see why that might be useful in some situations, but my initial sense would be that this sort of similarity-floor *shouldn't* be a built-in parameter, because:
* it's very easy to do as idiomatic use of Python `itertools` as a 1- liner outside the function; and
* making it a parameter might mislead users as to the cost/efficiency, and the proper interpretations of cosine-similarity
To explain further, first, here's the way to do it externally:
import itertools as it
all_sims = kv_model.most_similar('apple', topn=len(kv_model))
sims_over_0_95 = list(it.take_while(lambda sim: sim[1] > 0.95, all_sims))
And note – this is important for points below – that if you then want a larger set, you can re-use `all_sims`:
sims_over_0_50 = list(it.take_while(lambda sim: sim[1] > 0.50, all_sims))
So, why would making this even easier, via a parameter, potentially mislead users?
First, note that if you did the above – both 0.95 and 0.50 probes – via two calls to `.most_similar()`, you'd actually be doing the most-calculation intensive step – pairwise similarities with every model word – twice. And, you'd be doing another somewhat-intensive sorting step twice. (Further, there's an optimization in the sorting – avoiding a full sort of items that are surely out of the topn – using `numpy.argpartition` that might not be as easy to do for a value-threshold, though I might be missing an option to match that, in the floor case.)
If you truly want to compare multiple cutoffs, doing *one* call, returning the max you might possibly need, then carving your alternate-sized results from that may be noticeably faster.
Second – and this is more subtle – there's a tendency for people to think of the similarity values, because they max at 1.0, and often only the 0.0-1.0 range is even thought-about, to view cosine-similarity as if it were some absolute measure of inherent alikeness. They'll think (imprecisely) that 0.90 means either "90% similar" (on some objective basis) or perhaps even confuse with percentiles, thinking it means "more similar than 90% of other items". But it's neither of these, and in fact the *range* of effective similarities can be strongly affected by other model choices.
For example, if with plentiful training texts, you train two models, one with `vector_size=50`, and the other with `vector_size=300`, then check for the top-n words most like `apple`, they may be in very-close agreement. The nearest word, in each may be the same. For many purposes, they models may be of similar value (with the smaller model far easier to deploy, or capable of modeling more words in a fixed amount of memory.) But the reported cosine-similarity for that nearest-neighbor will usually be wildly different, because one of the coordinate-systems is far more 'spacious'.
If you do some initial experiments, and think, "0.80 is the cutoff that works for me", & start hardcoding that places, but then fail to realize that `0.8` in one model is *practically* no better than `0.3` in another (with different `vector_size`, or `negative`, or whatever), you'll have taking a wrong turn, & likely wasted time or missed a chance at more-robust analysis.
I'm not suggesting you're making this error – that you're talking about probing different similarity cutoffs suggests you may be aware how situational the absolute levels can be. But I see this slightly-off mental model *a lot*, and so view every use of absolute similarities, rather than relative similarities or rank orders, with a little suspicion.
Of course, if an implementation was efficient enough, and some examples of cases where it really helps very vivid, I might change my mind. But maybe also: the 1-liner above could just be mentioned in the method documentation?
- Gordon