It sounds like, given a text, you want to predict one or more labels, after using other text examples, each with a known set of one or more labels, as training.
Though it's intuitively tempting to try to do this via a `most_similar()` operation, especially when starting out, that's actually a pretty clumsy way to do what you want.
The term-of-art in machine learning for what you want to do is "multi-label classificatiion". Even before trying `Doc2Vec`, I'd highly suggest working though some online examples using scikit-learn to do 1st, some simple 'binary classification' of texts. (That's: every text belongs in one or the other class – like 'spam' or 'not-spam', or 'on-topic' or 'not-on-topic', etc.) Don't even try using `Doc2Vec` at 1st - just more simple 'bag of words' models, like those created by the scikit-learn `CountVectorizer` or `TfidfTransformer` classes.
Then, work through some 'multiclass classification' of texts - where every text gets exactly one class. Then, finally, a multi-label problem.
Doing those will help you better think about the steps of your task, like converting the text into feature-vectors and then training any one of many possible algorithms to learn to make predictions. Only after doing that in other ways, you might consider adding `Doc2Vec` as a way to get feature vectors from text - either as the main way, or an addition to other techniques. On some tasks, it'll help, but on others, its condensation of the whole text into a single small 'dense' vector might discard key aspects (like say a single word that's *always* a reliable indicator some label should be applied) moreso than othr approaches.
(In particular, the intuition to "find a ranked list of similar vectors & use those to pick the labels" is also part of the foundational classifier algorithm "K-Nearest-Neighbors", which might be a top-performer for you, but can also be a bit slower and memory-hungry than many other algorithms. So it's a good exercise to formulate your task so that you can try it, from a standard library, and then other classifiers, on your task. If you were trying to extend what you've tried to be more like KNN, you'd essentially take your new text of unknown labels, find its 5 nearest *exact document* neighbors, then look at *their* known labels - and use some heuristic to choose which of those candidate labels should be imputed for your new text. But really: you probably want to grab an off-the-shelf implementation, which is likely tobe well-tested, offer tunable options, and fit well into other evaluation techniques.)
- Gordon