That's a really interesting application! While you'd have to verify via your own full-cycle evaluations, my hunch remains that'd it'd be better to over-weight the rare-group samples, with interspersed repetition, than to thin out the frequent-group samples - because even the frequent-group samples aren't strict copies of each other, but include some naturally-useful internal variety. (To the extent any training-'texts' are exact duplicates of each other, and not specifically because you've created synthetic duplicates to over-weight some sample – those duplications probably *don't* help.)
IIUC, the source data doesn't have a natural ordering - they're "unordered-bags-of-proteins" rather than "ordered-lists-of-proteins". (Is that right?) If so, I'd be wary of modes where the artifact of your source data happening to put tokens as neighbors might affect results - like PV-DM (`dm=1`) or enabling skip-gram word-training for PV-DBOW (`dm=0, dbow_words=1`). This would especially be a concern if all the "bags" are listed in some sort of lexicographic order, and the `window` is usually smaller than the 'text'-length: you'd be learning associations that just reflect the token-names. Some possible alternative ways to offset this:
(1) use pure PV-DBOW (`dm=0, dbow_words=0`) - and since this may train faster, you might be able to do more training; OR
(2) if using window-size sensitive modes, use a humongous `window` (eg way way larger than the length of any training 'text'), so that essentially, every training-window always includes all neighbors without overweighting those that happen-to-arbitrarily-be-close-neighbors; OR
(3) if using window-size sensitive modes, replace each example with several shuffled versions of the same example; OR
(4) use pure PV-DBOW, but add all word-tokens as additional tag-tokens (so every word-token equally trains all its cooccuring word-tokens)
(4) is probably roughly equivalent to (2), but for internal implementation reasons alluded to in the other thread about core-utilization, (2) might in practice run faster with higher thread utilization (using the classic iterable-object corpus interface).
Anything which mixes in word-training risks another kind of imbalance between the model's effort improving the uniquely doc-vectors versus the word-vectors. For example, with `dm=0, dbow_words=1, window=10`, there's 10 times as many micro-training examples (input-vector->target-prediction), with backpropagated nudges to internal weights and input vectors, for the word-vectors than the doc-vectors. If it's the doc-vectors that are the main output of interest, that means their quality *might* be lower, because of the attempted word-to-word predictions.
Another off-the-wall technique that might be worth considering: use pure Word2Vec, giant `window` (tamping-down or eliminating neighbor-artifacts), intersperse N synthetic words into each example matching what were the doctags in `Doc2Vec`. Then, like a `dm=0, dbow_words=1, window=1000000`, the doctag-vecs and word-vecs are trained together in comparable ways, and wind up in the same space. But the multiplier N gives you a knob for devoting more model-training-effort towards making the doc-vectors predictive, rather than just the word-vectors.
- Gordon