balancing samples for Doc2Vec

A Viehweger

unread,

Feb 6, 2019, 3:05:30 AM2/6/19

to Gensim

Hi,

I train a model using Doc2Vec. Each document has a unique ID. However, some documents are of a type that is vastly overrepresented, i.e. for every document of type A I have 1000 of type B.

In Doc2Vec, is it necessary to downsample such an unbalanced dataset, or more generally what will the effect be during training?

If downsampling is necessary, do you have any experience on a good ratio, like at most 10:1 for overrepresented documents?

Thank you very much,

Adrian

Gordon Mohr

unread,

Feb 6, 2019, 3:42:13 AM2/6/19

to Gensim

What is your ultimate goal - classification of unknown docs, similar-document retrieval, or something else?

Assuming you're giving each doc the typical tag of a unique ID, the model is just learning vectors that well-model each document, for the training goal of predicting the document's words. The model is oblivious to the A/B categories, or any category imbalances. And the doc-vectors created might be just fine for whatever downstream use you have in mind.

But if they prove inadequate, you could try either thinning the overrepresented types, or (perhaps better) expanding the influence of the underrepresented types by repeating those documents, so that (for example) an A doc repeats N times. What N helps would have to be driven by some project-specific evaluation/optimization of whether this trick is helping or hurting.

Whether some docs are repeated or not, it's best for the varied document types to be more-or-less randomly intermixed – rather than say all like-type A docs early in the corpus, then all like-type B docs, etc. (And if an A doc is repeated N times, better if those N occurrences are strewn throughout the corpus, than in one contiguous run.)

Note also that the existing `sample` parameter already down-samples the most-frequent words, and can be made more aggressive (via smaller-than-default values like `sample=1e-05`), especially in larger corpuses. This might also be a tunable know to help somewhat, with giant type-A vs type-B imbalances, because B-doc words will be more downsampled than A-doc words.

- Gordon

A Viehweger

unread,

Feb 6, 2019, 4:35:45 AM2/6/19

to Gensim

Thank you Gordon, you comments help me quite a bit.

The use case: We treat proteins in microbial genomes like words in a document and -- surprisingly -- this groups the genomes into what we think are functional groups, see:

https://www.biorxiv.org/content/10.1101/524280v2

Adding on this work, we now recruited many more genomes -- which makes the data unbalanced. After Doc2Vec, I cluster the document vectors using HDBSCAN. We want to investigate the resulting clusters and figure out, what makes them "special", e.g. which proteins (words) are overrepresented.

The problem about changing the document frequencies by up- or downsampling is that I have no idea before training the model, which "functional groups"/ HDBSCAN clusters will emerge for each group of overrepresented microbes. For example, I might have 10k genomes (documents) of species "E. coli", but it could be that they can be split into several clusters after Doc2Vec training by HDBSCAN. Simply thinning the corpus based on "species" would maybe miss those clusters.

Further downstream, classification of unknown docs is a goal too, as is using the word/ document vectors as input to other ML algos.

Gordon Mohr

unread,

Feb 7, 2019, 4:38:01 PM2/7/19

to Gensim

That's a really interesting application! While you'd have to verify via your own full-cycle evaluations, my hunch remains that'd it'd be better to over-weight the rare-group samples, with interspersed repetition, than to thin out the frequent-group samples - because even the frequent-group samples aren't strict copies of each other, but include some naturally-useful internal variety. (To the extent any training-'texts' are exact duplicates of each other, and not specifically because you've created synthetic duplicates to over-weight some sample – those duplications probably *don't* help.)

IIUC, the source data doesn't have a natural ordering - they're "unordered-bags-of-proteins" rather than "ordered-lists-of-proteins". (Is that right?) If so, I'd be wary of modes where the artifact of your source data happening to put tokens as neighbors might affect results - like PV-DM (`dm=1`) or enabling skip-gram word-training for PV-DBOW (`dm=0, dbow_words=1`). This would especially be a concern if all the "bags" are listed in some sort of lexicographic order, and the `window` is usually smaller than the 'text'-length: you'd be learning associations that just reflect the token-names. Some possible alternative ways to offset this:

(1) use pure PV-DBOW (`dm=0, dbow_words=0`) - and since this may train faster, you might be able to do more training; OR

(2) if using window-size sensitive modes, use a humongous `window` (eg way way larger than the length of any training 'text'), so that essentially, every training-window always includes all neighbors without overweighting those that happen-to-arbitrarily-be-close-neighbors; OR

(3) if using window-size sensitive modes, replace each example with several shuffled versions of the same example; OR

(4) use pure PV-DBOW, but add all word-tokens as additional tag-tokens (so every word-token equally trains all its cooccuring word-tokens)

(4) is probably roughly equivalent to (2), but for internal implementation reasons alluded to in the other thread about core-utilization, (2) might in practice run faster with higher thread utilization (using the classic iterable-object corpus interface).

Anything which mixes in word-training risks another kind of imbalance between the model's effort improving the uniquely doc-vectors versus the word-vectors. For example, with `dm=0, dbow_words=1, window=10`, there's 10 times as many micro-training examples (input-vector->target-prediction), with backpropagated nudges to internal weights and input vectors, for the word-vectors than the doc-vectors. If it's the doc-vectors that are the main output of interest, that means their quality *might* be lower, because of the attempted word-to-word predictions.

Another off-the-wall technique that might be worth considering: use pure Word2Vec, giant `window` (tamping-down or eliminating neighbor-artifacts), intersperse N synthetic words into each example matching what were the doctags in `Doc2Vec`. Then, like a `dm=0, dbow_words=1, window=1000000`, the doctag-vecs and word-vecs are trained together in comparable ways, and wind up in the same space. But the multiplier N gives you a knob for devoting more model-training-effort towards making the doc-vectors predictive, rather than just the word-vectors.

- Gordon

A Viehweger

unread,

Feb 7, 2019, 6:35:02 PM2/7/19

to Gensim

Thank you for these awesome suggestions. I am gaining a ton of intuition about *2vec training.

Note that the source data __does__ have a natural ordering. There are all kinds of genetic structures that create "ordered-lists-of-proteins". For example, there are sequences of proteins that are all members of the same pathway to create some metabolite, and they are all co-regulated, meaning that if a bacterium produces one of the proteins, it automatically has to produce the others, too.

I would like to train the word vectors too, if possible, bc/ given a cluster of like documents (genomes) I want to check which words (proteins) are overrepresented. However, I'll try without to check the effect on the document vector eval metrics.

(1) In this context, would it make sense to use a larger window size than 10? I read somewhere that smaller windows capture more syntax while larger ones focus on semantics?

(2) Also, I currently use 5 negative samples, but in theory the more the better right?

(3) I will definitely try to upsample less frequent texts. What augmentation factor is reasonable -- like 10? 100?

Thank you for your help!

Gordon Mohr

unread,

Feb 7, 2019, 8:05:30 PM2/7/19

to Gensim

On Thursday, February 7, 2019 at 3:35:02 PM UTC-8, A Viehweger wrote:

Thank you for these awesome suggestions. I am gaining a ton of intuition about *2vec training.

Note that the source data __does__ have a natural ordering. There are all kinds of genetic structures that create "ordered-lists-of-proteins". For example, there are sequences of proteins that are all members of the same pathway to create some metabolite, and they are all co-regulated, meaning that if a bacterium produces one of the proteins, it automatically has to produce the others, too.

But, is this reflected by them appearing next-to-each-other (or within `window` positions) in your training data? If the actual ordering-of-word-tokens is affected by something like lexicographic sort-order, or some other process's clumping-of-related-tokens, then window-sensitive modes will be modeling/mirroring those factors, rather than discovering something inherent in the co-occurrences.

(OTOH, if the word-tokens are alongside each other because some natural, inherent process – like their genetic encodings – also places them alongside each other, neighbor-sensitive methods would learn inherent associations.)

I would like to train the word vectors too, if possible, bc/ given a cluster of like documents (genomes) I want to check which words (proteins) are overrepresented. However, I'll try without to check the effect on the document vector eval metrics.

(1) In this context, would it make sense to use a larger window size than 10? I read somewhere that smaller windows capture more syntax while larger ones focus on semantics?

It has been observed, in the word2vec context, that small windows tend to make a word' nearest neighbors more 'syntactic' or 'functionally' similar – for example, words that could be 1-for-1 swapped in (serving the part-of-speech). And, larger windows allow more 'topical' or 'domain' similar words to become nearest-neighbors.

(2) Also, I currently use 5 negative samples, but in theory the more the better right?

Not necessarily. The original word2vec paper observed with regard to `k`, the number-of-negative-samples: "Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5." So it seems more samples may help speed training with smaller datasets, but ever-larger datasets can get away with ever-smaller counts of negative samples. (Thinking again of core-utilization under Python: more negative samples allow more computation in one native array-operation, or no-GIL block, so when total training time and core-utilization are considered, more negative samples may be helpful.)

Further, some have observed that usual word2vec-training results in a full set of word-vectors which is *not* balanced around the origin point, but rather have a mean vector biased in a particular direction. See: "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath <https://arxiv.org/abs/1702.01417v2>. (And, they suggest eliminating this mean can improve the quality of the word-vectors for downstream tasks, along with other ways of eliminating commonalities between the vectors, which I interpret as increasing the contrast/variety of the final positions.)

My experiments suggest that this biased-from-origin direction is likely caused by the fact that more-than-one negative example is trained against a single positive example. (The mean bias grows larger with larger `negative` values, and becomes negligible with just a single negative example.) So while I've not yet used the transformation those authors suggest, if on a project where there was plentiful data, and concerned about eking-out every slight advantage, I'd try very-low `negative` values (even to 1), and/or try their post-processing, to see if restoring a origin-mean helps.

(3) I will definitely try to upsample less frequent texts. What augmentation factor is reasonable -- like 10? 100?

I have no idea - this would have to be guided by some project-specific evaluation of final results quality. None might be fine! (The imbalanced training might still have "enough" influence on the relevant doc-vectors, and internal model weights, to achieve whatever goals are important, even if the model is "coarser" or "weaker" in certain "neighborhoods".) But if some known-groupings are 1000x more common than others, and those rarer ones "should" have as much influence on the model despite the paucity of samples, then maybe 1000x would be appropriate. Most of these parameters/tradeoffs are just chosen by "what works" , without strong "proper" choices or stable rules-of-thumb.

Thank you for your help!

You're welcome! It's exciting to see algorithms that arose in natural-language processing contribute to biological understanding.

- Gordon

A Viehweger

unread,

Feb 11, 2019, 5:36:19 AM2/11/19

to Gensim

> But, is this reflected by them appearing next-to-each-other (or within `window` positions) in your training data?

Well, both. Genes in closely related strains of bacteria are sometimes subject to some "combinatorics", e.g. bacterium A has the token sequence 1-2-2-3 and B 3-2-1-2. They are not, however, artificially sorted in any way, e.g. by lexicographical order.

One more question regarding the number of negative samples:

Do they increase the contrast between non-related vectors? I.e. does neg=20 push two non-related vectors "further appart" than neg=5 would?

Thanks

Gordon Mohr

unread,

Feb 11, 2019, 10:04:19 AM2/11/19

to Gensim

I wouldn't be sure of the "contrast" effects of a higher `negative` count without running experiments.

As previously mentioned, a higher `negative` parameter might help speed training-to-the-point-of-usefulness (in terms of either "fewer training passes" or "less clock time"), especially relatively smaller corpuses. That'd be because it's updating more of the model with each micro (input-vector(s)->target-token) training-example, in somewhat larger native/no-GIL blocks.

However, per the observations in the `All-but-the-Top' paper (and my own tinkering with offsetting that effect with very-low `negative` values down to `negative=1`), any speedup might come at the cost of somewhat biasing the induced vector coordinates in certain directions from the origin (in certain common/principal component directions). My sense is that any such biases would, at least in terms of 'raw angle from the origin', tend to *decrease* contrast. (For example, if you started with a perfectly random ball-of-vectors around the origin, but then shifted them all in the same direction, there'd then be directions-from-the-origin with a comparative paucity-of-vectors, and I believe the pairwise cosine-similarities of the full set would, on balance, decrease.)

But is such an abstract idea of "contrast" really important in your problem domain? Finding your own domain-specific quality vector quality evaluations, to guide the optimization of meta-parameters, may still be the best approach.

(OTOH, if you were obsessively looking for every dial-and-knob that might offer an advantage: gensim 3.5.0 added the ability to adjust the previously-fixed `ns_exponent` factor, controlling the sampling of negative-examples with respect to token frequency. The paper at <https://arxiv.org/abs/1804.04212> suggested that for the domain of predictions-from-event-streams, which is another problem similar-to-but-not-quite-natural-language where word2vec-style vectors are used, markedly different values of this parameter might improve vector usefulness.)

- Gordon

A Viehweger

unread,

Feb 19, 2019, 2:13:58 AM2/19/19

to Gensim

Just to recap my experiments based on your suggestions.

In my use case, it does not seem necessary to balance the samples (some are 1000 times more common, but in small variations).

To achieve "contrast" between clusters of similar docs, the largest effect could be observed for small window sizes (e.g. `window=5`). The `ns_exponent` and `sample` argument have some effect, but really window size seems to be the key.

Thanks Gordon for you patience with me.

Reply all

Reply to author

Forward