Is there a mechanism to merge two trained Doc2Vec models?

bigdeepe...@gmail.com

unread,

Jan 17, 2017, 10:57:31 AM1/17/17

to gensim

Intuitively I don't see any problem with doing it. Points in vector space for multiple models should be easy to overlay into a single space.
Has any work in this area been done?

Thanks.

Gordon Mohr

unread,

Jan 17, 2017, 3:11:46 PM1/17/17

to gensim

Each model's coordinates/directions are only interpretable with respect to other vectors trained in the same model/session, so vectors from different models can't be directly compared or naively concatenated with each other into a larger combined model.

Why? Given the random initialization, and other during-training sampling-randomization or thread-ordering jitter, even a model trained on the same data can wind up in quite-different end-states that are all equally good for the training word-prediction task – or other downstream tasks involving relative distances and orientations. So even the exact same document, especially if co-trained with different sets of documents, can have very different vectors in different models. (There's no 'true location' for it, or 'true north' in the induced space.)

To have comparable vectors, the best case is to train-them-up together. (Or, with the possibility of inference, infer the later or both from the same frozen model. So once you have a 'large-enough', 'representative-enough' canonical model, bringing in more vectors by inference, rather than merging from other Doc2Vec models, may be an adequate strategy for arbitrarily-large sets of comparable vectors.)

In the Word2Vec space, it's been observed that if you have some shared reference points – such as known-same words in different languages – you can later learn a projection from one space to another that moves those known-points in the right way, and retains their relative orientations as much as possible. Then that same projection can be used to convert other vectors between the two spaces – with potential applications in machine-translation or vocabulary-expansion.

Some code for this is a wishlist item for gensim Word2Vec/Doc2Vec – see <https://github.com/RaRe-Technologies/gensim/wiki/Ideas-%26-Feature-proposals#word2vecdoc2vec-implement-translation-matrix-of-exploiting-similarities-among-languages-for-machine-translation>, where there are pointers to some prior work. Applicability to Doc2Vec might be trickier, unless you were to include some 'anchor' documents in all models as reference points. (Using one of the modes that co-trains words and doc-vectors, and has them in the 'same' space where docs and words can be compared, might work... but still if only using words as the anchors the projection of documents might suffer.)

- Gordon

bigdeepe...@gmail.com

unread,

Jan 17, 2017, 5:20:18 PM1/17/17

to gensim

Fair enough. Let's say we do use an anchor document. Computing the difference between the vectors of the same anchor document we should be able to shift the center of mass of all other documents within 1 model to make the anchor document coincide. If we use N anchor documents, where is N is our space dimension, we should be able to come up with a matrix that will change the basis of 1 of the models to coincide with the basis of the other one. It might be doable.

On a more practical note: I am trying to train the model now in batches. After training the first batch for 10 epochs I tried to reload the model and continue training with the new batch, but I got an exception from within word2vec saying I need to sort the vocabulary first. So I added a call to model.sort_vocab(), right after I load the model from disk and before I call I call model.build_vocab() for the new batch of examples. It fixed the exception, but I am not certain if this was the right way to deal with it.

Was it the right way, or should I have done something else?

Thanks for the help.

Gordon Mohr

unread,

Jan 17, 2017, 6:57:27 PM1/17/17

to gensim

I suspect you might want far more anchor-docs than the number of dimensions for best results. The Mikolov paper using Word2Vec for machine-translation had word-vectors with hundreds of dimensions, but learned the projections using languages' 5000 most-common words – by rough analogy, in Doc2Vec you might want thousands of highly-representative documents shared between the two models. But this is all speculative, so you'd have to try it.

Hard to comment on your process without seeing the exact code and verbatim exceptions. That exception sounds like what you might from a model that hadn't even finished all the `build_vocab()` steps – are you sure you saved/reloaded the model *after* training? Calling `build_vocab()` again essentially re-initializes the model from scratch, for the given vocabulary – so is equivalent to starting a new model, and thus probably not your aim.

For a self-consistent model, you really should be training all available examples in one combined corpus. If you have disjoint batches A and B, after training with A for 10 epochs – AAAAAAAAAA – say at `alpha` descending from 0.025 to 0.0125, you may be approaching a good model for A. If you then start training with different examples B for 10 epochs – BBBBBBBBBB – at `alpha` descending from 0.0125 to 0.001, every epoch is pulling the model towards being better for B examples, but likely relatively worse for A (except to the extend A was already like B, and thus redundant). That is, the influence of examples in A on the weights is being diluted by all training on other examples, and unless the original examples are interleaved, their influence via their earlier presentation could shrink to essentially nothing.

If you want both subsets to have equal influence, better to do 10 epochs of AB, with a single descent of learning rate from its max to min values. And even if you wanted some examples to have more influence, interleaving the examples with some explicit kind of varying weighting (like presenting examples from B twice as often, or with a learning-rate bonus factor) could be more reasonable than just hoping early-vs-late presentation achieves a desirable balance.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 17, 2017, 7:29:03 PM1/17/17

to gensim

My entire motivation to break up the example set is 1) The model was somehow broken when I was training with the entire set and it took 10 hours with 8-12 cores engaged for one epoch 2) I wanted to save the intermediate models in case of problems.

I think I might have to start from scratch and try to see what's going on after 1 epoch. What are steps that need to happen after the model is saved
to continue training??? No more vocab_build() because that resets the model, right? Other than doing a load model = Doc2Vec.load('filename') does anything else
need to happen to continue training from the same point?

Thanks.

Gordon Mohr

unread,

Jan 17, 2017, 10:42:58 PM1/17/17

to gensim

If an in-memory model is ready to have `train()` called, a `save()` at that point should result in a model that, after `load()`, can also have `train()` called. That is, there should be no extra steps required beside `save()`/`load()`, which aim to preserve the entire model state.

By default, `build_vocab()` throws away any existing vocabulary and model weights, so would disrupt any training-in-progress. On the other hand, if your later examples include words that weren't in the corpus originally presented to `build_vocab()`, those words will just be skipped as unknown. And, unless the corpus presented to `train()` is the same size as was presented to `build_vocab()`, progress estimates and the automatic linear-decay of the learning-rate won't be calculated right, unless you provide extra size estimates to `train()`. So the best and most-straightforward approach, if at all possible, is to use a unified corpus, presented once to `build_vocab()`, then once to `train()`.

If runtime is a concern:

You haven't mentioned the size of your corpus in documents/total-words, but be sure your installation is properly using the cython-optimized routines, which can be 100x faster than the pure-Python code. (You probably are – if they weren't being used, there'd be logged warnings, and you wouldn't see 8+ cores highly utilized, because the pure-Python code can't achieve much multithreaded concurrency.)

Sometimes a runtime bottleneck is any preprocessing or fancy-tokenization that's happening in the (single-thread) corpus iteration, before batches are fed to training theads – so try to do as little there as possible.

As the vocabulary gets larger, negative-sampling tends to become relatively faster than hierarchical-softmax. Plain DBOW without word-training (`dm=0`) is usually the fastest mode and performs better than you might expect. Smaller values of 'size', `window` (where relevant) and `negative` tend to be faster. A larger `min_count` effectively slims the corpus and thus can speed training. Smaller `sample` values also effectively shrink the corpus – and maybe even improve the resulting vectors (at least in Word2Vec word-vectors).

So if getting some initial baseline result on a very-large dataset is desired, you could try extreme parameter choices for quickness – for example, `dm=0, negative=2, size=100, sample=1e-7, min_count=50, iter=10`. Then, with confirmation that everything completes and does something of value, consider more typical and time-consuming alternative settings.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 18, 2017, 6:21:28 AM1/18/17

to gensim

The dataset is 27.6 million biomedical title+abstract(if there is an abstract), with a list of tags, various identifiers, names, etc... The settings are size=400, min_size=3 (500 and 1 would exceed memory at 300+GB, I have only 256GB), window=9, I am adjusting alpha inside a loop, so iter was not set. I am reducing alpha by .002 (is 0.001 better?) per epoch, dm and negative were not changed, they are defaults.

My main goal is to be able to find similar documents by plugging in a raw text, to get a list of similar texts, with their tags.

bigdeepe...@gmail.com

unread,

Jan 18, 2017, 6:44:43 AM1/18/17

to gensim

I meant min_count=3, min_size. After the vocabulary is pruned from about 25million because of the min_count being 3 and not 1, I am left with slightly over than 6 million terms.

bigdeepe...@gmail.com

unread,

Jan 18, 2017, 7:06:06 AM1/18/17

to gensim

The only "fancy" processing I do at the training time, is finding parenthesized expressions and moving the inner content to the end of example text. So, it's a regex with re.findall, and re.sub. Doesn't seem it should be too costly in time.

On Wednesday, January 18, 2017 at 6:21:28 AM UTC-5, bigdeepe...@gmail.com wrote:

bigdeepe...@gmail.com

unread,

Jan 18, 2017, 7:07:49 AM1/18/17

to gensim

It's processing about 416,000 words per second.

Message has been deleted

bigdeepe...@gmail.com

unread,

Jan 18, 2017, 10:41:59 AM1/18/17

to gensim

Ok, there must be a bug in my version of gensim (or still in the current version). When I loaded the model with mmap='r' it worked, but loading without mmap='r' I still get the same exception as before, init_sims fails without mmap

Gordon Mohr

unread,

Jan 18, 2017, 2:03:09 PM1/18/17

to gensim

That's hard to interpret without seeing the exact exception text. If `mmap` is making any difference, it's by keeping more of the model paged out on disk when not recently accessed. But enabling that could be absolutely awful for training speed: the highly-random accesses of training need have the whole model in RAM for efficiency. (It's better to fail, and fix the memory-needed mismatch, than thrash along paging the model's main weights in-and-out.)

Still, this shouldn't be an issue if you have 256GB RAM. (Even 500-dimensional vectors for 25 million vocabulary tokens and 27 million doc-IDs should fit within about 220GB – at least before any `init_sims()` for post-training similarity-rankings.) So something else may be wrong in the setup or how the corpus size is estimated.

Also, while training, no `init_sims()` should be happening – that's only triggered when doing `most_similar()` operations after training. In fact, after you've triggered an `init_sims()`, further training might not affect `most_similar()` calculations, unless you force a recalculation of the unit-normed vectors. It's really best to consider training as a single-shot deal, unless trying advanced tweaks on a process that's already working & well-understood.

Regex processing is exactly the kind of costly bottleneck that's best left out of the iterator, if speed is a major concern. Also, by doing it in the iterator, it will be repeated on each training pass, rather than done just once. (Separately, I wouldn't necessarily expect the transformation you describe – moving parenthetical text to the end – to improve word/doc vectors unless it was shown to do so in a comparative test.)

You are unlikely to be getting much training value from terms that only appear 3 (or even 10+) times, among billions of words, so using a much-higher `min_count` could speed things, with negligible (or even a positive) effect on vector quality.

I strongly recommend AGAINST calling `train()` multiple times in your own loop, or adjusting `alpha` yourself. It's rarely necessary and there are too many things to get wrong. The default value of `iter` is already 5, meaning that a single call to `train()` will make 5 passes over your data, and glide the `alpha` properly from its max to min values. So if leaving `iter` unset, you are probably already doing 5 times as many passes as you intend.

Bigger `window` values aren't necessarily better, but are slower – so I would only up the defaults if testing new settings to try to improve on an established baseline. Note that pure DBOW (`dm=0`, which doesn't even utilize a `window`) is fast and tends to work quite well on short, uniformly-sized documents (like titles, abstracts, and metadata fields).

- Gordon

Message has been deleted

bigdeepe...@gmail.com

unread,

Jan 18, 2017, 2:50:35 PM1/18/17

to gensim

It also seems I broke gensim by updating it. It now complains that Doc2Vec has no attribute model_trimmed_post_training

By the way, I got the code training in a loop here https://rare-technologies.com/doc2vec-tutorial/ At least from that tutorial it appears that model.train does a single pass over the corpus.

bigdeepe...@gmail.com

unread,

Jan 18, 2017, 4:37:11 PM1/18/17

to gensim

model.docvecs.most_similar('tagname') will fail in doc2vec.py at line 409 (was line 407 before update), With the updated gensim, it will fail no matter if I use mmap or not.

Gordon Mohr

unread,

Jan 18, 2017, 6:54:27 PM1/18/17

to gensim

Please include verbatim error messages – "complains" or "fail" requires too much guessing about what's happening.

A message about `model_trimmed_post_training` sounds like a new bug related to loading an model from an older version of gensim. You may be able to work around this by manually setting `model.model_trimmed_post_training = False` after loading an older model, but you may also just want to start with a fresh model in your current code, rather than using a reloaded one.

Unfortunately that ~2-year-old tutorial post is outdated with respect to more recent defaults. It needs a more prominent warning or other updates. The included demo notebooks (in docs/notebooks) are generally better models – but also may lag the latest default/recommendations a bit.

- Gordon

Message has been deleted

bigdeepe...@gmail.com

unread,

Jan 19, 2017, 6:54:03 AM1/19/17

to gensim

This is the original exception that persists still after gensim update, after I reran the Doc2Vec constructors with iter=1, size=500, min_count=3

>>>
>>>
>>> logger = logging.getLogger('Doc2Vec')
>>> logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
>>> logging.root.setLevel(level=logging.INFO)
>>>
>>>
>>> model = Doc2Vec.load('NIH.doc2vec')
2017-01-19 05:29:39,041: INFO: loading Doc2Vec object from SomePrefix.doc2vec
2017-01-19 05:35:55,556: INFO: loading docvecs recursively from SomePrefix.doc2vec.docv
ecs.* with mmap=None
2017-01-19 05:35:55,556: INFO: loading doctag_syn0 from SomePrefix.doc2vec.docvecs.doct
ag_syn0.npy with mmap=None
2017-01-19 05:47:36,557: INFO: loading doctag_syn0_lockf from SomePrefix.doc2vec.docvec
s.doctag_syn0_lockf.npy with mmap=None
2017-01-19 05:47:38,213: INFO: loading wv recursively from SomePrefix.doc2vec.wv.* with
mmap=None
2017-01-19 05:47:38,213: INFO: loading syn0 from SomePrefix.doc2vec.wv.syn0.npy with mm
ap=None
2017-01-19 05:48:18,583: INFO: loading syn1neg from SomePrefix.doc2vec.syn1neg.npy with
mmap=None
2017-01-19 05:48:57,954: INFO: setting ignored attribute syn0norm to None
2017-01-19 05:48:57,954: INFO: setting ignored attribute cum_table to None
2017-01-19 05:48:57,954: INFO: loaded SomePrefix.doc2vec
>>> model.docvecs.most_similar('ID_SomeTag_3')
2017-01-19 05:50:36,013: INFO: precomputing L2-norms of doc weight vectors
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 4
26, in most_similar
self.init_sims()
File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 4
09, in init_sims
self.doctag_syn0norm = empty(self.doctag_syn0.shape, dtype=REAL)
MemoryError

bigdeepe...@gmail.com

unread,

Jan 19, 2017, 1:35:19 PM1/19/17

to gensim

I mistyped, the size was still size=400, min_count=3, the logger reported that the estimated RAM needed was about 186GB for 400 dimension and 6M unique words. Not sure if it is correct or not, but this is what was displayed.

If you have any ideas on the error it would be great. I tried it with mmap=r again, and it worked, although took a long time to compute L2-norms.

Gordon Mohr

unread,

Jan 19, 2017, 2:52:11 PM1/19/17

to gensim

The MemoryError suggests it was unable to allocated memory, specifically for the unit-normalized versions of the doctag vectors (as used for most-similar rankings).

To understand the inherent size of the model:

The combined size of all the files beginning "SomePrefix.doc2vec" (given that they are uncompressed) will be roughly of the same magnitude of the amount of RAM needed to load the model. The file `SomePrefix.doc2vec.docvecs.doctag_syn0.npy` is specifically the array for which a second unit-normed version is created – so its size is roughly the increment of extra allocated memory required by `init_sims()` before doing `most_similar()` operations.

The total count of doctag vectors is `len(model.docvecs)`, and the count of those that are named by string tokens is `len(model.docvecs.doctags)`.

One option would be to create an inherently smaller model, so there's more headroom for this operation.

Using a smaller vector-size will have the most direct effect.

If most/all of the doctags are string doctags (so `model.docvecs.doctags` is large), but you could manage to use plain integer document-IDs instead, switching to do that before training could save some memory, As previously noted, using a smaller effective vocabulary (via a larger `min_count`) will save some memory (and often simulaneously improve vector quality), though as it seems you already have far more doctags than vocabulary words, the memory savings here may be relatively minor.

Another option is to destructively slim the model, leaving only the parts needed for `most_similar()` operations (leaving it broken for other purposes, such as further training or inference).

For example, after loading, you could call `model.docvecs.init_sims(replace=True)` yourself, before trying any `most_similar()` operations. This will unit-norm the doctag-vectors in-place – at no cost in memory, but destroying the raw trained-magnitude vectors (and thus leaving the state inconsistent for continued training). Inference will still work after this step – since existing doctag-vectors aren't consulted during the current `infer_vector()` process. (There's a wishlist item to make inference take into account known other tags, which could change this, but that's not implemented yet.)

Once you've decided a loaded model won't be needed for further training, you could discard the `model.syn1neg` output weights to save some memory. (Disabling training also disables inference, since inference is a form of training under certain extra constraints.) And depending on whether your mode or end-purpose consults word-vectors, you might be able to throw away `model.syn0` for savings as well. But again given the dominance of doctag-vectors in your model, these savings will likely be small compared to unit-norming in-place or using a smaller vector size.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 19, 2017, 3:35:09 PM1/19/17

to gensim

So your assessment appears to be that I am running off the RAM edge, right? I could probably test that by adding temporary swap space. I know that a much smaller model worked (when I was trying to run it batches). I also determined that with the new gensim, and mmap='r' also works (after one epoch of training). Why did it work with mmap?
Is the memory footprint smaller with mmap?

bigdeepe...@gmail.com

unread,

Jan 19, 2017, 3:56:33 PM1/19/17

to gensim

By the way, what would be the rule of thumb for choosing size?

Gordon Mohr

unread,

Jan 19, 2017, 5:34:05 PM1/19/17

to gensim

Yes, there's insufficient memory, but sharing the sizing information I mentioned would clarify the extent of the shortfall and aspects of the model most responsible.

As previously mentioned, mmap only pages-in ranges of the array as they are accessed. So potentially less RAM is required – you're getting swap-like benefits of extra addressable space, even without allocating explicit general-purpose swap. But also as previously mentioned, it's likely to be awfully slow for many access patterns, including those of usual training, or the full-sweeps done for `most_similar()` results. The `mmap='r'` (read-only) mode you've mentioned wouldn't be appropriate for read-write training – and my test here suggests the failure if that's tried may just be an ugly interpreter crash. Using one of the writeable modes would add other complications. I'd recommend avoiding `mmap` options unless/until necessary as advanced optimizations, on a baseline process that's already working.

Values for `size` I've commonly seen in published work range from 100 - 1000. (One outlier, the paper 'Document Embeddings with Paragraph Vectors', indicates some tests with vectors of 10,000 or more dimensions.) Larger sizes with small datasets don't help and possible hurt vector usefulness (but your dataset seems reasonably large). What's best depends on the corpus and end goals and must be found by experiments. Starting smaller gets full processes working and helps experiments cycle faster – at which point you can test if larger sizes offer any benefits justifying the extra time/resources/complications.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 19, 2017, 6:52:29 PM1/19/17

to gensim

I'm currently running yet another training session from scratch with iter=10, size=400, min_count=3, (not using a loop) still 17.6M examples. Once it's done, I'll get the numbers.

bigdeepe...@gmail.com

unread,

Jan 20, 2017, 10:16:34 AM1/20/17

to gensim

Adding 65GB in swap space removed the exception. Computing L1 norms, however, is extremely slow (maybe I should 128GB swap). Is there a good way to preserve it so that it does not need to be recomputed.
It took 20.6 hours to train, with iter=10. The question I would have now is if I want to train some more, one iteration at a time how do I set "iter" now? Does model.train(stuff) take iter as argument? If I simply load the model from disk, it probably still set up with iter=10, right?

To drill a little deeper into into size={100,....,1000,10000} How should it depend on size of a corpus? Simply experimenting with it seems like insanity, considering it takes 20+ hours to train 10 epochs. Using a small corpus seems unsatisfactory as it would be impossible to see what one really gets in the end.

Thanks again for help and advice.

Gordon Mohr

unread,

Jan 20, 2017, 6:05:52 PM1/20/17

to gensim

As previously mentioned, if you're relying on swap space for any of these steps, things will likely be frustratingly slow. (Adding more reliance on swap? Slower.)

The unit-norm vectors are never specifically saved, since they can be recalculated (usually quite quickly) from the raw vectors.

As previously mentioned, you can in-place clobber the raw vectors with the unit-normed vectors by explicitly calling `init_sims(replace=True)`. If you then re-save that model, it already (and only) has the unit-normed vectors – but upon re-load still doesn't realize that (because those already-normed vectors are just stored in the raw-vectors property). So after load you'd want to manually put them in place, with something like `model.docvecs.doctag_syn0norm = model.docvecs.doctag_syn0`, to prevent a redundant re-calc.

But I would recommend working with smaller, faster models that allow you plenty of headroom, and don't require swap space or error-prone extra workaround steps, until you have good baseline results. Then you can decide if adding extra slower, more-memory-consumptive options, and custom workarounds, is worth any theoretically better performance.

You can just call `train()` again on a model, and it will again use the `iter`/`alpha`/`min_alpha` properties as initially set up (or later altered by directly changing the model's stored properties (like `model.iter`, etc.). But good choices for such incremental additional epochs are a murky issue; much better to get it so one, clearly-characterized session works well and can be evaluated.

Yes, some settings and large corpuses can take tens of hours (or day or weeks!) to train. Getting things debugged with small/quick settings can thus help a lot, and then if it looks like slower/larger settings offer incremental benefits, making the investment in more testing (such as renting more machines to test a few hundred 20+ hour configurations in one day in parallel).

- Gordon

bigdeepe...@gmail.com

unread,

Jan 20, 2017, 6:32:55 PM1/20/17

to gensim

I want to retain the ability to continue training the model with additional documents. So, I don't really want to clobber anything. I appreciate your point about training on a smaller model, but I am suspicious that a smaller model may look "good" wrt. to retrieval time, and quality of results while a larger one will be a horror show. I am trying to train with a larger corpus for several reasons. Some of them are to assess what kind of resources are needed. I at least now I know just how large the model ends up to be, and that if one were to make it useful, it'll need 512GB just to load the model and compute norms. 65GB is only enough for doc sim, not word sim.

You mentioned that that bringing the size down would be most effective to achieve a smaller memory footprint. But my question is, given the size of my corpus, how low size can I use and still achieve reliable results?

My aim is to be able to plug in a raw document, infer its vector, find similar documents inside the corpus, get their tags, then add the new document to the corpus.

Gordon Mohr

unread,

Jan 20, 2017, 8:00:25 PM1/20/17

to gensim

Regarding "continue training the model with additional documents":

Gensim Doc2Vec doesn't currently support additional training which introduces any new doctags (such as new document-IDs). And any training with new examples, without also re-presenting older examples, just pulls the model to be better at the new documents, at the likely expense of the older examples – diluting their influence, perhaps arbitrarily down to nothing. Murky balance issues with no clear answers would come up.

So: if you truly need this ability, be sure budget an R&D project to create the features & then discover best practices. :)

My hunch is that with a model trained on a large representative set of documents, inference on new examples would be enough. Though, if you want to add those inferred results to your similarity-search, you'd need to use some facilities outside the Doc2Vec model, since it doesn't support additions to the docvecs that are similarity-searched over. Other similarity utilities over document-vectors in gensim might be applicable.

Unfortunately, only experimentation with your corpus can reveal what settings give sufficiently 'reliable results' for your specific end-goals.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 21, 2017, 5:22:47 AM1/21/17

to gensim

Gordon, once again, I appreciate your help and insights, but I am having difficulty in reconciling your view of "Gensim Doc2Vec doesn't currently support additional training which introduces any new doctags (such as new document-IDs). And any training with new examples, without also re-presenting older examples, just pulls the model to be better at the new documents, at the likely expense of the older examples – diluting their influence, perhaps arbitrarily down to nothing" with how Doc2Vec is trained.

Certainly it is possible to train Doc2Vec in a loop with model.train(data), right? As I am training I am constantly adding new documents with new tags, etc... This would imply that the last batch of documents, according to what you said "dilutes" the influence of all the previous documents. So, how is this really different from adding some training later? As we train on the initial batch corpus, there will already be asymmetry built into the process, as some documents will be first and some documents will be last.

I suppose also that if the corpus is large enough whatever "dilution" is introduced will be minor and one can do a refresher training occasionally to tighten things a bit. I am also thinking that even if there is a drift of docvecs over time from additional documents, as long as there is still consistency between all of them it is not a problem,, since as you and Lev mentioned before, the absolute values don't matter, only the relative values do.

With respect to new vocabulary, new tags, would I be wrong in assuming that if the new documents are presented last and reinitializing the vocabulary with model.build_vocab() it should not disturb the relationship between the old tokens/terms and the docvecs? It is an inherently sequential process, no multi-threading at all.

Thanks.

Gordon Mohr

unread,

Jan 23, 2017, 12:48:37 AM1/23/17

to gensim

Some reasoning to prefer interleaving all examples, rather than providing them in other clumped orders, was in my message a few days ago, the the example of the A/B subsets: https://groups.google.com/d/msg/gensim/N5SCiq1F45w/59PxrKCMCAAJ

Presenting varied batches via `train()`, with ad-hoc choices for alpha, might sorta-work, but you could easily wind up with later training essentially undoing/weakening the earlier training, wasting time and discarding any benefits of a larger dataset. This risk that sequential training undoes earlier learning, as compared to interleaved training, has been called 'catastrophic interference' (or 'catastrophic forgetting'):

https://en.wikipedia.org/wiki/Catastrophic_interference

You'd want to carefully track whether an improvised approach offers the improvements you're expecting.

Calling `build_vocab()` by default wipes the model, so doesn't allow later `train()` batches to include new doctags. There's a new (but essentially experimental) option in `build_vocab()` for Word2Vec, that expands the known-vocabulary for the Word2Vec case, but I don't think it's been designed/tested to work with Doc2Vec models. Even when used for Word2Vec, it leaves open the same questions about whether sequential training on data subsets can give well-balanced results (or even net-improvements).

- Gordon

bigdeepe...@gmail.com

unread,

Jan 23, 2017, 9:08:40 AM1/23/17

to gensim

Looks like a lot of experiments are unavoidable. Yesterday I added some functionality into the class that feeds data to Doc2Vec to produce only a percentage of the corpus. I also cleaned up the punctuation a bit to reduce number of separate terms.

I did get some reasonable results from a single example, on a 1% corpus, model.docvecs.most_similar('ischemia') pulled an article with the word in the title, with several tags associated with it in the similarity list.

I am not sure at the moment what I should expect if I do model.docvecs.most_similar(''brain-ischemia") where "brain-ischemia" is a tag associated with multiple examples.
The desired behavior would be if the similarity list contained unique tags associated with multiple documents which also have the "brain-ischemia" tag. I'll see I guess.

Would you have any thoughts if there is a more robust way to test a model for accuracy than eye-balling it? If I simply eye-ball it, I am prone to suffer from confirmation bias where everything looks good to me, without any objective way to test it.

I will read through the links you gave to try to understand better what's happening under the hood.

Thanks again.

Gordon Mohr

unread,

Jan 23, 2017, 3:04:34 PM1/23/17

to gensim

For getting other tags that most-often co-occur with a target tag, you could do an exact count. That is, tag A co-occurs with 'brain-ischemia' 12 times, tag B c-occurs 9 times, etc.

An automated evaluation technique used in both the original 'Paragraph Vectors' paper and the follow-up 'Document Embeddings with Paragraph Vectors' was to leverage some previous grouping or categorization system to identify subsets of documents that 'should' be closer to each other than others. (In the original paper, the documents were search-result snippets and the subsets were those that appeared in the same set of results from the traditional words/links/etc search engine. In the followup, the documents were Wikipedia or Arxiv articles, and the subsets were those that human curators had put in the same categories.)

Given those existing groupings, the quality-test of document-vectors was to pick two documents of the same topical grouping, then one more at random from all those *not* in the subset – then consider the doc-vectors good if the pair (associated by prior groupings) are closer to each other than the third vectors.

On the plus side, this can turn some existing categorization info into a lot of individual evaluation test cases. On the negative, it might be exactly those documents that are truly similar to others, but not yet (well-)categorized by prior work, that you're hoping to find with PV-Doc2Vec. So at least some of the test cases where this suggests the doc-vectors have 'failed' might still in fact be real, useful similarities.

Still, it's apparently useful as a coarse guide to whether the doc-vectors are becoming better able to match human senses of relatedness, across different parameter choices.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 23, 2017, 3:48:23 PM1/23/17

to gensim

I wasn't clear enough on tags shared by multiple examples. Some tags are unique to a single document and some tags are descriptive and may be found attached to multiple documents.

If I were to do model.docvecs.most_similar('sharedtag') I'd like to see a list of unique tags, each associated with their respective document. The unique tags will always have only a single co-occurence with each shared tag, only for that one document.

I have been looking through the source, trying to figure out, how I can take a raw document, do infer_vector and then find similar documents in the trained corpus. Is there already a way to do that?

...

Gordon Mohr

unread,

Jan 23, 2017, 6:35:11 PM1/23/17

to gensim

`most_similar()` works on vectors, so it can use the output of `infer_vector()`.

However, to avoid the single target vector from being confused with a list-of-positive-examples (which `most_similar()` also supports), it should be placed in a list (as if it were a single positive-example). EG:

similars = model.docvecs.most_similar([model.infer_vector(new_doc_tokens)])

- Gordon

<div

bigdeepe...@gmail.com

unread,

Jan 23, 2017, 8:20:35 PM1/23/17

to gensim

Starting to see some promising results on the small corpus. I trained it 2 more times on the same vocab. alpha from 0.025 to 0.005 and then 0.002 to 0.0002

bigdeepe...@gmail.com

unread,

Jan 27, 2017, 6:09:11 PM1/27/17

to gensim

Gordon, I've trained and trained a second time a model based on the entire corpus. As opposed to the 1% corpus model, this one makes no sense to me at all.

After training I used a relatively long abstract (a large portion that fit in a window) and computed the inferred vector with alpha=0.025, min_alpha=0.0, steps=50. Then
I plugged the results into model.docvecs.most_similar([vec]) and got back a list of tags that define general subject area (that in itself is strange, no unique tags), but it's totally wrong for the vector I plugged in.The text I used deals with oncology, the tags I got back are about dentistry. I expected that the most_similar would at least give me back the same abstract that I plugged into the infer_vector. This is messed up.

Any thoughts?

Gordon Mohr

unread,

Jan 27, 2017, 7:32:23 PM1/27/17

to gensim

Some things to check:

Is the list-of-words you're passing to `infer_vector()` tokenized in the same way as the training data was?

When you pass a trained tag, or a vector corresponding to one of the trained document-tags, to `most_similar()`, do the results seem sensible then?

Is the abstract you're using to test `infer_vector()` exactly the same tokens as one of the texts used in training? (I'm not sure what you mean by "a large portion that fit in a window".) If so, and there's been enough training done at reasonable settings, then in the classic PV-Doc2Vec case where a document has a single unique ID, we would hope and expect for the same text's original ID to be one of the most-similar results.

But returning to observations made up-thread:

With the many-more tags per training-document that you've also described adding, I'm not sure what a reasonable expectation might be. As previously mentioned, that's different enough from other projects that precedents might not apply. If instead of 1 tag per document you have 10 or more, maybe instead of 20 training epochs to get useful results, you'll need 200. I don't know.

- Gordon

Some code for this is a wishlist item for gensim Word2Vec/Doc2Vec – see <<a href="https://github.com/RaRe-Technologies/gensim/wiki/Ideas-%26-Feature-proposals#word2vecdoc2vec-implement-translation-matrix-of-exploiting-similarities-among-l

bigdeepe...@gmail.com

unread,

Jan 27, 2017, 7:58:45 PM1/27/17

to gensim

Ok, thanks for the suggestions. I think I will strip all the tags except one unique tag to get a baseline on what's going on when I train on the entire dataset. This won't be useful for what I want to do but hopefully I'll learn something from the results. The tokenization is just .split(), so it is identical. I copied and pasted a large portion of an abstract that was visible, so it was a large fraction of the abstract, but not all of it. Even when I use a unique tag the results don't seem sensible. The similarity values were around 0.25. When I used infer_vector() I got values around .72 but the tags I got were not the unique tags but instead a single type of tag, shared by multiple examples (which seemed really strange that I got only National LIbrary of Medicine tags and none of any other kinds)

Thanks.

Gordon Mohr

unread,

Jan 27, 2017, 8:43:17 PM1/27/17

to gensim

I would use an exact example from the same iterator as was used during training – such as the first document – rather than cutting-and-pasting an excerpt from some longer document.

It almost sounds like your `most_similar()` operations might be happening on original/untrained values. Perhaps something in your pattern of testing/saving/loading/mmapping has inadvertently left `model.docvecs.doctag_syn0norm` (the source of vectors for `most_similar()` ops) with untrained (or less-trained) values. You may want to double-check related steps for unintended effects, or try `model.clear_sims()` on your trained model, to ensure it's not using older unit-normed values.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 28, 2017, 1:18:54 PM1/28/17

to gensim

After retraining for 10 epochs with alpha=0.025, min_alpha=0.0, size=400, min_count=3 and a single unique tag, the results don't make any sense.
I inferred a vector from a single word "ischemia" (when I do word2vec model.most_similar('ischemia') it gives synonyms, different spellings,plural, etc... it all makes sense) but when
I do model.docvecs.most_similar([vec]) with vec = model.infer_vector(example.split(), alpha=0.025, min_alpha=0.0, steps=100) I get back a bunch of unique tags, but when I look up the associated examples, all of them seem totally unrelated. The only common theme I see so far is that all the tags I get back are associated with very SHORT examples.

So, here we are, 1 unique tag per example, trains faster (11 hours or so) with 1 tag, but the results are pretty awful.

Message has been deleted

Gordon Mohr

unread,

Jan 28, 2017, 3:04:11 PM1/28/17

to gensim

Note when you do `model.most_similar[`ischemia']`, you are not 'inferring' a vector - but looking up the pretrained word-vector for 'ischemia'. That you're getting back meaningful words from `most_similar()` indicates effective word-training of some sort is happening.

I suspect something is wrong with other steps of your model handling, or the choice/preparation of your `example`.

Let's assume that `taggeddoc_corpus` is set up to be your TaggedDocuments with a single unique ID tag per document. Also, that `doc_text[id_tag]` will return the text associated with one of the trained tags. I suggest you run & review/share the output of the following (which uses different options than you've been using, to run a quick minimal trial):

model = Doc2Vec(taggeddoc_corpus, size=100, dm=0, min_count=50, sample=1e-06, iter=10, workers=cores)

probe_doc = iter(taggeddoc_corpus).next() # 1st document

probe_tag = probe_doc.tags[0]

probe_tokens = probe_doc.words

vec = model.infer_vector(probe_tokens, alpha=0.025, steps=100)

similars = model.docvecs.most_similar([vec])

print("original text: %s" % doc_text[probe_tag])

print("as tokens: %s" % probe_tokens)

print("most similars:")

for i, sim in enumerate(similars):

print("___ #%i %s\n%s" % (i, sim[0], doc_text[sim[0]]))

(I haven't run this but it should be roughly correct with the above assumptions.)

If this gives more sensible results than what you've been seeing, adjust the steps incrementally towards your other goals – ready to reverse/debug anything that seems to hurt.

- Gordon

bigdeepe...@gmail.com

unread,

Jan 28, 2017, 9:00:01 PM1/28/17

to gensim

I mistyped. I did both, looked up similar words to 'ischemia' and used model.infer_vector with it and later with much longer text. Neither of the inferred vectors produced reasonable results when plugged into model.docvecs.most_similar.

I checked what's coming out of the iterator, it looked perfectly fine. An array of tokens. I am training a model with the settings you suggested below. The one I noticed it is not training perceptibly faster than the previous training sessions.

Some things to check:

- Gordon

On Tuesday, January 17, 2017 at 4:29:03 PM UTC-8, bigdeepe...@gmail.com wrote:<blockquote style="margin:0;margin

bigdeepe...@gmail.com

unread,

Jan 29, 2017, 9:26:12 AM1/29/17

to gensim

The training with the parameters you suggested finished. model.docvecs.most_similar('tagxxxxx') came up with about 4 other tags that were in general within the same field of study.
At the same time, model.most_similar('ischemia') gave me back total junk. Training another one with min_count back to 3.

Some things to check:

- Gordon

Gordon Mohr

unread,

Jan 29, 2017, 12:21:34 PM1/29/17

to gensim

The mode I suggested for the sake of speed – pure PV-DBOW, `dm=0` – doesn't train word vectors, so any word-similarity lookups will just give random results. (That's why the example code didn't probe any words as a post-training test.)

To train word vectors, skip-gram training can be added to PV-DBOW – `dm=0, dbow_words=1` – or you can switch back to PV-DM – `dm=1`. Each will be slower than pure PV-DBOW, though, and the doc-vectors might turn out to be better or worse, or peak in usefulness with different other metaparameters.

Did inference using the exact same tokens as one of the corpus documents, such as the very first, give a new vector that was close the bulk-trained vector for the same text?

If you now have a model that kinda/sorta works, it'd be good to devise a repeatable quality evaluation for each model revision's doc-vectors, so that you know whether other tinkering (like changing modes, `min_count`, etc) helps or hurts. (Using such a low `min_count` is likely to hurt vector quality, for reasons given in a prior message.)

- Gordon

Some things to check:

<d

bigdeepe...@gmail.com

unread,

Jan 30, 2017, 8:15:34 AM1/30/17

to gensim

Can one meaningfully compare similarity values from two models? Let's say the only difference is size=200 versus 100?

Thanks.

Gordon Mohr

unread,

Jan 30, 2017, 11:57:12 PM1/30/17

to gensim

Probably not. As noted in 1st reply on this thread, vectors aren't necessarily comparable unless trained together – even with same parameters, same corpus. And, any parameter changes will shift the model to be sensitive in different ways. And, especially with regard to vector-size, a 200-dimensional space has a lot more directions for things to vary than a 100-dimensional space – so similarities/distances could be distributed very differently.

- Gordon

...

bigdeepe...@gmail.com

unread,

Jan 31, 2017, 6:28:01 AM1/31/17

to gensim

The problem I have then is that there are no criteria by which I can judge whether things are getting better or not. When things "kinda-sorta" start making sense, which way do I shift parameters to improve the results? Am I trying to convince myself that things are similar, when they are just marginally related?

I was looking at implant coatings example, and the similar documents made sense as possible contenders but how can I know there are no other better choices? On every possible subject
there are probably hundreds of "similar" abstracts.

Gordon Mohr

unread,

Jan 31, 2017, 3:46:03 PM1/31/17

to gensim

There is no absolute definition of "better" or "most similar" – only what works for your purposes, and/or works a little better than what came before.

If there are "no criteria", that's just because you still need to come up with your own appropriate evaluations. Ideally those are quantitative and repeatable processes, to help rationally and quickly choose between alternate approaches/parameter-values.

Andrey in your other thread made good suggestions: these kinds of projects usually require some set of human-generated judgements about what results are 'better' or 'worse'. My message in this thread 8 days ago – https://groups.google.com/d/msg/gensim/N5SCiq1F45w/Mqav7WZXCgAJ – pointed out some techniques from the original Paragraph Vectors papers, for bootstrapping from other available document categorizations, that might be appropriate (up to a point).

- Gordon

bigdeepe...@gmail.com

unread,

Feb 1, 2017, 8:41:30 AM2/1/17

to gensim

Unfortunately I don't have the man power to manually classify similarity measures over 27+ million documents. I am considering an automated scheme now. Perhaps you can weigh in
on what the pitfalls would be. I obviously need to assess how well different documents are clustered around various themes. I am assuming that there are clusters based on research areas since the purpose of any article is to communicate with other researchers.

I am considering computing dot product, and divide it by the dimension, between docvecs vectors for let us say topn=500-1000, and then aggregate the values over the TOPN vectors.
My intuition tells me that it would be a good measure to determine HOW WELL the model clusters related material.

Thoughts?

Gordon Mohr

unread,

Feb 1, 2017, 10:46:14 AM2/1/17

to gensim

I have no idea or even hunch as to whether the calculation you propose would be positively correlated with your other project/end-user goals. It sounds like an idea that itself would need testing, against some human-derived fitness judgements.

It seems you already have extra topical annotations on the documents – the other non-unique-ID tags – from either humans or some prior system. Such other indications-of-relatedness are exactly what's needed for the evaluation method used in the 'Document Embeddings with Paragraph Vectors' paper mentioned earlier. Is there something you don't like about that method, for your purposes?

- Gordon

Radim Řehůřek

unread,

Feb 2, 2017, 2:44:20 AM2/2/17

to gensim

Also, since it seems you're on an epic and complex task, let me remind you that there's also a commercial support & consulting option behind gensim:

http://radimrehurek.com/gensim/support.html

If you have the budget, this is a perfect opportunity to get more than an occasional mailing list post from the expert team behind gensim.

Best,

Radim

bigdeepe...@gmail.com

unread,

Feb 2, 2017, 8:16:39 AM2/2/17

to gensim

Radim, I appreciate the suggestion. At the moment I am the only person on the project, and I am funding the equipment and my time out of my own pocket and have no idea if it might
change. I will keep your offer in mind.

bigdeepe...@gmail.com

unread,

Feb 2, 2017, 8:22:12 AM2/2/17

to gensim

I have browsed through some documents based on one shared tag and the grouping of documents (at least based on the sample) was nothing to write home about. I was actually hoping to improve clustering based on Doc2Vec from what might have been done. I have to call the National Library of Health and find out a little more how they go about classifying stuff.

Slogging on. :-)

bigdeepe...@gmail.com

unread,

Feb 2, 2017, 4:46:14 PM2/2/17

to gensim

My thought about defining the metric I mentioned, was to only assess how tightly a model clusters those points that it considers similar. This is only to asses the differences between
let's say size=100 versus size=300. In my mind, tighter clusters, with less variance around the vector cluster centroid is better.

...

bigdeepe...@gmail.com

unread,

Feb 6, 2017, 12:10:49 PM2/6/17

to gensim

I wrote this little function to assess how tight clusters are:

def tightness(model, topn):
    sum = 0.0
    count = 0
    for v in model.docvecs:
        sims = model.docvecs.most_similar([v], topn=topn)
              for sim in sims:
                  sum = sum + v.dot(model.docvecs[sim[0]])
              if count % 10000 == 0:
                  print ("Now at %d" % count)
count = count + 1
   return sum/(model.size*topn)

tightness(model, topn=100)

runs incredibly slowly. Short of trying to instead to estimate by taking only 1% of vectors, are there any more efficient ways
to do something similar?

Thanks.

...

Lev Konstantinovskiy

unread,

Feb 6, 2017, 12:17:20 PM2/6/17

to gensim

Annoy is a faster way of `most_similar`.

Also, would it be possible to start a new mailing list thread when the message is on a new topic? For example, the tightness of cluster is not related to "MemoryError"

...

Reply all

Reply to author

Forward