Doc2Vec on large amount of data, but query on a subset of them

Caterina Gallo

unread,

Nov 25, 2015, 6:32:43 AM11/25/15

to gensim

I am using Doc2Vec on a large amount of labelled texts. I have >2milion texts, but for my application I'm interested in query for most similar tags just in a subset of them (~400K).

For example, If I ask for most similar tags to 'TAG_i' that belongs to the set of ~400K, I'd like the similar tags to belong to this set as well.
I thought about three way to achieve my goal, but I don't know if they make sense:

training the model only in the subset, but I'd like to avoid that in order to improve the precision of my model
training the model in the whole corpus and after the training eliminate the tags that I'm not interested in (but I don't know if it possible, and from some research it seems not)
training the model in the subset, then continue using the remaining corpus. From what I understand the new tags are not going to be saved, and also the new words present, but it still should improve the precision.

If someone have other suggestion or can suggest me which of my solutions can be adopted and if they make sense, it would be really appreciated.

Thanks in advance for any suggestion (and sorry for my bad english, not my natural language)

Caterina

Gordon Mohr

unread,

Nov 25, 2015, 8:20:38 PM11/25/15

to gensim

The `most_similar()` checks currently use a full-scan over all candidates, then sort and return the top-N. But, the good news is that you can limit the set of candidates to scan to a contiguous range of the backing array. See the `clip_start` and `clip_end` parameters of DocvecsArray `most_similar()`:

https://github.com/piskvorky/gensim/blob/5535fcfda26cd7a17ca429c6c03b1a9c45800445/gensim/models/doc2vec.py#L407

So, if you ensure that the 400K wind up at contiguous indexes, calling `most_similar()` clipped to that range should be the easiest way to achieve your goal.

For example, it should be enough to make your corpus iterable return those 400K in sequence (to the corpus-scan that happens pre-training inside `build_vocab()`), either before or after all the others.

But also note, for maximum memory-efficiency, you'd want to use plain python ints, rather than composed strings like `TAG_i`, as your document tags. In that case you'd just want to make sure whatever external steps you take to assign those ints makes them contiguous indexes.

Other thoughts:

Regarding your idea (1), you're probably right that it's better to train on the larger set. But if the other 1.6M examples are somewhat different in their vocabulary/content, it's also possible they dilute rather than improve the modeling of the core 400k. If you have time and an easy way to evaluate the quality of the end results, I might try both including them and excluding them.

Regarding (2), while there are no built-in methods to support such post-training trimming, it could certainly be done with some extra custom python to perform surgery on the trained model's internal. (If you never want to look-up or otherwise receive results from the 1.6M, they're just taking up memory: they're not even required for inferring new document vectors.)

Regarding (3), there's not currently any support to "continue" training with more examples. (There are ways you could try to force it, by improvising fix-ups on the internal model state, but whether that could achieve good results is unknown, and in any case the model tends to be most-influenced by the latest examples it has seen... so you wouldn't want the non-core data to come last.) You can ask the model to infer vectors for new texts, but that doesn't change/improve the model at all.

Be sure to try DBOW mode (`dm=0`) – some of the more interesting Doc2Vec results use that, rather than the default DM (`dm=1`) mode. (It also scores better in my attempts to reproduce the "Paragraph Vectors" IMDB sentiment experiment.)

- Gordon

Yana Volovik

unread,

Feb 3, 2016, 8:11:46 AM2/3/16

to gensim

I use clip_start too for the same needs. Please pay attention though that when clip_start is not 0, most_similar function will still give indexes of items starting from 0.
I've opened an issue https://github.com/piskvorky/gensim/issues/601.

Reply all

Reply to author

Forward