Best way to train documents which are subjected to expiration

467 views
Skip to first unread message

anurag...@bold.com

unread,
Nov 15, 2017, 12:09:37 AM11/15/17
to gensim
Hello All,

We have a system where we are proving recommendations against a input text document. We have 6-7 million such documents in our corpus (document is something like a review of a movie or a job description). The main thing here is that these documents are subject to expiration. On a daily basis we receive some events stating us to mark some of these documents as expired and also we receive some events to insert or modify an existing document, so over all documents in our corpus will fall in the above range (6-7 million).

So at any day, in our corpus we have some documents that were created or modified today and some documents which are were inserted some days back but are not expired yet. 

We follow the following steps to build our model -
  • Take a CSV dump of all the non-expired documents we have in our corpus. Each row will contain the document-id and the corresponding text.
  • Pass this file and get TaggedDocument object for each of the line present in the CSV file.
  • Use above to call modelDoc.build_vocab()
  • Then use this model vocab to train our model (modelDoc_loaded.train())
Once the model is trained we load the model and start giving recommendation by calling most_similar function of gensim. The entire process takes around 10 hours to complete (depending on the number of documents we have on a daily basis).

Since we are training every single document again and again and its doesn't make any sense to start afresh every single day. Is it possible to have a new model where use the trained documents (which were not expired) from the earlier model and train only the new documents which are inserted/ modified today. Or any other strategy to reduce this rum time of 10 hours.

Some specs about my machine- 110 GB RAM, 16 core machine running on CentOs. We already have a faster version of gensim installed to train our model faster. We are currently using gensiom 0.12.1 version.

Thanks in advance.

Ivan Menshikh

unread,
Nov 16, 2017, 12:27:17 AM11/16/17
to gensim
Hi,
Very good question! 
My advice is to train the model once on a large number of documents (it does not matter if it expired, edited, etc).
After it, every <UPDATE TIME> you call infer_vector for every "actual" document and store it everywhere (with 110GB you can store it even in RAM and use scipy.spatial.distance.cdist instead of most_similar) or use approximate search (for save memory and speed up your search), something like annoy or faiss.

I hope that Gordon will come to this thread too and discuss it.

Gordon Mohr

unread,
Nov 30, 2017, 2:51:48 PM11/30/17
to gensim
As Ivan notes, to get a vector for a new document, you don't need to do a full bulk-training cycle – you can use `infer_vector()`, and you'll get a doc-vector that approximates what that new document would have received, had it been part of the last bulk-training process. That is, it'll be in a compatible coordinate-space.

In fact, it's even possible for the inferred-vectors, if you use a generous number of `steps`, to work better for downstream info-retrieval/similarity/classification tasks than those left in the model by bulk-training. This is likely because the inferred-vector gets all of its inference-adjustments from the final frozen model, rather than (as the bulk-trained vectors) from the model over the full course of moving from untrained-to-fully-trained. And, while inference on a single text is single-threaded, the frozen model can be used by many processes, or replicated to many machines, making inference of large batches of documents somewhat more parallelizable/distributable than original training. 

(For this reason, it could be worth evaluating whether to re-infer vectors for the training data, after training - depending on time/process costs and the achievable lift it might be worth the effort. There's not yet explicit support in gensim for some practices that might be worthwhile here – like fine-tuning the bulk vectors with some extra final inference, or starting inference from a supplied vector rather than a fresh random vector. But they're a matter of discussion for the future, for example at Github issue <https://github.com/RaRe-Technologies/gensim/issues/515>.)

Also, I'd presume that when you mention "expired" documents, you mainly mean that they are no longer interesting as live most-similar results. They might still be useful as training data. (For example, some rare terms might fall below `min_count` occurrences, or otherwise have thin examples of usage/meaning, within a limited sliding window of 'active' documents. But training on a longer history of older documents would model such terms better... even if you don't need to retain the doc-vectors for those expired documents. 

Combining all these observations, I'd lean toward a system that:

* occasionally does gigantic model-builds, on as much data as is thought domain-relevant (so as to have the largest domain vocabulary and richest training data). For example, this might happen weekly or monthly. 

* from those big models, have a separate process to maintain a smaller subset of 'active' doc-vectors – either by extracting the subset from the bulk-trained vectors, or re-inferring as new documents arrive. (The gensim `KeyedVectors` class is an example ofstoring vectors by lookup keys, and supporting `most_similar()`, but without full training support/model-overhead. You could probably shoehorn your 'active' sets of vectors into it or a class like it.). This active-set might be maintainable daily or hourly, or even (nearly) instantly. 

* when a next "big model" rebuild happens, the resulting vectors won't usually or necessarily be comparable to those from a prior "big model" - the coordinate spaces weren't trained together, so while relative document similarities should be of similar quality, individual docs/neighborhoods will have moved around arbitrarily/randomly. So, you'd throw out any cached vectors in 'active' applications when rolling to a new big models. There might be ways of speeding the big builds, or achieving better quality in the same time, by starting the new model with some state from the previous epoch – but this would be a matter of ad hoc customization, there's no explicit support for this yet. There could also be ways of enforcing coordinate-compatibility, by carrying some state forward or locking some significant set of reference docs in the training set at fixed coordinates, or some set of domain terms (in input-words or NN output-encodings) at fixed coordinates – but that again would require a lot of research/tinkering to implement and evaluate its potential value. 

- Gordon

Radim Řehůřek

unread,
Dec 1, 2017, 8:42:55 AM12/1/17
to gensim
Hi,

option 3: we have a commercial enterprise product for fast and flexible semantic search in huge datasets, ScaleText.

So if optimizing these kind of things is not your core expertise, it might make sense to leave it to us, get a license and focus on your business instead.

Best,
Radim
Reply all
Reply to author
Forward
0 new messages