train word embedding by Word2vec

156 views
Skip to first unread message

xx liu

unread,
Apr 7, 2021, 4:02:16 AM4/7/21
to Gensim
I want to train word embedding in my corpus, all my corpus are 300+ files and each file is about 3GB. All  files are processed to right format.
So my code is like this:
-------------------------------------
model  = gensim.models.word2vec.Word2Vec.load(init_model)
for file in all_file:
    # update vocab
    model.build_vocab(corpus_file=file, update=True)
    # train model 
    model.train(corpus_file=file_name, update=True)

// save model for next epoch....
------------------------------------
So my question is:
1. is my code correct ? 
2. How can I accelerate the training? 
3. I have set workers to 16 or 8 and  my cpu usage is about 500%, but my device contains 16 core. What can I do ? 

Thanks!

Gordon Mohr

unread,
Apr 7, 2021, 2:47:37 PM4/7/21
to Gensim
(1) No.

In general you wouldn't want to start by loading some model from disk. If you're starting a new training (on 900GB of data!), just create a new model with explicit parameters.

Also, if at all possible you should only be calling `build_vocab()` once, without the `update` parameter. There are lots of gotchas with incremental updates that you'd only want to deal with as an advanced user. (And, as an advanced user, I personally don't think it's ever a good idea, though clearly some people find it useful.)

Similarly, only a single call to `train()` which includes all data will give the best results. (Also, ideally, the data that's *late* in the corpus shouldn't be wildly different in vocabulary/usage than that early.)

So, if using the `corpus_file` method, all your data should be in one file. (Alternatively, an iterable sequence class like `PathLineIterator` provided as a `corpus_iterable` could process multiplepre-tokenized  files in one directory, but may not be able to keep all cores as busy as the `corpus_file` method.) 

(2) & (3) With `corpus_file`, you're likely to get the highest training throughput with `workers` equal to the number of available cores. (With a `corpus_iterable`, the actual best throughput will vary based on your other parameters, needs to be experimentally discovered, and may max out with a `workers` value lower than the number of cores.)

Tweaking other parameters – `negative`, `window`, `sample`, `min_count`, etc – will also have effects on total runtime, but whether their speed-ups are worth whatever other changes in vector-quality you observe, you'd have to test experimentally with rgard to your project's goals.

With a corpus that large, especially-aggressive `min_count` (larger to shrink the surviving vocabulary) and `sample` (smaller to drop more highly-repeated words) could offer big reductions in runtime at no cost in final vector quality. (In fact, more-aggressively dropping rare words or undersampling frequent words often *improves* remaining vector quality for important tasks.)

Also, with a corpus that larger, and if the later-texts exhibit the same word-usage-patterns as the earlier ones, you may not practically need as many training epochs - you've got what's functionally as good as a repeated epoch in the large corpus itself. (If I recall correctly, the famous circa-2013 'GoogleNews' vectors, trained from ~100B words of news stories, used on 3 epochs, rather than a typical default of 5 or the even-larger count-of-epochs often used in smaller corpora.)

If you can get all your data into a single `corpus_file`, then moving to a system with even more cores (32, 64, etc) should further accelerate training. 

Good luck!

- Gordon

xx liu

unread,
Apr 14, 2021, 2:36:26 AM4/14/21
to Gensim
Thank you so much.
I have try to merge all my file into one and train from this directly but it failed with "OverflowError: value too large to convert to int".
After I update gensim to 4.0, everything goes well.
thank you again.
Reply all
Reply to author
Forward
0 new messages