train word embedding by Word2vec

xx liu

unread,

Apr 7, 2021, 4:02:16 AM4/7/21

to Gensim

I want to train word embedding in my corpus, all my corpus are 300+ files and each file is about 3GB. All files are processed to right format.

So my code is like this:

-------------------------------------

model = gensim.models.word2vec.Word2Vec.load(init_model)

for file in all_file:

# update vocab

model.build_vocab(corpus_file=file, update=True)

# train model

model.train(corpus_file=file_name, update=True)

// save model for next epoch....

------------------------------------

So my question is:

1. is my code correct ?

2. How can I accelerate the training?

3. I have set workers to 16 or 8 and my cpu usage is about 500%, but my device contains 16 core. What can I do ?

Thanks!

Gordon Mohr

unread,

Apr 7, 2021, 2:47:37 PM4/7/21

to Gensim

(1) No.

In general you wouldn't want to start by loading some model from disk. If you're starting a new training (on 900GB of data!), just create a new model with explicit parameters.

Also, if at all possible you should only be calling `build_vocab()` once, without the `update` parameter. There are lots of gotchas with incremental updates that you'd only want to deal with as an advanced user. (And, as an advanced user, I personally don't think it's ever a good idea, though clearly some people find it useful.)

Similarly, only a single call to `train()` which includes all data will give the best results. (Also, ideally, the data that's *late* in the corpus shouldn't be wildly different in vocabulary/usage than that early.)

So, if using the `corpus_file` method, all your data should be in one file. (Alternatively, an iterable sequence class like `PathLineIterator` provided as a `corpus_iterable` could process multiplepre-tokenized files in one directory, but may not be able to keep all cores as busy as the `corpus_file` method.)

(2) & (3) With `corpus_file`, you're likely to get the highest training throughput with `workers` equal to the number of available cores. (With a `corpus_iterable`, the actual best throughput will vary based on your other parameters, needs to be experimentally discovered, and may max out with a `workers` value lower than the number of cores.)

Tweaking other parameters – `negative`, `window`, `sample`, `min_count`, etc – will also have effects on total runtime, but whether their speed-ups are worth whatever other changes in vector-quality you observe, you'd have to test experimentally with rgard to your project's goals.

With a corpus that large, especially-aggressive `min_count` (larger to shrink the surviving vocabulary) and `sample` (smaller to drop more highly-repeated words) could offer big reductions in runtime at no cost in final vector quality. (In fact, more-aggressively dropping rare words or undersampling frequent words often *improves* remaining vector quality for important tasks.)

Also, with a corpus that larger, and if the later-texts exhibit the same word-usage-patterns as the earlier ones, you may not practically need as many training epochs - you've got what's functionally as good as a repeated epoch in the large corpus itself. (If I recall correctly, the famous circa-2013 'GoogleNews' vectors, trained from ~100B words of news stories, used on 3 epochs, rather than a typical default of 5 or the even-larger count-of-epochs often used in smaller corpora.)

If you can get all your data into a single `corpus_file`, then moving to a system with even more cores (32, 64, etc) should further accelerate training.

Good luck!

- Gordon

xx liu

unread,

Apr 14, 2021, 2:36:26 AM4/14/21

to Gensim

Thank you so much.

I have try to merge all my file into one and train from this directly but it failed with "OverflowError: value too large to convert to int".