gensim's word2vec can't load model from hdfs path

涅言

unread,

Aug 31, 2022, 11:52:00 PM8/31/22

to Gensim

Hi hello ,
currently I am facing a problem using gensim's word2vec, since the data sources are all on the cluster's hdfs, the training must be done on the cluster, so the incrementally trained model must be read and saved on the cluster's hdfs.
However, the load and save api of gensim.models.Word2Vec do not support hdfs paths, so I am hereby looking for a solution.

for example when my model path is :
online_model_path = 'hdfs://com1-hdfs/user/w2v_models/word2vec_200_online.model'
the Word2Vec.load( online_model_path ) cant work

Thank you in advance for your nice work, appreciate it

Gordon Mohr

unread,

Sep 1, 2022, 12:24:30 AM9/1/22

to Gensim

The suggestions from this thread a couple weeks ago may be helpful: https://groups.google.com/g/gensim/c/EYEo4mHqW9o/m/OBARWZSdAAAJ

(Specifically: either downloading everything local 1st to use working local paths, or try using plain Python pickling with the appropriate versions/parameters.)

However, your logic "since the data sources are all on the cluster's hdfs, the training must be done on the cluster, so the incrementally trained model must be read and saved on the cluster's hdfs" may not be correct.

Gensim's `Word2Vec` has no multi-machine training mode.

As far as it's concerned, it is oblivious to the source of the training texts (as long as they're a re-iterable Python sequence), or where a model loads/saves from (as long as it loads somehow), and will only ever be training on a single system (not across a cluster, even if that one system does happen to be part of some cluster).

- Gordon

涅言

unread,

Sep 1, 2022, 2:48:58 AM9/1/22

to Gensim

Thanks for such detailed quick answer, but sorry, I still don't know how to use Python pickling skill to load the model, can you give a specific example code or document? thanks

ps: ( Since I need to train the model regularly to update new vocabulary every day, but it involves data security, so this working environment is limited to the company's cluster environment, so the model update must be done on the cluster)

In addition, I also found that some others have this demand ---loading data from hdfs by spark sql, and then to update the gensim word2vec model, all those must done on the cluster machine, but did not find a suitable method

Gordon Mohr

unread,

Sep 1, 2022, 12:13:02 PM9/1/22

to Gensim

There are many online examples showing how to pickle an object to a local file, for example at StackOverflow:

https://stackoverflow.com/a/11218504/130288

(There are probably HDFS-specific examples, too, in HDFS-specific forums - but I've not used HDFS in around a decade.)

If continually updating an older model, keep in mind that a fresh training, using *all* data/vocabulary, may outperform constant incremental updates – at least with regard to repeatability, and often in other harder-to-measure qualities.

Why?

Imagine you have three separate training corpora: A, B, & C. A model trained from scratch on the mixed combination of [A, B, C] will treat all examples equally, include all relevant words that (across the whole combined corpus) appear `min_count` times, and keep those words in the (usually-most-efficient) most-frequent-to-least-frequent storage order.

If you instead train on A, use the model a bit, then update-train on B, use the model a bit, then update-train on C, the model is always most-influenced by the examples it has seen most-recently. The influence of the B-session, then the C-session, will (depending on other parameters) tend to dilute/shift words left over from A without the interleaved influence of usage examples in A. New words that never co-appear with earlier words, and further never underwent interleaved training with earlier words, may be trained into positions that aren't fully compatible with the unadjusted earlier words. If there's a word that in each individual training corpus appeared only `min_count-1` times, it'll never get a vector – even though altogether in the combined corpus, it appears a full `(3*min_count)-3` times. And, the incremental appending of new words always happens at the end, & incremental training never changes earlier-session word slots – so the incrementally-updated model no longer reliably has its words in the preferred most-frequent-to-least-frequent ordering.

Some projects may manage to get benefits from such incremental updates, but many are doing it without even realizing (or testing for) the kind of model-weaknesses they may be creating, compared to a full balanced training sessions with all data.

- Gordon

涅言

unread,

Sep 1, 2022, 9:56:15 PM9/1/22

to Gensim

Hi, Gordon Thank you very much for your kind help, your answer has taught me a lot, thank you again， What a great job the gensim is, appreciate it 。

Reply all

Reply to author

Forward