Support for Multilingual Corpus texts

moustac...@gmail.com

unread,

Oct 10, 2017, 5:19:38 AM10/10/17

to gensim

Hello,

I was wondering if there was any support for aligned corpus and multilingual models ? During my thesis, I managed to implement for gensim the following paper http://aclweb.org/anthology/D15-1131 by updating the Cython code for word2vec and adding a new objective function.

Do you think it is feasible for the current project ? There was multiple changes required in the file word2vec.py and word2vec.pyx to support not only one set of sentences bu multiple aligned sentences but I clearly was able to reproduce the same results as the article, meaning cross-lingual word embeddings.

Is it an add on worth it ?

Thx :)

Sami

Ivan Menshikh

unread,

Oct 11, 2017, 12:32:21 AM10/11/17

to gensim

Hi,

For my opinion, it can be a very nice feature for gensim, wdyt Gordon and Radim?

moustac...@gmail.com

unread,

Oct 11, 2017, 12:50:30 PM10/11/17

to gensim

I think what is hard is to think how we can implement this feature without compromising other things. The idea would be to implement a case when you put not only one set of sentnces but two for example like :

model = Word2Vec(Sentences_English, Sentences_French)

Then it gets you word embeddings that are correlated no matter the language (cf paper).

Ivan Menshikh

unread,

Oct 12, 2017, 1:43:18 AM10/12/17

to gensim

I agree It will be hard to join all interesting features, Gordon created issue about *2vec models in gensim (it is more general, but (2) is very relevant to this situation) if you have any ideas - please describe it in issue

Gordon Mohr

unread,

Oct 13, 2017, 2:29:11 AM10/13/17

to gensim

That is a super-interesting paper, with impressive results that are (to me) somewhat 'spooky' like the original Word2Vec analogy-arithmetic examples. It's surprising to me that the words can be nudged into such alignment with the secondary 'trans-gram' objective (which just uses full-sentence contexts from 2nd-language aligned sentences as contexts to predict words in the 1st language sentence).

The use of full-sentence contexts is similar in some ways to FastText in classification mode, or other word2vec/paragraph-vector variants that use a composited-from-word-vectors text-vector during training. (Another example is the 'Document Vector through Corruption' paper mentioned at <https://github.com/RaRe-Technologies/gensim/issues/1159>.) All of these seem to be examples of the 'adjacent possible' design space for similar algorithms, that should be considered with respect to a potential big unification-refactoring of the existing functionality (being discussed in <https://github.com/RaRe-Technologies/gensim/issues/1623>).

This mixed algorithm might almost be modeled by having two separate models, that share certain internal weight arrays, and each get training from their respective examples in an interleaved fashion. (This could be a 1:1 zippering of examples, or even parallel threads each pursuing the alternate objectives, asynchronously updating the same backing memory.) The relative balance of within-a-language vs cross-language training cycles might be a metaparameter, to balanced a session more towards 'trans-gram' alignment or traditional word2vec.

So if it's possible to do neatly on the existing code, sure. But better supporting such new permutations would be a goal of any refactor, so it could also be an important motivating case for that design/benchmarking work.

- Gordon

Ivan Menshikh

unread,

Oct 13, 2017, 2:41:08 AM10/13/17

to gensim

I submitted feature request on GitHub

On Tuesday, October 10, 2017 at 2:19:38 PM UTC+5, moustac...@gmail.com wrote:

moustac...@gmail.com

unread,

Oct 16, 2017, 10:02:21 AM10/16/17

to gensim

Don't hesitate to ping me whenever you will have sort out how you will be refactoring the code :)

Ivan Menshikh

unread,

Oct 17, 2017, 1:37:44 AM10/17/17

to gensim

Nice, thanks. Please write a comment in issue (so that you can be found easier).

Reply all

Reply to author

Forward