That is a super-interesting paper, with impressive results that are (to me) somewhat 'spooky' like the original Word2Vec analogy-arithmetic examples. It's surprising to me that the words can be nudged into such alignment with the secondary 'trans-gram' objective (which just uses full-sentence contexts from 2nd-language aligned sentences as contexts to predict words in the 1st language sentence).
The use of full-sentence contexts is similar in some ways to FastText in classification mode, or other word2vec/paragraph-vector variants that use a composited-from-word-vectors text-vector during training. (Another example is the 'Document Vector through Corruption' paper mentioned at <
https://github.com/RaRe-Technologies/gensim/issues/1159>.) All of these seem to be examples of the 'adjacent possible' design space for similar algorithms, that should be considered with respect to a potential big unification-refactoring of the existing functionality (being discussed in <
https://github.com/RaRe-Technologies/gensim/issues/1623>).
This mixed algorithm might almost be modeled by having two separate models, that share certain internal weight arrays, and each get training from their respective examples in an interleaved fashion. (This could be a 1:1 zippering of examples, or even parallel threads each pursuing the alternate objectives, asynchronously updating the same backing memory.) The relative balance of within-a-language vs cross-language training cycles might be a metaparameter, to balanced a session more towards 'trans-gram' alignment or traditional word2vec.
So if it's possible to do neatly on the existing code, sure. But better supporting such new permutations would be a goal of any refactor, so it could also be an important motivating case for that design/benchmarking work.
- Gordon