Multilingual Word Embedding

411 views
Skip to first unread message

13j...@gmail.com

unread,
Jan 7, 2019, 6:32:14 AM1/7/19
to Gensim

In order to classify sentences in multiple languages, we need multilingual word embeddings (All languages in single vector space). Now the question is why do you want to do it? why not separate models for separate language? The answer to that is if I have less data for a single language, it would be beneficial to include data from other languages in order to make the model more effective.

I am finding it hard to get any tool that would help me do it. Yes, I know that we can create word embedding on the go while training a network but there comes another fine theory. If I don't have enough data for one language who well the vectors are? Hence what I decided was to use something similar to the original data but which has huge data points.

There are tools like facebooks MUSE but they don't align multiple languages into a single vector space.

It would be helpful if the community can help me here. Any further questions or suggestions are welcome here

Have already looked into fastText vector alignment. They allow 2 languages.

Jordi Carrera

unread,
Jan 7, 2019, 3:52:36 PM1/7/19
to Gensim
As far as I know, the 2018 state-of-the-art for unsupervised (and I think also supervised) multilingual word embeddings (aligning multiple languages into a single space) was set in this paper by Conneau et al.: https://arxiv.org/abs/1710.04087 (Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087).

They solve a problem similar to the one you describe, but both their methodology as well as related literature suggest that their approach is heavily constrained by how comparable/parallel the data is across the languages you want to translate between. Working on Wikipedia, as they do, has good results because all copies of Wikipedia can be expected to contain highly comparable information for the same entry (provided that the entry's page exists in all the relevant languages), but this assumption seems to fail on other datasets as reported by research conducted by the lang.ai team: https://building.lang.ai/messing-with-intents-translation-part-i-d605fde30755

They find that, in an intent-classification task based on mult-lingual embeddings, recall is low across both languages and domains, and that precision is not that good even for cross-language translation when the translations belong to the same intent (e.g. customer service) due to strong topic heterogeneity, which prevents the alignments from occurring.

As far as I remember, they used a fair amount of data, so, if data size is also a constraint for you, you may have to explore an alternative approach.

Anyway, I hope this references are helpful in some way or can at least guide your search further.

Radim Řehůřek

unread,
Feb 3, 2019, 11:52:47 AM2/3/19
to Gensim
Facebook also released their LASER embeddings recently. Check it out:

HTH,
Radim

carlton radivoyevitch

unread,
Feb 4, 2019, 3:00:06 AM2/4/19
to Gensim
I'm having a somewhat similar issue, and am hoping someone reading this thread can help me out. 
I'm working with data which has texts mixed in English and Japanese (i.e. "いつでもI like youだよ"). My current tokenizer, english.pickle from ntlk, doesn't read the Japanese part.
MeCab can tokenize the Japanese, but I don't know if it can also tokenize the English. Does anyone have experience with using MeCab? Will FastText, or LASER allow for this bilingual file tokenization?

What I'm gathering from the papers and advice also listed here is that it's for texts that are in either language A or B, and are for translating and gathering data in multiple languages, not for texts that use two different alphabets in the same file. Is this assumption wrong?
Reply all
Reply to author
Forward
0 new messages