Using Hindi Unicode characters in wo

303 views
Skip to first unread message

Gaurish Thakkar

unread,
Dec 25, 2015, 12:49:00 AM12/25/15
to gensim
I have trained my model on the corpus which is written in Hindi.

sentences = LineSentence('/home/gaurish/Desktop/AllTokens.txt')
model = models.Word2Vec(sentences)

model.save('mymodel')



But when i try to access the cosine similarity of the same i get this error

gaurish@gaurish-Studio-1457:~/Desktop$ python loadPy.py
2015-12-25 11:06:59,909 : INFO : loading Word2Vec object from mymodel
2015-12-25 11:07:00,082 : INFO : setting ignored attribute syn0norm to None
2015-12-25 11:07:00,082 : INFO : setting ignored attribute cum_table to None
hi
Traceback (most recent call last):
  File "loadPy.py", line 17, in <module>
    print new_model.vocab["अशूभ"]
KeyError: '\xe0\xa4\x85\xe0\xa4\xb6\xe0\xa5\x82\xe0\xa4\xad'
gaurish@gaurish-Studio-1457:~/Desktop$ ^C
gaurish@gaurish-Studio-1457:~/Desktop$

Am i doing something wrong ???

the vocab object if printed on terminal looks like this

word2vec.Vocab object at 0x7fdaa2637ed0>, u'\u0924\u093e\u0917': <gensim.models.word2vec.Vocab object at 0x7fdaa2ac7150>, u'\u0926\u094b\u0928\u0936\u0947': <gensim.models.word2vec.Vocab object at 0x7fdaa2411bd0>, u'\u0930\u0938\u094d\u0924\u094d\u092f\u093e\u0902\u0924': <gensim.models.word2vec.Vocab object at 0x7fdaa2ab9a90>, u'\u0915\u093e\u0933\u0916\u093e\u0902\u0924': <gensim.models.word2vec.Vocab object at 0x7fdaa3020050>, u'\u0928\u093e\u0936\u093e\u0921\u0940': <gensim.models.word2vec.Vocab object at 0x7fdaa2411c50>, u'\u0909\u092a\u0928\u093f\u0937\u0926': <gensim.models.word2vec.Vocab object at 0x7fdaa2411c90>, u'\u0917\u0942': <gensim.models.word2vec.Vocab object at 0x7fdaa2411cd0>, u'\u091c\u092e\u092a': <gensim.models.word2vec.Vocab object at 0x7fdaa2dbd0d0>, u'\u092c\u0938\u0924\u0932\u094b': <gensim.models.word2vec.Vocab object at 0x7fdaa2411d50>, u'\u092a\u093f\u0924\u094d\u0924': <gensim.models.word2vec.Vocab object at 0x7fdaa2411d90>, u'\u0908-\u092e\u0947\u0932': <gensim.models.word2vec.Vocab object at 0x7fdaa2411dd0>, u'\u0935\u093e\u092f\u0942': <gensim.models.word2vec.Vocab object at 0x7fdaa24b4b10>, u'\u0939\u093f\u0902\u0926\u0942\u0902': <gensim.models.word2vec.Vocab object at 0x7fdaa2411e50>, u'\u0935\u094d\u0939\u0921\u092a\u0923': <gensim.models.word2vec.Vocab object at 0x7fdaa2411e90>, u'\u0915\u0930\u092a\u093e\u091a\u0947': <gensim.models.word2vec.Vocab object at 0x7fdaa2411ed0>, u'\u0938\u0902\u0917\u0923\u0915\u0940': <gensim.models.word2vec.Vocab object at 0x7fdaa2411f10>, u'\u091c\u092e\u093e': <gensim.models.word2vec.Vocab object at 0x7fdaa2411f50>, u'\u091a\u0930\u093f\u0924\u094d\u0930\u0935\u093e\u0928': <gensim.models.word2vec.Vocab object at 0x7fdaa2a95050>, u'2551': <gensim.models.word2vec.Vocab object at 0x7fdaa2917e50>, u'\u0932\u0915\u094d\u0937\u094d\u092e\u0940\u091a\u0940': <gensim.models.word2vec.Vocab object at 0x7fdaa241f050>, u'\u0927\u094b\u0902\u092a\u0930\u093e': <gensim.models.word2vec.Vocab object at 0x7fdaa241f090>, u'10641': <gensim.models.word2vec.Vocab object at 0x7fdaa2c49f50>, u'\u0935\u0916\u0926\u093e\u0915': <gensim.models.word2vec.Vocab object at 0x7fdaa2d83cd0>, u'13826': <gensim.models.word2vec.Vocab object at 0x7fdaa309d2d0>, u'\u0938\u0902\u0935\u0938\u093e\u0930\u093e\u091a\u0947\u0930': <gensim.models.word2vec.Vocab object at 0x7fdaa28fc7d0>, u'\u0915\u093e\u0930\u093e\u0925\u094d\u092f\u093e\u091a\u0940': <gensim.models.word2vec.Vocab object at 0x7fdaa327ee90>, u'\u091b\u0924\u094d\u0930\u092a\u0924\u0940': <gensim.models.word2vec.Vocab object at 0x7fdaa241f210>, u'\u092e\u0941\u0933\u093e\u0935\u094d\u092f\u093e': <gensim.models.word2vec.Vocab object at 0x7fdaa241f250>, u'\u0935\u0947\u0936\u094d\u092f\u093e': <gensim.models.word2vec.Vocab object at 0x7fdaa241f290>, u'\u092a\u0941\u0928\u0930\u093e\u0935\u0943\u0924\u094d\u09

Christopher S. Corley

unread,
Dec 25, 2015, 2:16:16 AM12/25/15
to gensim

Looks like you're using Python 2. Try adding a u to your string literals, e.g. print new_model.vocab[u"अशूभ"]

The current way you are approaching it is as bytes, hence the error with strange \x characters, instead of unicode ones that the model is expecting (\u when you print them).

Chris.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gaurish Thakkar

unread,
Dec 25, 2015, 7:30:00 AM12/25/15
to gensim
Thank You chtistopher that worked like a charm.

Suraj Vantigodi

unread,
Jul 18, 2016, 6:37:22 AM7/18/16
to gensim
Hi,

I am working on building Machine translation models for different languages using tensorflow. I have currently built it for French and German languages. However I am stuck at english to Hindi and vice versa. Can you please tell me what will be the steps involved in training a  model for english to hindi translation. I am not getting how to get the data. hindi has non-ascii characters.
Reply all
Reply to author
Forward
0 new messages