UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

1,903 views

Skip to first unread message

santosh.b...@gmail.com

unread,

Jul 18, 2020, 3:22:33 PM7/18/20

to Gensim

Hello, I just trained a word2vec model using the following code:

model = gensim.models.Word2Vec(corpus_sentences,

min_count= 1,

workers= 6,

size= 100,

window= 5,

iter = 50)

model.wv.save('word2vec_s100_w5_m1_i50_v2.bin')

It went through without any error.

Now, when I try to load it, I get an error:

model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True, unicode_errors='ignore')

2020-07-18 21:17:45,143 : INFO : loading projection weights from /media/santoshbs/GanWD3T/pCloud/2-Gan-Own/27-Gan-NamedEntity/compustat/word2vec/out/word2vec_s100_w5_m1_i50_v2.bin

Traceback (most recent call last):

File "<ipython-input-7-7e63ae0c5d11>", line 1, in <module>

model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True, unicode_errors='ignore')

File "/anaconda3/envs/env_spacy/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1549, in load_word2vec_format

limit=limit, datatype=datatype)

File "/anaconda3/envs/env_spacy/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 276, in _load_word2vec_format

header = utils.to_unicode(fin.readline(), encoding=encoding)

File "/anaconda3/envs/env_spacy/lib/python3.7/site-packages/gensim/utils.py", line 368, in any2unicode

return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Request your help on solving this issue.

Thanks and regards
sbs

Gordon Mohr

unread,

Jul 19, 2020, 12:44:35 PM7/19/20

to Gensim

If you save a model using the gensim native `.save()`, it should be loaded using the gensim native `.load()` on the same class as was saved. (The `.load_word2vec_format()` method is for loading plain vector sets in another format, as if saved by `.save_word2vec_format()` or from other non-gensim tools.)

So try:

w2v_model = gensim.models.Word2Vec.load(model_file)

And if you just want the word-vectors....

kv_model = model.wv

(And if you only wanted to save the word-vectors, you could initially have just saved `w2v_model.wv.save(FILENAME)` then loaded `kv_model = KeyedVectors.load(FILENAME)`.)