UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

1,893 views
Skip to first unread message

santosh.b...@gmail.com

unread,
Jul 18, 2020, 3:22:33 PM7/18/20
to Gensim
Hello, I just trained a word2vec model using the following code:

model = gensim.models.Word2Vec(corpus_sentences,
                              min_count= 1, 
                              workers= 6,
                              size= 100,
                              window= 5,
                              iter = 50)
model.wv.save('word2vec_s100_w5_m1_i50_v2.bin')

It went through without any error.

Now, when I try to load it, I get an error:
model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True, unicode_errors='ignore')

2020-07-18 21:17:45,143 : INFO : loading projection weights from /media/santoshbs/GanWD3T/pCloud/2-Gan-Own/27-Gan-NamedEntity/compustat/word2vec/out/word2vec_s100_w5_m1_i50_v2.bin
Traceback (most recent call last):

  File "<ipython-input-7-7e63ae0c5d11>", line 1, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True, unicode_errors='ignore')

  File "/anaconda3/envs/env_spacy/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1549, in load_word2vec_format
    limit=limit, datatype=datatype)

  File "/anaconda3/envs/env_spacy/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 276, in _load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)

  File "/anaconda3/envs/env_spacy/lib/python3.7/site-packages/gensim/utils.py", line 368, in any2unicode
    return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Request your help on solving this issue.

Thanks and regards
sbs

Gordon Mohr

unread,
Jul 19, 2020, 12:44:35 PM7/19/20
to Gensim
If you save a model using the gensim native `.save()`, it should be loaded using the gensim native `.load()` on the same class as was saved. (The `.load_word2vec_format()` method is for loading plain vector sets in another format, as if saved by `.save_word2vec_format()` or from other non-gensim tools.)

So try:

    w2v_model = gensim.models.Word2Vec.load(model_file)

And if you just want the word-vectors....

    kv_model = model.wv

(And if you only wanted to save the word-vectors, you could initially have just saved `w2v_model.wv.save(FILENAME)` then loaded `kv_model = KeyedVectors.load(FILENAME)`.)

- Gordon
Reply all
Reply to author
Forward
0 new messages