How do I load the word2vec models?

Elijah Rippeth

unread,

Jul 7, 2021, 11:35:43 AM7/7/21

to MultiLexNorm

Hi all --

I'm trying to load the pretrained w2v models as provided here. Unfortunately gensim seems to dislike these models:

>>> from gensim.models import KeyedVectors

>>> model = KeyedVectors.load_word2vec_format('sl.bin', binary=True)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1632, in load_word2vec_format

limit=limit, datatype=datatype, no_header=no_header,

File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1910, in _load_word2vec_format

fin, kv, counts, vocab_size, vector_size, datatype, unicode_errors, binary_chunk_size,

File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1805, in _word2vec_read_binary

kv, counts, chunk, vocab_size, vector_size, datatype, unicode_errors)

File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1786, in _add_bytes_to_kv

word = chunk[start:i_space].decode("utf-8", errors=unicode_errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 98: unexpected end of data

It might be possible that they can be used directly using the original Google code, but I thought I'd check to see if I'm missing something obvious. :-)

Thanks,

Elijah

Elijah Rippeth

unread,

Jul 7, 2021, 12:03:05 PM7/7/21

to MultiLexNorm

It looks like we just ignore unicode decoding errors and all is OK.

>>> model = KeyedVectors.load_word2vec_format('sl.bin', binary=True, unicode_errors='ignore')

>>> model['haha'].shape

(400,)

robvanderg

unread,

Jul 8, 2021, 8:37:52 AM7/8/21

to MultiLexNorm

Hi Elijah, thanks for posting the answer as well! there seems to be some mismatch in character handling between the original word2vec and gensim.

Op woensdag 7 juli 2021 om 18:03:05 UTC+2 schreef Elijah Rippeth:

Reply all

Reply to author

Forward