Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

How do I load the word2vec models?

47 views
Skip to first unread message

Elijah Rippeth

unread,
Jul 7, 2021, 11:35:43 AM7/7/21
to MultiLexNorm
Hi all --

I'm trying to load the pretrained w2v models as provided here. Unfortunately gensim seems to dislike these models:

>>> from gensim.models import KeyedVectors
>>> model = KeyedVectors.load_word2vec_format('sl.bin', binary=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1632, in load_word2vec_format
    limit=limit, datatype=datatype, no_header=no_header,
  File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1910, in _load_word2vec_format
    fin, kv, counts, vocab_size, vector_size, datatype, unicode_errors, binary_chunk_size,
  File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1805, in _word2vec_read_binary
    kv, counts, chunk, vocab_size, vector_size, datatype, unicode_errors)
  File "/Users/erippeth/miniconda3/envs/tmp/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1786, in _add_bytes_to_kv
    word = chunk[start:i_space].decode("utf-8", errors=unicode_errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 98: unexpected end of data

It might be possible that they can be used directly using the original Google code, but I thought I'd check to see if I'm missing something obvious. :-)

Thanks,
Elijah

Elijah Rippeth

unread,
Jul 7, 2021, 12:03:05 PM7/7/21
to MultiLexNorm
It looks like we just ignore unicode decoding errors and all is OK.

>>> model = KeyedVectors.load_word2vec_format('sl.bin', binary=True, unicode_errors='ignore')
>>> model['haha'].shape
(400,)

robvanderg

unread,
Jul 8, 2021, 8:37:52 AM7/8/21
to MultiLexNorm
Hi Elijah, thanks for posting the answer as well! there seems to be some mismatch in character handling between the original word2vec and gensim. 

Op woensdag 7 juli 2021 om 18:03:05 UTC+2 schreef Elijah Rippeth:
Reply all
Reply to author
Forward
0 new messages