Newbie - type error training model.

Numeric Lee

unread,

Jan 23, 2018, 4:02:25 PM1/23/18

to gensim

As per this post, I encoded my input as UTF-8 but it generates a type error

https://stackoverflow.com/questions/20362993/how-to-load-sentences-into-python-gensim

when I remove the encoding, it seems to work.

can you please clarify whether and how to encode

Thanks

document = "That includes payors. These exclude merchants."
# train word2vec on first doc
sent1 = document.split('.')
sentences = [s.encode('utf-8').split() for s in sent1]
print(sentences)

Returns
[[b'This', b'includes',  b'payors'], [b'These', b'exclude,', b'merchants'],[] ]

from gensim.models import Word2Vec as wtv
w2v_model = wtv(sentences,  min_count=1)

Returns this error

Traceback (most recent call last):
  File "nlp9a.py", line 11, in <module>
    w2v_model = wtv(sent2,  min_count=1)
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 551, in __init__
    self.build_vocab(sentences, trim_rule=trim_rule)
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 634, in build_vocab
    self.finalize_vocab(update=update)  # build tables & arrays
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 869, in finalize_vocab
    self.reset_weights()
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 1304, in reset_weights
    self.wv.syn0[i] = self.seeded_vector(self.wv.index2word[i] + str(self.seed))

TypeError: can't concat bytes to str

Nitheen Rao T

unread,

Jan 23, 2018, 5:23:14 PM1/23/18

to gensim

It's a byte array please convert to string.

Returns
[[b'This', b'includes',  b'payors'], [b'These', b'exclude,', b'merchants'],[] ]

Ivan Menshikh

unread,

Jan 24, 2018, 1:40:56 AM1/24/18

to gensim

Hello Numeric Lee,

please replace `sentences = [s.encode('utf-8').split() for s in sent1]` to `sentences = [s.split() for s in sent1]` and all will works fine (I checked it with py36).

Hetal Gandhi

unread,

Jan 2, 2019, 6:00:15 AM1/2/19

to Gensim

The solution works fine if sent1 is a list of sentences which are not encoded.

But if we need to get word vectors for other language which are in unicode format, how to handle the error- typeerror can't concat bytes to str.

The error snapshot is depicted below:

Reply all

Reply to author

Forward