Newbie - type error training model.

228 views
Skip to first unread message

Numeric Lee

unread,
Jan 23, 2018, 4:02:25 PM1/23/18
to gensim

As per this post, I encoded my input as UTF-8 but it generates a type error
https://stackoverflow.com/questions/20362993/how-to-load-sentences-into-python-gensim

when I remove the encoding, it seems to work. 
can you please clarify whether and how to encode
Thanks

document = "That includes payors. These exclude merchants."
# train word2vec on first doc
sent1 = document.split('.')
sentences = [s.encode('utf-8').split() for s in sent1]
print(sentences)

Returns
[[b'This', b'includes', b'payors'], [b'These', b'exclude,', b'merchants'],[] ]

from gensim.models import Word2Vec as wtv
w2v_model = wtv(sentences, min_count=1)


Returns this error

Traceback (most recent call last):
  File "nlp9a.py", line 11, in <module>
    w2v_model = wtv(sent2,  min_count=1)
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 551, in __init__
    self.build_vocab(sentences, trim_rule=trim_rule)
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 634, in build_vocab
    self.finalize_vocab(update=update)  # build tables & arrays
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 869, in finalize_vocab
    self.reset_weights()
  File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 1304, in reset_weights
    self.wv.syn0[i] = self.seeded_vector(self.wv.index2word[i] + str(self.seed))

TypeError: can't concat bytes to str

Nitheen Rao T

unread,
Jan 23, 2018, 5:23:14 PM1/23/18
to gensim
It's a byte array please convert to string.

Returns
[[b'This', b'includes', b'payors'], [b'These', b'exclude,', b'merchants'],[] ]

Ivan Menshikh

unread,
Jan 24, 2018, 1:40:56 AM1/24/18
to gensim
Hello Numeric Lee,

please replace `sentences = [s.encode('utf-8').split() for s in sent1]` to `sentences = [s.split() for s in sent1]` and all will works fine (I checked it with py36).

Hetal Gandhi

unread,
Jan 2, 2019, 6:00:15 AM1/2/19
to Gensim
The solution works fine if sent1 is a list of sentences which are not encoded. 
But if we need to get word vectors for other language which are in unicode format, how to handle the error- typeerror can't concat bytes to str.

The error snapshot is depicted below:

Typeerror1.png


Typeerror2.png

Reply all
Reply to author
Forward
0 new messages