As per this post, I encoded my input as UTF-8 but it generates a type error
https://stackoverflow.com/questions/20362993/how-to-load-sentences-into-python-gensim
when I remove the encoding, it seems to work.
can you please clarify whether and how to encode
Thanks
document = "That includes payors. These exclude merchants."
# train word2vec on first doc
sent1 = document.split('.')
sentences = [s.encode('utf-8').split() for s in sent1]
print(sentences)
Returns
[[b'This', b'includes', b'payors'], [b'These', b'exclude,', b'merchants'],[] ]
from gensim.models import Word2Vec as wtv
w2v_model = wtv(sentences, min_count=1)
Returns this error
Traceback (most recent call last):
File "nlp9a.py", line 11, in <module>
w2v_model = wtv(sent2, min_count=1)
File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 551, in __init__
self.build_vocab(sentences, trim_rule=trim_rule)
File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 634, in build_vocab
self.finalize_vocab(update=update) # build tables & arrays
File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 869, in finalize_vocab
self.reset_weights()
File "/Ajax/.local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 1304, in reset_weights
self.wv.syn0[i] = self.seeded_vector(self.wv.index2word[i] + str(self.seed))
TypeError: can't concat bytes to str