Streaming files - TypeError: can only concatenate tuple (not "str") to tuple

502 views
Skip to first unread message

isaacs...@gmail.com

unread,
Apr 11, 2018, 7:35:01 PM4/11/18
to gensim
Hello,

I am trying to stream files in a directory and generate a word embedding model from them.

I have a DataSet class:

class DataSet:

   
def __init__(self, dir, verbose, categories):
       
self.dir = dir
       
self.verbose = verbose
       
self.dictionary = None
       
self.categories = categories
       
self.type = None

@staticmethod
def iter_documents():
   
"""
    Generator: iterate over all relevant documents
    :return: yields one document (=list of utf8 tokens) at a time
    """

   
for root, dirs, files in os.walk(DIR_PROCESSED):
       
for fname in filter(lambda fname: fname.endswith('.txt'), files):
            document
= open(os.path.join(root, fname)).read()
           
yield gensim.utils.tokenize(document, errors='ignore')

def __iter__(self):
   
"""
    __iter__ is a generator => Dataset is a streamed iterable
    :return: sparse dictionary
    """

   
for tokens in DataSet.iter_documents():
       
yield self.dictionary.doc2bow(tokens)



The files in 'self.dir' are plain txt files that I have previously pre-processed.

corpus = DataSet(...)
model
= gensim.models.Word2Vec(corpus, size=dim, window=5, workers=workers)


I recieve this error:

Traceback (most recent call last):
 
File "/Users/isaacsultan/Code/MedEmbed/medembed/main.py", line 53, in <module>
    main
()
 
File "/Users/isaacsultan/Code/MedEmbed/medembed/main.py", line 49, in main
    embedding
.generate(dataset, args.model, args.dim, args.workers)
 
File "/Users/isaacsultan/Code/MedEmbed/medembed/embedding.py", line 31, in generate
    model
= gensim.models.Word2Vec(corpus, size=dim, window=5, workers=workers)  # mincount
 
File "/anaconda/envs/py36/lib/python3.6/site-packages/gensim/models/word2vec.py", line 527, in __init__
    fast_version
=FAST_VERSION)
 
File "/anaconda/envs/py36/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 335, in __init__
   
self.build_vocab(sentences, trim_rule=trim_rule)
 
File "/anaconda/envs/py36/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 486, in build_vocab
   
self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
 
File "/anaconda/envs/py36/lib/python3.6/site-packages/gensim/models/word2vec.py", line 1402, in prepare_weights
   
self.reset_weights(hs, negative, wv)
 
File "/anaconda/envs/py36/lib/python3.6/site-packages/gensim/models/word2vec.py", line 1419, in reset_weights
    wv
.vectors[i] = self.seeded_vector(wv.index2word[i] + str(self.seed), wv.vector_size)
TypeError: can only concatenate tuple (not "str") to tuple

From printing wv.index2word[i] I can see that it is a tuple of int e.g. (0, 12).

What could be causing this please?

Ivan Menshikh

unread,
Apr 12, 2018, 2:42:24 AM4/12/18
to gensim
Hello,

the problem here is "doc2bow" call, word2vec expected iterable of list of tokens (not document in BoW format).
Your code will work correctly if you replace yield self.dictionary.doc2bow(tokens)-> yield tokens
Reply all
Reply to author
Forward
0 new messages