Word2Vec with phrases : train() called with an empty iterator

1,447 views
Skip to first unread message

er.pra...@gmail.com

unread,
Mar 27, 2017, 9:43:52 AM3/27/17
to gensim
sentences = Text8Corpus('/home/prakhar/text8')
phrases
= Phrases(Text8Corpus('/home/prakhar/text8'), min_count=1, threshold=2)
bigram
= Phraser(phrases)
model
= models.word2vec.Word2Vec(bigram[sentences], size=200,workers=4,min_count=1)


The logger info while running this code-

2017-03-27 18:33:23,366 : INFO : training model with 4 workers on 677776 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-03-27 18:33:24,319 : INFO : expecting 1701 sentences, matching count from corpus used for vocabulary survey
2017-03-27 18:33:25,170 : WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable).


Clearly, it is not desirable as can be seen here -

model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

[(u'davies_welsh', 0.3605641722679138),
 
(u'add_ins', 0.3399544656276703),
 
(u'kings_landing', 0.3140672445297241),
 
(u'the_cordillera', 0.30870741605758667),
 
(u'giant_anteater', 0.30382204055786133),
 
(u'analog_clocks', 0.30148613452911377),
 
(u'back_together', 0.30050382018089294),
 
(u'ionych', 0.2958505153656006),
 
(u'be_true', 0.29267528653144836),
 
(u'particle_physicists', 0.2917472720146179)]



Gordon Mohr

unread,
Mar 27, 2017, 2:50:44 PM3/27/17
to gensim
The `bigrams[sentences]` syntax from Phraser (or even Phrases) only creates an iterator for a single phrase-combining pass over `sentences`. 

Word2Vec needs an Iterable object that can be iterated over multiple times – once for vocabulary-discovery, then again for multiple (default 5) training passes. You'll get this error if after making the 1st pass, the iterator you passed in has been exhausted, and can't restart for another pass. 

Some options:

(1) For smaller corpuses that fit in memory, you can turn the single iteration into an in-memory list:

    corpus = list(bigram[sentences])

This has the added benefit of only doing the phrase-combining calculations once, which might speed later passes.

(2) For larger corpuses, you might want to write your own iterable wrapper, that re-executes the `bigrams[sentences]` code to create a single-pass iterator every time a new iteration is requested. Roughly the following should work:

    class PhrasingIterable(object):
        def __init__(self, phrasifier, texts):
            self. phrasifier, self.texts = phrasifier, texts
        def __iter__():
            return phrasifier[texts]

Then you'd pass Word2Vec a corpus of `PhrasingIterable(bigrams, sentences)`. 

(3) Similarly for larger corpuses, you might want to write the phrase-combined texts to a new text file or files, which are then re-read with a proper IO-based iterable (such as Text8Corpus itself, or the class LineSentence from a few lines down in the same place as Text8Corpus). This also has the benefit of only doing the phrase-combining once.

- Gordon

er.pra...@gmail.com

unread,
Mar 27, 2017, 8:23:11 PM3/27/17
to gensim
Thanks for clarifying

Abhishek Dubey

unread,
Aug 23, 2017, 3:45:07 PM8/23/17
to gensim
Hey Gordon,

But when I use the class below:

class PhrasingIterable(object):

    def __init__(self, phrasifier, texts):
        self.phrasifier, self.texts = phrasifier, texts

    def __iter__(self):
        return self.phrasifier[self.texts]

with python 3.x, I get
TypeError: iter() returned non-iterator of type 'TransformedCorpus'

Now I know the issue between __next__ & next in python 3.x and 2.x, but how do we fix it here ?

Gordon Mohr

unread,
Aug 23, 2017, 4:25:15 PM8/23/17
to gensim
Note that it's usually better to follow the approach numbered (3) above: write the phrase-ified corpus somewhere, then read that for efficiency and simplicity. 

I'd need a lot more context about what you're attempting – or a simple fully-self-contained example of how to trigger it – to know how to interpret the new, different error you're reporting.

- Gordon

Abhishek Dubey

unread,
Aug 24, 2017, 3:15:34 AM8/24/17
to gensim
from __future__ import unicode_literals, print_function
from gensim.parsing import PorterStemmer
from spacy.en import English
from gensim.models import Word2Vec, Phrases, phrases, KeyedVectors
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import tokenize
import string
import re
import os


stemmer = PorterStemmer()
stopwords = stopwords.words('english')
nlp = English() #nlp = spacy.load("en")
data_dir_path = "full_path"

base_dir = os.path.dirname(data_dir_path)
os.chdir(base_dir)

class Stemming(object):
    word_lookup = {}
   
    @classmethod
    def stem(cls, word):
        stemmed = stemmer.stem(word)
        if stemmed not in cls.word_lookup:
            cls.word_lookup[stemmed] = {}
        cls.word_lookup[stemmed][word] = (
            cls.word_lookup[stemmed].get(word, 0) + 1)
        return stemmed
 
    @classmethod
    def original_form(cls, word):
        if word in cls.word_lookup:
            return max(cls.word_lookup[word].keys(),
                       key=lambda x: cls.word_lookup[word][x])
        else:
            return word

class SentenceClass(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            with open(os.path.join(self.dirname,fname), 'r') as myfile:
                doc = myfile.read().replace('\n', ' ')
                for sent in tokenize.sent_tokenize(doc.lower()):
                    yield [Stemming.stem(word)\
                    for word in word_tokenize(re.sub("[^A-Za-z]", " ",sent))\
                    if word not in stopwords]


class PhrasingIterable(object):
    def __init__(self, phrasifier, texts):
        self.phrasifier, self.texts = phrasifier, texts
    def __iter__(self):
        yield self.phrasifier[self.texts]


my_sentences = SentenceClass(data_dir_path)

my_phrases = Phrases(my_sentences, min_count=1)
my_corpus = PhrasingIterable(my_phrases,my_sentences)
model = Word2Vec(my_corpus, size=100, window=2, min_count=1, workers=2)


Hey Gordon,
Above is my complete code, the error I am getting as of now is below, this code above is passing a list somewhere when it is suppose to pass a words.


  File "C:/Users/Adubey4/Desktop/rasagit/mycode/error_bigram.py", line 65, in <module>
    model = Word2Vec(my_corpus, size=100, window=2, min_count=1, workers=2)

  File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 503, in __init__
    self.build_vocab(sentences, trim_rule=trim_rule)

  File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 577, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey

  File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 601, in scan_vocab
    vocab[word] += 1

TypeError: unhashable type: 'list'

Abhishek Dubey

unread,
Aug 24, 2017, 3:19:16 AM8/24/17
to gensim
Just to update for the actual query:
When I change yield with return in PhrasingIterable class, I get the error I mentioned earlier

Updated function:


    class PhrasingIterable(object):
        def __init__(self, phrasifier, texts):
            self. phrasifier, self.texts = phrasifier, texts
        def __iter__():
            return phrasifier[texts]

Error:


  File "<ipython-input-146-8c8b59b0c842>", line 1, in <module>
    model = Word2Vec(my_corpus, size=100, window=2, min_count=1, workers=4)


  File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 503, in __init__
    self.build_vocab(sentences, trim_rule=trim_rule)

  File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 577, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey

  File "C:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 589, in scan_vocab
    for sentence_no, sentence in enumerate(sentences):

Gordon Mohr

unread,
Aug 24, 2017, 8:16:00 PM8/24/17
to gensim
A good minimal example would trigger the error without recourse to any outside dataset, or even other libraries/steps (like the stemming you're doing). 

Additionally, it could help to add code that prints checks that each step has done what's expected, before continuing with the next. (As one example, does the `my_phrases` object behave as expected before wrapping it up for later steps?)

In your original message, you mentioned Python 2 vs 3 differences – are you suggesting this code worked in Python 2 but not Python 3? Or  have all your tests been in 3?

- Gordon 

Mahmood Kohansal

unread,
Sep 26, 2017, 3:18:13 AM9/26/17
to gensim
Hey Abhishek, 

Do you find any solution for this error? 
I want to train a model like you, first using phrases and then word2vec training.

Gordon Mohr

unread,
Sep 26, 2017, 1:48:14 PM9/26/17
to gensim
From your other post describing the same `TypeError: iter() returned non-iterator of type 'TransformedCorpus'` error, after working from the example in this thread, I I now see that my example code earlier in this thread does the wrong thing with *its* `__iter__()` return line. 

It should not be `return`ing the raw phrasifier, but one that has already been started-as-an-iterator-object, by use of the `iter()` built-in method. That is, the `PhrasingIterator` example up-thread should have read:

    class PhrasingIterable(object):
        def __init__(self, phrasifier, texts):
            self. phrasifier, self.texts = phrasifier, texts
        def __iter__():
            return iter(phrasifier[texts])  # <-- this line fixed

- Gordon
Reply all
Reply to author
Forward
0 new messages