How to limit the number of lines read from corpus

18 views
Skip to first unread message

thistlillo

unread,
Sep 29, 2021, 9:37:16 AM9/29/21
to Gensim
Problem: I am not able to limit the maximum number of lines read from a text corpus and used to train a word2vec model.

Please consider the following code. filepath points to a text file in the local filesystem.

class MyCorpus:
    def __init__(self, filepath, max_counter=None):
        self.filepath = filepath
        self.counter = 0
        self.max_counter = max_counter
        
    def __iter__(self):
        for line in open(self.filepath):                
            self.counter += 1
            print(f"returning line {self.counter}, max is {self.max_counter}")
            
            if (self.max_counter is not None) and (self.counter == self.max_counter):
                print("returning")
                return  # exit loop
            
            yield line

sentences = MyCorpus(txt_file_path, max_counter=2)
model = gensim.models.Word2Vec(sentences)

The output I get is:

returning line 1, max is 2 
returning line 2, max is 2 
returning
returning line 3, max is 2 
returning line 4, max is 2 
returning line 5, max is 2
...

I tried also:
1) break in place of return
2) raise StopIteration in place of return 
3) return False 
4) return None

without success:

1) 3) 4) did not change anything
2) raising an exception caused an exception, meaning that it was not caught and managed.

How can I limit the number of lines (read from the file) in the code above (without using any built-in functions or parameter: I might use gensim.models.word2vec.LineSentence(sourcemax_sentence ..., but I would rather not)? What value does gensim.models.Word2Vec expect for terminating? 

Is there anyone who can help me? 

Gordon Mohr

unread,
Sep 30, 2021, 9:11:13 PM9/30/21
to Gensim
Gensim's `LineSentence`, with its optional `limit` parameter, will suffice for this. For example:

    sentences = LineSentence(txt_file_path, limit=2)

It will additionally `.split()` the line strings on whitespace, so that `Word2Vec` gets the *individual word-tokens* it expects. If you simply `yield the full `line` instead, you're passing single strings into `Word2Vec`. That's *not* what it expects, and it will see those as *lists-of-single-character-strings*. 

Why don't you want to use that? 

I think your code has managed to create an *iterator* (with somewhat odd behavior after the 1st iteration hits a `return`) but not a true *iterable* (which restarts every time a caller begins a new iteration). But resolving general Python iterator/iterable design issues would be better raised on a generic Python forum, or StackOverflow - since in Gensim, `LineSentence` ashould already meet the typical user need. 

- Gordon
Reply all
Reply to author
Forward
0 new messages