Problem: I am not able to limit the maximum number of lines read from a text corpus and used to train a word2vec model.
Please consider the following code.
filepath points to a text file in the local filesystem.
class MyCorpus:
def __init__(self, filepath, max_counter=None):
self.filepath = filepath
self.counter = 0
self.max_counter = max_counter
def __iter__(self):
for line in open(self.filepath):
self.counter += 1
print(f"returning line {self.counter}, max is {self.max_counter}")
if (self.max_counter is not None) and (self.counter == self.max_counter):
print("returning")
return # exit loop
yield line
sentences = MyCorpus(txt_file_path, max_counter=2)
model = gensim.models.Word2Vec(sentences)
The output I get is:
returning line 1, max is 2
returning line 2, max is 2
returning
returning line 3, max is 2
returning line 4, max is 2
returning line 5, max is 2
...
I tried also:
1) break in place of return
2) raise StopIteration in place of return
3) return False
4) return None
without success:
1) 3) 4) did not change anything
2) raising an exception caused an exception, meaning that it was not caught and managed.
How can I limit the number of lines (read from the file) in the code above (without using any built-in functions or parameter: I might use gensim.models.word2vec.LineSentence(source, max_sentence ..., but I would rather not)? What value does gensim.models.Word2Vec expect for terminating?
Is there anyone who can help me?