How to limit the number of lines read from corpus

18 views

Skip to first unread message

thistlillo

unread,

Sep 29, 2021, 9:37:16 AM9/29/21

to Gensim

Problem: I am not able to limit the maximum number of lines read from a text corpus and used to train a word2vec model.

Please consider the following code. filepath points to a text file in the local filesystem.

class MyCorpus:

def __init__(self, filepath, max_counter=None):

self.filepath = filepath

self.counter = 0

self.max_counter = max_counter

def __iter__(self):

for line in open(self.filepath):

self.counter += 1

print(f"returning line {self.counter}, max is {self.max_counter}")

if (self.max_counter is not None) and (self.counter == self.max_counter):

print("returning")

return # exit loop

yield line

sentences = MyCorpus(txt_file_path, max_counter=2)

model = gensim.models.Word2Vec(sentences)

The output I get is:

returning line 1, max is 2

returning line 2, max is 2

returning

returning line 3, max is 2

returning line 4, max is 2

returning line 5, max is 2

...

I tried also:

1) break in place of return

2) raise StopIteration in place of return

3) return False

4) return None

without success:

1) 3) 4) did not change anything

2) raising an exception caused an exception, meaning that it was not caught and managed.

How can I limit the number of lines (read from the file) in the code above (without using any built-in functions or parameter: I might use gensim.models.word2vec.LineSentence(source, max_sentence ..., but I would rather not)? What value does gensim.models.Word2Vec expect for terminating?

Is there anyone who can help me?

Gordon Mohr

unread,

Sep 30, 2021, 9:11:13 PM9/30/21

to Gensim

Gensim's `LineSentence`, with its optional `limit` parameter, will suffice for this. For example:

sentences = LineSentence(txt_file_path, limit=2)

It will additionally `.split()` the line strings on whitespace, so that `Word2Vec` gets the *individual word-tokens* it expects. If you simply `yield the full `line` instead, you're passing single strings into `Word2Vec`. That's *not* what it expects, and it will see those as *lists-of-single-character-strings*.

Why don't you want to use that?

I think your code has managed to create an *iterator* (with somewhat odd behavior after the 1st iteration hits a `return`) but not a true *iterable* (which restarts every time a caller begins a new iteration). But resolving general Python iterator/iterable design issues would be better raised on a generic Python forum, or StackOverflow - since in Gensim, `LineSentence` ashould already meet the typical user need.