Streamed Restartable Iterables - Details (for corpus streaming)

Jeff Winchell

unread,

Nov 21, 2023, 12:02:29 PM11/21/23

to Gensim

The docs say
"Note the sentences iterable must be restartable (not just a generator), to allow the algorithm to stream over your dataset multiple times."

Does this mean that the training method for word2vec (and I guess FastText) require the ability to do a "get previous" or does it mean it needs a "goto first" method ... or does it need a "go back(N items)" method?

I need to make a custom iterables class to stream the data, and I need to know what specific methods (including the method signature) the training method needs to call on my custom streaming iterables class.

Gordon Mohr

unread,

Nov 21, 2023, 6:17:57 PM11/21/23

to Gensim

It need only be a python *iterable* interface - able to start an iteration (return an iterator) multiple times, for one-time forward iteration each time.

No random/backwards/indexed access is required as part of Python's iteration interfaces. See the 1st sentence of the Python docs for iterables here: https://wiki.python.org/moin/Iterator

See the source for the included `LineSentence` class as an example of what a iterable-interface to your own data might look like:

https://github.com/piskvorky/gensim/blob/6e7ec6a4d8801095dd6084b18da72db2f2b8ac78/gensim/models/word2vec.py#L2070

The common error here is that people provide a plain single-use *iterator*, which will work once as a sort of degenerate quasi-iterable, but then provide nothing when the Gensim model needs extra front-to-back passes.

If your class were named `MyIterable` (and it should probably have a more-descriptive name), the following code would print the same number twice – but if you make the most-common error, the second number will instead be zero:

my_corpus = MyIterable(...whatever your parameters are...)

print(sum(1 for _ in my_corpus)) # one iteration

print(sum(1 for _ in my_corpus)) # another iteration

- Gordon

Jeff Winchell

unread,

Nov 21, 2023, 6:31:09 PM11/21/23

to Gensim

This code seemed to work (though I was surprised my generator needed to parse the string into words vs using LineSentence data which seems to operate on a sentence as a single string input).

Gordon, I looked at some of your other messages related to this topic and tried to incorporate them into this short code. Does anything look like it could fail in some obvious usage?

class Sentences(list):
def __init__(self,cursor,SQL,params):
self.cursor=cursor
self.SQL=SQL
self.params=params
def __iter__(self):
self.cursor.execute(self.SQL,self.params)
return self.MyGenerator()
def MyGenerator(self):
row=self.cursor.fetchone()
while row is not None:
yield row[0].lower().split()
row=self.cursor.fetchone()

import pyodbc
DB = pyodbc.connect(r'Driver={SQL Server};Server=(local);Database=Corpora;Trusted_Connection=yes;',autocommit=True)
DB_Link=DB.cursor()
params=(42)
SQL='Select Sentence From Sentence Where Corpus_Id=? Order By Sentence_Number'
myiterable=Sentences(DB_Link,SQL,params)
model = gensim.models.Word2Vec(sentences=myiterable, vector_size=300, sg=1, window=10, min_count=5, workers=4, epochs=10)
db.close()

Gordon Mohr

unread,

Nov 23, 2023, 8:14:25 AM11/23/23

to Gensim

Is the code failing? If so, how?

Is there an error?

Does it pass or fail the test I suggested previously (showing the same count of items in 2 successive test iterations)?

If you write code to examine the first few items from an iteration, are they each of the type the class `Word2Vec` expects (a list of single-word strings)?

It's easier to review code with a clear explanation of what's failing. (Also, easier if usual Python naming/spacing conventions are used, and names of variables & types are functionally-descriptive.

- Gordon

Jeff Winchell

unread,

Nov 23, 2023, 8:20:51 AM11/23/23

to gen...@googlegroups.com

I do not understand why you are asking those questions in response to my message and its question.

On Thu, Nov 23, 2023 at 8:14 AM Gordon Mohr <goj...@gmail.com> wrote:

Is the code failing? If so, how?

Is there an error?

Does it pass or fail the test I suggested previously (showing the same count of items in 2 successive test iterations)?

If you write code to examine the first few items from an iteration, are they each of the type the class `Word2Vec` expects (a list of single-word strings)?

It's easier to review code with a clear explanation of what's failing. (Also, easier if usual Python naming/spacing conventions are used, and names of variables & types are functionally-descriptive.

- Gordon

On Tuesday, November 21, 2023 at 3:31:09 PM UTC-8 jeffwi...@gmail.com wrote:

This code seemed to work .... Does anything look like it could fail in some obvious usage?

--
You received this message because you are subscribed to a topic in the Google Groups "Gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/kZxWi7T_EFo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/4e8ff892-2053-4e03-8396-b021ae1d1e20n%40googlegroups.com.

Gordon Mohr

unread,

Nov 23, 2023, 10:11:32 AM11/23/23

to Gensim

OK, then I suppose I didn't understand why you were asking me for generic review of already-working code, in some non-Gensim database setup with which I'm unfamiliar. That's not really on-topic here, unless there's a describable problem using or understanding Gensim's interfaces.

In my experience, question-askers will sometimes word things like you have when there's still some unmentioned issue they're struggling to remedy. For example, the vague "code seemed to work" (in contrast to just "code worked by criteria X") can often mean code compiled/ran but gave fishy results.

If your code provides what Gensim expects, it'll work.

My genuine advice is:

- It's better to write test cases than ask me, someone who doesn't have a full picture of your project, to speculatively envision possible failure modes by sight-reading code snippets. (The simple "does the iteration successfully report the same number of items when run twice" test is one example of a good test case for your custom data adapters.)

- When asking me to review Python code fragments, it can help me help you if you use descriptive naming & strong conformance to Python conventions.

- Depending on the size/volatility of your data & performance requirement, it may even be better to instead write code that dumps the relevant textual data to a local plain text file – in the format where `LineSentences` itself can be used for local file IO – rather than to deal with (remote?) database overhead every iteraton.

- Gordon

Jeff Winchell

unread,

Nov 23, 2023, 12:21:20 PM11/23/23

to gen...@googlegroups.com

On Thu, Nov 23, 2023 at 10:11 AM Gordon Mohr <goj...@gmail.com> wrote:

OK, then I suppose I didn't understand why you were asking me for generic review of already-working code,

in some non-Gensim database setup with which I'm unfamiliar

"import pyodbc
DB = pyodbc.connect(r'Driver={SQL Server};Server=(local);Database=Corpora;Trusted_Connection=yes;',autocommit=True)"

I was not aware that the most widely used database library in python was too obscure for this listserv.

it may even be better to instead write code that dumps the relevant textual data to a local plain text file – in the format where `LineSentences` itself can be used for local file IO – rather than to deal with (remote?) database overhead every iteraton.

"Server=(local)"

I am not sure why you guessed remote. Word2vec normally is used with very large datasets to be useful. So having a very large amount of data stored TWICE on a local disk is generally highly inefficient.

"Driver={SQL Server}"

Using a file system vs the fastest dbms is also generally slower

I noted in the emails in this group that it was over a decade ago when the primary code author talked about wanting to have a DBMS interface to load the data. Nothing was done. From the comments above it seems that Gensim is still concerned with effectively managing huge amounts of data or my questions and responses would be unnecessary

The code I gave was not obscure. It was very understandable and I did not hide anything relevant to my questions in my original message so assuming I did rather than just asking questions is uncalled for.

I do not care about arbitrary coding "standards", only that the code is readable and understandable. But if the reader knows very little about DBMS or why most people use them, then I guess they may not be understandable (like the obvious local keyword in my code). Nor would they be understandable by using some "Python code naming standards"

To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/2eb2bf16-7482-4957-b64c-9a756cd4ac0fn%40googlegroups.com.

Message has been deleted

Gordon Mohr

unread,

Nov 24, 2023, 8:05:12 PM11/24/23

to Gensim

I suspect if you were to test your conjecture that "Using a file system vs the fastest dbms is also generally slower" with the kinds of simple textual datasets used in Gensim's algorithms, you would discover that your conjecture is very wrong.

In my experience, programmer time/efficiency is worth 10x to 1000x the cost of storing data "TWICE" on disk. And, on the sorts of jobs common with Gensim, keeping interim already-selected, already-tokenized copies of text corpora around is often a big win.

Your code snippet's generic-naming & non-Pythonic spacing made it less "readable and understandable" to the most relevant audience: me, the person you sent the code to, asking for free help. Further, it would not meet the minimum code standards of any paid coding I've ever done, nor any open-source projects to which I've ever contributed.

YMMV, but even if you "do not care about arbitrary coding 'standards'", when collaborating with or requesting advice from others, it is courteous to meet them halfway.