Is word2vec trained on documents or on sentences?

388 views
Skip to first unread message

Phil

unread,
Jul 27, 2021, 11:01:03 AM7/27/21
to Gensim

I am reading the Train your model section of the tutorial "Word2Vec Model".

The section says

"All that is required is that the input yields one sentence (list of utf8 words) after another."

Also the variable is called "sentences":

> sentences = MyCorpus()

But, by inspecting that, it looks like the iterator returns a "tokenized document" not a "tokenized sentence".

For instance, the first item returned by the iterator is:

> sentences = MyCorpus()
> sent_list = list(sentences)
> print(sent_list[0])

[ ... 'hill', 'top', 'new', 'blaze', .... ]

While the real document is:

> corpus_path = datapath('lee_background.cor')
> with open(corpus_path) as fh:
>    lines = fh.readlines()
> print(lines[0])

" ... the town of Hill Top. A new blaze ...."

So it looks like utils.simple_preprocess(line) is not splitting on sentences (that is, splitting on the full stop).

My question has practical implication, because I have to train a word embedding and there is a sent_tokenize function in NLTK, and I don't know if I have to use that. It looks like I don't.

I think it was called "document" in the previous tutorial, I don't know why it is now called "sentence".

So I have to return a "tokenized document" right?

Gordon Mohr

unread,
Jul 27, 2021, 3:05:24 PM7/27/21
to Gensim
Word2Vec always trains on lists of string tokens. Various generationso of code or docs or online examples might call each such list-of-string-tokens a 'document' or a 'sentence' or other things, but it always needs a sequence where each item is a list-of-string-tokens.

The algorithm has no innate understanding of sentences, just the runs-of-words that have distinct ends, and thus don't 'run over' into neighbors. They might be sentences, paragraphs, documents,  chapters, subsections, or other fragments – it usually doesn't make a big difference, just as long as words appear in natural usage contexts. 

For example, the `text8` & `text9` corpora used in some examples are just long punctuation-free word token--runs from a subset of Wikipedia, with all sentence-/paragraph/article- breaks lost. Breaking that into arbitrary fixed-length texts still generates word-vectors capable to demoing many of their useful qualities.

Alternatively, word2vec training sessions retain punctuation, including sentence-ending `.` characters, as tokens, turning them into pseudo-words, so those punctuation tokens receive 'word-vectors' too. 

The one internal implementation limit to watch out for: no more than 10000 tokens per list. Any more than that will be ignored. 

If you mistakenly provide a non-tokenized string, instead of a list-of-tokens, Python's treatment of strings as lists-of-one-character-strings means you'll only get generally-useless vectors for single-characters from the model.

(The `simple_preprocess()` function is just one fairly-simple word-tokenizer, which is also oblivious to the idea of sentences.)

- Gordon

Phil

unread,
Jul 27, 2021, 4:14:56 PM7/27/21
to Gensim
Thank you Gordon. Just so that I understand.

Suppose I have the following story:

"Sarah was eating a sandwich. Mary was drinking a beer. Michael was driving."

According to the documentation, the constructor for Word2Vec takes an argument called `sentences` that must be an "iterable of iterables".

In order to build `sentences` from the story above, I can do either (just one document)

> sentences = [['Sarah', 'was', 'eating', 'a', 'sandwich', 'Mary', 'was', 'drinking', 'a', 'beer', 'Michael', 'was', 'driving']]

or (three different documents, the actual sentences)

> sentences = [['Sarah', 'was', 'eating', 'a', 'sandwich'], ['Mary', 'was', 'drinking', 'a', 'beer'], ['Michael', 'was', 'driving']]

and, as you explained, there isn't much difference between these two tokenizations, in terms of the resulting word embedding and its performance, right?

Gordon Mohr

unread,
Jul 27, 2021, 7:25:52 PM7/27/21
to Gensim
Yes, over an adequately-sized corpus, there shouldn't be much difference between those 2 choices - just so long as no item has more than 10000 tokens. 

If in fact adjacent texts are logically related – as if for example a bunch of short sentences coming from the same paragraph, about the same topic – I suppose there might be a *slight* benefit more towards keeping them together, so that the last words of one sentence intentionally wind up in the context-windows of the first words of the next sentence, and vice-versa. But I'd not expect a strong effect either way, and as the texts become longer, and the overall corpus data plentiful, there'll be plenty of meaningful windows within each text, and these sorts of boundary decisions become even less meaningful. 

In the API documentation, `sentences` is just a variable name, not a binding description of what's expected. (And, for consistency with some other changes, the intent was for that variable's name to change from `sentences` to `corpus_iterable` in Gensim 4.0, to be more generic and to mirror the alternative `corpus_file`. But, that renaming wasn't completed before release.)

- Gordon
Reply all
Reply to author
Forward
0 new messages