FastText showing vectors for single characters that version 3.2.0 didn't show

Jamie Brandon

unread,

Dec 13, 2019, 4:07:56 PM12/13/19

to Gensim

Hi there,

I noticed that when I was using gensim version 3.2.0, I would get an error if I called

model.wv['xyz']
model.wv['💩']

or other small character sets that were not present in the training data. This was especially noted with emojis. Trying to search for an emoji would always raise an error if the training data was exclusively text without emojis. The error was

KeyError: 'all ngrams for word xyz absent from model'

I've upgraded to version 3.8.1, and I notice that when I train a model on text without emojis, the call to gensim's word vectors returns a vector even though we didn't have it present in the training data.

model.wv['💩']

Is this the desired behavior? How does it create this vector if it has no information about this character?

Thanks in advance,

Jamie

Gordon Mohr

unread,

Dec 13, 2019, 6:20:40 PM12/13/19

to Gensim

This is to match the native Facebook FastText code's behavior, which returns a vector (the origin point) even for OOV words where it couldn't possibly have learned any n-grams, because the token is shorter than the minimum n-gram size. (It will also return non-origin vectors for words containing only n-grams that weren't in the training data, because those n-grams will still map to slots in the collision-oblivious hashtable used for n-grams – and all such slots were either initialized to non-origin values, or trained by colliding n-grams.)

See the doc-comment for FastTextKeyedVectors.__contains__(), explaining that it always returns 'True':

https://github.com/RaRe-Technologies/gensim/blob/de0dcc39fee0ae4eaf45d79bd5418d32780f9aa5/gensim/models/keyedvectors.py#L2033

(That should probably be noted in the class doc-comment, as the doc-comments for __special__ methods don't get put in the API docs.)

Jamie Brandon

unread,

Dec 16, 2019, 8:19:31 AM12/16/19

to Gensim

Makes much more sense now, thanks for your help, Gordon.

Do I understand correctly that every unseen word returns the same vector, the origin? I would've expected the origin to be the zero vector, but perhaps it is a random initialization? Something else?

Screen Shot 2019-12-16 at 8.11.23 AM.png

I get the same for single letters that were not present in my training data set.

Screen Shot 2019-12-16 at 8.15.01 AM.png

I can't seem to get the warning line you linked. Would you expect it to show here if the corpus used to train the model had no emojis or single-character words?

Thanks again for your help,

Jamie

Gordon Mohr

unread,

Dec 16, 2019, 4:52:24 PM12/16/19

to Gensim

Are you sure you're actually testing code from the latest (or a recent) gensim (not some lingering older installation), and that nothing has set the logging-level to be less sensitive than WARNING? The warning code I highlighted was added no later than April 2019, so should definitely be present in gensim-3.7.3 and beyond. (And maybe earlier versions, too.)

Looking at the code as it stands in gensim 3.8.1:

* A FastText set of vectors will always return a vector for any string passed it, and always return `True` when asked if a string is contained within the set

* However, if the full-string wasn't in training, and the string is so small it has no n-gram substrings (at least `min_n` in length), the vector returned should be the origin vector (all `0.0` dimensions), and a WARNING message should be logged (provided WARNING-level logging is enabled)

* If a string is long-enough to have any n-grams, then even if neither the string nor its n-grams were seen during training, whatever n-grams it does have will collide with slots that were at the very least randomly-initialized, so a non-origin vector will be returned.

(In each of these behaviors, gensim should be matching what Facebook's FastText code returns, when loading the same set of vectors from disk.)

Given your results with the `'x'` and `'q'` letters, are you sure neither of those were supplied as words during training? (Is there any chance training happened with a `min_n=1`?)

I'm less sure of what might explain any discrepancies with unicode/emoji characters, especially if you're in Python 2.x. I suppose there *might* be a chance that n-grams are being split out of their multi-byte UTF8 representations, though I'd hope not. (Also, from the spacing in the screenshots, it appears you might be testing the emoji plus one or more trailing whitespace characters. And, I know some display-emoji are actually display-combinations of more-than-one underlying unicode characters, which could lead to nonintuitive results, though I don't know if any of the emoji you've shown are such.)

- Gordon

Reply all

Reply to author

Forward