Convert bigrams to vector form without pruning

207 views

Skip to first unread message

Hamman Samuel

unread,

Mar 15, 2016, 9:38:49 AM3/15/16

to gensim

Hi everyone, I'm new to GenSim and the main reason I've begun using it is that I want to get vector representations of phrases. Below is the code I have, that works partially for bigrams

from gensim.models import word2vec

def bigram2vec(unigrams, bigram_to_search):

bigrams = Phrases(unigrams)

model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)

if bigram_to_search in model.vocab.keys():

return model[bigram_to_search]

else:

return None

The problem is that the Word2Vec model does automatic pruning of some of the bigrams, i.e. len(model.vocab.keys()) != len(bigrams.vocab.keys()). I've tried adjusting various parameters such as trim_rule, min_count, but they don't seem to affect the pruning. Any thoughts? If I've missed something out in my explanation, please let me know.

PS - I am aware that bigrams to look up need to be represented using underscore instead of space, i.e. proper way to call my function would be bigram2vec(unigrams, 'this_report')

Gordon Mohr

unread,

Mar 15, 2016, 6:28:08 PM3/15/16

to gensim

The Phrases class has a `vocab` of as many bigrams as it can remember co-occurrence stats for – *not* just the bigrams that pass threshold-to-combine testing. That is, it includes bigrams it has seen, but which won't be converted to bigrams by its transformation function. So you shouldn't expect its `vocab` length to be comparable/identical to that of a Word2Vec model – which only sees the output of the Phrases transformation.

(As a separate note, the way that Phrases discards low-frequency items when it hits its `max_vocab_size` budget – at <https://github.com/piskvorky/gensim/blob/3d7293f3831250a47cf8e85a7f493a87c7d6ac32/gensim/models/phrases.py#L154> – is fairly crude. If that limit is hit during its initial scan, and especially if it's hit a lot, there will be a lot of imprecision in the final counts. If you have the memory, use as large of a `max_vocab_size` as possible to minimize this effect. Also, some project pull-requests feature experimental alternative approaches using less-crude probabilistic approximate counts – so if this is an important function in your project you may want to look at those.)