Hi everyone, I'm new to GenSim and the main reason I've begun using it is that I want to get vector representations of phrases. Below is the code I have, that works partially for bigrams
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
The problem is that the Word2Vec model does automatic pruning of some of the bigrams, i.e.
len(model.vocab.keys()) != len(bigrams.vocab.keys()). I've tried adjusting various parameters such as
trim_rule,
min_count, but they don't seem to affect the pruning. Any thoughts? If I've missed something out in my explanation, please let me know.
PS - I am aware that bigrams to look up need to be represented using underscore instead of space, i.e. proper way to call my function would be bigram2vec(unigrams, 'this_report')