most_similar ignore "words not in vocabulary'

170 views
Skip to first unread message

Felipe Ferreira de Carvalho

unread,
Aug 11, 2021, 1:53:59 PM8/11/21
to Gensim
Hi guys,

im using a CBOW S50 model of Word2Vec to apply .most_similar on a splited text. But some words dosen't exist on vocabulary. In this case, function returns a KeyError saying 'Word not in vocabulary'.

The question is: Theres a some way to ignores word not in vocabulary and use only words present on vocabulary to search similarity?  

Thanks 
Hope some know

Gordon Mohr

unread,
Aug 11, 2021, 3:16:06 PM8/11/21
to Gensim
The usual way to do this would be to check, yourself, if the words you are using as the similarity-origin are in the model before attempting to find their neighbors. For example:

    starting_words = ['apple', 'banana', 'cordicarpa']
    positive_words = [w for w in starting_words if w in model]
    similars = model.most_similar(positive=positive_words)

Of course, you might also want to make sure your filtering for in-model words doesn't leave you with no words at all. 

- Gordon

Felipe Ferreira de Carvalho

unread,
Aug 12, 2021, 12:34:38 PM8/12/21
to gen...@googlegroups.com
Thanks Gordon, Actually i doed a little modification on the source code of .most_similar() function. But your code are really more easy to implement. I'll use, thanks again for your contribution.

Follow my modification:
    from six import iteritems, itervalues, string_types
    from numpy import exp, log, dot, zeros, outer, random, dtype, float32 as REAL,\
    ndarray, array
    from gensim import matutils

    def most_similar_iguk(selfpositive=[]negative=[]topn=10restrict_vocab=Noneindexer=None): #most_similar adpted to ignore unknow words.

        self.init_sims()

        if isinstance(positive, string_types) and not negative:
            # allow calls like most_similar('dog'), as a shorthand for most_similar(['dog'])
            positive = [positive]

        # add weights for each word, if not already present; default to 1.0 for positive and -1.0 for negative words
        positive = [
            (word, 1.0) if isinstance(word, string_types + (ndarray,)) else word
            for word in positive
        ]
        negative = [
            (word, -1.0) if isinstance(word, string_types + (ndarray,)) else word
            for word in negative
        ]

        # compute the weighted average of all words
        all_words, mean, ignr_words = set(), [], []
        for word, weight in positive + negative:
            if isinstance(word, ndarray):
                mean.append(weight * word)
            elif word in self.vocab:
                mean.append(weight * self.vectors_norm[self.vocab[word].index])
                all_words.add(self.vocab[word].index)
            elif word not in self.vocab: #UPDATE FOR INGORE UNKNOW WORDS AND PRINT.
              ignr_words.append(word)
              continue
        print("Palavras não encontradas no vocabulário:")
        print(ignr_words)
        if not mean:
            raise ValueError("cannot compute similarity with no input")
        mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)

        if indexer is not None:
            return indexer.most_similar(mean, topn)

        limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
        dists = dot(limited, mean)
        if not topn:
            return dists
        best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
        # ignore (don't return) words from the input
        result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
        return result[:topn]
--
You received this message because you are subscribed to a topic in the Google Groups "Gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/5Ym3fDwlqmo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/75dec06d-46b2-49ba-9b94-679b6f0469afn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages