models.phrases.Phrases min_count effect on unigrams

114 views
Skip to first unread message

Pogens Porgue

unread,
Dec 29, 2019, 5:48:56 PM12/29/19
to Gensim
models.phrases.Phrases documentation says:

   min_count (floatoptional) – Ignore all words and bigrams with total collected count lower than this value

and while the bigrams original_scorer does clearly use min_count, it does not appear to have any effect on unigrams, so I'm not sure what 'words' refers to in the "Ignore all words and bigrams" 

I've looked at the code for models.phrases._sentence2token , as called by the Phrases.__getitem__ , and it only uses Phrases.analyze_sentence() to join unigrams to bigrams, then uses new_s.append(words) two lines from the end to return a list of all unigrams and bigrams.  My interpretation of min_count as described above is that it would not return words below the min_count value, i.e., would have instead

      if phrase_class.vocab[words] >= phrase_class.min_count: new_s.append(words)

I do want that functionality and of course can just subclass and redefine _sentence2token , but I'm wondering what I'm misinterpreting in the above documentation. Tks

Gordon Mohr

unread,
Jan 6, 2020, 6:39:46 PM1/6/20
to Gensim
Yes, that looks like a misleading bit of documentation. The effect of `min_count` is dependent on the `scoring` choice, and doesn't appear to me to ever be as simple as "ignoring" all unigram or bigram tokens below this count. 

(If your actual aim is to simply elide any unigrams/bigrams below a certain count, similar to the operation of `min_count` in the `Word2Vec`/etc classes, that might be cleanest to do as a separate step/pass – even if you leverage the frequency-survey made by the `Phrases` class.)

- Gordon

Pogens Porgue

unread,
Jan 7, 2020, 4:47:52 PM1/7/20
to Gensim
Thanks for confirming that the documentation is misleading (maybe it can be fixed?).

In any event, I had independently come to the conclusion that a separate step/pass was more sensible, so completely agree.
(In this case am using the min_df parameter in sklearn.feature_extraction.text.CountVectorizer.)

But it was nonetheless worthwhile to have examined the code for the phrases class, since I'm also manipulating my Phraser.phrasegrams
Reply all
Reply to author
Forward
0 new messages