Threshold for Gensim TF-IDF

172 views
Skip to first unread message

Masto Music

unread,
Jun 15, 2021, 7:55:10 AM6/15/21
to Gensim
I am trying to figure out how to use TF-IDF to compare multi-word vectors. I just ran the Gensim TF-IDF on a Wikipedia corpus. After, I noticed when common words like 'am', 'like'  and 'good' indexed for in the model they did not return a tf-idf vector with a low score but rather returned an empty vector which throws off my algorithm. Is there any way parameter to ensure these words have a smaller score?

Cheers,
Sam

Radim Řehůřek

unread,
Jun 15, 2021, 8:33:47 AM6/15/21
to Gensim
Hi Sam,

in TFIDF, words don't really have vectors – the score for a specific term depends on a specific document.

So not sure what you mean. What gives you an "empty vector", exactly?

Best,
Radim

Masto Music

unread,
Jun 17, 2021, 5:55:36 AM6/17/21
to Gensim
I understand in TFIDF words don't really have vectors, but I am trying to use TF-IDF scores to multiply by fasttext word vectors to compute a meaning for an overall phrase.
So the fasttext word vectors for each word are scaled by the importance of the word (which comes from TFIDF score).

Masto Music

unread,
Jun 17, 2021, 5:56:41 AM6/17/21
to Gensim
Honestly though I am not sure using TF-IDF is the best method for my particular question: https://stackoverflow.com/questions/67869439/how-to-use-tf-idf-to-determine-importance-of-words-for-whole-language

Radim Řehůřek

unread,
Jun 20, 2021, 11:31:19 AM6/20/21
to Gensim
The "word importance" in TF-IDF is its "IDF" part = "inverse document frequency".
Documentation: https://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.df2idf

If you have a trained TfidfModel in Gensim, the IDF scores are in `model.idfs`, a dictionary.

IDF is a fairly crude measure, but popular because it's simple and fast and interpretable. Whether it works for your application or not, I don't know. 

Hope that helps,
Radim




Reply all
Reply to author
Forward
0 new messages