How to get specific tf-idf values from the created tfidf model

MMM

unread,

Nov 15, 2017, 9:25:52 AM11/15/17

to gensim

I am calculating my tf-idf values as follows using genism.

texts = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)

Now, I want to get the 3 words that has the highest tf-idf value. How can I get it from the created 'tfidf'? Please help me!

Ivan Menshikh

unread,

Nov 16, 2017, 12:35:41 AM11/16/17

to gensim

Hi Volka,

TfIdf result value depends on concrete word-frequency in current document (i.e. second value in tuple after doc2bow method)

You can extract mapping word_index -> idfs from model easily - tfidf.idfs and use it for top-tfidfwords calculation.

Ivan Menshikh

unread,

Nov 16, 2017, 12:36:59 AM11/16/17

to gensim

Useful link TfIdfModel.__getitem__ code

On Wednesday, November 15, 2017 at 7:25:52 PM UTC+5, Volka wrote:

Message has been deleted

Volka

unread,

Nov 16, 2017, 3:39:11 AM11/16/17

to gensim

Hi Ivan,

I did not get what you meant by "You can extract mapping word_index -> idfs from model easily - tfidf.idfs and use it for top-tfidfwords calculation.". Could you please kindly elaborate it further?

Is it correct if I do something like this to get the 3 words that has the highest tf-idf values?

corpus_tfidf = tfidf[corpus]

d = {}

for doc in corpus_tfidf:

for id, value in doc:

word = dictionary.get(id)

d[word] = value

print(sorted(d, key=d.get, reverse=True)[:3])

Thank you very much!

Ivan Menshikh

unread,

Nov 17, 2017, 12:29:11 AM11/17/17

to gensim

If you want to calculate it in the same way - replace d (dict) to list and append tuples (word, tfidf), instead of overwriting d[word].

Remember that for one word you can get different `value` (you overwrite d[word] = value many times because tfidf depends on the document (not only from word).

Message has been deleted

Volka

unread,

Nov 17, 2017, 9:38:07 AM11/17/17

to gensim

Hi Ivan,

Thanks a lot for your reply. I think the way I have understood tf-idf is wrong. Thanks for correcting me :)

Please let me know what I am doing now is correct?

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corpus)

corpus_tfidf = tfidf[corpus]

analyzedDocument = namedtuple('AnalyzedDocument', 'word tfidf_score')

d=[]

for doc in corpus_tfidf:

for id, value in doc:

word = dictionary.get(id)

score = value

d.append(analyzedDocument(word, score))

However, I want to detect the most important words in my corpus. Thus, I am just wondering how to do it with these values. Can you please suggest me an approach to find the most important words in the corpus? Can I do it by merely considering 'idf' value?

Thank you!

Ivan Menshikh

unread,

Nov 20, 2017, 4:23:37 AM11/20/17

to gensim

Hi Volka,

Your code is correct now. For extracting most important words I propose the TextRank algorithm that implemented in summarization submodule in gensim, try to use keywords function, tutorial here.

Volka

unread,

Nov 21, 2017, 8:29:56 AM11/21/17

to gensim

Hi Ivan, Thanks a lot for your feedback. I will use TextRank Algorithm! :)

Reply all

Reply to author

Forward