How to get specific tf-idf values from the created tfidf model

820 views
Skip to first unread message

MMM

unread,
Nov 15, 2017, 9:25:52 AM11/15/17
to gensim

I am calculating my tf-idf values as follows using genism.


texts = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)

Now, I want to get the 3 words that has the highest tf-idf value. How can I get it from the created 'tfidf'? Please help me!

Ivan Menshikh

unread,
Nov 16, 2017, 12:35:41 AM11/16/17
to gensim
Hi Volka,

TfIdf result value depends on concrete word-frequency in current document (i.e. second value in tuple after doc2bow method)
You can extract mapping word_index -> idfs from model easily - tfidf.idfs and use it for top-tfidfwords calculation.

Ivan Menshikh

unread,
Nov 16, 2017, 12:36:59 AM11/16/17
to gensim


On Wednesday, November 15, 2017 at 7:25:52 PM UTC+5, Volka wrote:
Message has been deleted

Volka

unread,
Nov 16, 2017, 3:39:11 AM11/16/17
to gensim
Hi Ivan,

I did not get what you meant by "You can extract mapping word_index -> idfs from model easily - tfidf.idfs and use it for top-tfidfwords calculation.". Could you please kindly elaborate it further? 

Is it correct if I do something like this to get the 3 words that has the highest tf-idf values?

corpus_tfidf = tfidf[corpus]
d = {}
for doc in corpus_tfidf:
    for id, value in doc:
        word = dictionary.get(id)
        d[word] = value
print(sorted(d, key=d.get, reverse=True)[:3])


Thank you very much!

Ivan Menshikh

unread,
Nov 17, 2017, 12:29:11 AM11/17/17
to gensim
If you want to calculate it in the same way - replace d (dict) to list and append tuples (word, tfidf), instead of overwriting d[word].
Remember that for one word you can get different `value` (you overwrite d[word] = value many times because tfidf depends on the document (not only from word).
Message has been deleted

Volka

unread,
Nov 17, 2017, 9:38:07 AM11/17/17
to gensim
Hi Ivan,

Thanks a lot for your reply. I think the way I have understood tf-idf is wrong. Thanks for correcting me :)

Please let me know what I am doing now is correct?

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
analyzedDocument = namedtuple('AnalyzedDocument', 'word tfidf_score')
d=[]
for doc in corpus_tfidf:
    for id, value in doc:
        word = dictionary.get(id)
        score = value
        d.append(analyzedDocument(word, score))


However, I want to detect the most important words in my corpus. Thus, I am just wondering how to do it with these values. Can you please  suggest me an approach to find the most important words in the corpus? Can I do it by merely considering 'idf' value?

Thank you!

Ivan Menshikh

unread,
Nov 20, 2017, 4:23:37 AM11/20/17
to gensim
Hi Volka,

Your code is correct now. For extracting most important words I propose the TextRank algorithm that implemented in summarization submodule in gensim, try to use keywords function, tutorial here.

Volka

unread,
Nov 21, 2017, 8:29:56 AM11/21/17
to gensim
Hi Ivan, Thanks a lot for your feedback. I will use TextRank Algorithm! :)
Reply all
Reply to author
Forward
0 new messages