Getting the top documents for a topic any model

2,937 views
Skip to first unread message

Aneesha Bakharia

unread,
Aug 26, 2012, 8:46:58 AM8/26/12
to gen...@googlegroups.com
Hi

Is there are quick way to get the top documents for a topic most of
the models (eg LDA and LSI) ? I need to display both the top words and
the documents together.

Many Thanks

Aneesha

Radim Řehůřek

unread,
Aug 27, 2012, 6:29:22 AM8/27/12
to gensim
Hello Aneesha,

> Is there are quick way to get the top documents for a topic most of
> the models (eg LDA and LSI) ? I need to display both the top words and
> the documents together.

I don't understand. Can you give an example?

Best,
Radim



>
> Many Thanks
>
> Aneesha

Aneesha Bakharia

unread,
Aug 27, 2012, 8:27:41 AM8/27/12
to gen...@googlegroups.com
Hi
Sorry I typed that sentence way to quick - it should read:
Is there are quick way to get the top documents and words for a topic
using most of the models (eg LDA and LSI) ?

Here is an example:

Topic 1
Word1 Word2 Word3 Word4 ... WordN

- Doc1
- Doc2
- Doc3
- DocN

Topic 2
Word1 Word2 Word3 Word4 ... WordN

- Doc1
- Doc2
- Doc3
- DocN

Regards

Aneesha

Radim Řehůřek

unread,
Aug 27, 2012, 8:55:35 AM8/27/12
to gensim
Depends how "quick" you need it :)

There is no built-in support for that operation, but you can simulate
it with something like:

>>> tops = sorted(all_documents, reverse=True, key=lambda doc: abs(dict(doc).get(topic_number, 0.0)))
>>> print tops[ : N]

This will print the top N documents that have the greatest value
(~probability, in case of LDA) for the topic `topic_number`.

Best,
Radim


On Aug 27, 2:27 pm, Aneesha Bakharia <aneesha.bakha...@gmail.com>
wrote:

Aneesha Bakharia

unread,
Aug 27, 2012, 9:07:35 AM8/27/12
to gen...@googlegroups.com
Hi

Many thanks Radim.

Aneesha

Pino

unread,
Sep 13, 2012, 5:10:00 AM9/13/12
to gen...@googlegroups.com
Hi,
first of all thanks for your excellent work Radim.

I'm not new to python, but I was wondering how to put that code to accomplish Aneesha's goal (I guess she did what I'm trying to do...)

Param all_documents is the corpus (bow), right?
And 'doc' in the lambda exp, what is?

Maybe I'll start a new topic with a more specific question regarding my problem.

Best,

Pino

Radim Řehůřek

unread,
Sep 14, 2012, 5:00:07 AM9/14/12
to gensim
On Sep 13, 11:10 am, Pino <gder...@gmail.com> wrote:
> Hi,
> first of all thanks for your excellent work Radim.
>
> I'm not new to python, but I was wondering how to put that code to
> accomplish Aneesha's goal (I guess she did what I'm trying to do...)
>
> Param all_documents is the corpus (bow), right?
> And 'doc' in the lambda exp, what is?

You're welcome. Yes, `all_documents` is a corpus of documents
(transformed by LDA, if you're interested in documents most relevant
to an LDA topic).

Best,
Radim


>
> Maybe I'll start a new topic with a more specific question regarding my
> problem.
>
> Best,
>
> Pino
>
>
>
>
>
>
>
> On Monday, 27 August 2012 15:07:35 UTC+2, aneesha wrote:
>
> > Hi
>
> > Many thanks Radim.
>
> > Aneesha
>
> > On Mon, Aug 27, 2012 at 10:55 PM, Radim Řehůřek <m...@radimrehurek.com<javascript:>>

Pino

unread,
Sep 14, 2012, 11:23:43 AM9/14/12
to gen...@googlegroups.com
Thanks for your time. 
Maybe I'm a bit sleepy now but...
Is there a way to get them back from bow to documents.
I mean, tops[:N] are first N documents ranked by probability that fit on topic_num, but they are bag of words. 
How can I identify the document?

Best regards,

Pino

Radim Řehůřek

unread,
Sep 15, 2012, 11:32:16 AM9/15/12
to gensim
Hello Pino,

gensim doesn't care about document identification -- that's entirely
up to you. Gensim "identifies" documents only by their position in the
corpus stream (=first, second, ...).

One possibly way is to attach your ids before you run the sort:

>>> tops = sorted(zip(my_ids, all_documents)), reverse=True, key=lambda my_id, doc: abs(dict(doc).get(topic_number, 0.0)))

HTH,
Radim

Pino

unread,
Sep 15, 2012, 11:56:22 AM9/15/12
to gen...@googlegroups.com
Well, now everything is more clear.
Thank you again. Yes, your answer is very helpful.
Have a nice week end,

Pino

Ravi Karan

unread,
May 7, 2014, 1:27:47 AM5/7/14
to gen...@googlegroups.com
Hi Radhim,

I'm working on a similar implementation. I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say,  35 topics)
I found the following output:
DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]
LDA 1 : [(29, 0.80571428571428572)]
DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]
LDA 2 : [(29, 0.83809523809523812)]
DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]
LDA 3 : [(34, 0.75714285714285712)]
DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]
LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]
DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, 1), (4037, 1), (4502, 1), (5027, 1), (5723, 1)]
LDA 5 : [(12, 0.075870715371114297), (30, 0.088821329943986921), (31, 0.75219107156801579)]
DOC 6 : [(705, 1), (3156, 1), (3284, 1), (3555, 1), (3920, 1), (4306, 1), (4581, 1), (4900, 1), (5224, 1), (6156, 1)]
LDA 6 : [(6, 0.63896110435842401), (20, 0.18441557445724915), (28, 0.09350643806744402)]
DOC 7 : [(470, 1), (1434, 1), (1741, 1), (3654, 1), (4261, 1)]
LDA 7 : [(5, 0.17142855723258577), (13, 0.17142856888458904), (19, 0.50476192150187316)]
DOC 8 : [(2227, 1), (2290, 1), (2549, 1), (5102, 1), (7651, 1)]
LDA 8 : [(12, 0.16776844589094803), (19, 0.13980868559963203), (22, 0.1728575716782704), (28, 0.37194624921210206)]

Where, 
DOC N is the document from the BOW corpus 
LDA N is the transformation of DOC N by that LDA model

Am I correct in understanding the output for each transformed document "LDA N" to be the topics that the document N belongs to? By that understanding, I can see some documents like 4, 5, 6, 7 and 8 to belong to more than 1 topic like DOC 8 belongs to topics 12, 19, 22 and 28 with the respective probabilities.

Could you please explain the output of LDA N and correct my understanding of this output, especially since in another thread HERE, you've mentioned that a document belongs to ONE topic? 

Neil Stoker

unread,
Aug 26, 2016, 12:34:45 PM8/26/16
to gensim, gde...@gmail.com
Hi Pino,

I know it's a while since this was posted, but did this actually work for you?

When I try the version with my_id, I get an error:
TypeError: <lambda>() missing 1 required positional argument: 'doc'

Is there anything obvious I might be overlooking?  The earlier version which returns just the bow is running fine.  I'm using Python 3 (just in case that would make a difference, but I doubt it)
Thanks,
Neil

Sharbani Pandit

unread,
Oct 6, 2016, 4:35:52 PM10/6/16
to gensim
Hello Radim,
I am trying to do the same thing(getting the top documents of a certain topic) with LSI.
I tried the following lines,
tops = sorted(complaints, reverse=True, key=lambda doc: abs(dict(doc).get(0, 0.0))) 
print tops[ : N] 

where "complaints" contains all the documents and I am looking for documents close to topic 0. However, I get this error, "ValueError: dictionary update sequence element #0 has length 4; 2 is required". Can you please help me with this issue?

Thanks,
Sharbani

Lev Konstantinovskiy

unread,
Oct 11, 2016, 2:29:41 AM10/11/16
to gensim
Hi,


Did you convert the document to its LSI representation?

doc
= "Human computer interaction" vec_bow = dictionary.doc2bow(doc.lower().split()) vec_lsi = lsi[vec_bow]

Elizaveta

unread,
Aug 18, 2017, 4:31:44 AM8/18/17
to gensim
Hello Radim,

I know that it's been a while since this answer was posted.


>>> tops = sorted(zip(my_ids, all_documents)), reverse=True, key=lambda my_id, doc: abs(dict(doc).get(topic_
number, 0.0)))

Unfortunately it does not work for me, something is missing and I do not know how to fix it.
I've also tried other options, like trying to convert lda probabilities to a matrix of documents, topics and their probabilities, so I'd be able to associate topics with documents.
However, I still cannot manage that. So I'd be highly appreciate any help.

Regards
Liza

ian.m...@conversica.com

unread,
Oct 27, 2017, 5:24:56 PM10/27/17
to gensim
This solves his problem btw:
tops = sorted(zip(my_ids, lda[corpus]), reverse=True, key=lambda x: abs(dict(x[1]).get(topic_number, 0.0)))
Reply all
Reply to author
Forward
0 new messages