Getting the top documents for a topic any model

Aneesha Bakharia

unread,

Aug 26, 2012, 8:46:58 AM8/26/12

to gen...@googlegroups.com

Hi

Is there are quick way to get the top documents for a topic most of
the models (eg LDA and LSI) ? I need to display both the top words and
the documents together.

Many Thanks

Aneesha

Radim Řehůřek

unread,

Aug 27, 2012, 6:29:22 AM8/27/12

to gensim

Hello Aneesha,

> Is there are quick way to get the top documents for a topic most of
> the models (eg LDA and LSI) ? I need to display both the top words and
> the documents together.

I don't understand. Can you give an example?

Best,
Radim

>
> Many Thanks
>
> Aneesha

Aneesha Bakharia

unread,

Aug 27, 2012, 8:27:41 AM8/27/12

to gen...@googlegroups.com

Hi
Sorry I typed that sentence way to quick - it should read:
Is there are quick way to get the top documents and words for a topic
using most of the models (eg LDA and LSI) ?

Here is an example:

Topic 1
Word1 Word2 Word3 Word4 ... WordN

- Doc1
- Doc2
- Doc3
- DocN

Topic 2
Word1 Word2 Word3 Word4 ... WordN

- Doc1
- Doc2
- Doc3
- DocN

Regards

Aneesha

Radim Řehůřek

unread,

Aug 27, 2012, 8:55:35 AM8/27/12

to gensim

Depends how "quick" you need it :)

There is no built-in support for that operation, but you can simulate
it with something like:

>>> tops = sorted(all_documents, reverse=True, key=lambda doc: abs(dict(doc).get(topic_number, 0.0)))
>>> print tops[ : N]

This will print the top N documents that have the greatest value
(~probability, in case of LDA) for the topic `topic_number`.

Best,
Radim

On Aug 27, 2:27 pm, Aneesha Bakharia <aneesha.bakha...@gmail.com>
wrote:

Aneesha Bakharia

unread,

Aug 27, 2012, 9:07:35 AM8/27/12

to gen...@googlegroups.com

Hi

Many thanks Radim.

Aneesha

Pino

unread,

Sep 13, 2012, 5:10:00 AM9/13/12

to gen...@googlegroups.com

Hi,

first of all thanks for your excellent work Radim.

I'm not new to python, but I was wondering how to put that code to accomplish Aneesha's goal (I guess she did what I'm trying to do...)

Param all_documents is the corpus (bow), right?

And 'doc' in the lambda exp, what is?

Maybe I'll start a new topic with a more specific question regarding my problem.

Best,

Pino

Radim Řehůřek

unread,

Sep 14, 2012, 5:00:07 AM9/14/12

to gensim

On Sep 13, 11:10 am, Pino <gder...@gmail.com> wrote:
> Hi,
> first of all thanks for your excellent work Radim.
>
> I'm not new to python, but I was wondering how to put that code to
> accomplish Aneesha's goal (I guess she did what I'm trying to do...)
>
> Param all_documents is the corpus (bow), right?
> And 'doc' in the lambda exp, what is?

You're welcome. Yes, `all_documents` is a corpus of documents
(transformed by LDA, if you're interested in documents most relevant
to an LDA topic).

Best,
Radim

>
> Maybe I'll start a new topic with a more specific question regarding my
> problem.
>
> Best,
>
> Pino
>
>
>
>
>
>
>
> On Monday, 27 August 2012 15:07:35 UTC+2, aneesha wrote:
>
> > Hi
>
> > Many thanks Radim.
>
> > Aneesha
>

> > On Mon, Aug 27, 2012 at 10:55 PM, Radim Řehůřek <m...@radimrehurek.com<javascript:>>

Pino

unread,

Sep 14, 2012, 11:23:43 AM9/14/12

to gen...@googlegroups.com

Thanks for your time.

Maybe I'm a bit sleepy now but...

Is there a way to get them back from bow to documents.

I mean, tops[:N] are first N documents ranked by probability that fit on topic_num, but they are bag of words.

How can I identify the document?

Best regards,

Pino

Radim Řehůřek

unread,

Sep 15, 2012, 11:32:16 AM9/15/12

to gensim

Hello Pino,

gensim doesn't care about document identification -- that's entirely
up to you. Gensim "identifies" documents only by their position in the
corpus stream (=first, second, ...).

One possibly way is to attach your ids before you run the sort:

>>> tops = sorted(zip(my_ids, all_documents)), reverse=True, key=lambda my_id, doc: abs(dict(doc).get(topic_number, 0.0)))

HTH,
Radim

Pino

unread,

Sep 15, 2012, 11:56:22 AM9/15/12

to gen...@googlegroups.com

Well, now everything is more clear.

Thank you again. Yes, your answer is very helpful.

Have a nice week end,

Pino

Ravi Karan

unread,

May 7, 2014, 1:27:47 AM5/7/14

to gen...@googlegroups.com

Hi Radhim,

I'm working on a similar implementation. I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics)

I found the following output:

DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]

LDA 1 : [(29, 0.80571428571428572)]

DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]

LDA 2 : [(29, 0.83809523809523812)]

DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]

LDA 3 : [(34, 0.75714285714285712)]

DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]

LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]

DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, 1), (4037, 1), (4502, 1), (5027, 1), (5723, 1)]

LDA 5 : [(12, 0.075870715371114297), (30, 0.088821329943986921), (31, 0.75219107156801579)]

DOC 6 : [(705, 1), (3156, 1), (3284, 1), (3555, 1), (3920, 1), (4306, 1), (4581, 1), (4900, 1), (5224, 1), (6156, 1)]

LDA 6 : [(6, 0.63896110435842401), (20, 0.18441557445724915), (28, 0.09350643806744402)]

DOC 7 : [(470, 1), (1434, 1), (1741, 1), (3654, 1), (4261, 1)]

LDA 7 : [(5, 0.17142855723258577), (13, 0.17142856888458904), (19, 0.50476192150187316)]

DOC 8 : [(2227, 1), (2290, 1), (2549, 1), (5102, 1), (7651, 1)]

LDA 8 : [(12, 0.16776844589094803), (19, 0.13980868559963203), (22, 0.1728575716782704), (28, 0.37194624921210206)]

Where,

DOC N is the document from the BOW corpus

LDA N is the transformation of DOC N by that LDA model

Am I correct in understanding the output for each transformed document "LDA N" to be the topics that the document N belongs to? By that understanding, I can see some documents like 4, 5, 6, 7 and 8 to belong to more than 1 topic like DOC 8 belongs to topics 12, 19, 22 and 28 with the respective probabilities.

Could you please explain the output of LDA N and correct my understanding of this output, especially since in another thread HERE, you've mentioned that a document belongs to ONE topic?

Neil Stoker

unread,

Aug 26, 2016, 12:34:45 PM8/26/16

to gensim, gde...@gmail.com

Hi Pino,

I know it's a while since this was posted, but did this actually work for you?

When I try the version with my_id, I get an error:

TypeError: <lambda>() missing 1 required positional argument: 'doc'

Is there anything obvious I might be overlooking? The earlier version which returns just the bow is running fine. I'm using Python 3 (just in case that would make a difference, but I doubt it)

Thanks,

Neil

Sharbani Pandit

unread,

Oct 6, 2016, 4:35:52 PM10/6/16

to gensim

Hello Radim,

I am trying to do the same thing(getting the top documents of a certain topic) with LSI.

I tried the following lines,

tops = sorted(complaints, reverse=True, key=lambda doc: abs(dict(doc).get(0, 0.0)))

print tops[ : N]

where "complaints" contains all the documents and I am looking for documents close to topic 0. However, I get this error, "ValueError: dictionary update sequence element #0 has length 4; 2 is required". Can you please help me with this issue?

Thanks,

Sharbani

Lev Konstantinovskiy

unread,

Oct 11, 2016, 2:29:41 AM10/11/16

to gensim

Hi,

Did you convert the document to its LSI representation?

doc = "Human computer interaction" vec_bow = dictionary.doc2bow(doc.lower().split()) vec_lsi = lsi[vec_bow]

Elizaveta

unread,

Aug 18, 2017, 4:31:44 AM8/18/17

to gensim

Hello Radim,

I know that it's been a while since this answer was posted.

>>> tops = sorted(zip(my_ids, all_documents)), reverse=True, key=lambda my_id, doc: abs(dict(doc).get(topic_

number, 0.0)))

Unfortunately it does not work for me, something is missing and I do not know how to fix it.
I've also tried other options, like trying to convert lda probabilities to a matrix of documents, topics and their probabilities, so I'd be able to associate topics with documents.
However, I still cannot manage that. So I'd be highly appreciate any help.

Regards
Liza

ian.m...@conversica.com

unread,

Oct 27, 2017, 5:24:56 PM10/27/17

to gensim

This solves his problem btw:

tops = sorted(zip(my_ids, lda[corpus]), reverse=True, key=lambda x: abs(dict(x[1]).get(topic_number, 0.0)))

Reply all

Reply to author

Forward