Zero probabilities in LDA Topic Models?

863 views
Skip to first unread message

sachiniw

unread,
Nov 3, 2014, 1:15:16 PM11/3/14
to gen...@googlegroups.com
Hi,

Thank you very much for maintaining a very useful (and user friendly) forum + website.
I have a question with regard to the output I created while training an LDA model on my corpus.

My corpus consists of 10 pdf documents (research papers converted to text format) & fed to LDA model for 100 iterations to extract top 10 topic models.

GENSIM output:
------------------------

topic #0 (0.100): 0.033*source + 0.030*swarm + 0.025*algorithm + 0.013*emission + 0.011*2008 +0.010*chemical + 0.009*article + 0.009*date + 0.009*surveys + 0.008*april
topic #1 (0.100): 0.020*responsibility + 0.016*laws + 0.013*regarding + 0.013*moral + 0.010*rules +0.008*already + 0.008*combat + 0.008*geneva + 0.006*artificial + 0.006*army
topic #2 (0.100): 0.034*vector + 0.023*directed + 0.017*direction + 0.016*message + 0.013*multirobot +0.013*chain + 0.011*travel + 0.011*targets + 0.011*near + 0.010*emergent
topic #3 (0.100): 0.018*article + 0.017*nodes + 0.012*node + 0.011*august + 0.009*surveys +0.009*algorithm + 0.009*date + 0.009*source + 0.009*inference + 0.009*routing
topic #4 (0.100): 0.018*cognitive + 0.018*ight + 0.015*ooda + 0.011*loop + 0.010*coordination +0.010*designs + 0.008*controller + 0.008*modules + 0.008*aircraft + 0.007*agentbased
topic #5 (0.100): 0.000*history + 0.000*hungary + 0.000*prasanna + 0.000*visited + 0.000*optimality +0.000*short + 0.000*acknowledgments + 0.000*changed + 0.000*poses + 0.000*improving
topic #6 (0.100): 0.022*formation + 0.021*ight + 0.016*drones + 0.015*imaging + 0.015*image +0.013*images + 0.011*projects + 0.010*networked + 0.009*line + 0.009*route
topic #7 (0.100): 0.060*utility + 0.026*distribution + 0.023*sharing + 0.019*policies + 0.012*cost +0.011*distributions + 0.011*optimality + 0.010*optimal + 0.010*member + 0.010*trail
topic #8 (0.100): 0.000*history + 0.000*hungary + 0.000*prasanna + 0.000*visited + 0.000*optimality +0.000*short + 0.000*acknowledgments + 0.000*changed + 0.000*poses + 0.000*improving
topic #9 (0.100): 0.024*measures + 0.014*studies + 0.012*experience + 0.012*behaviors + 0.012*users + 0.011*measure + 0.010*understanding + 0.010*metric + 0.009*source + 0.008*field

Eyeballing the models & content of the document I can say that the discovered models are "acceptable" but I am not sure why some probabilities are appearing as zero. (highlighted in bold)
I was thinkinng, because the model considers sparse vectors, only non-zero values are considered.

Is it because the limited size of the document? Is there something wrong with my approach?

Python Code
-------------------

corpusname = ('path to corpus')
acmcorpus = Corpus(corpusname) # class corpus overrides get_texts( ) to extract txt content from pdf documents, remove stopwords etc.

acmdict = corpora.Dictionary(acmcorpus.get_texts( ) )
acmdict.filter_extremse(no_below=2, no_above=.5, keep_n=50000)
acmdict.compactify( )

corpus = [acmdict.doc2bow(t) for t in acmcorpus.get_texts( ) ]
model = models.LdaModel(corpus, id2word=acmdict, passes=100, num_topics = 10)

Christopher S. Corley

unread,
Nov 3, 2014, 1:57:23 PM11/3/14
to gensim
Excerpts from sachiniw's message of 2014-11-03 12:15:16 -0600:
> topic #5 (0.100):* 0.000*history + 0.000*hungary + 0.000*prasanna +
> 0.000*visited + 0.000*optimality +0.000*short + 0.000*acknowledgments +
> 0.000*changed + 0.000*poses + 0.000*improving*
> topic #6 (0.100): 0.022*formation + 0.021*ight + 0.016*drones +
> 0.015*imaging + 0.015*image +0.013*images + 0.011*projects +
> 0.010*networked + 0.009*line + 0.009*route
> topic #7 (0.100): 0.060*utility + 0.026*distribution + 0.023*sharing +
> 0.019*policies + 0.012*cost +0.011*distributions + 0.011*optimality +
> 0.010*optimal + 0.010*member + 0.010*trail
> topic #8 (0.100): *0.000*history + 0.000*hungary + 0.000*prasanna +
> 0.000*visited + 0.000*optimality +0.000*short + 0.000*acknowledgments +
> 0.000*changed + 0.000*poses + 0.000*improving*
> topic #9 (0.100): 0.024*measures + 0.014*studies + 0.012*experience +
> 0.012*behaviors + 0.012*users + 0.011*measure + 0.010*understanding +
> 0.010*metric + 0.009*source + 0.008*field
>
> Eyeballing the models & content of the document I can say that the
> discovered models are "acceptable" but I am not sure why some probabilities
> are appearing as zero. (highlighted in bold)
> I was thinkinng, because the model considers sparse vectors, only non-zero
> values are considered.
>
> Is it because the limited size of the document? Is there something wrong
> with my approach?
>
> Python Code
> -------------------
>
> corpusname = ('path to corpus')
> acmcorpus = Corpus(corpusname) # class corpus overrides get_texts( ) to
> extract txt content from pdf documents, remove stopwords etc.
>
> acmdict = corpora.Dictionary(acmcorpus.get_texts( ) )
> acmdict.filter_extremse(no_below=2, no_above=.5, keep_n=50000)
> acmdict.compactify( )
>
> corpus = [acmdict.doc2bow(t) for t in acmcorpus.get_texts( ) ]
> model = models.LdaModel(corpus, id2word=acmdict, passes=100, num_topics =
> 10)
>


Nope, nothing wrong here! Having a few topics with "zero" probabilities is
perfectly normal. It's likely due to your small corpus size.

If it helps, I like to think of it in this (oversimplified) way: in LDA, you
can *allocate* as many topics as you'd like (or memory can afford) -- you'll
just have a bunch of 'empty' topics (i.e., ones where all words have really
low probabilities) as no documents exhibited those topics during training. If
you add 10 new documents to your training, they may exhibit those topics that
the previous did not.

Hope that helps!

Chris.

sachini weerawardhana

unread,
Nov 3, 2014, 3:55:04 PM11/3/14
to gen...@googlegroups.com
Thanks Chris! :-)


--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/LuPD2VSouSQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mnf

unread,
Mar 14, 2019, 11:10:56 AM3/14/19
to Gensim
Hello ,

I  have a corpus with 1M of docs, I'm running LDA for 500 topics and 440/500 of them have the format:

topic #5 (0.100):* 0.000*history + 0.000*hungary + 0.000*prasanna +
> 0.000*visited + 0.000*optimality +0.000*short + 0.000*acknowledgments +
> 0.000*changed + 0.000*poses + 0.000*improving*
topic #10 (0.100):* 0.000*history + 0.000*hungary + 0.000*prasanna +
> 0.000*visited + 0.000*optimality +0.000*short + 0.000*acknowledgments +
> 0.000*changed + 0.000*poses + 0.000*improving*
topic #345 (0.100):* 0.000*history + 0.000*hungary + 0.000*prasanna +
> 0.000*visited + 0.000*optimality +0.000*short + 0.000*acknowledgments +
> 0.000*changed + 0.000*poses + 0.000*improving*

Most of the others parameters are fixed by default.

I've tried changing alpha initialization, but nothing change.

Any hint?

Thanks in advance.

Radim Řehůřek

unread,
Mar 14, 2019, 2:16:52 PM3/14/19
to Gensim
Hm. There was an LDA refactoring merged recently, which decreased the internal matrix precision:

I wonder if that could be related to your problem.

Can you try changing the float32 back to float64 here:

and let us know if that helps?

Cheers,
Radim

Marisa Faraggi

unread,
Mar 15, 2019, 6:23:20 AM3/15/19
to gen...@googlegroups.com
Radim,

Thanks a lot! Yes, it was the precision! Now all run perfectly!

Regards

You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

Radim Řehůřek

unread,
Mar 15, 2019, 7:52:50 AM3/15/19
to Gensim
No problem.

I opened a bug ticket for this on Github:

mnf, could you add more info into that ticket? What is your vocab size? Software versions?

Adding the info will help us debug and improve the LdaModel.

Thanks,
Radim
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.

Thomas H.

unread,
Sep 30, 2019, 1:30:32 PM9/30/19
to Gensim
Hi Radim,

We are currently trying the Author Topic Model. All we get are 0.000 probabilities, no matter the size of the corpus or the number of topics. 

The training also just takes a couple of seconds (independent of number of passes). This can't be right.

num_topics=10
        corpus = [dct.doc2bow(text) for text in texts]
        model = AuthorTopicModel(corpus, id2word=dct, num_topics=num_topics, chunksize=2000, passes=500, iterations=20)
        print(model)
        print(model.print_topics(num_words=10))


AuthorTopicModel(num_terms=304953, num_topics=10, num_authors=0, decay=0.5, chunksize=2000)
[(0, '0.000*"brünstiglich" + 0.000*"feuerhaus" + 0.000*"sonnenhimmels" + 0.000*"allzugrau\u017fame" + 0.000*"gleubts" + 0.000*"freiflüssend" + 0.000*"orlogsmänner" + 0.000*"275" + 0.000*"gesalzne" + 0.000*"jammerkreiß"'), (1, '0.000*"untermenget" + 0.000*"glanzgewimmel" + 0.000*"büchersäle" + 0.000*"wegzudonnern" + 0.000*"harpy" + 0.000*"strohl" + 0.000*"erobrergrösse" + 0.000*"ruhtenbund" + 0.000*"demanttropfen" + 0.000*"hinaufgekommen"'), (2, '0.000*"verjüngender" + 0.000*"furchtbarster" + 0.000*"nderin" + 0.000*"caro\u017f\u017fe" + 0.000*"porte" + 0.000*"wunderholde" + 0.000*"feierzeit" + 0.000*"felses" + 0.000*"volksgetu" + 0.000*"aufgelodert"'), (3, '0.000*"trübseliger" + 0.000*"durchtrabt" + 0.000*"sönnen" + 0.000*"apportiren" + 0.000*"wid" + 0.000*"samstagmorgen" + 0.000*"angeschwirrt" + 0.000*"\u017ftab" + 0.000*"königsgruß" + 0.000*"machangelstrauch"'), (4, '0.000*"sternenkampe" + 0.000*"augenhimmel" + 0.000*"romniz" + 0.000*"grenzweg" + 0.000*"unsinnige" + 0.000*"k\u0119gelbahn" + 0.000*"eue" + 0.000*"meermoß" + 0.000*"wegzureysen" + 0.000*"surrende"'), (5, '0.000*"blocksbergstyl" + 0.000*"stilgefühl" + 0.000*"stummem" + 0.000*"teupe" + 0.000*"dunckelheit" + 0.000*"rei\u017felu\u017ft" + 0.000*"ingedenk" + 0.000*"glossieren" + 0.000*"vielgestalt" + 0.000*"haareinlagen"'), (6, '0.000*"zurückzuspringen" + 0.000*"thôr" + 0.000*"möchten" + 0.000*"himmelstheile" + 0.000*"kente" + 0.000*"bergnebel" + 0.000*"embsiglich" + 0.000*"eingepflanzter" + 0.000*"belagrungszustand" + 0.000*"flammen\u017fchein"'), (7, '0.000*"auszu\u017fchlafen" + 0.000*"raköthe" + 0.000*"jetzgen" + 0.000*"menschenhohen" + 0.000*"abgestochen" + 0.000*"kompromittiret" + 0.000*"auf\u017fchließt" + 0.000*"gestohl" + 0.000*"devotest" + 0.000*"gramlied"'), (8, '0.000*"aufgeweckter" + 0.000*"weibchen" + 0.000*"sonnenglühen" + 0.000*"eingeschrenkt" + 0.000*"mokant" + 0.000*"felsengang" + 0.000*"chüblen" + 0.000*"manheim" + 0.000*"vermachet" + 0.000*"gaukelndes"'), (9, '0.000*"feilgespreizt" + 0.000*"frauenthränen" + 0.000*"beschencken" + 0.000*"siegesjauchzen" + 0.000*"hochzeittanze" + 0.000*"angeworbnem" + 0.000*"gewähr" + 0.000*"grüs" + 0.000*"anwald" + 0.000*"pümmt"')]

AuthorTopicModel(num_terms=304953, num_topics=10, num_authors=0, decay=0.5, chunksize=2000)
[(0, '0.000*"begläntzet" + 0.000*"silberblauen" + 0.000*"rhodopeus" + 0.000*"lilienverwandte" + 0.000*"zurückgreifend" + 0.000*"maulwurff" + 0.000*"betrachter" + 0.000*"gurr" + 0.000*"bluteswallungen" + 0.000*"bejreift"'), (1, '0.000*"hainbaum" + 0.000*"genef" + 0.000*"erquikken" + 0.000*"nachtigalle" + 0.000*"verruffen" + 0.000*"sommerlaue" + 0.000*"waldgeraune" + 0.000*"könte" + 0.000*"pilgerierten" + 0.000*"ver\u017ftumm"'), (2, '0.000*"menschenseyns" + 0.000*"offenherzig" + 0.000*"fluecht" + 0.000*"himmelswege" + 0.000*"gluthgeschwader" + 0.000*"erschauet" + 0.000*"urbusch" + 0.000*"meerentsandt" + 0.000*"begoßner" + 0.000*"sickelt"'), (3, '0.000*"ausruhend" + 0.000*"um\u017fchra" + 0.000*"entrücken" + 0.000*"josiah" + 0.000*"entbittert" + 0.000*"rmet" + 0.000*"duco" + 0.000*"ve\u017fevus" + 0.000*"knick" + 0.000*"cleanth"'), (4, '0.000*"aehrenwald" + 0.000*"weltbekanntem" + 0.000*"hundertklauigen" + 0.000*"mäntlein" + 0.000*"verwilderter" + 0.000*"vnbekand" + 0.000*"obgerungen" + 0.000*"anfachen" + 0.000*"bünden" + 0.000*"behalten"'), (5, '0.000*"zukunftskrone" + 0.000*"götzenbilder" + 0.000*"heldenthrone" + 0.000*"cymbale" + 0.000*"lte\u017fte" + 0.000*"geblöckt" + 0.000*"n\u017ftlich\u017fte" + 0.000*"kanzeli\u017ften" + 0.000*"goldgewand" + 0.000*"verschnarcht"'), (6, '0.000*"umblühn" + 0.000*"pfannkuchen" + 0.000*"spottgeburt" + 0.000*"lästermäulig" + 0.000*"windeskraft" + 0.000*"augengift" + 0.000*"verwelckter" + 0.000*"be\u017fcheert" + 0.000*"niederflammt" + 0.000*"aufmerck\u017fames"'), (7, '0.000*"seelenzucht" + 0.000*"anheimgestellt" + 0.000*"11" + 0.000*"konsole" + 0.000*"faustgroße" + 0.000*"ge\u017fchmiert" + 0.000*"schapställ" + 0.000*"krampfte" + 0.000*"selle" + 0.000*"psalmengesang"'), (8, '0.000*"stralesund" + 0.000*"silberperl" + 0.000*"fürgesetzet" + 0.000*"tempel" + 0.000*"trauer\u017fang" + 0.000*"moskowitern" + 0.000*"kinderjahre" + 0.000*"scanderbeck" + 0.000*"prickt" + 0.000*"uberschwemm"'), (9, '0.000*"durchfuchtelnd" + 0.000*"rittertroß" + 0.000*"gründungssteine" + 0.000*"irreligiös" + 0.000*"spröd" + 0.000*"schätztest" + 0.000*"sälden" + 0.000*"sturben" + 0.000*"jahrszeit" + 0.000*"bezwangen"')]


AuthorTopicModel(num_terms=2766, num_topics=10, num_authors=0, decay=0.5, chunksize=2000)
[(0, '0.000*"trächtig" + 0.000*"zusammenflossen" + 0.000*"unempfindlichkeit" + 0.000*"ge\u017fchnittene" + 0.000*"meer" + 0.000*"wahn" + 0.000*"artus" + 0.000*"wiederkehre" + 0.000*"lustig" + 0.000*"scheut"'), (1, '0.000*"jahren" + 0.000*"freudenbahre" + 0.000*"stärkre" + 0.000*"gewicht" + 0.000*"trillerten" + 0.000*"ber\u017fchrifft" + 0.000*"steine" + 0.000*"flogen" + 0.000*"duft" + 0.000*"priesterwein"'), (2, '0.001*"met" + 0.000*"wasser" + 0.000*"acoluth" + 0.000*"wegen" + 0.000*"reistest" + 0.000*"mirs" + 0.000*"kniet" + 0.000*"\u017fer" + 0.000*"schlüssel" + 0.000*"pereat"'), (3, '0.000*"drehn" + 0.000*"unendlichen" + 0.000*"todeswunden" + 0.000*"48" + 0.000*"lang" + 0.000*"erden" + 0.000*"merkur" + 0.000*"sünden" + 0.000*"rollen" + 0.000*"bins"'), (4, '0.000*"schlage" + 0.000*"fall" + 0.000*"heiteren" + 0.000*"ersieht" + 0.000*"asche" + 0.000*"dumpfe" + 0.000*"verschloßten" + 0.000*"angeschienen" + 0.000*"meißel" + 0.000*"andren"'), (5, '0.001*"end" + 0.000*"klagen" + 0.000*"fal\u017fch" + 0.000*"schön" + 0.000*"nde" + 0.000*"thore" + 0.000*"warm" + 0.000*"tiefen" + 0.000*"blendend" + 0.000*"wurden"'), (6, '0.000*"phönixnest" + 0.000*"gern" + 0.000*"meergöttinnen" + 0.000*"weisen" + 0.000*"entsprossen" + 0.000*"\u017fteigt" + 0.000*"wollt" + 0.000*"lindor" + 0.000*"fährt" + 0.000*"obherrschen"'), (7, '0.000*"sollt" + 0.000*"ehre" + 0.000*"hektor" + 0.000*"topfe" + 0.000*"unglückliche" + 0.000*"undanckbarkeit" + 0.000*"54" + 0.000*"wies" + 0.000*"pitts" + 0.000*"blümlein"'), (8, '0.001*"erfahren" + 0.000*"erweckte" + 0.000*"offen" + 0.000*"morgens" + 0.000*"gesäme" + 0.000*"schenken" + 0.000*"waffensaal" + 0.000*"protegees" + 0.000*"vergeuden" + 0.000*"lege"'), (9, '0.000*"denck" + 0.000*"stern" + 0.000*"vorurtheilen" + 0.000*"opferkuchen" + 0.000*"bäumlein" + 0.000*"taubenschlage" + 0.000*"gutes" + 0.000*"fremde" + 0.000*"blicken" + 0.000*"persepolis"')]


Best,
Thomas
To unsubscribe from this group and stop receiving emails from it, send an email to gen...@googlegroups.com.

Radim Řehůřek

unread,
Oct 1, 2019, 5:36:02 AM10/1/19
to Gensim
Hi Thomas,

that's not good! Can you open a ticket on github, including a minimal reproducible example?

Thanks,
Radim

SquarePegg

unread,
Oct 24, 2019, 11:38:10 AM10/24/19
to Gensim
I can see why this would cause you alarm!  To my knowledge, all term-probabilities w/in a topic must be non-zero.  That said, given that the highlighted topics consist only of non-positive term-probabilities, it seems that perhaps this is a way to represent a lesser number of topics than is requested?  Just out of curiosity, what happens when you specify 8 topics instead of 10?  Could you post that output for us as well?  
Reply all
Reply to author
Forward
0 new messages