Seeding Words into LDA model in gensim

2,610 views
Skip to first unread message

tianh...@yale.edu

unread,
Jun 28, 2016, 10:55:20 PM6/28/16
to gensim
hi all, I am doing LDA topics modeling over a huge corpus. I want to seed certain words into the model, which means that I want to use certain priors of some words. I did found the document online saying that I should specify 'eta' or 'alpha' in the model, but I hardly found any examples. Does anyone know how to use that in gensim.models.ldamodel.LdaModel?

Radim Řehůřek

unread,
Jun 29, 2016, 12:55:34 AM6/29/16
to gensim
Hi,

the documentation of `eta` describes the expected input:

What exactly is the problem? Is it giving you errors?

-rr

tianh...@yale.edu

unread,
Jun 29, 2016, 1:38:45 AM6/29/16
to gensim
Hi,

Thanks for replying. Yes, I saw that description actually. But I am just confused about what 'eta' should be like. Say my given priors distribution is like: xxx word with p=XX, xxx word with p=XX..etc. Thus I don't know how to transfer my priors to 'eta'.

Lev Konstantinovskiy

unread,
Jun 29, 2016, 8:53:36 PM6/29/16
to gensim

Hi Tian,

The shape of eta is num_topics x num_terms

The term index is the index of the word in the dictionary(for example from doc2bow method).
The topic could be any number you wish.


For example eta[0, word_index] = some_probability

tianh...@yale.edu

unread,
Jun 30, 2016, 12:01:32 AM6/30/16
to gensim
Thanks, that's really helpful! In that case, in the entries of words I want to seed, there should be probs, but for those I don't want to seed in eta matrix, should I enter 0?

BHARGAV SRINIVASA DESIKAN

unread,
Jun 30, 2016, 12:49:20 AM6/30/16
to gensim
Would this process happen before the training starts? The seeding, I mean.

Lev Konstantinovskiy

unread,
Jun 30, 2016, 7:12:49 PM6/30/16
to gensim
Tian,

Would suggest initialising the rest symmetrically as in (1 - already_assigned_to_special_words) /num_topics.

Regards
Lev

irfnali

unread,
Jul 20, 2016, 1:24:53 PM7/20/16
to gensim
Hi Lev,

It seems that (regardless of how I set the document-topic prior, alpha), after manually setting the topic-word prior, eta, to a non-uniform (in fact highly peaked) distribution over some hand-picked tokens (6 topics used, with 40-125 tokens with higher weights in each one), the perplexity (as given by logging at the INFO level when fitting LDA) simply oscillates with a very regular pattern in the best of cases, and in the worst cases actually drops. Despite the strongly-peaked prior over the topic-words prior, the topics inferred still reshuffle the seed words across all topics. Though frustrating (because of course the complete reshuffling nullifies the purpose of trying to guide the topic inference via the priors) I would be happy to accept that the inference over the data set simply prefers this and overpowers the priors, however with the perplexity behaving erratically as it is I am instead suspicious that the optimization is not converging at all.

Do you have any intuition/experience/guidance as to how to better do this, or what might be the issue?

I suspect a numerical stability issue so will try making the priors smoother/less peaked, though my fear is that with very multi-modal priors (as is the case here) the variational inference may be failing altogether. Or is this concern misplaced?

Kind regards,
Ilan

Myrthe van Dieijen

unread,
Feb 3, 2017, 5:41:35 AM2/3/17
to gensim
Hi Lev, 

Thanks for the helpful suggestions in this thread! I have one additional question regarding your last comment. Based on your suggestion to initialise the rest symmetrically as in  (1 - already_assigned_to_special_words) /num_topics the columns would add to 1 and the rows not. Is this correct? Each row represents a (Dirichlet) distribution of a topic over the vocabulary and from what I know that needs to sum to 1. The column represents the a vector with the probabilities with which a particular word occurs in the different topics, and I always thought that a word can occur in multiple topics with high probability (not necessarily summing to 1), and that that is one of the benefits of the LDA model. 
Or am I interpreting the 'already_assigned_to_special_words' incorrectly? Could you explain your suggestion a bit more? Do you add up the already assigned probabilities in each row or in each column when you compute this 'already_assigned_to_special_words' probability?

Thanks very much in advance for the clarification!

Regards,

Myrthe

Myrthe van Dieijen

unread,
Feb 3, 2017, 7:47:04 AM2/3/17
to gensim
Hi Lev, 

I've discovered in the mean time that the prior for eta is usually set to 1/num_topics for all words, so I guess there's my answer regarding the questions of the row and column sums. My only remaining question then is: How is the already_assigned_to_special_words computed? Is it the row sum or the column sum of already assigned probabilities? Or is it the sum of all the already assigned probabilities of the entire prior matrix? 

Thanks!

Myrthe

Antonio Velázquez

unread,
Feb 21, 2019, 7:55:53 PM2/21/19
to Gensim

Hi, Myrthe.

Where did you see that using 1/num_topics was typical when using a prior for eta?

My intuition was that it should be 1/num_words, so as to represent a symmetrical prior... but upon doing a couple of experiments with Gensim, I believe that 1/num_topics seems to yield similar results as using a symmetrical prior.

simon mackenzie

unread,
Mar 10, 2019, 2:15:21 PM3/10/19
to Gensim
Is there anyone here who has got guided lda working and can give an example? For eta I have tried 1/topics, 1/words, 1/(words*topics). For the latter increased the probability for target word/document combinations by multiplier of 1000. All produced almost identical results to unguided.

Here is my notebook that runs the https://github.com/vi3k6i5/GuidedLDApackage demo; and then gensim on the guidedlda data for comparison. Maybe I am using eta wrong or is there a bug in gensim?

Reply all
Reply to author
Forward
0 new messages