get perplexity?

153 views
Skip to first unread message

ak

unread,
Aug 19, 2015, 7:50:40 AM8/19/15
to lda-users
Hi. Is there any way to get perplexity? I believe the perplexity is the standard way to evaluate the result of LDA model learning.

Allen B. Riddell

unread,
Aug 19, 2015, 9:13:28 AM8/19/15
to ak, lda-users
Hi ak,

Calculating held-out perplexity is not trivial and there are several
ways to do it. Document-completion seems to be the most general and it's
something that would be nice to have.

I've opened an issue on github.

https://github.com/ariddell/lda/issues/44

Best,

Allen

On 08/19, ak wrote:
> Hi. Is there any way to get perplexity? I believe the perplexity is the
> standard way to evaluate the result of LDA model learning.
>
> --
> You received this message because you are subscribed to the Google Groups "lda-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lda-users+...@googlegroups.com.
> To post to this group, send email to lda-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/lda-users/2e09a8c0-ab0e-44f9-b16a-06f1df2f43e7%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

mari...@satalia.com

unread,
Aug 4, 2016, 8:17:25 AM8/4/16
to lda-users
Dear sirs, 

I'm quite new in topic modelling which I'm using to analyse grocery baskets. So far I have nice results setting an arbitrary number of topics. Is it a better way to identify the best number of topics? 

As well, I've seen in other softwares the use of perplexity to determine the number of topics, is it possible to get that with LDA? 

Thank you very much

markus...@wzb.eu

unread,
Sep 14, 2017, 6:05:29 AM9/14/17
to lda-users
Hi there,

I'd like to pick this up again, since I'm also trying to evaluate model performance by using held-out documents.
I've seen that by this time, the transform() method is able to determine the document topic distribution theta for unseen documents on a trained model using the "iterated pseudo-counts" (Wallach et al) approach. Is it correct that the only thing left is calculating the log likelihood for theta?

Bye,
Markus

ridd...@fastmail.com

unread,
Sep 14, 2017, 9:02:02 AM9/14/17
to lda-...@googlegroups.com
You mean the only thing left is calculating log likelihood for the
held-out documents? If so, I think that's right.

On 09/14/2017 06:05 AM, markus...@wzb.eu wrote:
> Hi there,
>
> I'd like to pick this up again, since I'm also trying to evaluate model
> performance by using held-out documents.
> I've seen that by this time, the transform() method is able to determine
> the document topic distribution /theta/ for unseen documents on a
> trained model using the "iterated pseudo-counts" (Wallach et al)
> approach. Is it correct that the only thing left is calculating the log
> likelihood for /theta/?
>
> Bye,
> Markus
>
>
> On Thursday, August 4, 2016 at 2:17:25 PM UTC+2, mari...@satalia.com wrote:
>
> Dear sirs, 
>
> I'm quite new in topic modelling which I'm using to analyse grocery
> baskets. So far I have nice results setting an arbitrary number of
> topics. Is it a better way to identify the best number of topics? 
>
> As well, I've seen in other softwares the use of perplexity to
> determine the number of topics, is it possible to get that with LDA? 
>
> Thank you very much
>
> On Wednesday, 19 August 2015 12:50:40 UTC+1, ak wrote:
>
> Hi. Is there any way to get perplexity? I believe the perplexity
> is the standard way to evaluate the result of LDA model learning.
>
> --
> You received this message because you are subscribed to the Google
> Groups "lda-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lda-users+...@googlegroups.com
> <mailto:lda-users+...@googlegroups.com>.
> To post to this group, send email to lda-...@googlegroups.com
> <mailto:lda-...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lda-users/5b411a5e-4af2-46f1-801c-f0556b51f652%40googlegroups.com
> <https://groups.google.com/d/msgid/lda-users/5b411a5e-4af2-46f1-801c-f0556b51f652%40googlegroups.com?utm_medium=email&utm_source=footer>.

Markus Konrad

unread,
Sep 14, 2017, 9:31:59 AM9/14/17
to lda-...@googlegroups.com
Yes. How can this actually be done? In the Griffiths & Steyvers paper
they're evaluating log P(w|T) for different numbers of topics T but I
don't understand how it is actually done.
With your lda package I can use the transform() function for held-out
documents but it seems there's no way to get the log likelihood for them.
--
--

Markus Konrad
- DV / Data Science -

fon: +49 30 25491 555
fax: +49 30 25491 558
mail: markus...@wzb.eu

WZB Data Science Blog: https://datascience.blog.wzb.eu/

Raum D 005
WZB – Wissenschaftszentrum Berlin für Sozialforschung
Reichpietschufer 50
D-10785 Berlin

markus...@wzb.eu

unread,
Oct 11, 2017, 10:15:29 AM10/11/17
to lda-users
Would it be reasonable to use the formula for perplexity as in the R implementation by Grün/Hornik [1]? The document topic distribution theta for unseen documents that is used in the log(p(w)) function could be calculated using LDA.transform() method. They say it's suitable for Gibbs sampling.

>> To post to this group, send email to lda-...@googlegroups.com

Markus Paff

unread,
Mar 15, 2018, 5:52:30 AM3/15/18
to lda-users
Hi all,

I was wondering if there is any way to approximate the perplexity as it is very complex and time consuming. Is anybody aware of something like a rule of thumb? 

Thanks and regards,
Markus

ridd...@fastmail.com

unread,
Mar 15, 2018, 8:40:55 AM3/15/18
to lda-...@googlegroups.com
Yes, you hold out 1/4th of the words at random from _each document_ and
see how well your fitted model predicts those. With this strategy you
don't have to estimate the document-topic weights because you've already
estimated them for each document (using those 3/4ths of the words).

This is much faster and I don't see any real downside. It's not
implemented in the lda package, however.

Best,

Allen
> --
> You received this message because you are subscribed to the Google
> Groups "lda-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lda-users+...@googlegroups.com
> <mailto:lda-users+...@googlegroups.com>.
> To post to this group, send email to lda-...@googlegroups.com
> <mailto:lda-...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lda-users/52095eb3-74d2-4ada-8521-d7d2603d7bd5%40googlegroups.com
> <https://groups.google.com/d/msgid/lda-users/52095eb3-74d2-4ada-8521-d7d2603d7bd5%40googlegroups.com?utm_medium=email&utm_source=footer>.

Markus Paff

unread,
Mar 19, 2018, 11:19:17 AM3/19/18
to lda-users
Hi Allen, 

Thank you very much for your reply! I am pretty new to the topic and I do not really understand the approach. 

I guess the idea is as follows:
  • Run the topic modeling for an ascending number of topic with the 3/4ths of words
  • Check the overlap with the hold out 1/4th of the words.
  • The best fitting model is the solution.
Is this the correct process or have I missed something?

Thanks and regards,
Markus

ridd...@fastmail.com

unread,
Mar 27, 2018, 2:00:11 PM3/27/18
to lda-...@googlegroups.com
I'm not sure what you mean by "overlap".

If I recall, the strategy for doing held-out perplexity calculation are
described in section 4.3 of

Buntine, Wray L., and Swapnil Mishra. 2014. “Experiments with
Non-Parametric Topic Models.” In Proceedings of the 20th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
881–890. KDD ’14. New York, NY, USA: ACM.
https://doi.org/10.1145/2623330.2623691.

Their approach is the most practical, I think.


On 03/19/2018 11:19 AM, 'Markus Paff' via lda-users wrote:
> Hi Allen, 
>
> Thank you very much for your reply! I am pretty new to the topic and I
> do not really understand the approach. 
>
> I guess the idea is as follows:
>
> * Run the topic modeling for an ascending number of topic with the
> 3/4ths of words
> * Check the overlap with the hold out 1/4th of the words.
> * The best fitting model is the solution.
> > an email to lda-users+...@googlegroups.com <javascript:>
> > <mailto:lda-users+...@googlegroups.com <javascript:>>.
> > To post to this group, send email to lda-...@googlegroups.com
> <javascript:>
> > <mailto:lda-...@googlegroups.com <javascript:>>.
> <https://groups.google.com/d/msgid/lda-users/52095eb3-74d2-4ada-8521-d7d2603d7bd5%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "lda-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lda-users+...@googlegroups.com
> <mailto:lda-users+...@googlegroups.com>.
> To post to this group, send email to lda-...@googlegroups.com
> <mailto:lda-...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lda-users/90f15b56-7ded-4387-8373-f1a98d0d443e%40googlegroups.com
> <https://groups.google.com/d/msgid/lda-users/90f15b56-7ded-4387-8373-f1a98d0d443e%40googlegroups.com?utm_medium=email&utm_source=footer>.

Markus Konrad

unread,
Apr 12, 2018, 10:39:29 AM4/12/18
to lda-...@googlegroups.com
Hi Allen and Markus,

from time to time I also come back to this problem and I've been
following your discussions.
Unfortunately, the Buntine paper is not very clear on how the
calculations are actually done, the same goes for many other papers
about that topic – but that's probably because of my lack of knowledge
regarding the underlying mathematical concepts.

I'd like to come back to the Allen's proposal:

> you hold out 1/4th of the words at random from _each document_ and
> see how well your fitted model predicts those. With this strategy you
> don't have to estimate the document-topic weights because you've
> already estimated them for each document (using those 3/4ths of the words).

My problem is that I don't understand how you actually measure "how well
your fitted model predicts those [the 1/4th held-out words]".

From your explanations, I'd do something like this (in pseudo-Python):

# split documents

X_train = []
X_test = []
for doc in X:
doc_test = ... # sample 1/4th of X
doc_train = ... # other 3/4th of X
X_test.append(doc_test)
X_train.append(doc_train)

# fit model with training data
model.fit(X_train)

# evaluate
for d in range(len(X)):
# document-specific distribution across topics (from training)
theta_d = model.doc_topic_[d]
# document-specific probabilities for each word (from training)
prob_train = theta_d * model.topic_word_
# ratios of word occurrences in the test document
prob_test = X_test[d] / X_test[d].sum()

# now compare prob_train with prob_test?

My questions are now: 1) Am I on the right track so far with this? I end
up with a vector of word probabilities for each training document learnt
via LDA and a vector of word "probabilities" for the respective test
document, so that I can see if the probabilities of the training doc.
are close to those from the test doc.
2) If that's correct so far, how do I proceed? Calculate the KL
divergence between both and average across documents?

I'd be happy to hear your feedback!

Best,
Markus

ridd...@fastmail.com

unread,
Apr 16, 2018, 1:13:12 PM4/16/18
to lda-...@googlegroups.com
I think you're on the right track but the specifics aren't quite right
in the evaluate section.

It would good to be have a tutorial/vignette on this, it's a common problem.

Markus Konrad

unread,
Apr 25, 2018, 8:28:39 AM4/25/18
to lda-...@googlegroups.com
I finally found a way. I had a look at the MATLAB code that comes with
the Wallach 2009 paper (https://people.cs.umass.edu/~wallach/code/etm/).
There's a method implemented for estimating the evaluation probability
using held-out documents. I re-implemented it in Python and it gives me
the same results as the MATLAB code.
In Allen's lda package, the "iterated pseudo-counts" method for
estimating theta for unseen documents is included (`transform` method),
which comes from the same paper. So we can split the corpus into
training and test documents, fit the model using the training data and
then estimate theta_test with the `transform` method. Then the
evaluation probability is calculated in the same way as in the MATLAB code.

I implemented it in my project tmtoolkit [1] (see [2] for the
implementation). It requires the package "gmpy2" to be installed.
tmtoolkit will use cross-validation with a given number of folds so that
all documents will be part of the training and test set as well.

So this is the strategy of using held-out documents, not held-out words
in documents ("document completion"). I think both strategies are valid.
If I have time, I may implement the latter one too (it's also in the
MATLAB sources).

Anyway, it works but I think a measure like model coherence (as in [3])
is better at estimating the quality of a topic model, since this measure
is justified by experiments with human judgment on topic quality.


[1]: https://github.com/WZBSocialScienceCenter/tmtoolkit
[2]:
https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/tmtoolkit/topicmod/evaluate.py#L23
[3]: D. Mimno, H. Wallach, E. Talley, M. Leenders, A. McCullum 2011:
Optimizing semantic coherence in topic models
--
--

Markus Konrad
- IT / Data Science -
Reply all
Reply to author
Forward
0 new messages