Hi Allen and Markus,
from time to time I also come back to this problem and I've been
following your discussions.
Unfortunately, the Buntine paper is not very clear on how the
calculations are actually done, the same goes for many other papers
about that topic – but that's probably because of my lack of knowledge
regarding the underlying mathematical concepts.
I'd like to come back to the Allen's proposal:
> you hold out 1/4th of the words at random from _each document_ and
> see how well your fitted model predicts those. With this strategy you
> don't have to estimate the document-topic weights because you've
> already estimated them for each document (using those 3/4ths of the words).
My problem is that I don't understand how you actually measure "how well
your fitted model predicts those [the 1/4th held-out words]".
From your explanations, I'd do something like this (in pseudo-Python):
# split documents
X_train = []
X_test = []
for doc in X:
doc_test = ... # sample 1/4th of X
doc_train = ... # other 3/4th of X
X_test.append(doc_test)
X_train.append(doc_train)
# fit model with training data
model.fit(X_train)
# evaluate
for d in range(len(X)):
# document-specific distribution across topics (from training)
theta_d = model.doc_topic_[d]
# document-specific probabilities for each word (from training)
prob_train = theta_d * model.topic_word_
# ratios of word occurrences in the test document
prob_test = X_test[d] / X_test[d].sum()
# now compare prob_train with prob_test?
My questions are now: 1) Am I on the right track so far with this? I end
up with a vector of word probabilities for each training document learnt
via LDA and a vector of word "probabilities" for the respective test
document, so that I can see if the probabilities of the training doc.
are close to those from the test doc.
2) If that's correct so far, how do I proceed? Calculate the KL
divergence between both and average across documents?
I'd be happy to hear your feedback!
Best,
Markus