I'm pretty new to topic modelling, but am learning as I go. So far I
like gensim a lot. Thanks.
Quick question, I ran David Blei's lda-c code over some data as well
as gensim's LdaModel with update_every=0 (I expect a lot of topic
drift). IIUC, this option for batch lda should be very similar to the
lda-c code except that it's computationally a bit more efficient? My
inferred topics were a little different, however, and I'm wondering if
it could be because of the concentration parameters used or maybe just
that my topics aren't all that robust?
Blei's lda-c code has the option to estimate the concentration
parameter alpha via maximum likelihood, which I used, but gensim
doesn't look like it can. Is this something that's desirable in gensim
and just hasn't been implemented or is there something else keeping it
from being used? I'd be happy to code it up and submit a pull request,
if this would be helpful.
Cheers,
Skipper
Ah, ok.
>> inferred topics were a little different, however, and I'm wondering if
>> it could be because of the concentration parameters used or maybe just
>> that my topics aren't all that robust?
>
> Both alpha and eta hyper-parameters affect the resulting topics. As
> well as the decay parameter in online version. In theory, the algo
> always converges to the same solution. In practice, one usually stop
> training well before complete convergence, so the results can differ.
If I may ask one more question, what do these lines in my log indicate?
"132/1000 documents converged within 50 iterations"
Correct me if I'm wrong, but this suggests to me that the the
intermediate optimization of the variational parameters wasn't
actually all that good here and therefore my per-document posteriors
over topics aren't all that good. Blei et. al [2003] suggests that the
number of iterations required here is usually on the order of the
number of words in the document, and the Hoffman source uses 100
instead of 50. Would it make sense to change the VAR_MAXITER to #
words + 50 for better performance in my case? My document sizes are
all over the place.
>
>> Blei's lda-c code has the option to estimate the concentration
>> parameter alpha via maximum likelihood, which I used, but gensim
>> doesn't look like it can. Is this something that's desirable in gensim
>> and just hasn't been implemented or is there something else keeping it
>> from being used? I'd be happy to code it up and submit a pull request,
>> if this would be helpful.
>
> It is desirable -- getting rid of free parameters is always a good
> thing :)
>
> This function used to exist in older versions of gensim -- the ones
> based directly on Blei's LDA-C, before I switched to online-lda. See
> `optAlpha` in http://trac.assembla.com/gensim/browser/tags/release-0.7.6/src/gensim/models/ldamodel.py
>
Good to know. Will have a look.
> But I never found it useful: in my experience, alpha just kept getting
> smaller and smaller with each iteration, and never seemed to converge
> (in both LDA-C and my re-implementation). More importantly, the topics
> coming out of the onlina-lda algorithm seemed more coherent and made
> more sense, which, frankly, was more important to me than optimizing
> some likelihood quantity. So I never looked back.
Agreed on practicality over purity. Running the lda-c on my corpus
did, however, seem to converge to a reasonable value, which I used in
the online-lda.
> TL;DR: please share your experience with alpha auto-tuning. And with
> topics coming out of lda-c vs. online-lda. If your auto-tune
> implementation behaves well and is helpful, a pull request will be
> most welcome!
Thanks for the detailed answer. My intuition is not yet very strong
wrt the algorithms and why my topics would differ, so I appreciate the
guidance.
Aside: What I'm most interested in is speed right now, and I do have
access to a cluster; hence my interest in gensim. If I can find the
time, I was thinking of doing some profiling and trying to push some
of the loops down into Cython and calling dgemm directly using tokyo
[1]. It's been my experience that repeated calls to numpy.dot even
with ATLAS BLAS add quite a bit of overhead. Will report on my
progress if I ever make any.
Skipper
[1] Up to date fork: https://github.com/wesm/tokyo
Hi Skipper,
Depends where in the log do you see them: if near the beginning of
> If I may ask one more question, what do these lines in my log indicate?
>
> "132/1000 documents converged within 50 iterations"
training, it's ok. If near the end, something went wrong.
Oh, interesting. What value was that? Can you post some overall
> Agreed on practicality over purity. Running the lda-c on my corpus
> did, however, seem to converge to a reasonable value, which I used in
> the online-lda.
statistics about your training corpus/training params? There's much
talk of LDA in literature, but practical results are hard to come by.
This is incorrect. I had a bug in my change (summed word count over
wrong axis). Now on the first pass I get convergence for 40-90% of my
documents for the variational parameters, which is a huge improvement
from 5-25% convergence.
>>
>> > Agreed on practicality over purity. Running the lda-c on my corpus
>> > did, however, seem to converge to a reasonable value, which I used in
>> > the online-lda.
>>
>> Oh, interesting. What value was that? Can you post some overall
>> statistics about your training corpus/training params? There's much
>> talk of LDA in literature, but practical results are hard to come by.
>>
>
> I have about 25k documents with about 700k unique words. I'm using a
> vocabularly of the top 10k by tf-idf. Right now I'm doing two things - using
> LDA to try and identify a first pass at the latent structure of topics, do
> some more stop word pruning, and to improve my intuition about the
> algorithms and models. We expect that ultimately the structure of the topics
> is more complex. I'm new to this stuff though obviously.
>
One other stat is that I have rather long documents. Mean word count
is 8k with a max count of 800k. This could be why I needed to increase
the number of iterations. If you're interested in my change, I can
make it an option to LdaModel and push it.
Yes, this is the batch algorithm.
> There I'm in favour of increasing the default internal accuracy
> parameters (self.VAR_MAXITER and self.VAR_THRESH), as per earlier
> discussion. But changing one magic value to another (50 to 100) seems
> somehow unsatisfactory... can you think of a better way to get rid of
> them?
Let the user set them?
Default would be current defaults:
LdaModel(..., var_maxiter = 50, var_thresh = 1e-3)
Optionally:
LdaModel(..., var_maxiter = None, var_thresh = 1e-5)
Estimates the maxiter based on per document word count? Or this
behavior is var_maxiter='est' to be more explicit and let None
literally iterate until var_thresh is met or some really big upper
bound is hit?
>
> With the batch algo, we're allowed to go over the training data
> multiple times, so maybe we can sacrifice one pass (or a part of it),
> and estimate some sensible defaults, automatically?
>
Right, I'm currently iterating over mine around 20 times just to be on
the safe side. Ideally would like to check how much change there are
in the last few iterations.
I just added a line or two in the inference method.
if update_every == 0: # sometimes I thought I saw it changed to 1?
VAR_MAXITER = max(100, int(1.25*sum(doc, axis=0)[1]))
(I've imported all the methods from numpy to avoid getattribute
overhead ... hopeful micro-optimizations)
Though this chokes if it gets an empty document (user error?). Of
course it could be saved after the first pass and used again, though I
doubt there's too much difference in practice between summing again vs
getting and slicing an attribute list/array.
Skipper
Ha, ok. I'm probably -1 on a config file option. I wonder if you want
to drop LDA if it might be able to find a home in scikit-learn, or if
it's too domain specific. If you want to keep it, maybe it would make
sense to split up the model instantiation from the training - and you
can split the options over the two methods __init__ and train.
>
>> Estimates the maxiter based on per document word count? Or this
>> behavior is var_maxiter='est' to be more explicit and let None
>> literally iterate until var_thresh is met or some really big upper
>> bound is hit?
>
> Sounds good. I'll open a github issue for setting these params
> automatically. Thanks for bringing this up,
I have a branch here with the changes already plus a few other
optimizations, so you can see if it's too much with the params.
https://github.com/jseabold/gensim/tree/speedup-olda
There really should be a distributed, general, python library for
probabilistic modeling
Pymc is close, but not distributed
I'm just learning probabilistic modeling, and can attest to it not
being for humans, but it's certainly powerful and useful for a broad
range of problems!!
Norvig does a nice job of underlining the importance of statistical
modeling here -- http://norvig.com/chomsky.html
Radim -- i would love to see gensim build out the LDA + topic modeling stuff :]
i'm learning --
just found this tutorial which was helpful in general --
http://videolectures.net/ecmlpkdd09_park_sldair/
one thing i'm not sure of -- he claims making alpha asymmetric causes
fitting to take longer
i assume that's true -- but i think alpha;s still useful for
increasing/decreasing term-topic sparsity
ie -- lowering alpha increases sparsity
i think that's correct --- ????
it seems to jive w/ this:
> In theory, the algo always converges to the same solution.
> In practice, one usually stop training well before complete
> convergence, so the results can differ.
and this:
https://lists.cs.princeton.edu/pipermail/topic-models/2011-October/001603.html
if that's correct -- it's worth noting that there are benefits to
choosing sparsity at the expense of precision --
http://www.youtube.com/watch?v=ZmNOAtZIgIk
I'm excited to experiment/play-with/bend the model
I'm still messing w/ my db + data collection routines (the boring stuff)
Next on my list is spending more time w/ gensim + the new 'Similarity Server'
Hi Radim,
Can you (or someone) sanity check me here? Looking at the lda-c code,
I see that f here
http://trac.assembla.com/gensim/browser/tags/release-0.7.6/src/gensim/models/ldamodel.py#L414
is the same as their code. Both look correct for alpha as a scalar
*except* the last term. According to A.4.2 in Blei, Ng, and Jordan,
this should be
term3 = np.sum(np.sum((alpha-1)*(gammaln(gamma) -
gammaln(np.sum(gamma,1)[:,None])), axis=1))
which is the same if alpha is a scalar or a 1d array. This does *not* equal
(alpha - 1) * alpha_suff_stats
for alpha is a scalar, unless I am missing something.
Any thoughts?
Thanks,
Skipper
Doh. gammaln should be digamma. Sorry for the noise.
Looks good to me. I've brought the code back in in a branch in my
fork, except I'm just using a solver from scipy.optimize after doing
inference on all the documents instead of trying to roll my own (bfgs
takes a few seconds at best with one parameter). It seems to be
working very well on the code I've got running. I'm only using it for
batch updating though, I haven't added the online updates yet.