Random seed in LDA?

2,481 views
Skip to first unread message

izzy

unread,
Feb 15, 2013, 4:53:06 PM2/15/13
to gen...@googlegroups.com
When using LDA on the same corpus, I have noticed a different result each time.  Is there an ability to choose the random seed, or is this set off the clock or by some other method. I looked through the code, but maybe I missed it or didn't dig deep enough.  I basically just want to be able to replicate the same set of topics each time. I know that might not be possible when using distributed LDA because different workers may be assigned different chunks, but in batch mode, it ought to be possible (unless I am completely misunderstanding the algorithm).

Radim Řehůřek

unread,
Feb 17, 2013, 10:40:03 AM2/17/13
to gensim
Hello izzy,

I never tried with LDA, but setting the random seed ought to achieve
that.

That is, setting `numpy.random.seed` and `random.seed` to a fixed
value and then following the same sequence of steps ought to produce
the same results.

Let me know if that worked,
Radim

izzy

unread,
Feb 20, 2013, 2:39:57 PM2/20/13
to gen...@googlegroups.com
Thanks, Radim.   I think that is probably what I am looking for.  I'll give it a try when I get a chance and let you know. 

izzy

unread,
Feb 21, 2013, 11:33:39 AM2/21/13
to gen...@googlegroups.com
Radim,
It does work when you set the seeds as you suggest AND you use the non distributed version.  I haven't tried it for distributed, but since there is going to be randomness over which workers get which chunks, it won't work.

One issue that I have encountered in this is I can't change max_iter when I use distributed=True.  It seems like it overrides that and chooses 50.

Here's my code:

   mod = models.LdaModel(id2word=dictionary, num_topics=num_topics, passes=num_passes, update_every=0, distributed=True)
   mod.VAR_MAXITER = 100
   mod.VAR_THRESH = 0.001
   mod.update(corpus)

In the worker logs, it says "...documents coverged within 50 iterations", but only when distributed=True.

Radim Řehůřek

unread,
Feb 23, 2013, 8:38:10 AM2/23/13
to gensim
Hi izzy,

you're right. The following

>    mod = models.LdaModel(id2word=dictionary, num_topics=num_topics,
> passes=num_passes, update_every=0, distributed=True)
>    mod.VAR_MAXITER = 100
>    mod.VAR_THRESH = 0.001
>    mod.update(corpus)

will only change VAR_MAXITER in the master -- the change does not
affect any workers.

As a simple workaround, I'd suggest changing the default value in
ldamodel.py in your source code:
https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamodel.py#L234

Btw, is 50 not good with your data, while 100 helps? This value is a
bit of a magic constant, maybe I could increase it to 100 by default
everywhere.

Regards,
Radim

Radim Řehůřek

unread,
Feb 23, 2013, 8:39:35 AM2/23/13
to gensim
Hah, actually there is an open ticket just for that:
https://github.com/piskvorky/gensim/issues/58

If you feel like contributing, getting rid of the magic VAR_MAXITER
constant in gensim would be cool.

Radim

Skipper Seabold

unread,
Feb 23, 2013, 9:51:19 AM2/23/13
to gen...@googlegroups.com
On Thu, Feb 21, 2013 at 11:33 AM, izzy <risra...@gmail.com> wrote:
Radim,
It does work when you set the seeds as you suggest AND you use the non distributed version.  I haven't tried it for distributed, but since there is going to be randomness over which workers get which chunks, it won't work.


Hi,

Yes, reproducible distributed random numbers can be tricky business.


Here is some IPython code that shows the problems and a likely suboptimal solution. For gensim, you could possibly have the seed set by the use and/or, depending on what you want, the hash of the chunk or something along these lines - provided the chunks are always the same and they are hashable (or can be converted to be). I'm not sure the performance hit here, but it might be negligible or unimportant.

import numpy as np
from IPython.parallel import Client

def random_array(size):
    array = np.random.random(size)
    return array

def random_array_stateful(size, prng):
    array = prng.rand(size)
    return array

def random_array_naive_solution(size, prng, seed):
    prng.seed(seed)
    array = prng.rand(size)
    return array

rc = Client()
dview = rc[:]
dview.execute("import numpy as np")

# not reproducible, doesn't know about global state
np.random.seed(12345)
p_res = dview.map_sync(random_array, [5, 5, 5, 5, 5])
np.random.seed(12345)
p_res2 = dview.map_sync(random_array, [5, 5, 5, 5, 5])

# gives same values for each one
prng = np.random.RandomState(12345)
p_res3 = dview.map_sync(random_array_stateful, [5, 5, 5, 5, 5], [prng]*5)
prng.seed(12345)
p_res4 = dview.map_sync(random_array_stateful, [5, 5, 5, 5, 5], [prng]*5)

# naive "solution"
prng.seed(12345)
seeds = prng.randint(0, 1e6, size=5)
p_res5 = dview.map_sync(random_array_naive_solution, [5, 5, 5, 5, 5], [prng]*5,
                        seeds)
p_res6 = dview.map_sync(random_array_naive_solution, [5, 5, 5, 5, 5], [prng]*5,
                        seeds)

Skipper

Skipper Seabold

unread,
Feb 23, 2013, 9:59:06 AM2/23/13
to gen...@googlegroups.com
On Sat, Feb 23, 2013 at 8:39 AM, Radim Řehůřek <m...@radimrehurek.com> wrote:
Hah, actually there is an open ticket just for that:
https://github.com/piskvorky/gensim/issues/58

If you feel like contributing, getting rid of the magic VAR_MAXITER
constant in gensim would be cool.


You can see how I was thinking about doing this here.


This code is not fit for public consumption really. I got it working for what I needed and never cleaned it up/decided on a sensible API. I just went in and set VAR_MAXITER to max(int(doc.sum(0)[1]*1.25), 100). So IIRC this gets around it being reset in the workers.

But my idea was to allow an "est" argument for VAR_MAXITER, which would allow this to be  max(int(doc.sum(0)[1]*1.25), SOME_MINIMUM_NUMBER).

I don't like the mixing of strings and integers for an argument though. Just some food for thought,

Skipper

Radim Řehůřek

unread,
Feb 24, 2013, 4:52:31 AM2/24/13
to gensim
Excellent comments, thank you Skipper. Also the idea of seeding RNG by
chunk hash is cool (chunks fit in RAM, so there's no performance
penalty to speak of there <= iteration is cheap).

izzy, may I ask why you need exact re-runs of LDA? Where does that
requirement come from?

Cheers,
Radim



On Feb 23, 3:51 pm, Skipper Seabold <jsseab...@gmail.com> wrote:
> On Thu, Feb 21, 2013 at 11:33 AM, izzy <risrael...@gmail.com> wrote:
> > Radim,
> > It does work when you set the seeds as you suggest AND you use the non
> > distributed version.  I haven't tried it for distributed, but since there
> > is going to be randomness over which workers get which chunks, it won't
> > work.
>
> Hi,
>
> Yes, reproducible distributed random numbers can be tricky business.
>
> http://mail.scipy.org/pipermail/numpy-discussion/2008-December/thread...

Radim Řehůřek

unread,
Feb 24, 2013, 4:55:12 AM2/24/13
to gensim
Ah, I had a vague recollection VAR_MAXITER had already been discussed
before. Should have know it was you Skipper :)

You seem to have touched on all the right topics back then, pity we
didn't see the changes through to a pull request!

Radim



On Feb 23, 3:59 pm, Skipper Seabold <jsseab...@gmail.com> wrote:
> On Sat, Feb 23, 2013 at 8:39 AM, Radim Řehůřek <m...@radimrehurek.com> wrote:
> > Hah, actually there is an open ticket just for that:
> >https://github.com/piskvorky/gensim/issues/58
>
> > If you feel like contributing, getting rid of the magic VAR_MAXITER
> > constant in gensim would be cool.
>
> You can see how I was thinking about doing this here.
>
> https://github.com/jseabold/gensim/blob/speedup-olda/gensim/models/ld...

Skipper Seabold

unread,
Feb 24, 2013, 9:06:01 PM2/24/13
to gen...@googlegroups.com
I'll likely return to it and finish up. It's poor form I left it
hanging, but the project I was working on with LDA got put on the
backburner. I'll work on it more again this year sometime. Probably
during the summer.
> --
> You received this message because you are subscribed to the Google Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

izzy

unread,
May 29, 2013, 10:26:25 AM5/29/13
to gen...@googlegroups.com


On Sunday, February 24, 2013 4:52:31 AM UTC-5, Radim Řehůřek wrote:
izzy, may I ask why you need exact re-runs of LDA? Where does that 
requirement come from?

Cheers,
Radim


I'm an academic, so I may need to show that I can replicate the exact same run at some point in the future. 
Reply all
Reply to author
Forward
0 new messages