Different similarity after loading

39 views
Skip to first unread message

niefpaarschoenen

unread,
Apr 24, 2012, 8:52:45 AM4/24/12
to gensim
Hi all,

I keep getting different document similarities with the same models
when they are computed online or offline. What am I doing wrong?

Here is my code:

# query and corpora to compare
query = 'er rijdt een fiets naast mijn auto'
corpus_dev = ['de auto rijdt over de snelweg']

# building bows
dictionary = corpora.Dictionary.load(base + '.dict')
bow_dev = [dictionary.doc2bow(text) for text in ([[word for word in
doc.lower().split()] for doc in corpus_dev])]
bow_query = dictionary.doc2bow(query.lower().split())

# training tf-idf + lsa model on training bow corpus
bow_train = corpora.MmCorpus(base + '.mm')
model_tfidf = models.TfidfModel(bow_train)
tfidf_train = model_tfidf[bow_train]
model_lda_tfidf = models.LdaModel(tfidf_train,
id2word=dictionary,num_topics=100, update_every=1, chunksize=1000,
passes=1)

# transforming test bows to latent space
lda_query = model_lda_tfidf[bow_query]
index =
similarities.MatrixSimilarity(model_lda_tfidf[bow_dev],num_features=len(dictionary))
sims = index[lda_query]
print sims[0]

# serialize the model
model_lda_tfidf.save(base + '.tfidf.lda')

# rerun the above experiment with the loaded model
model_lda_tfidf = models.LdaModel.load(base + '.tfidf.lda')
lda_query = model_lda_tfidf[bow_query]
index =
similarities.MatrixSimilarity(model_lda_tfidf[bow_dev],num_features=len(dictionary))
sims = index[lda_query]
print sims[0]

In the first case I get a probability of 0.939869; in the second case
I get 0.939746. Both are quite similar and OK of course, but I'm
wondering what might cause this difference. Any ideas? Maybe there is
some rounding going on when serializing?

Thanks in advance,

Joris

Radim Řehůřek

unread,
Apr 24, 2012, 9:42:17 AM4/24/12
to gensim
Hi Joris!

Not sure whether it's related, but I noticed you train the LDA model
as lda[tfidf[bow]], but then later transform as lda[bow] (the tfidf
step is missing).

Other than that, the LDA transformation of a document stops after 50
inner (variational) iterations, so it is possible you can get a
slightly different result, if the inference loop fails to converge
within that number.

You can change this parameter in LdaModel.VAR_MAXITER (default is 50),
but the poorer convergence may very well be connected to the missing
tf-idf step mentioned above (train/transform input mismatch), so I'd
start there.

Let me know if the issue persists,
Radim

niefpaarschoenen

unread,
Apr 25, 2012, 8:21:12 AM4/25/12
to gensim
Hello Radim,

> Not sure whether it's related, but I noticed you train the LDA model
> as lda[tfidf[bow]], but then later transform as lda[bow] (the tfidf
> step is missing).

Is this really the case? I thought, since I saved the lda transform
based on tfidf values, that it would perform the double
transformation. Some new results seem to indicate that this is a true
assumption:

probabilities
before saving: 0.29011
use both saved transforms: 0.547337
use only last saved transform: 0.29011

As you see, the probabilities are equal now. I achieved this, not by
increasing VAR_MAXITER, but decreasing the lda chunksize, since it was
a bit too high wrt my temporary tiny training corpus (5183 docs).

If I rerun the entire experiment, the probabilities are still always
equal, but they have a different value (often 0.0 by the way). I'm
guessing/hoping that LDA as implemented in gensim always converges to
the same solution, but takes a different random path to get there? And
since VAR_MAXITER might be too small, convergence has not been
reached, hence the probabilities are different?

Have a nice day,

Joris

Radim Řehůřek

unread,
Apr 26, 2012, 12:53:39 PM4/26/12
to gensim
Hi Joris,

> > Not sure whether it's related, but I noticed you train the LDA model
> > as lda[tfidf[bow]], but then later transform as lda[bow] (the tfidf
> > step is missing).
>
> Is this really the case? I thought, since I saved the lda transform
> based on tfidf values, that it would perform the double
> transformation. Some new results seem to indicate that this is a true
> assumption:

nah, the intermediate transformations are not saved automatically. In
fact, they are not even passed on to LdaModel -- that object never
sees the TfidfModel, only the corpus transformed by it.

> probabilities
> before saving: 0.29011
> use both saved transforms: 0.547337
> use only last saved transform: 0.29011

I think this may be because they both don't use tf-idf (i.e., both
0.29 are wrong :)

> As you see, the probabilities are equal now. I achieved this, not by
> increasing VAR_MAXITER, but decreasing the lda chunksize, since it was
> a bit too high wrt my temporary tiny training corpus (5183 docs).

Cool.

> If I rerun the entire experiment, the probabilities are still always
> equal, but they have a different value (often 0.0 by the way). I'm
> guessing/hoping that LDA as implemented in gensim always converges to
> the same solution, but takes a different random path to get there? And
> since VAR_MAXITER might be too small, convergence has not been
> reached, hence the probabilities are different?

Yes. The variational online LDA algo is guaranteed to converge,
eventually. Note that even when trained to complete convergence, the
topics can still differ between training runs, because the order of
topics is arbitrary (unlike with LSA).

Best,
Radim
Reply all
Reply to author
Forward
0 new messages