Poor similarity results for LDA and HDP vs. LSI and TF-IDF

787 views

Skip to first unread message

fr...@inkwireapp.com

unread,

Mar 6, 2014, 1:56:54 AM3/6/14

to gen...@googlegroups.com

Hi,

First of all, tks for creating such an amazing topic modeling library Radim, it's awesome! :-)

I have a project to assess document similarity and I've been testing with a 1K news corpus.

I am seeing that similarity based on LSI seems very good, Tfidf-based cosine similarity is also good, while LDA and HDP seems quite off.

, see attached results.

the similarity code is pretty much similar for all models and looks like this for LDA:

def getSimilarDocuments(_corpus, document):
    dictionary = corpora.Dictionary(_corpus['texts'])
    corpus = [dictionary.doc2bow(text) for text in _corpus['texts']]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    model = models.LdaModel(corpus=corpus_tfidf, id2word=dictionary, num_topics=100)
    corpus_model = model[corpus]
    vec_bow = dictionary.doc2bow(document['text'])
    vec_model = model[vec_bow]
    index = similarities.MatrixSimilarity(corpus_model)
    sims = index[vec_model] # perform a similarity query against the corpus
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    return sims

As can be seen in attached lda.html, the performance is quite bad, ie top ranked docs are not very similar,

esp. when compared with LSI or TF-IDF.

So, I created a small test script to look at what's happening under hood with a tiny corpus, see attached tinyCorpusTest.py

if you run this script as follows: python tinyCorpusTest.py 4 2 (pick first 4 texts as corpus and look for 2 topics)

you get the following log output:

================= INPUTS =================

docs: 

0: Cats and dogs chase each other
1: I love dogs, most dogs
2: I like ice cream
3: Some dogs like ice cream


tokenized docs: 

0: [u'cats', u'dogs', u'chase']
1: [u'love', u'dogs', u'dogs']
2: [u'like', u'ice', u'cream']
3: [u'dogs', u'like', u'ice', u'cream']


dictionary: 

0: cats
1: chase
2: dogs
3: love
4: cream
5: ice
6: like


Transformed corpus: 

0: [(0, 1), (1, 1), (2, 1)]
1: [(2, 2), (3, 1)]
2: [(4, 1), (5, 1), (6, 1)]
3: [(2, 1), (4, 1), (5, 1), (6, 1)]


================= TFIDF =================

tfidf corpus: 

0: [(0, 0.699614836733826), (1, 0.699614836733826), (2, 0.1451831961481918)]
1: [(2, 0.383332888988391), (3, 0.9236102512530996)]
2: [(4, 0.5773502691896258), (5, 0.5773502691896258), (6, 0.5773502691896258)]
3: [(2, 0.23302537487517574), (4, 0.5614561943922499), (5, 0.5614561943922499), (6, 0.5614561943922499)]


================= LSI =================

lsi topics: 

topic #0(2.868): 0.759*"dogs" + 0.341*"like" + 0.341*"ice" + 0.341*"cream" + 0.210*"love" + 0.122*"chase" + 0.122*"cats"
topic #1(2.219): -0.495*"dogs" + 0.459*"cream" + 0.459*"like" + 0.459*"ice" + -0.253*"love" + -0.170*"cats" + -0.170*"chase"


lsi corpus: 

0: [(0, 1.003198675781718), (1, -0.83443922019834349)]
1: [(0, 1.728656133688101), (1, -1.2433457817287863)]
2: [(0, 1.0241492394227165), (1, 1.3783826048323284)]
3: [(0, 1.7833788095551224), (1, 0.88301830648324198)]


================= LDA =================

lda topics: 

topic #1 (0.500): 0.273*dogs + 0.136*chase + 0.135*cats + 0.121*cream + 0.120*like + 0.120*ice + 0.094*love
topic #0 (0.500): 0.228*dogs + 0.179*ice + 0.178*like + 0.177*cream + 0.105*love + 0.066*cats + 0.066*chase


lda corpus: 

0: [(0, 0.15003383354998492), (1, 0.8499661664500151)]
1: [(0, 0.29923854371697478), (1, 0.70076145628302522)]
2: [(0, 0.84085059538101015), (1, 0.15914940461898988)]
3: [(0, 0.85278805727523821), (1, 0.14721194272476182)]


================= HDP =================

hdp topics: 

topic 0: 0.341*cream + 0.251*chase + 0.234*love + 0.080*dogs + 0.044*ice + 0.031*like + 0.020*cats
topic 1: 0.561*ice + 0.212*chase + 0.079*dogs + 0.077*cream + 0.040*cats + 0.029*love + 0.002*like
topic 2: 0.436*cats + 0.205*cream + 0.146*ice + 0.098*like + 0.061*chase + 0.027*love + 0.027*dogs
topic 3: 0.325*cream + 0.283*love + 0.221*chase + 0.100*dogs + 0.050*cats + 0.013*ice + 0.008*like
topic 4: 0.485*cats + 0.183*like + 0.132*love + 0.109*cream + 0.082*chase + 0.009*ice + 0.000*dogs
topic 5: 0.578*cream + 0.152*chase + 0.087*ice + 0.085*dogs + 0.051*love + 0.046*like + 0.001*cats
topic 6: 0.312*love + 0.301*like + 0.129*cats + 0.110*dogs + 0.061*ice + 0.044*chase + 0.043*cream
topic 7: 0.240*dogs + 0.233*chase + 0.225*like + 0.110*cream + 0.104*cats + 0.061*ice + 0.028*love
topic 8: 0.298*cats + 0.197*cream + 0.184*love + 0.147*dogs + 0.111*like + 0.035*ice + 0.028*chase
topic 9: 0.313*like + 0.193*dogs + 0.172*ice + 0.129*cream + 0.103*cats + 0.048*love + 0.041*chase
topic 10: 0.255*cats + 0.172*ice + 0.148*cream + 0.141*chase + 0.141*like + 0.083*dogs + 0.060*love
topic 11: 0.221*like + 0.194*dogs + 0.161*love + 0.136*ice + 0.119*cats + 0.109*chase + 0.059*cream
topic 12: 0.287*ice + 0.171*cats + 0.163*chase + 0.160*cream + 0.093*love + 0.075*dogs + 0.051*like
topic 13: 0.455*cats + 0.177*chase + 0.151*cream + 0.092*love + 0.087*like + 0.024*ice + 0.014*dogs
topic 14: 0.476*love + 0.189*chase + 0.122*like + 0.092*cats + 0.054*cream + 0.046*dogs + 0.022*ice
topic 15: 0.301*like + 0.184*cats + 0.164*chase + 0.151*love + 0.087*cream + 0.063*ice + 0.050*dogs
topic 16: 0.640*dogs + 0.122*ice + 0.100*love + 0.055*cats + 0.038*like + 0.030*cream + 0.015*chase
topic 17: 0.312*like + 0.286*ice + 0.145*cream + 0.087*cats + 0.073*love + 0.060*chase + 0.037*dogs
topic 18: 0.447*chase + 0.168*cats + 0.130*dogs + 0.103*ice + 0.098*cream + 0.051*like + 0.003*love
topic 19: 0.360*cats + 0.233*dogs + 0.190*chase + 0.120*love + 0.043*like + 0.031*cream + 0.022*ice


hdp corpus: 

0: [(0, 0.53889088396006601), (1, 0.059473711000574199), (2, 0.33161559092399945), (3, 0.024298583878098176), (4, 0.015809600845564282), (5, 0.010416687269615013)]
1: [(0, 0.83430096456690561), (1, 0.05842337842545306), (2, 0.037255409095707563), (3, 0.024299041878024532), (4, 0.015809576642428685), (5, 0.010416687269397493)]
2: [(0, 0.60781462207119386), (1, 0.28446110716846756), (2, 0.037707728678144391), (3, 0.024295273705752688), (4, 0.015809638979660354), (5, 0.010416687274697928)]
3: [(0, 0.6370803644395352), (1, 0.27684269575590126), (2, 0.030062942570020318), (3, 0.019436997224766901), (4, 0.01264769649320153)]

I have several questions about these results:

1) Why does LDA pick up the tokens 'cats' and 'cream' in both topics #0 and #1 while these tokens do not appear in the same sentences, is that a bug?

2) Why does LSI assign negative probabilities to tokens (it seems to work well :-), but LDA does not?

3) Why does HDP create a whole bunch of topics with same tokens but diff weights? that would seem a bit redundant, no?

Cheers,

Fred

hdp.html

lda.html

lsi.html

tfidf.html

tinyCorpusTest.py

Radim Řehůřek

unread,

Mar 6, 2014, 6:36:55 AM3/6/14

to gen...@googlegroups.com

Hi Fred,

first of all your approach is great -- testing, comparing, evaluating, printing visual sanity checks.

Re. LDA: it's rather sensitive to corpus preprocessing and training parameters (stopword removal, dictionary.filter_extremes etc). How many training passes do you use? Also try training it directly on bow vectors (not tf-idf).

In your code, you train LDA on tf-idf but then index and query documents without tfidf transformation. Is this on purpose? Maybe try dropping the tf-idf transformation from LDA, and increase the number of training passes.

Also, your code keeps everything in RAM (no document streaming), so I assume memory is not an issue for you.

I have several questions about these results:
1) Why does LDA pick up the tokens 'cats' and 'cream' in both topics #0 and #1 while these tokens do not appear in the same sentences, is that a bug?

"cats" and "cream" co-occur via "dogs". It's certainly possible for a topic model to put them in the same topic.

It's highly unlikely there's a bug in the LDA implementation; it's quite possible the method as such is not suitable for you though. LDA is mostly useful when you need a "human interpretation/visualization" of the topics (in contrast to LSI topics which are opaque).

2) Why does LSI assign negative probabilities to tokens (it seems to work well :-), but LDA does not?

The LSI topic matrix represents (mutually orthogonal, vector space base) vectors, rather than probabilities. LDA topics are probability dists, so all values are <0.0, 1.0> and sum up to 1.0. So the answer is: "topics" come down to different things in different models :)

3) Why does HDP create a whole bunch of topics with same tokens but diff weights? that would seem a bit redundant, no?

I don't have any insights into HDP, never used it myself. The model was contributed by Jonathan Esterhazy, perhaps he could assist you here.

Best,

Radim

Radim Řehůřek, Ph.D.

consulting @ machine learning, natural language processing, big data

http://radimrehurek.com

Cheers,

Fred

xiaocheng liu

unread,

Aug 8, 2015, 4:55:14 AM8/8/15

to gensim

topic #0(2.868): 0.759*"dogs" + 0.341*"like" + 0.341*"ice" + 0.341*"cream" + 0.210*"love" + 0.122*"chase" + 0.122*"cats"