LDA on movie review dataset

392 views
Skip to first unread message

suvir

unread,
Mar 11, 2014, 12:56:53 PM3/11/14
to gen...@googlegroups.com
Hi,

First of all, thanks for such nice documentation that comes with gensim. It made easy to get started with LDA using gensim.

I'm trying to do LDA on movie review dataset. My understanding is to first obtain LDA generated topics from movie review(let's say of 100 movies) and later categorize movies under these topics. I started with one dataset of 1000 positive reviews(i just grabbed the dataset that comes with nltk).


corpus = MyCorpus('/home/test/nltk_data/corpora/movie_reviews/pos')
tfidf
= gensim.models.TfidfModel(corpus)
corpus_tfidf
= tfidf[corpus]  
model
= gensim.models.ldamodel.LdaModel(corpus=corpus_tfidf, id2word=corpus.dictionary, num_topics=50, alpha = None)


Here is the output:
In [171]: model.show_topics(10)
Out[171]:
['0.001*leila + 0.001*matilda + 0.001*truman + 0.001*virgil + 0.001*leigh + 0.001*beth + 0.001*mighty + 0.001*stewart + 0.001*bergman + 0.001*spacey',
 
'0.001*truman + 0.001*pauline + 0.001*trekkies + 0.001*cole + 0.001*baby + 0.001*scream + 0.001*mel + 0.001*indian + 0.001*valek + 0.000*reeves',
 
'0.001*bulworth + 0.001*memphis + 0.001*scream + 0.001*girls + 0.000*dance + 0.000*jamaican + 0.000*nikki + 0.000*thai + 0.000*patti + 0.000*simon',
 
'0.001*flynt + 0.001*gibson + 0.001*toy + 0.001*mel + 0.001*linklater + 0.001*mullen + 0.001*gladiator + 0.001*tarantino + 0.001*ordell + 0.001*jackie',
 
'0.001*hrundi + 0.001*flynt + 0.001*dolores + 0.001*bava + 0.001*stempel + 0.001*aliens + 0.001*barlow + 0.001*scream + 0.001*cinque + 0.001*claiborne',
 
'0.001*beaumarchais + 0.001*grodin + 0.001*hunting + 0.001*damon + 0.001*endings + 0.001*furtwangler + 0.001*vail + 0.001*dvd + 0.001*chan + 0.001*damme',
 
'0.001*lambeau + 0.001*cauldron + 0.001*taran + 0.001*sean + 0.001*quaid + 0.001*brown + 0.001*maximus + 0.001*ryan + 0.000*rosalba + 0.000*tarantino',
 
'0.001*quilt + 0.001*faculty + 0.001*nixon + 0.001*li + 0.001*frankenstein + 0.001*vivian + 0.001*hogarth + 0.001*sonny + 0.001*giant + 0.001*pam',
 
'0.001*lola + 0.001*maggie + 0.001*frequency + 0.001*smith + 0.001*bacon + 0.001*ronna + 0.001*dragon + 0.001*skinheads + 0.001*lucas + 0.001*derek',
 
'0.001*gallo + 0.001*hauer + 0.001*chucky + 0.001*cynthia + 0.001*fingernail + 0.001*titanic + 0.001*doll + 0.001*chan + 0.001*hortense + 0.001*turner']

In [172]: model.show_topic(0,10)
Out[172]:
[(0.00080199922395261867, 'truman'),
 
(0.00077141573371340949, 'tarzan'),
 
(0.00066999736068072029, 'lola'),
 
(0.00066889472058921271, 'chocolat'),
 
(0.0006608367820080598, 'gugino'),
 
(0.00064430353454066709, 'kissed'),
 
(0.00063166619926945361, 'roger'),
 
(0.00055236832395473968, 'patlabor'),
 
(0.00054165617319329332, 'kurosawa'),
 
(0.00053950152038525709, 'spoon')]



The probability values of 0.001/0.000 doesn't make any sense. Am i missing something?

In general, how do i generate topics which make sense for movie categorization. I guess some words(such as family drama, dark comedy, European based) have more weight in describing a movie.
All suggestions are welcome.

Regards
Suvir

Radim Řehůřek

unread,
Mar 11, 2014, 1:13:02 PM3/11/14
to gen...@googlegroups.com
Hello Suvir,

for a start, increase the number of training passes over your data. The default is 1 (aka a single pass), which cannot lead to good results on such tiny dataset.

In fact, gensim prints a warning if you do that -- try turning on logging and checking the messages.

After that, it's about tuning LDA params. Play with tokenizing, removing stopwords&frequent words (dictionary.filter_extremes), number of topics, drop the tf-idf transformation...

For small datasets, you can also use the recently added gensim.models.LdaMallet class, which uses a different LDA training algo. (it's in the develop branch of gensim on github)

Best,
Radim
--
Radim Řehůřek, Ph.D.
consulting @ machine learning, natural language processing, big data
 

suvir

unread,
Mar 19, 2014, 6:19:11 AM3/19/14
to gen...@googlegroups.com
Thanks Radim for the reply.

I have now done lot of preprocessing (stopwords,tokeizing,pos tagging, custom noun phrase chunking) and reduced the movie review dataset to only relevant words from each review document.
Now in this new dataset, i have some noun phrases which are of size 2-3 words(such as "comedy lover","facial expression"). 
When doing lda on it, lda picks individual words out of it. I want these custom phrases occurring together in topics instead of separate words. This is because these phrases carry more meaning than individual words. Any suggestion how to generate topics with these custom phrases?

Regards
Suvir

Radim Řehůřek

unread,
Mar 19, 2014, 6:36:31 AM3/19/14
to gen...@googlegroups.com
Sure: just pass phrases into gensim instead of words.

Gensim doesn't care what the strings are -- it just assigns a unique id (integer) to each unique string (word or phrase) in Dictionary and then continues working with the ids.

HTH,
Radim

suvir

unread,
Mar 20, 2014, 6:18:43 AM3/20/14
to gen...@googlegroups.com
Yup, i got it working with phrases.
i now increased the num_topics to 1000 to see more topics. Although result is ok but topic diff=inf.
I guess that means model does not converged at all. Do i need to increase my number_of_passes.(current num_passes =10 for the 50k docs.)

Also, i was trying to run it with hdp but getting this error with my corpus:object of type 'MyCorpus' has no len()
Of-course, my corpus is just the directory with text docs of movie reviews. 

Regards
Suvir

suvir

unread,
Mar 20, 2014, 6:27:17 AM3/20/14
to gen...@googlegroups.com
there wasn't any warning , so i kept num_passes = 10.

suvir

unread,
Mar 24, 2014, 11:36:46 AM3/24/14
to gen...@googlegroups.com
ok. i saved my corpus of review in mm format with  corpora.MmCorpus.serialize('corpus_unsup.mm', corpus).
I ran hdp model and got 149 topics on the dataset of 50k review docs.
Is the max number of topics with hdp is 149?


This is my code but the result is not as pleasant as with LDA:
corpus_in_mm = corpora.MmCorpus('corpus_unsup.mm')
dictionary
= corpora.Dictionary.load('dict_unsup_50k.dict')
hdp
= gensim.models.hdpmodel.HdpModel(corpus_in_mm, dictionary)

i guess i'm missing something with hdp. 
btw, i tried ldamallet, it is pretty fast .(17 min for 50k docs.)

Regards
Suvir

suvir

unread,
Apr 8, 2014, 11:57:48 AM4/8/14
to gen...@googlegroups.com
After generating 200 topics(they look good enough to proceed), i want to find similarity among documents. I throw a document and got the result of top 10 result of similarity.

corpus = MyCorpus('/home/test/nltk_data/corpora/movie_reviews/neg_pre')# preprocessed corpus.
dictionary
= corpus.dictionary # save the dictionary
corpora
.MmCorpus.serialize('testing_corpus.mm', corpus) # since  we need len, save the corpus in mm format.
corpus
= corpora.MmCorpus('testing_corpus.mm') # read back the saved corpus.
#model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=200)
model
= gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=200, update_every=1, chunksize=10000, passes=15, alpha = None)
doc
= read_texts()# read one sample review.
bow
= dictionary.doc2bow(utils.simple_preprocess(doc))
vec_lda
= model[bow]
#index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=20)
index
= similarities.MatrixSimilarity(model[corpus])
sims
= index[vec_lda]
#print(list(enumerate(sims)))
sims
= sorted(enumerate(sims), key=lambda item: -item[1])
print(sims[:10])
=====================
[(84, 0.58722341),
 (156, 0.58722341),
 (257, 0.58722341),
 (622, 0.58722341),
 (705, 0.58722341),
 (474, 0.39097035),
 (855, 0.39097035),
 (978, 0.34163111),
 (623, 0.3177681),
 (629, 0.3177681)]
=============================

The result is just average (ok, it finds some level of comedy when i query with a comedy movie review but results are not reliable always.).
In order to find better similarity, i'm now thinking of calculating KL distance. For KL distance, i would need to find distribution of topics for each document. Any hint how to get that from the model. 

Also, In the Radim's benchmark of nearest neighbour, LSI is used, Is it because LSI is faster (on such a big corpus as Wikipedia ) or is it because LSI is more preferred over LDA when finding similarity?
After reading so many older posts regarding LSI vs LDA, i'm still confused. When is the LDA a good choice for finding similarities? At-least for human interpretation, topics generated by LDA are quite pleasant but i don't know if it matters when finding similarity among docs/movies.

Regards
Suvir

Radim Řehůřek

unread,
Apr 8, 2014, 1:08:33 PM4/8/14
to gen...@googlegroups.com

On Tuesday, April 8, 2014 5:57:48 PM UTC+2, suvir wrote:
In order to find better similarity, i'm now thinking of calculating KL distance. For KL distance, i would need to find distribution of topics for each document. Any hint how to get that from the model. 

Simply use `model[vector]`. The output is a distribution of topics for the document `vector`.
 


Also, In the Radim's benchmark of nearest neighbour, LSI is used, Is it because LSI is faster (on such a big corpus as Wikipedia ) or is it because LSI is more preferred over LDA when finding similarity?
After reading so many older posts regarding LSI vs LDA, i'm still confused. When is the LDA a good choice for finding similarities? At-least for human interpretation, topics generated by LDA are quite pleasant but i don't know if it matters when finding similarity among docs/movies.


LDA has topics that are more pleasant to look at, and a model that is statistically sounder (a generative process etc).

Similarity-wise (i.e. when you don't care what the topics or model look like), I've found LSI comparable while more straightforward to train. LSI's training algo achieves the optimum directly, rather than iterating via EM/sampling like LDA.

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages