A bit of a newbie question, but trying to understand feasibility of LSA

mattyterm

unread,

Jun 21, 2016, 5:55:29 PM6/21/16

to gensim

I'm a bit new to gensim and LSA/LDA generally having taken a text mining class last year. I did due diligence and read the tutorials and many of the docs.

I'm trying to work on a project with the yelp dataset mining user reviews. I'm curious if I can use LSA to topic model sub-collections of the review sets to determine more narrowly tailored topic sets. (LDA is a totally different question, but I want to focus on LSA first)

For example. Running LSA on all reviews identifies many different reviews across Restaurants, Doctors Offices, etc etc.

I'm sure I could use LSA on just one of those segments (inputting only documents already classified as Restaurants for example) to get topics only for that subsection.

My question is at what point would might I reach an impasse as I delve down deeper into the subcategories. For example, Let's say I wanted to use LSA to determine topics only for Restaurants/Foodtrucks and I only feed it those reviews. Is there a certain threshold I MUST have in terms of number of reviews to derive meaningful output?

Also, as the total number of documents included in the models for these sub categories drops, should the number of LSI topics be reduced as well? Radim says that, "For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard”" But I imagine that wouldn't be true if once you got down to a subcategory like Foodtrucks and there were only several thousand reviews.

So to summarize 1) how many documents are needed for LSA to be minimally viable, and 2) what are some best practice suggestions when the total number of documents approaches that threshold.

Broad strokes, or further reading suggestions welcomed. Again, I'm a student and want to understand this stuff.

Thanks!

Lev Konstantinovskiy

unread,

Jun 21, 2016, 8:54:41 PM6/21/16

to gensim

Hi Mattyterm

This really depends on your specific corpus. You can easily see it for yourself. LSI and LDA are quick to run and analyse with PyLDAVis, topic coherence(new gensim feature, see PR) and perplexity.

Just try and see how it works in your case.

Regards
Lev

Radim Řehůřek

unread,

Jun 21, 2016, 9:58:53 PM6/21/16

to gensim

Hi Matt,

for LSA, the topics are nested and there is no lower or upper limit. This means that if you compute 10 topics, and then 100 topics (on the same training corpus), the 10 topics will be just a prefix of the 100 topics = the first ten. You can inspect how much variance of your data topic `i` captures by `lsi.projection.s[i]` (higher `s[i]` = higher singular value for topic #i = more variance captured).

Also, for LSA, the more topics you have, the closer the topical representation is to plain tf-idf. Ultimately, once #topics = min(#features, #train_docs), tf-idf and LSA become equivalent. So it's not really the case that "more topics is always better"; the #topics cut-off point, which is usually between 200-500, is application-dependent.

For LDA, the topics are not composable/nested like that, and intuition is much less geometric and hazier. I find that hundreds of documents per topics can be enough, as long as they are focused (consistent word co-occurrence patterns) and the text is well preprocessed.

HTH,

Radim

mattyterm

unread,

Jun 22, 2016, 6:14:41 AM6/22/16

to gensim

Both excellent, enlightening answers, thank you so much. I'm really looking forward to working with gensim, and will be participating in the community. Thanks again!

J

mattyterm

unread,

Jul 28, 2016, 7:55:25 PM7/28/16

to gensim

Lev,

I've gotten PyLDAVis working for my LDA models, but have been less successful trying to do that with LSI. Is LSI possible to visualize with PyLDAVis? Any insights will be helpful and sorry if I'm missing something obvious.

Best,

Justin

Bhargav Srinivasa

unread,

Jul 29, 2016, 1:53:50 AM7/29/16

to gensim

Hello,

How exactly are you going about visualising LSI with pyLDAvis? Are you getting stuck setting it up or are you not getting satisfactory results?

mattyterm

unread,

Jul 29, 2016, 6:59:09 AM7/29/16

to gensim

I imagine I'm messing something up. pyLDAvis doesn't seem to be documented as well as gensim, so it has been trial and error.

I'm using pyLDAvis.gensim to take the LSI model I generated and prepare() it with the tfidf corpus that I trained it on. This works fins with the LDA and the bow corpus.

import pyLDAvis.gensim as gensimvis

dictionary = corpora.Dictionary.load(dictionary_path)
bowcorpus = corpora.MmCorpus(corpus_path)
tfidf = models.TfidfModel(bowcorpus)
tfidfcorpus = tfidf[bowcorpus]

model = models.LsiModel.load(lsi_model_path)
vis_data = gensimvis.prepare(model, tfidfcorpus, dictionary)
pyLDAvis.save_html(vis_data, lsi_model_path + ".html")

And then I get an 'LsiModel' object has no attribute 'inference'. So I figured I was either missing something, or it wasn't possible.

Traceback (most recent call last):
  File "topics/display.py", line 55, in <module>
    main()
  File "topics/display.py", line 29, in main
    vis_data = gensimvis.prepare(model, bowcorpus, dictionary)
  File "/Applications/anaconda/lib/python3.5/site-packages/pyLDAvis/gensim.py", line 109, in prepare
    opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
  File "/Applications/anaconda/lib/python3.5/site-packages/pyLDAvis/gensim.py", line 40, in _extract_data
    gamma, _ = topic_model.inference(corpus)
AttributeError: 'LsiModel' object has no attribute 'inference'

Bhargav Srinivasa

unread,

Jul 29, 2016, 7:29:50 AM7/29/16

to gensim

I'm not very sure myself how the insides of LSI work - but if you have a topic_term probability distribution matrix and a doc_topic probability distribution matrix you can manually feed in these values to help visualise it.

This is the pyLDAvis doc for the same, using the prepare() method - http://pyldavis.readthedocs.io/en/latest/modules/API.html#pyLDAvis.prepare

You can see it allows you to manually feed in matrices to aid in visualisation. topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency are all you need.

I've done something similar for the DTM model of gensim in this notebook: link.

It may not be super helpful but should give you some flavour of how to set things up.

mattyterm

unread,

Jul 29, 2016, 7:40:27 AM7/29/16

to gensim

This will help a lot. Thanks for giving me some training wheels :D

Cheers!

Lev Konstantinovskiy

unread,

Aug 18, 2016, 9:36:42 AM8/18/16

to gensim

Hi mattyterm,

Did you get pyLDAvis to cooperate with LSI? If yes, the code would be a useful pull request to Gensim!

Regards

Lev

mattyterm

unread,

Aug 19, 2016, 1:35:09 PM8/19/16

to gensim

Lev,

I have not yet gotten to it. This is a side project for me, but I want to get it working, so once I do, I will reach out.

Thanks!

Doan Helen

unread,

Feb 23, 2017, 10:59:32 PM2/23/17

to gensim

@mastisetorm I have the same issue with you. When trying to visualize LSI model with pyLDAvis , I got the same error which you have.

If you have the chance to finish fitting LSI to pyLDAvis, please make a pull or it's greate if you can show me how you can do that

Thanks!

Matt Buckley

unread,

Apr 11, 2022, 11:48:11 AM4/11/22

to Gensim

Has anyone managed to solve this and could provide their solution? Thanks!

Reply all

Reply to author

Forward