I'm a bit new to gensim and LSA/LDA generally having taken a text mining class last year. I did due diligence and read the tutorials and many of the docs.
I'm trying to work on a project with the yelp dataset mining user reviews. I'm curious if I can use LSA to topic model sub-collections of the review sets to determine more narrowly tailored topic sets. (LDA is a totally different question, but I want to focus on LSA first)
For example. Running LSA on all reviews identifies many different reviews across Restaurants, Doctors Offices, etc etc.
I'm sure I could use LSA on just one of those segments (inputting only documents already classified as Restaurants for example) to get topics only for that subsection.
My question is at what point would might I reach an impasse as I delve down deeper into the subcategories. For example, Let's say I wanted to use LSA to determine topics only for Restaurants/Foodtrucks and I only feed it those reviews. Is there a certain threshold I MUST have in terms of number of reviews to derive meaningful output?
Also, as the total number of documents included in the models for these sub categories drops, should the number of LSI topics be reduced as well? Radim says that, "For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard”" But I imagine that wouldn't be true if once you got down to a subcategory like Foodtrucks and there were only several thousand reviews.
So to summarize 1) how many documents are needed for LSA to be minimally viable, and 2) what are some best practice suggestions when the total number of documents approaches that threshold.
Broad strokes, or further reading suggestions welcomed. Again, I'm a student and want to understand this stuff.
Thanks!