Hello -
Thanks very much for gensim - it's a fantastic package. I've got up and running very quickly. I'm trying to do LSA on a corpus of some 1.25 short texts, and just struggling a bit with the order I should do things in order to get the most useful results.
I have a text file (actually a one-column CSV output) of all the texts, one line per document.
I can create a corpus and mm file from these, and then do the LSA stuff described in the tutorials. (Resulting in .dict, .mm, and LSI .plk files.)
However, I haven't removed common words using stoplists and filtering extremes, and I want to go back and do this properly.
Before LSA, I took the basic steps:
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
background_corpus = TextCorpus(input="texts.csv.bz2")
background_corpus.dictionary.save("my_dict.dict")
Should I remove words from the dictionary file alone before or after creating the corpus?
Is the mm corpus created using this dictionary, or are the two separate processes?
Can I remove words from the dictionary file after creating a corpus, or do I need to rebuild it?
In short, I need to understand the ongoing relationship between dictionary and corpus a bit better (both are used to create the LSI, as I understand it), and any advice would be greatly appreciated.
Thanks!
James