Hello Shivani,
> Each file in the directory is a document containing plain text.
> Lets assume that the text is pruned for stopwords and special
> characters etc.
>
> I will need to write custom over-rides of the get_text() function?
exactly, all you have to do is inherit from `corpora.TextCorpus` and
override `get_texts()` so that it yields each document as a list of
tokens.
class MyCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
for filename in self.input: # for each relevant file
yield tokenize(open(filename).read())
mycorpus = MyCorpus(['file1.txt', 'file2.txt', ...])
The dictionary (word->word_id mapping) will then be in
`mycorpus.dictionary`. You can prune it, remove unwanted tokens etc.
`mycorpus` is a proper gensim corpus, so you can pass it into
transformations, store in different formats etc.
HTH,
Radim