Building a text corpus in gensim from a directory of text documents

6,219 views
Skip to first unread message

Shivani

unread,
Apr 14, 2011, 11:24:52 PM4/14/11
to gensim
I have a directory of text documents that I want indexed and topic
model built on
Each file in the directory is a document containing plain text.
Lets assume that the text is pruned for stopwords and special
characters etc.


I will need to write custom over-rides of the get_text() function?

Any help is greatly appreciated.

Regards,
Shivani

Radim

unread,
Apr 15, 2011, 12:59:23 PM4/15/11
to gensim
Hello Shivani,

> Each file in the directory is a document containing plain text.
> Lets assume that the text is pruned for stopwords and special
> characters etc.
>
> I will need to write custom over-rides of the get_text() function?

exactly, all you have to do is inherit from `corpora.TextCorpus` and
override `get_texts()` so that it yields each document as a list of
tokens.

class MyCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
for filename in self.input: # for each relevant file
yield tokenize(open(filename).read())

mycorpus = MyCorpus(['file1.txt', 'file2.txt', ...])

The dictionary (word->word_id mapping) will then be in
`mycorpus.dictionary`. You can prune it, remove unwanted tokens etc.

`mycorpus` is a proper gensim corpus, so you can pass it into
transformations, store in different formats etc.

HTH,
Radim

Shivani

unread,
Apr 21, 2011, 1:12:30 PM4/21/11
to gensim
Hello Randim,

Thanks so much for your reply. That really helped.

I finally used the following to get construct a corpus.

def split_line(text):
words = text.split()
out = []
for word in words:
out.append(word)
return out

import gensim
class MyCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
for filename in self.input:
yield split_line(open(filename).read())

I however, don't know what happens to the dictionary.

How do I create a dictionary while creating the corpus or if it is
created How do I access it?

Any help is appreciated.

Shivani

Radim

unread,
Apr 21, 2011, 1:38:13 PM4/21/11
to gensim
Hello,

On Apr 21, 6:12 pm, Shivani <raoshiv...@gmail.com> wrote:
> Hello Randim,
>
> Thanks so much for your reply. That really helped.
>
> I finally used the following to get construct a corpus.
>
> def split_line(text):
>     words = text.split()
>     out = []
>     for word in words:
>         out.append(word)
>     return out
>
> import gensim
> class MyCorpus(gensim.corpora.TextCorpus):
>     def get_texts(self):
>         for filename in self.input:
>             yield split_line(open(filename).read())
>
> I however, don't know what happens to the dictionary.
>
> How do I create a dictionary while creating the corpus or if it is
> created How do I access it?

the answer is in my previous reply; do you have a problem with any
part in particular?

Radim

Shivani

unread,
Apr 21, 2011, 2:28:20 PM4/21/11
to gensim
Hello Radim,

I get the following error
>>for word in myCorpus.dictionary:
... print word
...
books
college
community
electronic
library
student
volumes
Afghanistan
Iraq
Soldier
battle
combat
dioxide
field
traders
Exchange
Stock
dollar
market
street
wall
greenhouse
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/gensim-0.7.8-py2.6.egg/
gensim/corpora/dictionary.py", line 52, in __getitem__
return self.id2token[tokenid] # will throw for non-existent ids
KeyError: 22


Does this mean there is no mapping ?

THanks a lot for your help
Shivani

On Apr 15, 12:59 pm, Radim <radimrehu...@seznam.cz> wrote:

Radim

unread,
Apr 22, 2011, 2:03:36 PM4/22/11
to gensim
Hello Shivani,

the way to access the ids is with
`myCorpus.dictionary.token2id.values()`.
The way to access the words is through
`myCorpus.dictionary.token2id.keys()`.
To access (word, id) pairs, use `myCorpus.dictionary.token2id.items()`

But I actually like your idea of using `myCorpus.dictionary` directly,
without the extra `token2id`. In future release, I will extend the
Dictionary class to fully support such iteration, including the case
where you keep adding more documents to the dictionary dynamically
(which is what you probably did here, and that's the cause of your
KeyError exception).

Best,
Radim

Shivani

unread,
Apr 22, 2011, 2:48:14 PM4/22/11
to gensim
That is a useful function to know
Thanks
Shivani

lanndo

unread,
Nov 8, 2012, 9:14:13 AM11/8/12
to gen...@googlegroups.com
Hello Radim,
Related to this topic, if I have a directory of text with various subdirectories (or even if there were no subdirectories, just all files in one folder), is there a python or gensim function that would circumvent the need to supply the file names within the script - a command that would instruct to recurse through a directory, reading each file in sequence until done?  Just thinking about directories with massive amount of documents, or those with deep subdirectory structures.  Thank you!

Anton Bondar

unread,
Nov 8, 2012, 2:11:41 PM11/8/12
to gen...@googlegroups.com
Hello,

Thank you for the code you posted above as it is very useful and it saves a lot of work! I would like know if there is a way to provide already made dictionary to process a list of txt files into a gensim corpus using your code:

class MyCorpus(gensim.corpora.TextCorpus):
    def get_texts(self):
        for filename in self.input:
            yield tokenize(open(filename))


Thank you so much!
<3 gensim

Radim Řehůřek

unread,
Nov 9, 2012, 2:45:14 AM11/9/12
to gensim
Hello Anton,


On Nov 8, 8:11 pm, Anton Bondar <anton.bond...@gmail.com> wrote:
> Hello,
>
> Thank you for the code you posted above as it is very useful and it saves a
> lot of work! I would like know if there is a way to provide already made
> dictionary to process a list of txt files into a gensim corpus using your
> code:

Yes. And in fact, you don't need to change anything.

Just create the corpus with `c = MyCorpus()` and then set the input
and dictionary attributes with `c.input, c.dictionary = your_input,
already_made_dictionary`.

HTH,
Radim

Radim Řehůřek

unread,
Nov 9, 2012, 2:47:22 AM11/9/12
to gensim
Hello Ianndo,

there's already a built-in command for that, in Python -- have a look
at `os.walk`.

Best,
Radim
Message has been deleted

Swapnajit Chakraborti

unread,
May 27, 2014, 1:23:52 PM5/27/14
to gen...@googlegroups.com
Hello,

Posting a doubt which I recently faced to this old thread.
It seems get_texts() is also called when we iterate the
corpus, e.g.

for vec in mycorpus:
   print vec

Hence if we do something in get_texts() apart from tokenizing,
everything gets repeated.

Is there a way to stop this?

Regards,
Swapnajit 

Swapnajit Chakraborti

unread,
Jun 5, 2014, 8:49:44 AM6/5/14
to gen...@googlegroups.com
Any input on this will be appreciated.

Regards,
Swapnajit

Swapnajit Chakraborti

unread,
Jun 5, 2014, 9:39:01 AM6/5/14
to gen...@googlegroups.com
A little addendum to my earlier query:
It seems get_texts is called even for corpus2dense() and model
conversions such as tfidf etc.

So, is this expected behavior? Because it will hit performance badly.

Regards,
Swapnajit

Radim Řehůřek

unread,
Jun 5, 2014, 11:34:00 AM6/5/14
to gen...@googlegroups.com
Hello Swapnajit,

the `get_text` method of TextCorpus is called every time you iterate over the corpus. That's its sole and only purpose.

If you want to perform other computations not related to iteration, move them outside of `get_text`, such as into a different method etc.

HTH,
Radim

Swapnajit Chakraborti

unread,
Jun 5, 2014, 1:18:33 PM6/5/14
to gen...@googlegroups.com
Thanks Radim for confirming this.
I shall check out your suggestion.

Regards,
Swapnajit
Reply all
Reply to author
Forward
0 new messages