Building a corpus

234 views
Skip to first unread message

Tom Tourwe

unread,
Apr 27, 2012, 9:04:17 AM4/27/12
to gensim
Hi all,

I've just discovered gensim and would like to use it for LDA. I've
read the documentation, but am still having troubles constructing a
valid corpus. I would like to read a csv file, and consider the third
column of each line as a document in the corpus. Here's my code:

import csv
import logging
from gensim import corpora, models

logging.basicConfig(format='%(asctime)s : %(levelname)s : %
(message)s', level=logging.INFO)

class EventCorpus(corpora.TextCorpus):
def get_texts(self):
for line in self.input:
yield (line[1] + "." + line[2]).lower().split()

r = csv.reader(open("Event_data_small.csv", "r"), dialect='excel',
delimiter=';', quotechar='"', skipinitialspace=True)
r.next()
corpus = EventCorpus(r)

for document in corpus:
print document

This doesn't print anything, and it seems the corpus does not contain
any documents? However, the corpus.dictionary does contain all the
words in my documents.

What am I missing here?

Thanks for any answers, Tom

Tom Tourwe

unread,
Apr 27, 2012, 10:30:44 AM4/27/12
to gensim
The avoid any confusion: the code should read like this

import csv
import logging
from gensim import corpora, models
logging.basicConfig(format='%(asctime)s : %(levelname)s : %
(message)s', level=logging.INFO)

class EventCorpus(corpora.TextCorpus):
    def get_texts(self):
        for line in self.input:
            yield line[2].lower().split()

r = csv.reader(open("Event_data_small.csv", "r"), dialect='excel',
delimiter=';', quotechar='"', skipinitialspace=True)
r.next()
corpus = EventCorpus(r)

for document in corpus:
    print document

Radim Řehůřek

unread,
Apr 27, 2012, 2:03:31 PM4/27/12
to gensim
Hello Tom,

the problem is that you pass an open csv.reader object to EventCorpus.
When you iterate over the object once, any subsequent `for line in
self.input` will not yield anything anymore (the csv stream has been
exhausted).

It is a Python thing, nothing to do with gensim. To fix your CSV
corpus, only pass the filename in constructor: `corpus =
EventCorpus("Event_data_small.csv")` and change `get_texts()` to re-
open the CSV file on each corpus pass: `for line in
csv.reader(open(self.input), ...)`.

Best,
Radim


> class EventCorpus(corpora.TextCorpus):
>     def get_texts(self):
>         for line in self.input:
>             yield line[2].lower().split()

^^ this will
Reply all
Reply to author
Forward
0 new messages