Creating a corpus into python using text files

5,528 views
Skip to first unread message

Seamus Shanley

unread,
Jun 9, 2011, 10:03:03 AM6/9/11
to nltk-users
Hello, I am currently doing a masters thesis on finding out what is
the most accuarate classifier for use in sentiment analysis. I have
decided to look at this topic under the concept of movie reviewing as
this area does provide polarity comments and views. I have created a
number of text files and labelled each under two folders-positive and
negative. My main problem is trying to load these files onto a corpus
and then installing the data into the python network to measure and
train under a classifier.

From looking at a previous thread I found the following link to be
useful https://github.com/japerk/nltk-trainer to be of use for me
however I have found difficulty finding a link to download the
"nltk_trainer.classification" coding to use in train_classifier.py.

I have also found other links that have been of some use in
understanding the concept of creating corpus data on the following
chapter online- https://www.packtpub.com/sites/default/files/3609-chapter-3-creating-custom-corpora.pdf
. I attempted to install the text files by firstly following the lazy
corpus loading section and them attempting to create a concatenated
corpus view however I have found difficulty using the code for
concatenation.

I found another weblink http://streamhacker.com/tag/python/page/2/ in
which it shows the measure and precision of recall using
NavieBayesClassifier however I do not how to use my own dataset in
this exercise instead of the code line "from nltk.corpus import
movie_reviews"

I would just like to ask which method would be of most benefit to use
for my research-the coding used in the second paragraph or to follow
the third paragraph headlines and help me find a solution to coding in
python there.

I would be grateful to any information you can provide me

Alexis Dimitriadis

unread,
Jun 9, 2011, 11:42:17 AM6/9/11
to nltk-...@googlegroups.com
> My main problem is trying to load these files onto a corpus
> and then installing the data into the python network to measure and
> train under a classifier.

Don't know what nltk-trainer or the code in the Cookbook would buy you,
but starting up an nltk corpus reader is pretty trivial: Supposing your
files are in corpus/pos and corpus/neg, you can just say

reader = nltk.corpus.reader.PlaintextCorpusReader(r"./corpus",
r"(pos|neg)/.*\.txt")
print reader.sents( )[0:3] # etc.

The first argument is the base directory, the second an RE (not a glob)
matching the filenames to include. But you'll probably want to use a
CategorizedCorpusReader instead, see


http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.api.CategorizedCorpusReader-class.html

The constructor accepts a flag with REs that map filenames to categories.

Have fun with it,

Alexis

Jacob Perkins

unread,
Jun 10, 2011, 11:31:04 AM6/10/11
to nltk-users
Hi Seamus,

I've written a series of articles evaluating sentiment classifiers on
the movie_reviews corpus:

http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/
http://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/
http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/

You should be able to copy the code in the articles, then replace
movie_reviews with your own corpus (as Alexis described) to perform
similar evaluations.

You can also use nltk-trainer's train_classifier.py to test different
classification methods on your corpus. So if you've setup a corpus
similar to movie_reviews in ~/nltk_data/corpora/sentiment, then you
can do

$ python train_classifier.py sentiment

And it will train a NaiveBayesClassifier and output evaluation
results. You can also do add --cross-fold 10 for 10-fold cross-
validation and/or --classifier Maxent to train a Maximum Entropy
classifier.

Jacob
---
http://streamhacker.com/
http://text-processing.com/
http://twitter.com/japerk

On Jun 9, 8:42 am, Alexis Dimitriadis <alexis.dimitria...@gmail.com>
wrote:
> >  My main problem is trying to load these files onto a corpus
> >  and then installing the data into the python network to measure and
> >  train under a classifier.
>
> Don't know what nltk-trainer or the code in the Cookbook would buy you,
> but starting up an nltk corpus reader is pretty trivial: Supposing your
> files are in corpus/pos and corpus/neg, you can just say
>
>      reader = nltk.corpus.reader.PlaintextCorpusReader(r"./corpus",
> r"(pos|neg)/.*\.txt")
>      print reader.sents( )[0:3]  # etc.
>
> The first argument is the base directory, the second an RE (not a glob)
> matching the filenames to include. But you'll probably want to use a
> CategorizedCorpusReader instead, see
>
> http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.api.C...
>
> The constructor accepts a flag with REs that map filenames to categories.
>
> Have fun with it,
>
> Alexis
>
> On 09/06/2011 16:03, Seamus Shanley wrote:
>
>
>
>
>
>
>
> > Hello, I am currently doing a masters thesis on finding out what is
> > the most accuarate classifier for use in sentiment analysis. I have
> > decided to look at this topic under the concept of movie reviewing as
> > this area does provide polarity comments and views. I have created a
> > number of text files and labelled each under two folders-positive and
> > negative. My main problem is trying to load these files onto a corpus
> > and then installing the data into the python network to measure and
> > train under a classifier.
>
> >  From looking at a previous thread I found the following link to be
> > usefulhttps://github.com/japerk/nltk-trainerto be of use for me
> > however I have found difficulty finding a link to download the
> > "nltk_trainer.classification" coding to use in train_classifier.py.
>
> > I have also found other links that have been of some use in
> > understanding the concept of creating corpus data on the following
> > chapter online-https://www.packtpub.com/sites/default/files/3609-chapter-3-creating-...

Seamus Shanley

unread,
Jun 10, 2011, 12:14:00 PM6/10/11
to nltk-users
Hey guys thanks for the information very much appreciated. Indeed I
was looking at your weblink on testing the precision of
NaiveBayesClassifier Jacob, I am trying to copy in the code now. Ive
managed to install my own corpus into the system using the code

reader = nltk.corpus.reader.PlaintextCorpusReader(r"./corpus",
r"(pos|neg)/.*\.txt")
print reader.sents( )[0:3] # etc.

This was mentioned by Alexis.I have an error message now occuring when
i try to enter in the data negids = reader.fileids('neg')

The error message popped up saying

Traceback (most recent call last):
File "<pyshell#9>", line 1 in <module>
negids = reader.fileids('neg')
AttributeError: 'module' object has no attribute 'fileids'

I presume this error came as a result of the way i installed my data,
what changes do I need to make?Thank you again for your help

Alexis Dimitriadis

unread,
Jun 11, 2011, 6:44:37 AM6/11/11
to nltk-...@googlegroups.com
> negids = reader.fileids('neg')
> AttributeError: 'module' object has no attribute 'fileids'

Looks like "reader" refers to the module nltk.corpus.reader, not to your
object. Did you use "from nltk.corpus import reader"? (I shouldn't have
suggested "reader" as an object name, sorry). Just change the variable
name and try again:

sentimentcorpus = nltk.corpus.reader.PlaintextCorpusReader(...)

Alexis

PS. Here are some commands you can use to inspect python objects:

type(reader)
dir(reader)
help(reader), help(dir), help(sentimentcorpus.fileids), etc.

Seamus Shanley

unread,
Jun 11, 2011, 9:31:24 AM6/11/11
to nltk-users
Hey there I am still having problems installing my dataset into the
code, i renamed the file as "sentimentcorpus" instead of "reader" but
when i enter in the code "from nltk.corpus import sentimentcorpus" i
am getting an error message saying ImportError: cannot import name
sentimentcorpus. What code do i need to use to install the data?

Alexis Dimitriadis

unread,
Jun 11, 2011, 1:45:17 PM6/11/11
to nltk-...@googlegroups.com

You load your data using the call to PlaintextCorpusReader. The "import"
command imports python modules, which are code, not data. The command
"from nltk.corpus import reader" was not a solution--it's what probably
caused your problem.

Name your data folders whatever you want, just adjust the reader
arguments. Assuming your files are named ./corpus/pos/*.txt,
./corpus/neg/*.txt, the following is a complete working program:

import nltk
sentimentcorpus = nltk.corpus.reader.PlaintextCorpusReader(r"./corpus",

r"(pos|neg)/.*\.txt")

print sentimentcorpus.fileids( )

Alexis

Seamus Shanley

unread,
Jun 11, 2011, 8:37:15 PM6/11/11
to nltk-users
Ok I understand that by importing my own corpus i am importing data
instead of code. I am still having problems implementing this code
into programs. The follwing error occured for me

created my dataset named as "mysentiment"

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
def evaluate_classifier(featx):
negids=mysentiment.fileids('neg')
posids=mysentiment.fileids('pos')

negfeats = [(featx(mysentiment.words(fileids=[f])), 'neg') for f in
negids]

Traceback (most recent call last):
File "<pyshell#9>", line 1 in <module>
NameError: name 'negids' is not defined

The system is telling me it does not recognise negids despite the fact
i have just created the file name a few lines earlier. I dont
understand why this error is occuring as the data "mysentiment" did
not give me any errorset therefore my data must be in the system?





On Jun 11, 6:45 pm, Alexis Dimitriadis <alexis.dimitria...@gmail.com>
wrote:

Seamus Shanley

unread,
Jun 11, 2011, 8:56:38 PM6/11/11
to nltk-users
Upon looking at the comparison between my dataset and a module dataset
such as the movie_reviews corpus I see that when i use the code "print
mysentiment.fileids()" the answerset i receive is just simply [] when
it should be a list of all the text files that I have contained in my
corpus. when i used the same code for the movie_reviews corpus i
received a detailed dataset containing all the text files in that
corpus. I would like to know why this is the case, and how I could
remedy this?

Alexis Dimitriadis

unread,
Jun 12, 2011, 4:47:52 AM6/12/11
to nltk-...@googlegroups.com
On 12/06/2011 02:56, Seamus Shanley wrote:

> when i use the code "print mysentiment.fileids()" the answerset I
receive is just simply [ ]

You're obviously using the wrong pathname or filenames for the reader,
so it's not finding your files.

> def evaluate_classifier(featx):
> negids=mysentiment.fileids('neg')
> posids=mysentiment.fileids('pos')
>
> negfeats = [(featx(mysentiment.words(fileids=[f])), 'neg') for f in
> negids

> The system is telling me it does not recognise negids despite the fact
> i have just created the file name a few lines earlier.

You defined the variable negids inside a function, so even if you called
the function, the variable would not be visible outside it. Study the
python tutorial to understand python functions and variable scope.

Alexis

Seamus Shanley

unread,
Jun 12, 2011, 2:14:32 PM6/12/11
to nltk-users
It does appear that I cannot find the correct path for the reader. The
following files are located in as follows C:\Python26\nltk_data\corpora
\seamus. Then I have two folders located inside, Negative and
Positive.Inside these two folders are text files, for example
Hanna.txt is in the Positive folder and Arthur.txt is in the Negative
folder. When I am trying to implement the data I would try to load the
data as follows

mysentiment = nltk.corpus.reader.PlaintextCorpusReader(r"nltk_data/
corpora/seamus", r"(pos|neg)/Positive/Hanna.txt"). This is incorrect,
I believe that the second r" data is where I am entering the
information incorrectly. Could you please show me how to enter in this
code correctly. Apologies for this, I am not trained as a programmer
and am trying to learn this from scratch.






On Jun 12, 9:47 am, Alexis Dimitriadis <alexis.dimitria...@gmail.com>
wrote:

Alexis Dimitriadis

unread,
Jun 12, 2011, 3:46:54 PM6/12/11
to nltk-...@googlegroups.com
With the paths you describe, this should work:

root = nltk.data.find(r'corpora/seamus')
mysentiment = nltk.corpus.reader.PlaintextCorpusReader(root,
r"(Positive|Negative)/.*\.txt")

The two parts of the path (root + the second argument) must add up to
your filenames. The above will match ALL the files, not just one. For
just Hanna, use r"Positive/Hanna.txt". See it?

Sorry but you'll need to become a bit of a programmer so your can help
yourself more. Python is the glue that you must use to string the parts
of the nltk together, so you need to understand how it works or you'll
be permanently stuck. It's not that hard and it's fun to learn-- dive in
to the python tutorial and have fun with it!

Good luck with it all,

Alexis

Seamus Shanley

unread,
Jun 12, 2011, 3:50:50 PM6/12/11
to nltk-users
Disregard the above message I have realised my error I should have
typed in (Positive|Negative) instead of (pos|neg). I am trying to test
and train the NaiveBayesClassifier at the above link
http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/

When i type in the code negids = mysentiment.fileids('Negative') I get
a TypeError saying that fileids() takes exactly 1 argument (2 given).
In this situation I cannot directly copy the code that occured from
the movie_reviews corpus as mysentiment is an argument by itself. Is
it therefore not possible to follow training a classifier directly
from this link?

Alexis Dimitriadis

unread,
Jun 12, 2011, 5:09:23 PM6/12/11
to nltk-...@googlegroups.com
On 12/06/2011 21:50, Seamus Shanley wrote:
> When i type in the code negids = mysentiment.fileids('Negative') I get
> a TypeError saying that fileids() takes exactly 1 argument (2 given).

For that you need a Categorized corpus reader. See my very first
response to your queries.

Alexis

Seamus Shanley

unread,
Jun 12, 2011, 6:03:49 PM6/12/11
to nltk-users
I really appreciate your help Alexis, I am reading the link about
CategorizedCorpusReader data, could you tell me what code i need to
use to implement the dataset under this type of corpus reader as in
the same form as the PlaintextCorpusReader( reader =
nltk.corpus.reader.PlaintextCorpusReader(r"./corpus",
r"(pos|neg)/.*\.txt")?

Thanking you


On Jun 12, 10:09 pm, Alexis Dimitriadis <alexis.dimitria...@gmail.com>
wrote:

Alexis Dimitriadis

unread,
Jun 13, 2011, 5:39:20 AM6/13/11
to nltk-...@googlegroups.com
Hi Seamus,

You are welcome, I'm always glad to help. But I've already pointed you
to the manual page you need, so I think you can help yourself now.

Good luck with your MA thesis,

Alexis

Reply all
Reply to author
Forward
0 new messages