Don't know what nltk-trainer or the code in the Cookbook would buy you,
but starting up an nltk corpus reader is pretty trivial: Supposing your
files are in corpus/pos and corpus/neg, you can just say
reader = nltk.corpus.reader.PlaintextCorpusReader(r"./corpus",
r"(pos|neg)/.*\.txt")
print reader.sents( )[0:3] # etc.
The first argument is the base directory, the second an RE (not a glob)
matching the filenames to include. But you'll probably want to use a
CategorizedCorpusReader instead, see
The constructor accepts a flag with REs that map filenames to categories.
Have fun with it,
Alexis
Looks like "reader" refers to the module nltk.corpus.reader, not to your
object. Did you use "from nltk.corpus import reader"? (I shouldn't have
suggested "reader" as an object name, sorry). Just change the variable
name and try again:
sentimentcorpus = nltk.corpus.reader.PlaintextCorpusReader(...)
Alexis
PS. Here are some commands you can use to inspect python objects:
type(reader)
dir(reader)
help(reader), help(dir), help(sentimentcorpus.fileids), etc.
You load your data using the call to PlaintextCorpusReader. The "import"
command imports python modules, which are code, not data. The command
"from nltk.corpus import reader" was not a solution--it's what probably
caused your problem.
Name your data folders whatever you want, just adjust the reader
arguments. Assuming your files are named ./corpus/pos/*.txt,
./corpus/neg/*.txt, the following is a complete working program:
import nltk
sentimentcorpus = nltk.corpus.reader.PlaintextCorpusReader(r"./corpus",
r"(pos|neg)/.*\.txt")
print sentimentcorpus.fileids( )
Alexis
> when i use the code "print mysentiment.fileids()" the answerset I
receive is just simply [ ]
You're obviously using the wrong pathname or filenames for the reader,
so it's not finding your files.
> def evaluate_classifier(featx):
> negids=mysentiment.fileids('neg')
> posids=mysentiment.fileids('pos')
>
> negfeats = [(featx(mysentiment.words(fileids=[f])), 'neg') for f in
> negids
> The system is telling me it does not recognise negids despite the fact
> i have just created the file name a few lines earlier.
You defined the variable negids inside a function, so even if you called
the function, the variable would not be visible outside it. Study the
python tutorial to understand python functions and variable scope.
Alexis
root = nltk.data.find(r'corpora/seamus')
mysentiment = nltk.corpus.reader.PlaintextCorpusReader(root,
r"(Positive|Negative)/.*\.txt")
The two parts of the path (root + the second argument) must add up to
your filenames. The above will match ALL the files, not just one. For
just Hanna, use r"Positive/Hanna.txt". See it?
Sorry but you'll need to become a bit of a programmer so your can help
yourself more. Python is the glue that you must use to string the parts
of the nltk together, so you need to understand how it works or you'll
be permanently stuck. It's not that hard and it's fun to learn-- dive in
to the python tutorial and have fun with it!
Good luck with it all,
Alexis
For that you need a Categorized corpus reader. See my very first
response to your queries.
Alexis
You are welcome, I'm always glad to help. But I've already pointed you
to the manual page you need, so I think you can help yourself now.
Good luck with your MA thesis,
Alexis