...
Below is the complete code as I type it at the IDLE prompt:
IDLE 2.6.1
>>> import nltk.data
>>> import nltk.tokenize
>>> classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
>>> tokenized_sentences = nltk.sent_tokenize('raw')
Hi Tim, Thank you for your response. Unfortunately the file is still
being read as a single sentence, or more likely a single blank file.
Note that if after I run the following commands:
which I believe should allow me to access the contents of my file, I
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
run the following command:
>>> f.read()
''
The response I get is just a set of empty quotes. I believe this means
that my file is being read as an empty file. This doesn't make sense
to me since if I open the file in finder I see the file as I would
expect it to be, a text document with several sentences. Sincerely,
George
During file time, f.read() return whole content and end of file is
reached. So while reading second time it is returning empty string.
You need to save file_content in variable.
#assigning file content
text = f.read()
Thanks
Vineet Yadav
> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to
> nltk-users+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/nltk-users?hl=en.
>
During first time, f.read() return whole content and end of file is
So it looks like there's a couple of different things going on:
1) The f.read() function in Python slurps your entire file as a single
string, so your loop will only ever iterate once. If your file is
"organized" into sentences (i.e. one per line), then you should look
into using the f.readlines() function, which will return a list of
"sentences", specifically the contents of your file as split on
newlines ("\n").
2) When you use the f.read() function, it iterates over the contents
of your file---you can intuitively think of it as something like a
cursor---until it reaches the end of the file. In order to use it a
second time, you have to explicitly tell Python to start "looking"
from the beginning of the file again; the function you want to look
into in that case is f.seek().
Hope this helps,
Fred.
OK, so the error you got is because you fed the tokenizer a list,
rather than a sentence (the f.readlines() function returns a list of
sentences, e.g. ["sentence1", "sentence2", ...]), and the tokenizer
(presumably) doesn't know what to do with it.
Since your file is not organized into line-separated sentences, you've
got a bit more work to do. You'll have to use f.read() to get the
contents of your file, and then figure out how to split it into
sentences. A naive approach would be to split on periods ("."), but
that will fail if you have any sentences that contain the substrings
like "Mr. X", "Mrs. X", etc. If, on the other hand, you're confident
that none of the sentences in your corpus include a period anywhere
other than at the end, then you can
raw = f.read()
sentences = raw.split(".")
for s in sentences:
# do stuff to s here
Good luck!
Fred.
On Tuesday, 8 February 2011 at 9:24 AM, Fred Mailhot wrote:
Hi again, George...
OK, so the error you got is because you fed the tokenizer a list,
rather than a sentence (the f.readlines() function returns a list of
sentences, e.g. ["sentence1", "sentence2", ...]), and the tokenizer
(presumably) doesn't know what to do with it.
...
raw = f.read()
sentences = raw.split(".")
In a Python shell, e.g. you IDLE session, can you please try the following for me?>>> import glob>>> glob.glob('/Users/georgeorton/Documents/nlp*')Your file should be in that list. If that doesn't work, could you please also try:>>> import os>>> for path is os.walk('/Users/georgeorton'):... glob.glob(path + "nlp*")This may talk a little while, but should be enlightening.
On 7 February 2011 15:48, Bio <Sel...@bioasys.net> wrote:
> [...] Unfortunately I still got an error message. Here is a copy
> of my IDLE output:
>
> [...]
>>>> tokenized_sentences = nltk.sent_tokenize(sentences)
Here's the issue...Once again, the variable "sentences" is a *list* of
sentences, and the tokenizer (I think?) just wants sentences/strings.
[...]
As for the rest of your message, I see that Tim has addressed (i) an
error in my assumption about how sent_tokenize() works (my bad), and
(ii) pointed out a way to assess the accessibility of your file
(although I would assume that since the call to open() isn't failing,
your file is where you think it is. In which case perhaps there is,
indeed, a problem with the file's contents. Are the contents there
when you open it in a text editor?
Cheers,
Fred.
Ok, so there are a couple of typos. "for path is
os.walk('/Users/georgeorton')" should be "for path in
os.walk('/Users/georgeorton')", and then also "if nlp in '
'.join(path[2])" should be "if 'nlp' in ' '.join(path[2])".
I would recommend taking a week or at least a couple of days to just
learn the basics of python to avoid these kinds of issues. I like to do
a few problems from http://www.projecteuler.net/ when learning a new
programming language. The official python website has a nice tutorial at
http://docs.python.org/tutorial/ which should be useful well.
John
==
John K Pate
http://homepages.inf.ed.ac.uk/s0930006/
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
this data is in Unicode, in the UTF-16 format.
the byte order is "little-endian".
I suggest that your change your file to be Unicode, UTF-8 format and try again.
the immediately following "\r" also suggests that the file is in Mac OS9 line endings. Are you really using Mac OS9?
--Jeremy
...who knows too much about unicode