How to use a file as input for sentence tokenizer?

3,663 views
Skip to first unread message

Bio

unread,
Feb 7, 2011, 10:42:00 AM2/7/11
to nltk-users
Hello, I am trying to use a file as the input source for
'nltk.tokenize.word_tokenize(sentence)' sentence tokenizer command. I
have a file on my system at:
/Users/georgeorton/Documents/nlpexport02062011.txt. This document is
several sentences long. If I click on the file while in finder the
document appears in the text editor just as I would expect. However
when I attempt to use the document as the input source for the
sentence tokenizer 'nltk.tokenize.word_tokenize(each_sentence)' using
the following commands:

>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()

it seems that my multi sentence document is just being read as a
single sentence. When I enter:

>>> f.read()

I get the response:

''

Below is the complete code as I type it at the IDLE prompt:

IDLE 2.6.1
>>> import nltk.data
>>> import nltk.tokenize
>>> classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
>>> tokenized_sentences = nltk.sent_tokenize('raw')
>>> for each_sentence in tokenized_sentences:
words = nltk.tokenize.word_tokenize(each_sentence)
feats = dict([(word, True) for word in words])
classifier.classify(feats)


'typeone'
>>> f.read()
''
>>>

If anybody has any ideas on what I am doing wrong I would greatly
appreciate your input. Sincerely, George

Tim McNamara

unread,
Feb 7, 2011, 1:14:14 PM2/7/11
to nltk-...@googlegroups.com
On Tue, Feb 8, 2011 at 4:42 AM, Bio <Sel...@bioasys.net> wrote:
...

Below is the complete code as I type it at the IDLE prompt:

IDLE 2.6.1
>>> import nltk.data
>>> import nltk.tokenize
>>> classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
>>> tokenized_sentences = nltk.sent_tokenize('raw')

nltk.sent_tokenize('raw') will tokenise the string 'raw'. Remove the quotes.

 

Bio

unread,
Feb 7, 2011, 1:28:27 PM2/7/11
to nltk-users
Hi Tim, Thank you for your response. Unfortunately the file is still
being read as a single sentence, or more likely a single blank file.
Note that if after I run the following commands:

>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()

which I believe should allow me to access the contents of my file, I
run the following command:

>>> f.read()
''

The response I get is just a set of empty quotes. I believe this means
that my file is being read as an empty file. This doesn't make sense
to me since if I open the file in finder I see the file as I would
expect it to be, a text document with several sentences. Sincerely,
George

Tim McNamara

unread,
Feb 7, 2011, 1:35:01 PM2/7/11
to nltk-...@googlegroups.com
On Tue, Feb 8, 2011 at 7:28 AM, Bio <Sel...@bioasys.net> wrote:
Hi Tim, Thank you for your response. Unfortunately the file is still
being read as a single sentence, or more likely a single blank file.
Note that if after I run the following commands:

>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()

which I believe should allow me to access the contents of my file, I
run the following command:

>>> f.read()
''

The response I get is just a set of empty quotes. I believe this means
that my file is being read as an empty file. This doesn't make sense
to me since if I open the file in finder I see the file as I would
expect it to be, a text document with several sentences.   Sincerely,
George

Check that the file you wish to read is correct. The likelihood that you have entered it incorrectly far exceeds the likelihood that Python cannot open files. However, you may wish to check things like file permissions, just in case.

Tim

vineet yadav

unread,
Feb 7, 2011, 1:39:59 PM2/7/11
to nltk-...@googlegroups.com
Hi George,
I think you have used f.read() two times.
>>> f.read()
'This is the entire file.\n'
#end of file is reached
>>> f.read()
''

During file time, f.read() return whole content and end of file is
reached. So while reading second time it is returning empty string.
You need to save file_content in variable.
#assigning file content
text = f.read()
Thanks
Vineet Yadav

> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to
> nltk-users+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/nltk-users?hl=en.
>

vineet yadav

unread,
Feb 7, 2011, 1:40:59 PM2/7/11
to nltk-...@googlegroups.com
On Tue, Feb 8, 2011 at 12:09 AM, vineet yadav
<vineet.y...@gmail.com> wrote:
Hi George,
I think you have used f.read() two times.
>>> f.read()
'This is the entire file.\n'
#end of file is reached
>>> f.read()
''

During first time, f.read() return whole content and end of file is

Bio

unread,
Feb 7, 2011, 2:04:38 PM2/7/11
to nltk-users
Hi Tim, Thanks for your response. When I first experienced this
problem my first thought was I must have entered the wrong path or
file name but I've checked my spelling about 10 times now and I keep
having the same difficulty. I'm the only person who uses this computer
so I should have full administrator permissions but I'm not sure how
to check to see what my file permissions are. Thanks, George

Bio

unread,
Feb 7, 2011, 2:18:49 PM2/7/11
to nltk-users
Hi Vineet, Thanks for your response. Based upon your explanation I can
see how typing f.read() at the end of my code would elicit an empty
string response but if you'll notice in my code I define the variable
raw = f.read() and then I use the variable raw in the
tokenized_sentences = nltk.sent_tokenize(raw) command. So I believe my
code already incorporates your suggestion.

Below is the complete code as I type it at the IDLE prompt:
IDLE 2.6.1
>>> import nltk.data
>>> import nltk.tokenize
>>> classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
>>> tokenized_sentences = nltk.sent_tokenize(raw)
>>> for each_sentence in tokenized_sentences:

words = nltk.tokenize.word_tokenize(each_sentence)
feats = dict([(word, True) for word in words])
classifier.classify(feats)
'typeone'
>>> f.read()
''

Please note that even though I use the variable raw in the
tokenized_sentences = nltk.sent_tokenize(raw) command I still only get
one response to my code which is 'typeone' since my text document is
several sentences long I would expect to receive several responses. If
I perform the entire tokenize sentence/word and classification code on
a few typed in sentences rather than trying to use it with an
accessed .txt file I do get one response per tokenized sentence. If I
perform the open file command and the variable assignment independent
of the sentence tokenizing/ text classifying code and then just print
raw I also get an empty string as my response:

IDLE 2.6.1
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
>>> print raw

>>>

So at this point I am still at a loss as to what my problem is.
Sincerely, George

On Feb 7, 1:40 pm, vineet yadav <vineet.yadav.i...@gmail.com> wrote:

Fred Mailhot

unread,
Feb 7, 2011, 2:25:09 PM2/7/11
to nltk-...@googlegroups.com
Hi George...

So it looks like there's a couple of different things going on:

1) The f.read() function in Python slurps your entire file as a single
string, so your loop will only ever iterate once. If your file is
"organized" into sentences (i.e. one per line), then you should look
into using the f.readlines() function, which will return a list of
"sentences", specifically the contents of your file as split on
newlines ("\n").

2) When you use the f.read() function, it iterates over the contents
of your file---you can intuitively think of it as something like a
cursor---until it reaches the end of the file. In order to use it a
second time, you have to explicitly tell Python to start "looking"
from the beginning of the file again; the function you want to look
into in that case is f.seek().

Hope this helps,
Fred.

Bio

unread,
Feb 7, 2011, 2:42:16 PM2/7/11
to nltk-users
Hi Fred, Thanks for your help. My file is organized into sentences by
which I mean it follows the standard grammar rules where a sentence
starts with a capital letter and ends with a punctuation mark like a
period(.) or a question mark(?). But my file is not organized into one
sentence per line. The file is a text document that I copied from a
web page. Even though my file is not organized into one sentence per
line I tried using f.readlines instead of f.read but I got an error
message as soon as I entered the tokenize sentence command. Here is my
IDLE interface output:

IDLE 2.6.1
>>> import nltk.data
>>> import nltk.tokenize
>>> classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.readlines()
>>> tokenized_sentences = nltk.sent_tokenize(raw)

Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
tokenized_sentences = nltk.sent_tokenize(raw)
File "/Library/Python/2.6/site-packages/nltk/tokenize/__init__.py",
line 44, in sent_tokenize
return tokenizer.tokenize(text)
File "/Library/Python/2.6/site-packages/nltk/tokenize/punkt.py",
line 1124, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Python/2.6/site-packages/nltk/tokenize/punkt.py",
line 1140, in _sentences_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
>>>

So unfortunately I am still at a loss as to how to proceed. Just as a
matter of course I checked my file permissions and I do have read and
write permissions. Thanks, George

Fred Mailhot

unread,
Feb 7, 2011, 3:24:35 PM2/7/11
to nltk-...@googlegroups.com
Hi again, George...

OK, so the error you got is because you fed the tokenizer a list,
rather than a sentence (the f.readlines() function returns a list of
sentences, e.g. ["sentence1", "sentence2", ...]), and the tokenizer
(presumably) doesn't know what to do with it.

Since your file is not organized into line-separated sentences, you've
got a bit more work to do. You'll have to use f.read() to get the
contents of your file, and then figure out how to split it into
sentences. A naive approach would be to split on periods ("."), but
that will fail if you have any sentences that contain the substrings
like "Mr. X", "Mrs. X", etc. If, on the other hand, you're confident
that none of the sentences in your corpus include a period anywhere
other than at the end, then you can

raw = f.read()
sentences = raw.split(".")

for s in sentences:
# do stuff to s here


Good luck!
Fred.

Bio

unread,
Feb 7, 2011, 3:48:03 PM2/7/11
to nltk-users
Hi Fred, Even though my file contains abbreviations such that a
"sentences = raw.split(".")" would separate my file into more
sentences than would be accurate I went ahead and tried your
suggestion. Unfortunately I still got an error message. Here is a copy
of my IDLE output:

IDLE 2.6.1
>>> import nltk.data
>>> import nltk.tokenize
>>> classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
>>> sentences = raw.split(".")
>>> tokenized_sentences = nltk.sent_tokenize(sentences)

Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
tokenized_sentences = nltk.sent_tokenize(sentences)
File "/Library/Python/2.6/site-packages/nltk/tokenize/__init__.py",
line 44, in sent_tokenize
return tokenizer.tokenize(text)
File "/Library/Python/2.6/site-packages/nltk/tokenize/punkt.py",
line 1124, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Python/2.6/site-packages/nltk/tokenize/punkt.py",
line 1140, in _sentences_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
>>>

If while in IDLE I type the following, where I am opening my file and
then assigning the opened file to a variable, and then by printing the
variable shouldn't I should be able to see the opened file:

IDLE 2.6.1
>>> f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
>>> raw = f.read()
>>> print raw

>>>

because I try to print raw and get an empty string as a response
doesn't that mean that my file is initially being read as empty as
opposed to just one long string that does not differentiate
sentences. Also, I thought that the purpose behind the
"tokenized_sentences = nltk.sent_tokenize(raw)" command was to take
one long string and perform a "raw.split(".")' type function only at a
sophisticated enough level to account for abbreviations, question
marks etc. Thanks, George

Tim McNamara

unread,
Feb 7, 2011, 3:52:12 PM2/7/11
to nltk-...@googlegroups.com
In a Python shell, e.g. you IDLE session, can you please try the following for me?

>>> import glob
>>> glob.glob('/Users/georgeorton/Documents/nlp*')

Your file should be in that list. If that doesn't work, could you please also try:

>>> import os
>>> for path is os.walk('/Users/georgeorton'):
...    glob.glob(path + "nlp*")

This may talk a little while, but should be enlightening.

Tim McNamara

unread,
Feb 7, 2011, 3:53:43 PM2/7/11
to nltk-...@googlegroups.com

On Tuesday, 8 February 2011 at 9:24 AM, Fred Mailhot wrote:

Hi again, George...

OK, so the error you got is because you fed the tokenizer a list,
rather than a sentence (the f.readlines() function returns a list of
sentences, e.g. ["sentence1", "sentence2", ...]), and the tokenizer
(presumably) doesn't know what to do with it.
...


raw = f.read()
sentences = raw.split(".")

Fred, this isn't strictly necessary. George is using nltk.sent_tokenize to split sentences.

Tim

Tim McNamara

unread,
Feb 7, 2011, 4:01:07 PM2/7/11
to nltk-...@googlegroups.com
On Tuesday, 8 February 2011 at 9:52 AM, Tim McNamara wrote:
In a Python shell, e.g. you IDLE session, can you please try the following for me?

>>> import glob
>>> glob.glob('/Users/georgeorton/Documents/nlp*')

Your file should be in that list. If that doesn't work, could you please also try:

>>> import os
>>> for path is os.walk('/Users/georgeorton'):
...    glob.glob(path + "nlp*")

This may talk a little while, but should be enlightening.
Excuse me, this will fail. Try this in an interactive prompt :

for path is os.walk('/Users/georgeorton'):
    if 'nlp' in ' '.join(path[2]):
        path[0], path[2]

Fred Mailhot

unread,
Feb 7, 2011, 4:01:28 PM2/7/11
to nltk-...@googlegroups.com
Hi George,

On 7 February 2011 15:48, Bio <Sel...@bioasys.net> wrote:
> [...] Unfortunately I still got an error message. Here is a copy
> of my IDLE output:
>
> [...]
>>>> tokenized_sentences = nltk.sent_tokenize(sentences)

Here's the issue...Once again, the variable "sentences" is a *list* of
sentences, and the tokenizer (I think?) just wants sentences/strings.

[...]

As for the rest of your message, I see that Tim has addressed (i) an
error in my assumption about how sent_tokenize() works (my bad), and
(ii) pointed out a way to assess the accessibility of your file
(although I would assume that since the call to open() isn't failing,
your file is where you think it is. In which case perhaps there is,
indeed, a problem with the file's contents. Are the contents there
when you open it in a text editor?


Cheers,
Fred.

Bio

unread,
Feb 7, 2011, 4:06:04 PM2/7/11
to nltk-users
Hi Tim, Thanks for your response. Your first suggestion worked fine
and returned my file. Your second suggestion returned a syntax error.
Here is my IDLE output:

IDLE 2.6.1
>>> import glob
>>> glob.glob('/Users/georgeorton/Documents/nlp*')
['/Users/georgeorton/Documents/nlpexport02062011.txt']
>>> import os
>>> for path is os.walk('/Users/georgeorton'):

SyntaxError: invalid syntax
>>> if nlp in ' '.join(path[2]):
path[0], path[2]



Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
if nlp in ' '.join(path[2]):
NameError: name 'nlp' is not defined
>>>
Sincerely, George

On Feb 7, 4:01 pm, Tim McNamara <mcnamara....@gmail.com> wrote:
> On Tuesday, 8 February 2011 at 9:52 AM, Tim McNamara wrote:
>
> In a Python shell, e.g. you IDLE session, can you please try the following for me?
>
>
>
> > >>> import glob
> > >>> glob.glob('/Users/georgeorton/Documents/nlp*')
>
> > Your file should be in that list. If that doesn't work, could you please also try:
>
> > >>> import os
> > >>> for path is os.walk('/Users/georgeorton'):
> > ... glob.glob(path + "nlp*")
>
> > This may talk a little while, but should be enlightening.Excuse me, this will fail. Try this in an interactive prompt :

Bio

unread,
Feb 7, 2011, 4:07:59 PM2/7/11
to nltk-users
Hi Fred, If I go into finder and click on the file it opens in
texteditor and appears just as I would expect it to. Sincerely, George

John K Pate

unread,
Feb 7, 2011, 5:09:25 PM2/7/11
to nltk-...@googlegroups.com
On Mon, 2011-02-07 at 13:06 -0800, Bio wrote:
> Hi Tim, Thanks for your response. Your first suggestion worked fine
> and returned my file. Your second suggestion returned a syntax error.
> Here is my IDLE output:
>
> IDLE 2.6.1
> >>> import glob
> >>> glob.glob('/Users/georgeorton/Documents/nlp*')
> ['/Users/georgeorton/Documents/nlpexport02062011.txt']
> >>> import os
> >>> for path is os.walk('/Users/georgeorton'):
>
> SyntaxError: invalid syntax
> >>> if nlp in ' '.join(path[2]):
> path[0], path[2]
>
>
>
> Traceback (most recent call last):
> File "<pyshell#7>", line 1, in <module>
> if nlp in ' '.join(path[2]):
> NameError: name 'nlp' is not defined
> >>>
> Sincerely, George

Ok, so there are a couple of typos. "for path is
os.walk('/Users/georgeorton')" should be "for path in
os.walk('/Users/georgeorton')", and then also "if nlp in '
'.join(path[2])" should be "if 'nlp' in ' '.join(path[2])".

I would recommend taking a week or at least a couple of days to just
learn the basics of python to avoid these kinds of issues. I like to do
a few problems from http://www.projecteuler.net/ when learning a new
programming language. The official python website has a nice tutorial at
http://docs.python.org/tutorial/ which should be useful well.

John

==
John K Pate
http://homepages.inf.ed.ac.uk/s0930006/


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Bio

unread,
Feb 7, 2011, 5:53:43 PM2/7/11
to nltk-users
Hi John, I figured that your suggestion probably had a typo but just
to be sure I followed your output exactly. I've managed to work
through about half of Learning Python by Lutz so I'm starting to get a
handle on Python. However I'm not familiar with glob. I assume because
the glob .glob command returned my file name that my file is present
in the path I have been using in my code. Unfortunately I am still at
a loss as to why my file is being read as an empty string. Sincerely,
George
> a few problems fromhttp://www.projecteuler.net/when learning a new
> programming language. The official python website has a nice tutorial athttp://docs.python.org/tutorial/which should be useful well.
>
> John
>
> ==
> John K Patehttp://homepages.inf.ed.ac.uk/s0930006/

Bio

unread,
Feb 8, 2011, 10:31:43 AM2/8/11
to nltk-users
Hi Fred,
If while in IDLE I enter the following code:

f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
f.read()

I get the following output:

\xff\xfe\r\x00\r\x00M\x00a\x00r\x00k\x00e\x00t\x00s\x00 \x00a\x00r\x00e
\x00 \x00p\x00r\x00i\x00c\x00i\x00n\x00g\x00 \x00i\x00n\x00 \x00t\x00h
\x00e\x00 \x00v\x00i\x00e\x00w\x00 \x00t\x00h\x00a\x00t\x00 \x00t\x00h
\x00e\x00 \x00U\x00.\x00S\x00.\x00e\x00c\x00o\x00n\x00o\x00m\x00i\x00c
\x00 \x00r\x00e\x00c\x00o\x00v\x00e\x00r\x00y\x00 \x00i\x00s\x00 \x00g
\x00a\x00i\x00n\x00i\x00n\x00g\x00 \x00s\x00t\x00r\x00e\x00n\x00g\x00t
\x00h\x00,\x00 \x00e\x00v\x00e\x00n\x00 \x00a\x00s\x00 \x00i\x00n\x00v
\x00e\x00s\x00t\x00o\x00r\x00s\x00 \x00k\x00e\x00e\x00p\x00 \x00a\x00
\x00w\x00a\x00t\x00c\x00h\x00f\x00u\x00l\x00 \x00e\x00y\x00e\x00 \x00o
\x00n\x00 \x00t\x00h\x00e\x00 \x00M\x00i\x00d\x00d\x00l\x00e\x00 \x00E
\x00a\x00s\x00t\

This is just a partial sample of the beginning of the output. I first
thought maybe this was unicode but looking more closely I noticed that
the last character of each of the \x00e\ type sequences spells out the
words of my text. I initially generated this file from a database
program I use called FileMaker Pro Advanced. I initially copy the text
from a web page then parse the text to eliminate all the extraneous
commercials, headers, links etc. Once I have the cleaned up the text I
use an export file command within FileMaker to generate the text
document file that I am trying to run the nltk classifier on.

The next thing I did was try to run my code in Terminal instead of
IDLE. It was interesting that when I run the following code on
terminal:

f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
raw = f.read()
print raw

I get the output I would expect to get which is the text document in
readable form:

Markets are pricing in the view that the U.S.economic recovery is
gaining strengReports that showed surprisingly strong activity in
manufacturing and the services industry and gains in productivity sent
buyers into the dollar, sellers into the bond market and helped push
stocks to new multi-year highs in the past week. Even Friday's deeply
disappointing January jobs report, which showed scant job creation,
was written off as an etc...

So for some reason my IDLE interface is reading the file differently
than the terminal interface. However If I then continue with the
remainder of my classification code I still only get one response back
instead of one response per sentence. Here is my code as typed in
Terminal:

import nltk.data
import nltk.tokenize
classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
f = open('/Users/georgeorton/Documents/nlpexport02062011.txt')
raw = f.read()
tokenized_sentences = nltk.sent_tokenize(raw)
for each_sentence in tokenized_sentences:
... words = nltk.tokenize.word_tokenize(each_sentence)
... feats = dict([(word, True) for word in words])
... classifier.classify(feats)
...
'typethree'

The next thing I thought I'd try was instead of using FileMaker's
export file command I would copy my text file inside of Filemaker and
then paste it into the pico editor in Terminal and then try opening
the file in IDLE to see what the output looks like. This procedure
worked correctly and gave the desired readable text file output as
well as the desired one classification per sentence from the
classifier code. Here is the output from my IDLE prompt, this time
using the copy and pasted text file nlpexportfile.txt rather than my
initial export file command version of the text file:

import nltk.data
import nltk.tokenize
classifier = nltk.data.load('classifiers/weekahead_NaiveBayes.pickle')
tokenized_sentences = nltk.sent_tokenize(raw)
for each_sentence in tokenized_sentences:
words = nltk.tokenize.word_tokenize(each_sentence)
feats = dict([(word, True) for word in words])
classifier.classify(feats)


'typethree'
'typethree'
'typethree'
'typethree'
'typethree'
'typetwo'
'typethree'
'typethree'
'typeone'
'typethree'
'typethree'
'typeone'
'typeone'
'typethree'
'typethree'
'typethree'
'typethree'
etc...

I believe there was some type of encoding issue which apparently is
being caused by Filemaker Pro's export file command. Thanks again.
Sincerely, George

Jeremy Kahn

unread,
Feb 8, 2011, 1:58:08 PM2/8/11
to nltk-...@googlegroups.com
the telltale \xff\xfe\r tells me two things:

this data is in Unicode, in the UTF-16 format.
the byte order is "little-endian".

I suggest that your change your file to be Unicode, UTF-8 format and try again.

the immediately following "\r" also suggests that the file is in Mac OS9 line endings. Are you really using Mac OS9?

--Jeremy
...who knows too much about unicode

Bio

unread,
Feb 8, 2011, 2:29:43 PM2/8/11
to nltk-users
Hi Jeremy, Thank you for the information. Because I have been able to
circumvent the problem by copying the file and then pasting it into a
text editor I have decided not to try and rework the code to
accommodate the encoding type. But the data encoding type is nice to
know in case I run into some insurmountable problem around the copy
file strategy. At that point I can try to convert from UTF-16 to
UTF-8. I'm using MAC OS X Snow Leopard. Thanks again for the
information. Sincerely, George

Ekaterina Kuznetsova

unread,
Sep 24, 2016, 4:03:33 PM9/24/16
to nltk-users
Hi Bio, what I found out, you need to re-save your TXT file in encoding UTF-8 instead of Unicode. But what is really strange, other txt-files are saved in Unicode and they are fine..

среда, 9 февраля 2011 г., 2:31:43 UTC+11 пользователь Bio написал:

Bio

unread,
Sep 24, 2016, 4:39:14 PM9/24/16
to nltk-users
Thanks for your interest in this issue. Fortunately all the strange text encoding issues I was experiencing disappeared when I switched from Python 2.7 to Python 3.5. I also no longer use IDLE but have switched to PyCharm. Enjoy, George
Reply all
Reply to author
Forward
0 new messages