Hi Tim,
Thanks for your answer. There are still a few things that I would
appreciate if anybody could clarify for me.
<snip>
> If you give the tokeniser bytes in Latin-1, it should send out Latin-1.
But that's not what seems to have happened. Here's my code:
import nltk
from nltk.tokenize import *
with codecs.open('/Volumes/DATA/Documents/workspace/OUTPUTS/
tokens.txt', 'w') as out_tokens:
with codecs.open('/Volumes/DATA/Documents/workspace/INPUT/
document.txt', "r") as in_tokens:
s = in_tokens.read()
out_tokens.write(str(WordPunctTokenizer().tokenize(s)))
The document called 'document.txt' is encoded in Latin-1 so, according
to what you say, the tokenizer should have send out Latin-1 and that's
what I should expect to find in the document called 'tokens.txt'. Yet
when I open that document in my editor, what I see is the Unicode
string for every character that has an accent.
> You can encode from Unicode to Latin-1 fairly easily:
>
> >>> u'hi there'.encode('latin-1')
> 'hi there'
>
> Extending this, you could try:
>
> >>> import nltk
> >>> raw = u'interesting piece of text'.encode('latin-1')
> >>> nltk.wordpunct_tokenize(raw)
> ['interesting', 'piece', 'of', 'text']
OK, I see. The thing is, if there is a way to do this without having
to go through the extra step, that would be preferable, right? I'm
just trying to understand why Tokenizer spits out the strings
containing the codes for the accented characters as opposed to the
characters themselves. This might not pose a problem for processing
the list of strings but somewhere down the line if I have to revise
things readability might be a concern and I would prefer if all the
words are visible as they are supposed to be seen. I guess I'm showing
my greenness and I hope I don't sound too stupid but if somebody would
care to clarify this for me, I would really appreciate it.
Josep M.