nltk.tokenize and non-ascii characters

1,576 views
Skip to first unread message

josep_m

unread,
Dec 26, 2010, 4:19:27 PM12/26/10
to nltk-users
A couple of days ago I posted a question about the treatment of non-
ascii characters with regular expressions. I found an answer for that
question but now I'm finding a similar problem when trying to use
word_tokenize() or TreebankWordTokenizer with some non-English texts.

I see that the same issue I'm having was raised in:

http://code.google.com/p/nltk/issues/detail?id=477

Basically, when I tokenize a text in Spanish or Catalan (but it would
happen with any text containing non-ascii characters), accented
characters are treated as punctuation and become tokens themselves, so
'qué' becomes 'qu' 'é' and so on.

Can anybody suggest a good workaround for this problem? As I mentioned
in my other message, I'm working with texts that are encoded in
ISO-8859-1 (Latin 1). Thanks.

Josep M.

Jacob Perkins

unread,
Dec 27, 2010, 10:22:45 AM12/27/10
to nltk-users
Hi Josep,

Try wordpunct_tokenize() or the WordPunctTokenizer instead. That's
what I use for the tagging demo at http://text-processing.com/demo/tag/.
Also checkout my tokenizing demo to see how each tokenizer works:
http://text-processing.com/demo/tokenize/.

Jacob
---
http://streamhacker.com/
http://twitter.com/japerk

josep_m

unread,
Dec 28, 2010, 4:56:47 PM12/28/10
to nltk-users
Hi Jacob,

Thanks for the prompt response. WordPuncTokenizer seems to do the job
except for the fact that it returns the accented characters converted
to Unicode. So now the accented characters do not become tokens
themselves but they are converted to strings such as '\xe8' or '\xf3'
Is that the way it is supposed to be?

Josep M.



> Try wordpunct_tokenize() or the WordPunctTokenizer instead. That's
> what I use for the tagging demo athttp://text-processing.com/demo/tag/.
> Also checkout my tokenizing demo to see how each tokenizer works:http://text-processing.com/demo/tokenize/.
>
> Jacob
> ---http://streamhacker.com/http://twitter.com/japerk

Jacob Perkins

unread,
Dec 29, 2010, 11:24:03 AM12/29/10
to nltk-users
Hi Josep,

It's normal - if you print the strings in the console (instead of
looking at them raw), the characters should look correct.

josep_m

unread,
Dec 30, 2010, 10:19:44 AM12/30/10
to nltk-users
Hi Jacob,

Thanks again Jacob. Is there any parameter or flag that can be
modified in WordPunctTokenizer to change the character encoding to
ISO-8859-1? I'm working with some tools in addition to NLTK that
cannot work with Unicode.

JM


> It's normal - if you print the strings in the console (instead of
> looking at them raw), the characters should look correct.
>
> Jacob

Tim McNamara

unread,
Dec 30, 2010, 4:04:38 PM12/30/10
to nltk-...@googlegroups.com
On Fri, Dec 31, 2010 at 4:19 AM, josep_m <josep.m...@gmail.com> wrote:
Hi Jacob,

Thanks again Jacob. Is there any parameter or flag that can be
modified in WordPunctTokenizer to change the character encoding to
ISO-8859-1? I'm working with some tools in addition to NLTK that
cannot work with Unicode.

JM

If you give the tokeniser bytes in Latin-1, it should send out Latin-1. You can encode from Unicode to Latin-1 fairly easily:

   >>> u'hi there'.encode('latin-1')
   'hi there'

Extending this, you could try:

   >>> import nltk
   >>> raw = u'interesting piece of text'.encode('latin-1')
   >>> nltk.wordpunct_tokenize(raw)
   ['interesting', 'piece', 'of', 'text']

In general though, you shouldn't have any problems with applications that don't accept the full Unicode range. ASCII & Latin-1 are subsets of Unicode - so there will never be any illegal bytes sent. However, the software will complain if you're encoding characters that are not in ISO-8859-1. You'll also get strange results if the source material is lying about its encoding.


Tim McNamara
@timClicks

josep_m

unread,
Jan 1, 2011, 12:22:40 PM1/1/11
to nltk-users
Hi Tim,

Thanks for your answer. There are still a few things that I would
appreciate if anybody could clarify for me.

<snip>
> If you give the tokeniser bytes in Latin-1, it should send out Latin-1.

But that's not what seems to have happened. Here's my code:

import nltk
from nltk.tokenize import *
with codecs.open('/Volumes/DATA/Documents/workspace/OUTPUTS/
tokens.txt', 'w') as out_tokens:
with codecs.open('/Volumes/DATA/Documents/workspace/INPUT/
document.txt', "r") as in_tokens:
s = in_tokens.read()
out_tokens.write(str(WordPunctTokenizer().tokenize(s)))

The document called 'document.txt' is encoded in Latin-1 so, according
to what you say, the tokenizer should have send out Latin-1 and that's
what I should expect to find in the document called 'tokens.txt'. Yet
when I open that document in my editor, what I see is the Unicode
string for every character that has an accent.

> You can encode from Unicode to Latin-1 fairly easily:
>
>    >>> u'hi there'.encode('latin-1')
>    'hi there'
>
> Extending this, you could try:
>
>    >>> import nltk
>    >>> raw = u'interesting piece of text'.encode('latin-1')
>    >>> nltk.wordpunct_tokenize(raw)
>    ['interesting', 'piece', 'of', 'text']

OK, I see. The thing is, if there is a way to do this without having
to go through the extra step, that would be preferable, right? I'm
just trying to understand why Tokenizer spits out the strings
containing the codes for the accented characters as opposed to the
characters themselves. This might not pose a problem for processing
the list of strings but somewhere down the line if I have to revise
things readability might be a concern and I would prefer if all the
words are visible as they are supposed to be seen. I guess I'm showing
my greenness and I hope I don't sound too stupid but if somebody would
care to clarify this for me, I would really appreciate it.

Josep M.
Reply all
Reply to author
Forward
0 new messages