Issue 748 in nltk: nltk.tag.pos_tag() fails with UnicodeDecodeError

52 views
Skip to first unread message

nl...@googlecode.com

unread,
Jan 19, 2014, 10:07:34 AM1/19/14
to nltk-...@googlegroups.com
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 748 by samboosa...@gmail.com: nltk.tag.pos_tag() fails with
UnicodeDecodeError
http://code.google.com/p/nltk/issues/detail?id=748

nltk.__version__
'3.0a2'

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
to reproduce...

import nltk
text = 'the dog eats'
tokens = nltk.tokenize.word_tokenize(text)
tags = nltk.tag.pos_tag(tokens)

should be...

[('the', 'DT'), ('dog', 'NN'), ('eats', 'NNS')]

is...

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/nltk-3.0a2-py3.3.egg/nltk/data.py
in
load(resource_url, format, cache, verbose, logic_parser, fstruct_parser,
encoding)
640 elif format == 'pickle':
--> 641 resource_val = pickle.load(opened_resource)
642 elif format == 'yaml':
643 import yaml

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0:
ordinal not in range(128)

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # #

i fixed it locally by adding "encoding='ISO-8859-1'" to "pickle.load()" in
line 641 of the data.py file in error message.

i don't know if this breaks other things though, or if this is the right
way to do it, or what the root cause is (e.g. pickling across versions).

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

nl...@googlecode.com

unread,
Jan 20, 2014, 2:14:57 AM1/20/14
to nltk-...@googlegroups.com

Comment #1 on issue 748 by samboosa...@gmail.com: nltk.tag.pos_tag() fails
with UnicodeDecodeError
http://code.google.com/p/nltk/issues/detail?id=748

operating system: OSX 10.7.5
Reply all
Reply to author
Forward
0 new messages