Getting 'UnicodeDecodeError'. Please help

132 views
Skip to first unread message

Nishu

unread,
Oct 14, 2008, 12:44:47 AM10/14/08
to Google App Engine
Hello,

I am trying to develop a screen scraping application using the google
Webapp framework. The application parses the html output of some other
page to extract the required data and then forms a string out of these
data. Sometimes the application works well but at times the
application raises the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x95 in
position 100: ordinal not in range(128)

After googling around for some time I tried the following:

sys.setdefaultencoding("UTF-8")

As a result the default encoding was set to 'UTF-8' but even this did
not solve the problem and now the application raised the following
error:

UnicodeDecodeError: 'utf8' codec can't decode byte..........

So please help me solve this problem. Thanking you in advance.

Nishant

yejun

unread,
Oct 14, 2008, 3:13:50 AM10/14/08
to Google App Engine
iso8859-1 should be able to decode any char, but I guess there's a bug
in code which caused implicit unicode conversion.

Nishu

unread,
Oct 14, 2008, 3:35:59 AM10/14/08
to Google App Engine
Thanks for replying

Actually the html data that I am parsing is being rendered with UTF-8
encoding so I tried setting the default encoding as UTF-8. Just for
your information I would like to tell you that I am trying to parse
google's search results which is rendered with UTF-8 encoding. Is
there any other way to get google's search result instead of parsing
the HTML, some API which can be used with Python.

Your reply is highly appreciated but I will be thankful to you if you
can send me some code snippet or link to some other sources where I
can get more clear solution to my original problem.

Thanks
Nishant

yejun

unread,
Oct 14, 2008, 4:04:41 AM10/14/08
to Google App Engine
Google search has json-p interface which is used for ajax search api.
But the term of use of the data and protocol is very restrictive.

http://googleajaxsearchapi.blogspot.com/

kang

unread,
Oct 14, 2008, 7:27:27 AM10/14/08
to google-a...@googlegroups.com
a.decode('utf8','ignore')
--
Stay hungry,Stay foolish.

Nishu

unread,
Oct 15, 2008, 2:36:48 AM10/15/08
to Google App Engine
Thanks, it worked. At least the application is not raising such errors
any more. I am novice as far as Python is concerned so can you please
give a short explanation for the solution you provided. With this code
what I noticed is that the character which was creating problem was
not included in the final result string. So I tried with the
following

a.decode('utf8','replace')

This statement instead of removing the character replaced it with some
other character. So please suggest which one should I use, the one
with 'ignore' or the one with 'replace' and WHY? Can you also suggest
me some good book for learning PYTHON?

Thanks once again.

Nishant

Kang

unread,
Oct 15, 2008, 3:47:30 AM10/15/08
to google-a...@googlegroups.com
It is because there are some illegal string in the string you want to process. So with decode("UTF-8","ignore"), you can decode it without errors.

I think "Dive into Python" is a good  choice.

p.s.I am new to Python, too. And I am a Chinese, so I need to always deal with decode error problems. Because GAE dose not support Chinese well.

Nishu 写道:
Reply all
Reply to author
Forward
0 new messages