UnicodeEncodeError.

229 views
Skip to first unread message

James W

unread,
Jan 24, 2012, 12:22:34 AM1/24/12
to beautifulsoup
Hi All,

I am using BS 3.2 on python 2.7.1 here.

I have recently been trying to get something simple to work, but it
seems rather tricky:

I do the following:

for i in range(0,10):
temp3=BeautifulSoup(urllib2.urlopen(urlList[i], None,15))

However, I get the error:

***********************************************************************************
File "/home/foo/k/kat/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/home/foo/k/kat/BeautifulSoup.py", line 1144, in __init__
self._feed(isHTML=isHTML)
File "/home/foo/k/kat/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in
position 4: ordinal not in range(128)
***********************************************************************************

If I run the same loop another time, sometimes, I also get:

***********************************************************************************
File "/home/foo/k/kat/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/home/foo/mtorrents/kat/BeautifulSoup.py", line 1144, in
__init__
self._feed(isHTML=isHTML)
File "/home/foo/k/kat/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
4-5: ordinal not in range(128)
***********************************************************************************

How do I avoid this error?

I tried some solutions from stackoverflow:

*] Tried soup = BeautifulSoup(page, fromEncoding=<encoding of the
page>) Result: Dosent work, same errors.

*] Tried upgrading my sgmllib.py from a 2.7.2 version onto my 2.7.1
verision Result: Dosent work, same errors.

*] Tried html = BeautifulSoup(page.encode('utf-8')) Result: Dosent
work, same errors.


I would appreciate any suggestions as to how to solve this encode
error.



Thomas Kluyver

unread,
Jan 30, 2012, 6:54:28 AM1/30/12
to beautifulsoup
On Jan 24, 5:22 am, James W <s031507...@gmail.com> wrote:
> However, I get the error:
>
> ***********************************************************************************
>   File "/home/foo/k/kat/BeautifulSoup.py", line 1519, in __init__
>     BeautifulStoneSoup.__init__(self, *args, **kwargs)
>   File "/home/foo/k/kat/BeautifulSoup.py", line 1144, in __init__
>     self._feed(isHTML=isHTML)
>   File "/home/foo/k/kat/BeautifulSoup.py", line 1186, in _feed
>     SGMLParser.feed(self, markup)
>   File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
>     self.goahead(0)
>   File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
>     k = self.parse_endtag(i)
>   File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
>     self.finish_endtag(tag)
>   File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
>     method = getattr(self, 'end_' + tag)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in
> position 4: ordinal not in range(128)
> ***********************************************************************************

What HTML are you trying to parse with it? It looks like there might
be a non-ascii character in a tag name (like <ä> instead of <a>).

Thomas

Gary Ma

unread,
Jan 30, 2012, 7:51:29 AM1/30/12
to beauti...@googlegroups.com
try to modify the sitecustomize.py at your python lib/site-packages
directory with the following content

import sys
sys.setdefaultencoding('utf-8')

> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages