attempting to parse French Wikipedia pages

23 views
Skip to first unread message

Jim Tittsler

unread,
Jun 20, 2007, 9:31:32 PM6/20/07
to beautifulsoup
If I try parse the text of http://fr.wikipedia.org/wiki/Paris with
BeautifulSoup v3.0.4 I get a traceback because it appears to be
trying to parse an attribute containing an entity as ASCII rather than
UTF-8.

<span class="romain" title="Nombre&#160;écrit en chiffres romains">X</
span>

I have tried explicitly setting fromEncoding="utf-8", but the result
was
the same.

BeautifulSoup 2.1.1 appears to be able to make soup of it.

What am I doing wrong?

>>> soup = BeautifulSoup(s, fromEncoding="utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jim/Projects/exelearning/dev/exe/exe/engine/
beautifulsoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Users/jim/Projects/exelearning/dev/exe/exe/engine/
beautifulsoup.py", line 946, in __init__
self._feed()
File "/Users/jim/Projects/exelearning/dev/exe/exe/engine/
beautifulsoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
0: ordinal not in range(128)

Yichun

unread,
Jun 21, 2007, 4:50:21 PM6/21/07
to beautifulsoup
Where did you get BeautifulSoup from? I roughly remember I met similar
problem as what you are having. In the 3.0.4 source there're the
following lines:

# Autodetects character encodings.
# Download from http://chardet.feedparser.org/
try:
import chardet
# import chardet.constants
# chardet.constants._debug = 1
except:
chardet = None
chardet = None

So chardet is disabled by default even if it is installed. You need to
comment out the last line "chardet = None" to enable it.
Do not have time to try it, tell us what you get after you enable
chardet.

HTH,

-Yichun

Jim Tittsler

unread,
Jun 21, 2007, 6:45:53 PM6/21/07
to beautifulsoup
On Jun 22, 8:50 am, Yichun <yichun....@gmail.com> wrote:
> Where did you get BeautifulSoup from?

http://www.crummy.com/software/BeautifulSoup/#Download

Is there a more up to date place?

> So chardet is disabled by default even if it is installed. You need to
> comment out the last line "chardet = None" to enable it.
> Do not have time to try it, tell us what you get after you enable
> chardet.

Ok. Thanks. I'll give that a try.

My current solution is just to brute force:
content = content.replace('&#160;', '&nbsp;')
before trying to make soup out of it.

Yichun

unread,
Jun 21, 2007, 8:44:40 PM6/21/07
to beautifulsoup

Nice if this works for you. Read
http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps

if you need to massage troublesome markups.

-Yichun

rdesh

unread,
Jul 8, 2007, 11:45:43 PM7/8/07
to beautifulsoup
Hi All,

I am also having similiar problems. Tried the uncommenting of chardet
= None, but still the same problems. Any ideas?

>>> a = open('/home/rdesh/Desktop/default.html').read()
>>> n = BeautifulSoup.BeautifulSoup(a)


Traceback (most recent call last):
File "<stdin>", line 1, in <module>

File "BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "BeautifulSoup.py", line 946, in __init__
self._feed()
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
method(attrs)
File "BeautifulSoup.py", line 1372, in start_meta
self._feed(self.declaredHTMLEncoding)
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position


0: ordinal not in range(128)

>>> unicode(a)


Traceback (most recent call last):
File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
4445: ordinal not in range(128)

>>> unicode(a, 'utf-8')
...works fine, spits out all the html....

>>> n = BeautifulSoup.BeautifulSoup(unicode(a, 'utf-8'))


Traceback (most recent call last):
File "<stdin>", line 1, in <module>

File "BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "BeautifulSoup.py", line 946, in __init__
self._feed()
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position


0: ordinal not in range(128)

>>> n = BeautifulSoup.BeautifulSoup(unicode(a, 'utf-8'), fromEncoding='utf-8')


Traceback (most recent call last):
File "<stdin>", line 1, in <module>

File "BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "BeautifulSoup.py", line 946, in __init__
self._feed()
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
method(attrs)
File "BeautifulSoup.py", line 1372, in start_meta
self._feed(self.declaredHTMLEncoding)
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position


0: ordinal not in range(128)

Really confused. Any ideas?

Thanks,
rdesh

On Jun 21, 8:44 pm, Yichun <yichun....@gmail.com> wrote:
> On Jun 21, 3:45 pm, Jim Tittsler <jtitts...@gmail.com> wrote:
>
>
>
> > On Jun 22, 8:50 am, Yichun <yichun....@gmail.com> wrote:
>
> > > Where did you get BeautifulSoup from?
>
> >http://www.crummy.com/software/BeautifulSoup/#Download
>
> > Is there a more up to date place?
>
> > > So chardet is disabled by default even if it is installed. You need to
> > > comment out the last line "chardet = None" to enable it.
> > > Do not have time to try it, tell us what you get after you enable
> > > chardet.
>
> > Ok. Thanks. I'll give that a try.
>
> > My current solution is just to brute force:
> > content = content.replace('&#160;', '&nbsp;')
> > before trying to make soup out of it.
>

> Nice if this works for you. Readhttp://www.crummy.com/software/BeautifulSoup/documentation.html#Sanit...

rdesh

unread,
Jul 9, 2007, 4:41:12 PM7/9/07
to beautifulsoup
Fixed following this advice:

http://mail.python.org/pipermail/python-bugs-list/2007-February/037082.html

Hopefully this gets fixed soon!

Cheers.

Reply all
Reply to author
Forward
0 new messages