<span class="romain" title="Nombre écrit en chiffres romains">X</
span>
I have tried explicitly setting fromEncoding="utf-8", but the result
was
the same.
BeautifulSoup 2.1.1 appears to be able to make soup of it.
What am I doing wrong?
>>> soup = BeautifulSoup(s, fromEncoding="utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jim/Projects/exelearning/dev/exe/exe/engine/
beautifulsoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Users/jim/Projects/exelearning/dev/exe/exe/engine/
beautifulsoup.py", line 946, in __init__
self._feed()
File "/Users/jim/Projects/exelearning/dev/exe/exe/engine/
beautifulsoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
0: ordinal not in range(128)
# Autodetects character encodings.
# Download from http://chardet.feedparser.org/
try:
import chardet
# import chardet.constants
# chardet.constants._debug = 1
except:
chardet = None
chardet = None
So chardet is disabled by default even if it is installed. You need to
comment out the last line "chardet = None" to enable it.
Do not have time to try it, tell us what you get after you enable
chardet.
HTH,
-Yichun
http://www.crummy.com/software/BeautifulSoup/#Download
Is there a more up to date place?
> So chardet is disabled by default even if it is installed. You need to
> comment out the last line "chardet = None" to enable it.
> Do not have time to try it, tell us what you get after you enable
> chardet.
Ok. Thanks. I'll give that a try.
My current solution is just to brute force:
content = content.replace(' ', ' ')
before trying to make soup out of it.
Nice if this works for you. Read
http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps
if you need to massage troublesome markups.
-Yichun
I am also having similiar problems. Tried the uncommenting of chardet
= None, but still the same problems. Any ideas?
>>> a = open('/home/rdesh/Desktop/default.html').read()
>>> n = BeautifulSoup.BeautifulSoup(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "BeautifulSoup.py", line 946, in __init__
self._feed()
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
method(attrs)
File "BeautifulSoup.py", line 1372, in start_meta
self._feed(self.declaredHTMLEncoding)
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position
0: ordinal not in range(128)
>>> unicode(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
4445: ordinal not in range(128)
>>> unicode(a, 'utf-8')
...works fine, spits out all the html....
>>> n = BeautifulSoup.BeautifulSoup(unicode(a, 'utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "BeautifulSoup.py", line 946, in __init__
self._feed()
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position
0: ordinal not in range(128)
>>> n = BeautifulSoup.BeautifulSoup(unicode(a, 'utf-8'), fromEncoding='utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "BeautifulSoup.py", line 946, in __init__
self._feed()
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
method(attrs)
File "BeautifulSoup.py", line 1372, in start_meta
self._feed(self.declaredHTMLEncoding)
File "BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position
0: ordinal not in range(128)
Really confused. Any ideas?
Thanks,
rdesh
On Jun 21, 8:44 pm, Yichun <yichun....@gmail.com> wrote:
> On Jun 21, 3:45 pm, Jim Tittsler <jtitts...@gmail.com> wrote:
>
>
>
> > On Jun 22, 8:50 am, Yichun <yichun....@gmail.com> wrote:
>
> > > Where did you get BeautifulSoup from?
>
> >http://www.crummy.com/software/BeautifulSoup/#Download
>
> > Is there a more up to date place?
>
> > > So chardet is disabled by default even if it is installed. You need to
> > > comment out the last line "chardet = None" to enable it.
> > > Do not have time to try it, tell us what you get after you enable
> > > chardet.
>
> > Ok. Thanks. I'll give that a try.
>
> > My current solution is just to brute force:
> > content = content.replace(' ', ' ')
> > before trying to make soup out of it.
>
> Nice if this works for you. Readhttp://www.crummy.com/software/BeautifulSoup/documentation.html#Sanit...
http://mail.python.org/pipermail/python-bugs-list/2007-February/037082.html
Hopefully this gets fixed soon!
Cheers.