Re: UTF8 BOM mark in unicode results in empty soup

366 views

Skip to first unread message

Leonard Richardson

unread,

May 6, 2013, 7:16:18 PM5/6/13

to beauti...@googlegroups.com

> With Python 2.7.2, I then decoded the string to unicode, with the following
> result:
>
> u'\xef\xbb\xbf<!DOCTYPE HTML PUBLIC'

I believe this is where things went wrong for you. The byte order mark
is a single character: U+FEFF.

>>> '\xef\xbb\xbf<!DOCTYPE HTML PUBLIC'.decode("utf8")
u'\ufeff<!DOCTYPE HTML PUBLIC'

I'm guessing you decoded it as latin-1 instead. That turns it into the
3-character string "ï»¿".

But that seems tangential to your real problem.

> When I pass this string to BeautifulSoup, it chokes and gives me back an
> essentially empty soup.
>
> >>> soup = BeautifulSoup(pageData, ["lxml", "xml"])
> >>> soup
> <?xml version="1.0" encoding="utf-8"?>
>
> If however, I do the following:
>
> if pageData.startswith(u"\xef\xbb\xbf"):
> pageData = pageData[3:]
> soup = BeautifulSoup(pageData, ["lxml", "xml"])
>
> then everything is fine.
>
> I had to use lxml's XML parser via ["lxml", "xml"] instead of just the lxml
> HTML parser because in the same program another web site seemed to give me
> very weird results otherwise.
>
> It appears when using lxml HTML parser the BOM bytes are correctly ignored.

I don't know what the rest of your document looks like, but there does
seem to be an lxml-specific problem when a document starts with a
byte-order mark. When I run the markup through lxml directly I get
this error:

lxml.etree.XMLSyntaxError: Misplaced DOCTYPE declaration, line 1, column 4

i.e. it treats the byte order mark as the first three "columns", and
then complains that there's something ahead of the DOCTYPE. To me this
points to a problem on the lxml side.

Leonard

TP

unread,

May 7, 2013, 12:51:09 AM5/7/13

to beauti...@googlegroups.com

On Mon, May 6, 2013 at 4:16 PM, Leonard Richardson <leon...@segfault.org> wrote:

> With Python 2.7.2, I then decoded the string to unicode, with the following
> result:
>
> u'\xef\xbb\xbf<!DOCTYPE HTML PUBLIC'

I believe this is where things went wrong for you. The byte order mark
is a single character: U+FEFF.

>>> '\xef\xbb\xbf<!DOCTYPE HTML PUBLIC'.decode("utf8")
u'\ufeff<!DOCTYPE HTML PUBLIC'

I'm guessing you decoded it as latin-1 instead. That turns it into the
3-character string "ï»¿".

Ooops. Yea. Stupid problem while I was messing with my program trying to figure out what was going on.

I don't know what the rest of your document looks like, but there does
seem to be an lxml-specific problem when a document starts with a
byte-order mark. When I run the markup through lxml directly I get
this error:

lxml.etree.XMLSyntaxError: Misplaced DOCTYPE declaration, line 1, column 4

i.e. it treats the byte order mark as the first three "columns", and
then complains that there's something ahead of the DOCTYPE. To me this
points to a problem on the lxml side.