Leonard Richardson
unread,May 6, 2013, 7:16:18 PM5/6/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to beauti...@googlegroups.com
> With Python 2.7.2, I then decoded the string to unicode, with the following
> result:
>
> u'\xef\xbb\xbf<!DOCTYPE HTML PUBLIC'
I believe this is where things went wrong for you. The byte order mark
is a single character: U+FEFF.
>>> '\xef\xbb\xbf<!DOCTYPE HTML PUBLIC'.decode("utf8")
u'\ufeff<!DOCTYPE HTML PUBLIC'
I'm guessing you decoded it as latin-1 instead. That turns it into the
3-character string "".
But that seems tangential to your real problem.
> When I pass this string to BeautifulSoup, it chokes and gives me back an
> essentially empty soup.
>
> >>> soup = BeautifulSoup(pageData, ["lxml", "xml"])
> >>> soup
> <?xml version="1.0" encoding="utf-8"?>
>
> If however, I do the following:
>
> if pageData.startswith(u"\xef\xbb\xbf"):
> pageData = pageData[3:]
> soup = BeautifulSoup(pageData, ["lxml", "xml"])
>
> then everything is fine.
>
> I had to use lxml's XML parser via ["lxml", "xml"] instead of just the lxml
> HTML parser because in the same program another web site seemed to give me
> very weird results otherwise.
>
> It appears when using lxml HTML parser the BOM bytes are correctly ignored.
I don't know what the rest of your document looks like, but there does
seem to be an lxml-specific problem when a document starts with a
byte-order mark. When I run the markup through lxml directly I get
this error:
lxml.etree.XMLSyntaxError: Misplaced DOCTYPE declaration, line 1, column 4
i.e. it treats the byte order mark as the first three "columns", and
then complains that there's something ahead of the DOCTYPE. To me this
points to a problem on the lxml side.
Leonard