BeautifulSoup vs. Microsoft

John Nagle

unread,

Mar 29, 2007, 2:50:49 AM3/29/07

to

Here's a construct with which BeautifulSoup has problems. It's
from "http://support.microsoft.com/contactussupport/?ws=support".

This is the original:

<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>

And this is what comes back after parsing with BeautifulSoup
and using "prettify":

All that other stuff is in the neighborhood, but not in that <a> tag.

Strictly speaking, it's Microsoft's fault.

title="<!--http://www.microsoft.com/usability/information.mspx->"

is supposed to be an HTML comment. But it's improperly terminated.
It should end with "-->". So all that following stuff is from what
follows the next "-->" which terminates a comment.

It's so Microsoft.

Unfortunately, even Firefox accepts bad comments like that.

Anyway, a BeautifulSoup question. "findall(text=True)" collects comments,
processing instructions, etc. as well as real text. What's the right way
to collect ordinary text only?

John Nagle

Duncan Booth

unread,

Mar 29, 2007, 4:08:11 AM3/29/07

to

John Nagle <na...@animats.com> wrote:

> Strictly speaking, it's Microsoft's fault.
>
> title="". So all that following stuff is from what
> follows the next "-->" which terminates a comment.

It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.

Justin Ezequiel

unread,

Mar 29, 2007, 6:11:44 AM3/29/07

to

On Mar 29, 4:08 pm, Duncan Booth <duncan.bo...@invalid.invalid> wrote:

> John Nagle <n...@animats.com> wrote:
> > title="<!--http://www.microsoft.com/usability/information.mspx->"
>
> > is supposed to be an HTML comment. But it's improperly terminated.
>

> It is an attribute value, and unescaped angle brackets are valid in
> attributes. It looks to me like a bug in BeautifulSoup.

FWIW, see http://tinyurl.com/yjtzjz

new fan of BeautifulSoup here as it helped me parse "BAD" XML
(although my client would disagree with that description)

Justin Ezequiel

unread,

Mar 29, 2007, 6:24:08 AM3/29/07

to

On Mar 29, 6:11 pm, "Justin Ezequiel" <justin.mailingli...@gmail.com>
wrote:
>
> FWIW, seehttp://tinyurl.com/yjtzjz
>

hmm. not quite right.

http://tinyurl.com/ynv4ct

or

http://www.crummy.com/software/BeautifulSoup/documentation.html#Customizing%20the%20Parser

Duncan Booth

unread,

Mar 29, 2007, 7:52:21 AM3/29/07

to

"Justin Ezequiel" <justin.ma...@gmail.com> wrote:

I'm right behind BeautifulSoup's ability to parse bad HTML, but I still
think it should give priority to being able to parse valid HTML withough
messing it up.

Paul McGuire

unread,

Mar 29, 2007, 10:06:17 AM3/29/07

to

On Mar 29, 1:50 am, John Nagle <n...@animats.com> wrote:
> Here's a construct with which BeautifulSoup has problems. It's
> from "http://support.microsoft.com/contactussupport/?ws=support".
>
> This is the original:
>
> <a href="http://www.microsoft.com/usability/enroll.mspx"
> id="L_75998"
> title="<!--http://www.microsoft.com/usability/information.mspx->"
> onclick="return MS_HandleClick(this,'C_32179', true);">
> Help us improve our products
> </a>
>

<snip>

>
> Strictly speaking, it's Microsoft's fault.
>
> title="". So all that following stuff is from what
> follows the next "-->" which terminates a comment.
>

No, that comment is inside a quoted string, so it should be ok.

If you are just trying to extract <a href=...> tags, this pyparsing
scraper gets them, including this problematic one:

import urllib
from pyparsing import makeHTMLTags

pg = urllib.urlopen("http://support.microsoft.com/contactussupport/?
ws=support")
htmlSrc = pg.read()
pg.close()

# only take first tag returned from makeHTMLTags, not interested in
# closing </a> tags
anchorTag = makeHTMLTags("A")[0]

for a in anchorTag.searchString(htmlSrc):
if "title" in a:
print "Title:", a.title
print "HREF:", a.href
# or use this statement to dump the complete tag contents
# print a.dump()
print

Prints:
Title: <!--http://www.microsoft.com/usability/information.mspx->
HREF: http://www.microsoft.com/usability/enroll.mspx

Title: Print this page
HREF: /gp/noscript/

Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport

Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0

Title: Save to My Support Favorites
HREF: /gp/noscript/

Title: Go to My Support Favorites
HREF: /gp/noscript/

Title: Send Feedback
HREF: /gp/noscript/

-- Paul

John Nagle

unread,

Mar 29, 2007, 12:54:55 PM3/29/07

to

I think you're right. The HTML 4 spec,

http://www.w3.org/TR/html4/intro/sgmltut.html

says "Note that comments are markup". So recognizing comment syntax
inside an attribute is, in fact, an error in BeautifulSoup.

The source HTML on the Microsoft page is thus syntactically correct,
although meaningless. That's the only place on that page with a
comment-type form in an attribute.

John Nagle