This is the original:
<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>
And this is what comes back after parsing with BeautifulSoup
and using "prettify":
<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->">
<br clear="all" style="line-height: 1px; overflow: hidden" />
<table id="msviFooter" width="100%" cellpadding="0"
cellspacing="0">
<tr valign="bottom">
<td id="msviFooter2"
style="filter:progid:DXImageTransform.Microsoft.Gradient(startColorStr='#FFFFFF',
endColorStr='#3F8CDA', gradientType='1')">
<div id="msviLocalFooter">
<nobr>
</nobr>
</div>
</td>
</tr>
</table>
</a>
All that other stuff is in the neighborhood, but not in that <a> tag.
Strictly speaking, it's Microsoft's fault.
title="<!--http://www.microsoft.com/usability/information.mspx->"
is supposed to be an HTML comment. But it's improperly terminated.
It should end with "-->". So all that following stuff is from what
follows the next "-->" which terminates a comment.
It's so Microsoft.
Unfortunately, even Firefox accepts bad comments like that.
Anyway, a BeautifulSoup question. "findall(text=True)" collects comments,
processing instructions, etc. as well as real text. What's the right way
to collect ordinary text only?
John Nagle
> Strictly speaking, it's Microsoft's fault.
>
> title="<!--http://www.microsoft.com/usability/information.mspx->"
>
> is supposed to be an HTML comment. But it's improperly terminated.
> It should end with "-->". So all that following stuff is from what
> follows the next "-->" which terminates a comment.
It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.
FWIW, see http://tinyurl.com/yjtzjz
new fan of BeautifulSoup here as it helped me parse "BAD" XML
(although my client would disagree with that description)
hmm. not quite right.
or
http://www.crummy.com/software/BeautifulSoup/documentation.html#Customizing%20the%20Parser
I'm right behind BeautifulSoup's ability to parse bad HTML, but I still
think it should give priority to being able to parse valid HTML withough
messing it up.
No, that comment is inside a quoted string, so it should be ok.
If you are just trying to extract <a href=...> tags, this pyparsing
scraper gets them, including this problematic one:
import urllib
from pyparsing import makeHTMLTags
pg = urllib.urlopen("http://support.microsoft.com/contactussupport/?
ws=support")
htmlSrc = pg.read()
pg.close()
# only take first tag returned from makeHTMLTags, not interested in
# closing </a> tags
anchorTag = makeHTMLTags("A")[0]
for a in anchorTag.searchString(htmlSrc):
if "title" in a:
print "Title:", a.title
print "HREF:", a.href
# or use this statement to dump the complete tag contents
# print a.dump()
print
Prints:
Title: <!--http://www.microsoft.com/usability/information.mspx->
HREF: http://www.microsoft.com/usability/enroll.mspx
Title: Print this page
HREF: /gp/noscript/
Title: Print this page
HREF: /gp/noscript/
Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport
Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport
Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0
Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0
Title: Save to My Support Favorites
HREF: /gp/noscript/
Title: Save to My Support Favorites
HREF: /gp/noscript/
Title: Go to My Support Favorites
HREF: /gp/noscript/
Title: Go to My Support Favorites
HREF: /gp/noscript/
Title: Send Feedback
HREF: /gp/noscript/
Title: Send Feedback
HREF: /gp/noscript/
-- Paul
I think you're right. The HTML 4 spec,
http://www.w3.org/TR/html4/intro/sgmltut.html
says "Note that comments are markup". So recognizing comment syntax
inside an attribute is, in fact, an error in BeautifulSoup.
The source HTML on the Microsoft page is thus syntactically correct,
although meaningless. That's the only place on that page with a
comment-type form in an attribute.
John Nagle