parsing invalid href

15 views
Skip to first unread message

Krisztián Pintér

unread,
Feb 28, 2024, 7:58:23 AMFeb 28
to beautifulsoup
i need to extract data from a website. the website contains unescaped links in href attributes. e.g.:

    <a href="/something?a=1&registration=0>

i bet you see the issue: &reg is immediately recognized by BS, and turned into (R).

the problem is of course that it is not my website, so i can't fix it. browsers apparently go out of their way to please users, so they'll happily take the url literally when it appears in a href.

i'm kind of at a loss. meaningfully pre processing the html looks next to impossible. and once i feed this into BS, the content is ruined.

is there anything to do here?

leonardr

unread,
Feb 28, 2024, 9:44:50 AMFeb 28
to beautifulsoup
It looks like this is a strategy for parsing ambiguous HTML specific to Python's built-in HTML parser. The other supported parsers--lxml and html5lib--choose different strategies that should work better for you.

To diagnose this, I used the diagnose() function on your provided markup:

from bs4.diagnose import diagnose
markup = '<a href="/something?a=1&registration=0">'
diagnose(markup)


Here's the relevant output of diagnose():

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a href="/something?a=1®istration=0">
</a>

--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
 </head>
 <body>
  <a href="/something?a=1&amp;registration=0">
  </a>
 </body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <a href="/something?a=1&amp;registration=0">
  </a>
 </body>
</html>


The ® problem only shows up when html.parser is used, so switching to lxml or html5lib should solve the problem.

Leonard
Reply all
Reply to author
Forward
0 new messages