parsing invalid href

16 views

Skip to first unread message

Krisztián Pintér

unread,

Feb 28, 2024, 7:58:23 AMFeb 28

to beautifulsoup

i need to extract data from a website. the website contains unescaped links in href attributes. e.g.:

i bet you see the issue: &reg is immediately recognized by BS, and turned into (R).

the problem is of course that it is not my website, so i can't fix it. browsers apparently go out of their way to please users, so they'll happily take the url literally when it appears in a href.

i'm kind of at a loss. meaningfully pre processing the html looks next to impossible. and once i feed this into BS, the content is ruined.

is there anything to do here?

leonardr

unread,

Feb 28, 2024, 9:44:50 AMFeb 28

to beautifulsoup

It looks like this is a strategy for parsing ambiguous HTML specific to Python's built-in HTML parser. The other supported parsers--lxml and html5lib--choose different strategies that should work better for you.

To diagnose this, I used the diagnose() function on your provided markup:

from bs4.diagnose import diagnose

markup = '<a href="/something?a=1&registration=0">'
diagnose(markup)

Here's the relevant output of diagnose():

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a href="/something?a=1®istration=0">
</a>

--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
</head>
<body>
<a href="/something?a=1&registration=0">
</a>
</body>
</html>

--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
<a href="/something?a=1&registration=0">
</a>
</body>
</html>

The ® problem only shows up when html.parser is used, so switching to lxml or html5lib should solve the problem.