i need to extract data from a website. the website contains unescaped links in href attributes. e.g.:
<a href="/something?a=1®istration=0>
i bet you see the issue: ® is immediately recognized by BS, and turned into (R).
the problem is of course that it is not my website, so i can't fix it. browsers apparently go out of their way to please users, so they'll happily take the url literally when it appears in a href.
i'm kind of at a loss. meaningfully pre processing the html looks next to impossible. and once i feed this into BS, the content is ruined.
is there anything to do here?