In fact the code you're trying to parse is not walid, as &val; is not
a defined entity.
But this is an error in your code, not in parser (of course you can
change parser code to work as you want with such invalid code).
If you want to use & in url you should code it as & (both in href
and inside the tag).
But raw ampersads in URLs are still dangerous. Imagine scripts that
uses a GET variable called "amp":
And page with not entity-encoded link to it:
<html><body>
<a href="http://jaboja.pl/?sth=0&=1">http://jaboja.pl/?sth=0&=1</a>
</body></html>
When I display above HTML document in Firefox it interprets both <a>
tag content and href as http://jaboja.pl/?sth=0&=1
In fact it is parsed this way by all major web browsers (but only for
the defined entities!).
When I parse it with BeautifulSoup, it interprets & in content ad
malformed entity,
but does not do so for the href (what is different than in browsers). Vide:
<html><body>
<a href="http://jaboja.pl/?sth=0&amp=1">http://jaboja.pl/?sth=0&=1</a>
</body></html>
--
Jakub Jagiełło
IMHO the problem is that BeautifulSoup does not check if an entity is
defined, and works like all entity-like substrings are entities.
I think there should be added an check if an entity is defined and
correction should be performed only then.