&GT encoding puzzle

20 views
Skip to first unread message

Yangming

unread,
Oct 24, 2022, 10:00:58 AM10/24/22
to beautifulsoup
Here I have a code,
>> sp = bs4.BeautifulSoup('<iframe src="http://xxx/?a=&b=&GT"></iframe>', 'html.parser')
>> str(sp)

Output is:
>> '<iframe src="http://xxx/?a=&amp;b=&gt;"></iframe>'

Question is why &GT become &gt: ?

Isaac Muse

unread,
Oct 24, 2022, 2:43:27 PM10/24/22
to beautifulsoup

This seems to be parser specific:

>>> str(BeautifulSoup('<iframe src="http://xxx/?a=&b=&GT"></iframe>', 'html.parser'))
'<iframe src="http://xxx/?a=&amp;b=&gt;"></iframe>'
>>> str(BeautifulSoup('<iframe src="http://xxx/?a=&b=&GT"></iframe>', 'lxml'))
'<html><body><iframe src="http://xxx/?a=&amp;b=&amp;GT"></iframe></body></html>'
>>> str(BeautifulSoup('<iframe src="http://xxx/?a=&b=&GT"></iframe>', 'html5lib'))
'<html><head></head><body><iframe src="http://xxx/?a=&amp;b=&gt;"></iframe></body></html>'

With all of that said, at least in Chrome, &gt will be turned into &gt;. At least in my tests it did.

Yangming

unread,
Oct 24, 2022, 9:15:56 PM10/24/22
to beautifulsoup
Yeah,  and default parser will get this:

>>> str(bs4.BeautifulSoup('<iframe src="http://xxx/?a=&b=&GT"></iframe>'))
'<html><body><iframe src="http://xxx/?a=&amp;b=&amp;GT"></iframe></body></html>'

Isaac Muse

unread,
Oct 26, 2022, 9:20:01 AM10/26/22
to beautifulsoup
The default parser is usually based on whatever bs4 determines to be the best parser installed on your system, so the default can change if you don't have the same parsers installed on different systems. I usually prefer to explicitly state what parser I'm using.
Reply all
Reply to author
Forward
0 new messages