Re: 4.4.0 Bug parsing with html5lib (simple example attached)

48 views
Skip to first unread message
Message has been deleted

Fanny Dwargee

unread,
Aug 27, 2015, 8:55:36 AM8/27/15
to beautifulsoup
My apologies but I was plenty wrong:

According to the W3C Validator self-closing syntax used on a non-void HTML element (as the iframe in the example) must be treated as a start tag.

Sorry.


El miércoles, 26 de agosto de 2015, 15:42:43 (UTC+2), Fanny Dwargee escribió:
Hi,
parsing with the 'html5lib' parser builds an erroneous document if the HTML input string includes "selfclosing" HTML elements .

Tested on Debian Jessie with BeautifulSoup v4.4.40 and Python v2.7.9.

Take a look to the following example (notice the "selfclosing" iframe):

root@14c0ede392c2:/tmp# python
Python 2.7.9 (default, Mar  1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>>
>>> bs4.__version__
'4.4.0'
>>>
>>> html = '<html><iframe src="http://www.google.es"/><div id="test"></div></html>'
>>>
>>> bs4.BeautifulSoup( html, "html5lib" )
<html><head></head><body><iframe src="http://www.google.es">&lt;div id="test"&gt;&lt;/div&gt;&lt;/html&gt;</iframe></body></html>
>>>

so, as you can see, the "div" element was lost due to the parser, on the other hand, if the 'html.parser' is used all is parsed successfully.

Regards,

Fanny
Reply all
Reply to author
Forward
0 new messages