Re: why <br> is not regarded as <br /> in some cases

63 views
Skip to first unread message

Aaron DeVore

unread,
Feb 11, 2013, 11:05:01 PM2/11/13
to beauti...@googlegroups.com
> BeautifulSoup 3 gets this right, BeautifulSoup 4 gets it wrong (<br> is a void element that should never have any content).

BeautifulSoup 4 with the default parser gets it wrong. BS4 with the lxml.html backend works as expected. The lxml backend will automatically be used if lxml is available.

-Aaron DeVore


On Mon, Feb 11, 2013 at 9:35 AM, <dgsp...@gmail.com> wrote:
BeautifulSoup 3 gets this right, BeautifulSoup 4 gets it wrong (<br> is a void element that should never have any content).

My question is: "Is BeautifulSoup 4 fundamentally broken in this regard, or is there some option I'm missing that gets it to treat unterminated <br> tags the same way as BeautifulSoup 3 did?" (I would like to use BeautifulSoup 4, but it's currently making my life difficult in a way that BeautifulSoup 3 doesn't.)

Here's an example: given two soups (soup3, soup4) constructed from the same HTML using BeautifulSoup 3 and BeautifulSoup 4, respectively, notice the "nested" contents of snippet4 as compared to the "flat" contents of snippet3:

>>> snippet3 = soup3.find(text=re.compile('^Date')).parent.parent
>>> snippet4 = soup4.find(text=re.compile('^Date')).parent.parent

>>> snippet3
<p>
<strong>Date &amp; Time: </strong>02/02/2013 1:02pm<br />
<strong>Invoice Number: </strong><br />
<strong>Auth #: </strong><br />
<strong>Customer Name: </strong>Cards</p>

>>> snippet4
<p>
<strong>Date &amp; Time: </strong>02/02/2013 1:02pm<br>
<strong>Invoice Number: </strong><br>
<strong>Auth #: </strong><br>
<strong>Customer Name: </strong>Cards</br></br></br></p>

>>> snippet3.contents
[u'\n', <strong>Date &amp; Time: </strong>, u'02/02/2013 1:02pm', <br />, u'\n', <strong>Invoice Number: </strong>, <br />, u'\n', <strong>Auth #: </strong>, <br />, u'\n', <strong>Customer Name: </strong>, u'Cards']

>>> snippet4.contents
[u'\n', <strong>Date &amp; Time: </strong>, u'02/02/2013 1:02pm', <br>
<strong>Invoice Number: </strong><br>
<strong>Auth #: </strong><br>
<strong>Customer Name: </strong>Cards</br></br></br>]

>>> snippet4.contents[3].contents
[u'\n', <strong>Invoice Number: </strong>, <br>
<strong>Auth #: </strong><br>
<strong>Customer Name: </strong>Cards</br></br>]



--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.

To post to this group, send email to beauti...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages