BeautifulSoup 3 gets this right, BeautifulSoup 4 gets it wrong (<br> is a void element that should never have any content).
My question is: "Is BeautifulSoup 4 fundamentally broken in this regard, or is there some option I'm missing that gets it to treat unterminated <br> tags the same way as BeautifulSoup 3 did?" (I would like to use BeautifulSoup 4, but it's currently making my life difficult in a way that BeautifulSoup 3 doesn't.)
Here's an example: given two soups (soup3, soup4) constructed from the same HTML using BeautifulSoup 3 and BeautifulSoup 4, respectively, notice the "nested" contents of snippet4 as compared to the "flat" contents of snippet3:
>>> snippet3 = soup3.find(text=re.compile('^Date')).parent.parent
>>> snippet4 = soup4.find(text=re.compile('^Date')).parent.parent
>>> snippet3
<p>
<strong>Date & Time: </strong>02/02/2013 1:02pm<br />
<strong>Invoice Number: </strong><br />
<strong>Auth #: </strong><br />
<strong>Customer Name: </strong>Cards</p>
>>> snippet4
<p>
<strong>Date & Time: </strong>02/02/2013 1:02pm<br>
<strong>Invoice Number: </strong><br>
<strong>Auth #: </strong><br>
<strong>Customer Name: </strong>Cards</br></br></br></p>
>>> snippet3.contents
[u'\n', <strong>Date & Time: </strong>, u'02/02/2013 1:02pm', <br />, u'\n', <strong>Invoice Number: </strong>, <br />, u'\n', <strong>Auth #: </strong>, <br />, u'\n', <strong>Customer Name: </strong>, u'Cards']
>>> snippet4.contents
[u'\n', <strong>Date & Time: </strong>, u'02/02/2013 1:02pm', <br>
<strong>Invoice Number: </strong><br>
<strong>Auth #: </strong><br>
<strong>Customer Name: </strong>Cards</br></br></br>]
>>> snippet4.contents[3].contents
[u'\n', <strong>Invoice Number: </strong>, <br>
<strong>Auth #: </strong><br>
<strong>Customer Name: </strong>Cards</br></br>]To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.Visit this group at http://groups.google.com/group/beautifulsoup?hl=en.For more options, visit https://groups.google.com/groups/opt_out.