BeautifulSoup 4 with the default parser gets it wrong. BS4 with the
lxml.html backend works as expected. The lxml backend will automatically be
used if lxml is available.
On Mon, Feb 11, 2013 at 9:35 AM, <dgspot
...@gmail.com> wrote:
> BeautifulSoup 3 gets this right, BeautifulSoup 4 gets it wrong (<br> is a
> void element that should never have any content).
> My question is: "Is BeautifulSoup 4 fundamentally broken in this regard,
> or is there some option I'm missing that gets it to treat unterminated
> <br> tags the same way as BeautifulSoup 3 did?" (I would like to use
> BeautifulSoup 4, but it's currently making my life difficult in a way that
> BeautifulSoup 3 doesn't.)
> Here's an example: given two soups (soup3, soup4) constructed from the
> same HTML using BeautifulSoup 3 and BeautifulSoup 4, respectively, notice
> the "nested" contents of snippet4 as compared to the "flat" contents of
> snippet3:
> >>> snippet3 = soup3.find(text=re.compile('^Date')).parent.parent
> >>> snippet4 = soup4.find(text=re.compile('^Date')).parent.parent
> >>> snippet3
> <p>
> <strong>Date & Time: </strong>02/02/2013 1:02pm<br />
> <strong>Invoice Number: </strong><br />
> <strong>Auth #: </strong><br />
> <strong>Customer Name: </strong>Cards</p>
> >>> snippet4
> <p>
> <strong>Date & Time: </strong>02/02/2013 1:02pm<br>
> <strong>Invoice Number: </strong><br>
> <strong>Auth #: </strong><br>
> <strong>Customer Name: </strong>Cards</br></br></br></p>
> >>> snippet3.contents
> [u'\n', <strong>Date & Time: </strong>, u'02/02/2013 1:02pm', <br />,
> u'\n', <strong>Invoice Number: </strong>, <br />, u'\n', <strong>Auth #:
> </strong>, <br />, u'\n', <strong>Customer Name: </strong>, u'Cards']
> >>> snippet4.contents
> [u'\n', <strong>Date & Time: </strong>, u'02/02/2013 1:02pm', <br>
> <strong>Invoice Number: </strong><br>
> <strong>Auth #: </strong><br>
> <strong>Customer Name: </strong>Cards</br></br></br>]
> >>> snippet4.contents[3].contents
> [u'\n', <strong>Invoice Number: </strong>, <br>
> <strong>Auth #: </strong><br>
> <strong>Customer Name: </strong>Cards</br></br>]
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to beautifulsoup+unsubscribe@googlegroups.com.
> To post to this group, send email to beautifulsoup@googlegroups.com.
> Visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.