BeautifulSoup.find_all not working with anchors containing a line-break

756 views
Skip to first unread message

Sergei Bykov

unread,
Aug 11, 2014, 5:44:51 PM8/11/14
to beauti...@googlegroups.com
Hello!

I'm dealing with html/xhtml links with beautifulsoup 4.3.2 and have faced some strangeness with br occuring in a elements.

from bs4 import BeautifulSoup

html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

Gives an empty list.
Seems that it's caused by the br tag, appearing in the a tag. Hmm. Well, lets replace it with a newline:

html
.find('br').replaceWith('\n') html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

Again an empy list, damn.
Maybe,

html.find('br').replaceWith('')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

The same result..
But

html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000</a></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

[<a href="/track?no=ABCD0000000">ABCD0000000</a>]

- Works fine.

Is this a bug or do I misunderstand something?

As I see the most fast and simple way to bypass this is to clean or replace br's before feeding the data to bs4.

import re
re.sub(re.compile('<br\s*/>', re.IGNORECASE), '\n', '<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')

Or any another?
Thanks for suggestions and complements!
Best regards,
~S.

Aaron DeVore

unread,
Aug 11, 2014, 8:17:55 PM8/11/14
to beauti...@googlegroups.com
If you replace <br> with a \n, You get this tree:

<html>
  <head></head>
  <body>
    <a href="/track?no=ABCD00000000">
      "ABCD00000000"
      "\n"
    </a>
  </body>
</html>

text=... only matches tags where there is one child and it is a string. Because <a> has two children, text=... refuses to do any matching.

The cleanest way to remove <br> tags is something like this:

    for br in soup.find_all('br'):
        br.extract()

Then you don't have to deal with weird HTML corner cases.


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at http://groups.google.com/group/beautifulsoup.
For more options, visit https://groups.google.com/d/optout.

Sergei Bykov

unread,
Aug 12, 2014, 9:45:18 AM8/12/14
to beauti...@googlegroups.com
Aaron,

I think you're not quite correct.
This code doesn't yield '\n' as a separate child. May be because node type of "\n" is text (not an element).

html = BeautifulSoup("""<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000\n\n</a>text</body></html>""")
for i in html.find_all():
    print("%s: %s" % (i.name, i))

html: <html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000

</a>text</body></html>
head: <head></head>
body: <body><a href="/track?no=ABCD0000000">ABCD0000000

</a>text</body>
a: <a href="/track?no=ABCD0000000">ABCD0000000

</a>


This also works fine:

html.find_all('a', text=re.compile('ABCD0000000'))
[<a href="/track?no=ABCD0000000">ABCD0000000

</a>]


Sergei Bykov

unread,
Aug 12, 2014, 12:21:24 PM8/12/14
to beauti...@googlegroups.com
To be more precise, let's take this example:

html = [
    BeautifulSoup("""<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br/>\n</a>text</body></html>"""),
    BeautifulSoup("""<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000\n</a>text</body></html>""")
]


Search result:
html[0].find_all('a', text=re.compile('ABCD0000000'))
[]
html[1].find_all('a', text=re.compile('ABCD0000000'))
[<a href="/track?no=ABCD0000000">ABCD0000000
</a>]


Extract line-break and search again - no result:
for br in html[0].find_all('br'):
     br.extract()
     # br.decompose() gives the same result, i.e. no result

html[0].find_all('a', text=re.compile('ABCD0000000'))
[]


Let's see the difference between source data:
html
[

    <html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000
</a>text</body></html>,
    <html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000
</a>text</body></html>
]


I don't see any difference, neither Python3 interpreter does:

html[0].encode() == html[1].encode()
True


But seems that somewhere in BS object tree it's still present:

html[0].find_all('a', text=re.compile('ABCD0000000'))
[]
html[1].find_all('a', text=re.compile('ABCD0000000'))
[<a href="/track?no=ABCD0000000">ABCD0000000
</a>]


By the way, rough extracting of line-break is not suitable for all cases, because 'A<br/>B' would become a single word, although initially there are two (A and B).

Best regards,
~S.
Reply all
Reply to author
Forward
0 new messages