I have a bad formed HTML who contains something like
<a style="text-decoration: none;" class="smalltitle" href="xxxxxxxxxx">
<font style="margin: 0px; padding: 0px;" color="#000000">
U.R.P.</font></a>
First I pass the whole HTML to the prettify() function and I obtain
<a class="smalltitle" href="xxxxxxxxxx" style="text-decoration: none;">
<font color="#000000" style="margin: 0px; padding: 0px;">
U.R.P.
</font>
</a>
Then I need to search for *URP* or *U.R.P.* and check that it's in an <a>
(hyperlink) tag. The problem is that the find_all() function does a ==
match, but the prettify() function add always whitespaces for indentation.
This is what I do without using regular expression and just to check if my
code is right:
from bs4 import BeautifulSoup
import re
html = BeautifulSoup(BeautifulSoup(open("index.html"), "lxml").prettify())
for parent in html.find_all("font", text="\n U.R.P.\n ")[0].parents:
if parent.name == "a":
print 'found!'
break
The next step is to substitute that ugly text="\n U.R.P.\n " with a
regular expression:
from bs4 import BeautifulSoup
import re
html = BeautifulSoup(BeautifulSoup(open("index.html"), "lxml").prettify())
print html.find_all(text=re.compile("\s*U\.R\.P\.\s*")) # first try
print html.find_all("a", text=re.compile("\s*U\.R\.P\.\s*")) # second try
The first find_all(), find correctly the string, BUT I need the tag, not
the string.
The second find_all() should works exactly the way I need BUT it doesn't
find anything.
What am I doing wrong?
If I was not too clear, I need to search for *URP* or *U.R.P.* in the whole
page (including eventual whitespaces and tabs) and check that it's in an
<a> (hyperlink) tag. There is a better way to do, instead of the one I'm
trying?
Cheers