find('p',text=' ') and   within tags within p

1,863 views
Skip to first unread message

AdamAtNCPC

unread,
Dec 3, 2008, 2:39:46 PM12/3/08
to beautifulsoup
I'm writing a script to remove <p>&nbsp;</p>.

My original code (cleanHTML_methodOne below) was a little too
aggressive and would remove other <p> tags as well.

I thought it could be solved with regex (cleanHTML_methodTwo below),
but no dice.

Finally, I tried going by the notion that a <p> containing a tag and
&nbsp; would find a child (cleanHTML_methodThree below). The code's
not pretty, and doesn't work perfectly (I'm still stripping an
emphasized non-breaking space), but it should be good enough for now.

If anyone has encountered this and has a better solution (some kind of
differentiation between childText and descendantText?), please let me
know.

Thanks,
Adam

# every result should contain foo, bar, and boz. If bar is removed, I
have a problem.
dataSet = [
"<p>foo</p><p>&nbsp;</p><p>bar</p><p>&nbsp;</p><p>boz</p>",
"<p>foo</p><p>&nbsp;</p><p>bar&nbsp;</p><p>&nbsp;</p><p>boz</p>",
"<p>foo</p><p>&nbsp;</p><p>&nbsp;bar</p><p>&nbsp;</p><p>boz</p>",
"<p>foo</p><p>&nbsp;</p><p>bar<br />&nbsp;</p><p>&nbsp;</p><p>boz</
p>",
"<p>foo</p><p>&nbsp;</p><p>&nbsp;<br />bar</p><p>&nbsp;</p><p>boz</
p>",
"<p>foo</p><p>&nbsp;</p><p>bar<em>&nbsp;</em></p><p>&nbsp;</
p><p>boz</p>",
"<p>foo</p><p>&nbsp;</p><p>&nbsp;<em>bar</em></p><p>&nbsp;</
p><p>boz</p>",
"<p>foo</p><p>&nbsp;<em>bar</em></p><p>boz</p>"
]

from BeautifulSoup import BeautifulSoup

import re


def cleanHTML_methodOne (data):
soup = BeautifulSoup(data)
while soup.find('p', text='&nbsp;'):
soup.find('p', text='&nbsp;').parent.replaceWith('')
return str(soup)

def cleanHTML_methodTwo (data):
soup = BeautifulSoup(data)
while soup.find('p', text=re.compile('^&nbsp;$')):
soup.find('p', text='&nbsp;').parent.replaceWith('')
return str(soup)

def cleanHTML_methodThree (data):
soup = BeautifulSoup(data)
while None in [result.parent.findChild() for result in soup.findAll
('p',text='&nbsp;')]:
for paragraph in soup.findAll('p', text='&nbsp;'):
if not paragraph.parent.findChild():
paragraph.parent.replaceWith('')
return str(soup)


soupSetOne = [cleanHTML_methodOne(data) for data in dataSet]
soupSetTwo = [cleanHTML_methodTwo(data) for data in dataSet]
soupSetThree = [cleanHTML_methodThree(data) for data in dataSet]

output = ''

output += "soupSetOne:\n"
for soup in soupSetOne:
output+= soup + '\n'

output += "soupSetTwo:\n"
for soup in soupSetTwo:
output+= soup + '\n'

output += "soupSetThree:\n"
for soup in soupSetThree:
output+= soup + '\n'

print output

OUTPUT:
soupSetOne:
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>bar&nbsp;</p><p>boz</p>
<p>foo</p><p>&nbsp;bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
soupSetTwo:
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>bar&nbsp;</p><p>boz</p>
<p>foo</p><p>&nbsp;bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
soupSetThree:
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>bar&nbsp;</p><p>boz</p>
<p>foo</p><p>&nbsp;bar</p><p>boz</p>
<p>foo</p><p>bar<br />&nbsp;</p><p>boz</p>
<p>foo</p><p>&nbsp;<br />bar</p><p>boz</p>
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>&nbsp;<em>bar</em></p><p>boz</p>
<p>foo</p><p>&nbsp;<em>bar</em></p><p>boz</p>

Aaron DeVore

unread,
Dec 4, 2008, 12:59:40 AM12/4/08
to beauti...@googlegroups.com
There is, indeed, a better way.

def cleanHTML(data):
soup = BeautifulSoup(data)
for text in soup.findAll(text="&nbsp;"):
if text.parent.name == 'p' and len(text.parent.contents) == 1:
text.parent.extract()
return unicode(soup)

That finds all &nbsp;, then removes its parent if the parent is a <p>
tag and the &nbsp; is an only child.

Here are some problems that I spotted:
- When find* gets text="..." as an argument it completely ignores name
(the first argument). That's why an <em> tag is getting caught.
- In the first and second examples the function looks for every
&nbsp;, then *always* replaces its parent with an empty string. In
some examples the parent included other contents.
'len(text.parent.contents) == 1' makes sure the &nbsp; is a single
child.
- Use extract() instead of replaceWith('') if you're removing an element.
- There is no need to use a while statement (I don't even totally
understand why it works in cleanHTML_methodThree). Just iterate over
the list returned by findAll. Using a while statement with find incurs
a significant overhead. Beautiful Soup must create a SoupStrainer and
scan every single node until it hits a &nbsp; node each time find() is
called. Ouch!

Incidentally, findChild() is just an alias for find().

Good luck!
Aaron

AdamAtNCPC

unread,
Dec 4, 2008, 9:09:32 AM12/4/08
to beautifulsoup
Thanks for the quick reply! That makes sense...I _now_ remember
reading that the "name" argument is ignored when using the "text"
argument.

FWIW, I was using a while loop because I figured it was best to keep
scrubbing the incoming HTML until it shone. That is, if one of the
pages had <p>&nbsp;<p>&nbsp;</p></p>, I thought it would take multiple
passes to wipe it out. But, now that the while loop has been
questioned, I tested that and I can see BeautifulSoup is smart enough
to refactor that use-case to <p>&nbsp;</p><p>&nbsp;</p> ...thus, I
can't think of any reason to have the while loop, and I should've put
more faith in this beautiful BeautifulSoup from the start. Thanks
again!

In BeautifulSoup I trust!

Adam

Aaron DeVore

unread,
Dec 4, 2008, 10:20:54 PM12/4/08
to beauti...@googlegroups.com
Excellent! Happy writing.

-Aaron

Reply all
Reply to author
Forward
0 new messages