find('p',text=' ') and   within tags within p

AdamAtNCPC

unread,

Dec 3, 2008, 2:39:46 PM12/3/08

to beautifulsoup

I'm writing a script to remove  .

My original code (cleanHTML_methodOne below) was a little too
aggressive and would remove other tags as well.

I thought it could be solved with regex (cleanHTML_methodTwo below),
but no dice.

Finally, I tried going by the notion that a containing a tag and
  would find a child (cleanHTML_methodThree below). The code's
not pretty, and doesn't work perfectly (I'm still stripping an
emphasized non-breaking space), but it should be good enough for now.

If anyone has encountered this and has a better solution (some kind of
differentiation between childText and descendantText?), please let me
know.

Thanks,
Adam

# every result should contain foo, bar, and boz. If bar is removed, I
have a problem.
dataSet = [
"foo bar boz",
"foo bar  boz",
"foo  bar boz",
"foo bar   boz",
"foo   bar boz",
"foo bar  boz",
"foo  bar boz",
"foo barboz"
]

from BeautifulSoup import BeautifulSoup

import re

def cleanHTML_methodOne (data):
soup = BeautifulSoup(data)
while soup.find('p', text=' '):
soup.find('p', text=' ').parent.replaceWith('')
return str(soup)

def cleanHTML_methodTwo (data):
soup = BeautifulSoup(data)
while soup.find('p', text=re.compile('^ $')):
soup.find('p', text=' ').parent.replaceWith('')
return str(soup)

def cleanHTML_methodThree (data):
soup = BeautifulSoup(data)
while None in [result.parent.findChild() for result in soup.findAll
('p',text=' ')]:
for paragraph in soup.findAll('p', text=' '):
if not paragraph.parent.findChild():
paragraph.parent.replaceWith('')
return str(soup)

soupSetOne = [cleanHTML_methodOne(data) for data in dataSet]
soupSetTwo = [cleanHTML_methodTwo(data) for data in dataSet]
soupSetThree = [cleanHTML_methodThree(data) for data in dataSet]

output = ''

output += "soupSetOne:\n"
for soup in soupSetOne:
output+= soup + '\n'

output += "soupSetTwo:\n"
for soup in soupSetTwo:
output+= soup + '\n'

output += "soupSetThree:\n"
for soup in soupSetThree:
output+= soup + '\n'

print output

OUTPUT:
soupSetOne:
foobarboz
foobar boz
foo barboz
fooboz
fooboz
foobarboz
fooboz
fooboz
soupSetTwo:
foobarboz
foobar boz
foo barboz
fooboz
fooboz
foobarboz
fooboz
fooboz
soupSetThree:
foobarboz
foobar boz
foo barboz
foobar  boz
foo  barboz
foobarboz
foo barboz
foo barboz

Aaron DeVore

unread,

Dec 4, 2008, 12:59:40 AM12/4/08

to beauti...@googlegroups.com

There is, indeed, a better way.

def cleanHTML(data):
soup = BeautifulSoup(data)
for text in soup.findAll(text=" "):
if text.parent.name == 'p' and len(text.parent.contents) == 1:
text.parent.extract()
return unicode(soup)

That finds all  , then removes its parent if the parent is a 
tag and the   is an only child.

Here are some problems that I spotted:
- When find* gets text="..." as an argument it completely ignores name
(the first argument). That's why an tag is getting caught.
- In the first and second examples the function looks for every
 , then *always* replaces its parent with an empty string. In
some examples the parent included other contents.
'len(text.parent.contents) == 1' makes sure the   is a single
child.
- Use extract() instead of replaceWith('') if you're removing an element.
- There is no need to use a while statement (I don't even totally
understand why it works in cleanHTML_methodThree). Just iterate over
the list returned by findAll. Using a while statement with find incurs
a significant overhead. Beautiful Soup must create a SoupStrainer and
scan every single node until it hits a   node each time find() is
called. Ouch!

Incidentally, findChild() is just an alias for find().

Good luck!
Aaron

AdamAtNCPC

unread,

Dec 4, 2008, 9:09:32 AM12/4/08

to beautifulsoup

Thanks for the quick reply! That makes sense...I _now_ remember
reading that the "name" argument is ignored when using the "text"
argument.

FWIW, I was using a while loop because I figured it was best to keep
scrubbing the incoming HTML until it shone. That is, if one of the
pages had   , I thought it would take multiple
passes to wipe it out. But, now that the while loop has been
questioned, I tested that and I can see BeautifulSoup is smart enough
to refactor that use-case to    ...thus, I
can't think of any reason to have the while loop, and I should've put
more faith in this beautiful BeautifulSoup from the start. Thanks
again!

In BeautifulSoup I trust!

Adam

Aaron DeVore

unread,

Dec 4, 2008, 10:20:54 PM12/4/08

to beauti...@googlegroups.com

Excellent! Happy writing.

-Aaron

Reply all

Reply to author

Forward

find('p',text='&nbsp;') and &nbsp; within tags within p

AdamAtNCPC

Aaron DeVore

AdamAtNCPC

Aaron DeVore

find('p',text=' ') and within tags within p