AdamAtNCPC
unread,Dec 3, 2008, 2:39:46 PM12/3/08Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to beautifulsoup
I'm writing a script to remove <p> </p>.
My original code (cleanHTML_methodOne below) was a little too
aggressive and would remove other <p> tags as well.
I thought it could be solved with regex (cleanHTML_methodTwo below),
but no dice.
Finally, I tried going by the notion that a <p> containing a tag and
would find a child (cleanHTML_methodThree below). The code's
not pretty, and doesn't work perfectly (I'm still stripping an
emphasized non-breaking space), but it should be good enough for now.
If anyone has encountered this and has a better solution (some kind of
differentiation between childText and descendantText?), please let me
know.
Thanks,
Adam
# every result should contain foo, bar, and boz. If bar is removed, I
have a problem.
dataSet = [
"<p>foo</p><p> </p><p>bar</p><p> </p><p>boz</p>",
"<p>foo</p><p> </p><p>bar </p><p> </p><p>boz</p>",
"<p>foo</p><p> </p><p> bar</p><p> </p><p>boz</p>",
"<p>foo</p><p> </p><p>bar<br /> </p><p> </p><p>boz</
p>",
"<p>foo</p><p> </p><p> <br />bar</p><p> </p><p>boz</
p>",
"<p>foo</p><p> </p><p>bar<em> </em></p><p> </
p><p>boz</p>",
"<p>foo</p><p> </p><p> <em>bar</em></p><p> </
p><p>boz</p>",
"<p>foo</p><p> <em>bar</em></p><p>boz</p>"
]
from BeautifulSoup import BeautifulSoup
import re
def cleanHTML_methodOne (data):
soup = BeautifulSoup(data)
while soup.find('p', text=' '):
soup.find('p', text=' ').parent.replaceWith('')
return str(soup)
def cleanHTML_methodTwo (data):
soup = BeautifulSoup(data)
while soup.find('p', text=re.compile('^ $')):
soup.find('p', text=' ').parent.replaceWith('')
return str(soup)
def cleanHTML_methodThree (data):
soup = BeautifulSoup(data)
while None in [result.parent.findChild() for result in soup.findAll
('p',text=' ')]:
for paragraph in soup.findAll('p', text=' '):
if not paragraph.parent.findChild():
paragraph.parent.replaceWith('')
return str(soup)
soupSetOne = [cleanHTML_methodOne(data) for data in dataSet]
soupSetTwo = [cleanHTML_methodTwo(data) for data in dataSet]
soupSetThree = [cleanHTML_methodThree(data) for data in dataSet]
output = ''
output += "soupSetOne:\n"
for soup in soupSetOne:
output+= soup + '\n'
output += "soupSetTwo:\n"
for soup in soupSetTwo:
output+= soup + '\n'
output += "soupSetThree:\n"
for soup in soupSetThree:
output+= soup + '\n'
print output
OUTPUT:
soupSetOne:
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>bar </p><p>boz</p>
<p>foo</p><p> bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
soupSetTwo:
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>bar </p><p>boz</p>
<p>foo</p><p> bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>boz</p>
<p>foo</p><p>boz</p>
soupSetThree:
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p>bar </p><p>boz</p>
<p>foo</p><p> bar</p><p>boz</p>
<p>foo</p><p>bar<br /> </p><p>boz</p>
<p>foo</p><p> <br />bar</p><p>boz</p>
<p>foo</p><p>bar</p><p>boz</p>
<p>foo</p><p> <em>bar</em></p><p>boz</p>
<p>foo</p><p> <em>bar</em></p><p>boz</p>