I'm trying to remove all content-less tags from document. After node
removal it's parent should be reconsidered as content-less.
from bs4 import BeautifulSoup as bs
def normalize(soup):
no_content_patt = re.compile(r'^[^\w]+$', re.U | re.S)
while True:
bad_blocks = soup(True, text=no_content_patt)
if bad_blocks:
for item in bad_blocks:
item.decompose()
else:
break
# ... see below
soup = bs(s)
normalize(soup)
print soup.body.decode_contents()
What confuses me a lot, is a reaction to "\n" within tag's contens:
s = """
<p>Some content</p> <-- keeps (ok)
<p>...</p> <-- removes (ok)
<p>!!<span>...</span></p> <-- removes (ok)
"""
...
s = """
<p>Some content</p> <-- keeps (ok)
<p>...</p> <-- removes (ok)
<p>!!<span>...</span> <-- keeps <p>!!\n</p> untouched (why?)
</p>
"""
So question is clear -- how can "\n" break my regex or soup behavior?