Remove/delete/extract empty tags.

1,803 views
Skip to first unread message

tezro

unread,
Sep 22, 2009, 11:18:24 AM9/22/09
to beautifulsoup
I want to remove empty P tags from the HTML code. Aaron DeVore
suggested this:

empty_tags = soup.findAll(lambda tag: tag.name == 'p' and not
tag.contents and (tag.string is None or not tag.string.strip()))

I test it on this:

<p>21.09.2009</p>
<p> </p>
<p><img src="http://www.www.com/"></p>
<p></p>

And it finds only the last tag <p></p> even though the "string.strip
()" of the second one is null.

Please help.

Jakub Jagiełło

unread,
Sep 22, 2009, 3:32:42 PM9/22/09
to beauti...@googlegroups.com
The problem is in "not tag.contents" which is True only for empty
tags, but False for tags with spaces or other tags inside.
Therefore even if you test if content string after stripping spaces is
empty, this test is useless as it is performed only for nodes that
already were detected as constaining no contents.
Therefore tag.contents is useless in such situation, as empty tags can
be detected by tag.string is None, but tags constaining other tags
cannot be eliminated this way because tags constaining whitespaces
would be eliminated too.
You should look only for tags, not for any childNodes:

s.findAll(lambda tag: tag.name == 'p' and tag.find(True) is None and
(tag.string is None or tag.string.strip()==""))

--
Jakub Jagiełło



2009/9/22 tezro <tezr...@gmail.com>:

tezro

unread,
Sep 25, 2009, 11:52:44 AM9/25/09
to beautifulsoup
Thanks a lot. That really fits.

On Sep 22, 11:32 pm, Jakub Jagiełło <jab...@gmail.com> wrote:
> The problem is in "not tag.contents" which is True only for empty
> tags, but False for tags with spaces or other tags inside.
> Therefore even if you test if content string after stripping spaces is
> empty, this test is useless as it is performed only for nodes that
> already were detected as constaining no contents.
> Therefore tag.contents is useless in such situation, as empty tags can
> be detected by tag.string is None, but tags constaining other tags
> cannot be eliminated this way because tags constaining whitespaces
> would be eliminated too.
> You should look only for tags, not for any childNodes:
>
> s.findAll(lambda tag: tag.name == 'p' and tag.find(True) is None and
> (tag.string is None or tag.string.strip()==""))
>
> --
> Jakub Jagiełło
>
> 2009/9/22 tezro <tezro...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages