remove html and body tags

3,679 views
Skip to first unread message

eNJoy

unread,
May 14, 2013, 11:57:03 PM5/14/13
to beauti...@googlegroups.com
Hello all!

I am calling bs4 from other script, for some code cleansing tasks, that said, I'd like to have bs4 to output everything *inside* the body tag. I already know that using "html.parser" could do, but that has other complications.
is there a work around?

Thanks,

Leonard Richardson

unread,
May 17, 2013, 4:28:29 PM5/17/13
to beauti...@googlegroups.com
> I came across the same challenge and went through BS' source. Looks like we
> can use the `hidden` attribute of any wrapping tag in order to prevent it
> from being included upon "exporting" it to unicode:
>
>>>> from bs4 import BeautifulSoup
>>>> html = u"<div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>>> soup = BeautifulSoup(html, "html5lib")
>>>> soup.body.hidden=True
>>>> print soup.body.prettify()
> <div>
> <footer>
> <h3>
> foot
> </h3>
> </footer>
> </div>
> <div>
> foo
> </div>
>
>
> To the developers: can we rely on the `hidden` attribute being stable in its
> current function?

'hidden' is a hack that allows the BeautifulSoup object to act just
like a Tag, but to not show up in representations. I'm not going to
change its behavior, but it's conceivable I might get rid of it.

If you just want to get the contents of a tag as a string, you can
call encode_contents(). Or you can call unwrap() on the <body> tag to
get rid of it and replace it with its contents.

Leonard

Jan-Philip Gehrcke

unread,
May 22, 2013, 3:35:02 PM5/22/13
to beauti...@googlegroups.com
Thanks for your valuable input, Leonard. I have a follow-up question,
please see below.
The encode_contents() concept is good and clear and I will use it, thanks.

Regarding unwrap(), I need to understand what you meant specifically. In
order to extract e.g. body's contents we need to delete other elements
on the same tree level (head in this case), don't we? Working example:


>>> html = u"<link
href='foo'><div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")

# html5 lib added html, head, and body tags:
>>> soup
<html><head><link
href="foo"/></head><body><div><footer><h3>foot</h3></footer></div><div>foo</div></body></html>

# Remove outer html tag from tree, replace it with its contents:
>>> soup.html.unwrap()
<html></html>

# Remove head tag from tree:
>>> soup.head.extract()
<head><link href="foo"/></head>

# Remove body tag from tree, replace it with its contents:
>>> soup.body.unwrap()
<body></body>

# What's left over is only the content of the body tag:
>>> soup
<div><footer><h3>foot</h3></footer></div><div>foo</div>
>>>

The final `soup` object is what I want but the method is ugly. In this
case we know that there is only one neighboring element -- the head --
but in many other cases the neighboring elements are not predictable.

Another thing I tried, which would be a more obvious way, is extracting
the body tag and then unwrapping it. But this does not work:

>>> soup = BeautifulSoup(html, "html5lib")
>>> b = soup.body.extract()
>>> b
<body><div><footer><h3>foot</h3></footer></div><div>foo</div></body>
>>> b.unwrap()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\bs4\element.py", line 147, in unwrap
my_index = self.parent.index(self)
AttributeError: 'NoneType' object has no attribute 'index'


So, what do you think is the best way to go from here:

>>> html = u"<link
href='foo'><div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")

to here:

>>> o
<div><footer><h3>foot</h3></footer></div><div>foo</div>

with `o` being a tag or soup instance?

Thanks for help,

Jan-Philip

Jan-Philip Gehrcke

unread,
May 28, 2013, 4:45:42 AM5/28/13
to beauti...@googlegroups.com
Thanks for your response, Richard.

On 05/26/2013 10:45 PM, Richard Brooksby wrote:
> This works well for me. Just before returning the result (for feeding
> to a Jinja template) I do:
>
> soup.html.unwrap()
> soup.body.unwrap()
>

Could you please show a self-contained example? Did you have a link,
script or style tag in your html string? Did you parse with html5lib? See:

>>> from bs4 import BeautifulSoup
>>> html = u"<link href='foo'/><div></div>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> soup.html.unwrap()
<html></html>
>>> soup.body.unwrap()
<body></body>
>>> soup
<head><link href="foo"/></head><div></div>

The challenge is to end up with a soup or trag object representing
<div></div> (the body content) rather than <head><link
href="foo"/></head><div></div>.

Cheers,

Jan-Philip






Reply all
Reply to author
Forward
0 new messages