Thanks for your valuable input, Leonard. I have a follow-up question,
please see below.
The encode_contents() concept is good and clear and I will use it, thanks.
Regarding unwrap(), I need to understand what you meant specifically. In
order to extract e.g. body's contents we need to delete other elements
on the same tree level (head in this case), don't we? Working example:
>>> html = u"<link
href='foo'><div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")
# html5 lib added html, head, and body tags:
>>> soup
<html><head><link
href="foo"/></head><body><div><footer><h3>foot</h3></footer></div><div>foo</div></body></html>
# Remove outer html tag from tree, replace it with its contents:
>>> soup.html.unwrap()
<html></html>
# Remove head tag from tree:
>>> soup.head.extract()
<head><link href="foo"/></head>
# Remove body tag from tree, replace it with its contents:
>>> soup.body.unwrap()
<body></body>
# What's left over is only the content of the body tag:
>>> soup
<div><footer><h3>foot</h3></footer></div><div>foo</div>
>>>
The final `soup` object is what I want but the method is ugly. In this
case we know that there is only one neighboring element -- the head --
but in many other cases the neighboring elements are not predictable.
Another thing I tried, which would be a more obvious way, is extracting
the body tag and then unwrapping it. But this does not work:
>>> soup = BeautifulSoup(html, "html5lib")
>>> b = soup.body.extract()
>>> b
<body><div><footer><h3>foot</h3></footer></div><div>foo</div></body>
>>> b.unwrap()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\bs4\element.py", line 147, in unwrap
my_index = self.parent.index(self)
AttributeError: 'NoneType' object has no attribute 'index'
So, what do you think is the best way to go from here:
>>> html = u"<link
href='foo'><div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")
to here:
>>> o
<div><footer><h3>foot</h3></footer></div><div>foo</div>
with `o` being a tag or soup instance?
Thanks for help,
Jan-Philip