Why does soup.contents return 3 types of objects?

Skip to first unread message

Julius Hamilton

Oct 22, 2021, 7:54:18 AM10/22/21
to beautifulsoup

When I call:

for c in soup.children:

I get just "html". There is only one child for the top-level element, "html".

But when I call

for c in soup.children:

I get:

<class 'bs4.element.Doctype'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>

Why does it appear that there are four elements that are children of "soup", and why do they have different types?



Oct 22, 2021, 8:18:01 AM10/22/21
to beautifulsoup
I can't reproduce your output exactly without seeing your code and markup, but I'm guessing it was something like this:

soup = BeautifulSoup("<!doctype html><html>some content</html>   ")

There are three top-level things in this HTML document: a document type definition, an <html> tag, and a string at the end with some extra white space. When the BeautifulSoup object is created, an object is created for each of those top-level items, and those become the .contents of the BeautifulSoup object:

# <!DOCTYPE html>
# <html><body><p>some content</p></body></html>

print([type(x) for x in soup.children])
# [<class 'bs4.element.Doctype'>, <class 'bs4.element.Tag'>, <class 'bs4.element.NavigableString'>]

print([x.name for x in soup.children])
# [None, 'html', None]

The different types of things that might show up in an HTML document are given different Python classes because they play different roles in the document. The part of the documentation that covers this is Kinds of Objects. There are two main classes: Tag for tags and NavigableString for strings. The more obscure classes like Doctype are subclasses of NavigableString, and they're covered (briefly) in Comments and other special strings.

Since there's only one top-level HTML tag in the document, there's only one Tag object in soup.contents. Tags are the only objects that can have a .name. The other classes like NavigableString do implement .name, to avoid crashes when iterating over a mixed list like soup.contents, but for anything other than a Tag, .name is always defined to be None. That's why only one thing showed up in your list of names.
Reply all
Reply to author
0 new messages