Heading tags can't be child elements?

17 views
Skip to first unread message

Jessica Richards

unread,
May 21, 2013, 12:10:29 PM5/21/13
to beauti...@googlegroups.com
I've noticed that BeautifulSoup will not pick a heading tag as a child element of a P. For example:

<P ALIGN="LEFT">
<a name="x">
</a>
<A NAME="xml">
</a>
<h2>
XML
</h2>
eXtensible Mark-up Language, used especially for web pages. XML allows designers to create their own customized tags to provide functionality not available with HTML. It is especially useful for pages that come from database information and parts of the page are standardized and need to appear the same many times.
</P>

If I do 

for p in parent.find("p"):
   print p.h2

It won't find h2. It also won't list h2 or anything after it if I use p.get_text(), p.contents, p.strings, etc.

If I change h2 to something like <b> it finds it just fine. 

Jessica Richards

unread,
May 21, 2013, 12:12:50 PM5/21/13
to beauti...@googlegroups.com
that code should be p in parent.find_all("p"), but either way it still doesn't find h2

Leonard Richardson

unread,
May 21, 2013, 1:54:17 PM5/21/13
to beauti...@googlegroups.com
Jessica,

> I've noticed that BeautifulSoup will not pick a heading tag as a child
> element of a P. For example:
>
> <P ALIGN="LEFT">
> <a name="x">
> </a>
> <A NAME="xml">
> </a>
> <h2>
> XML
> </h2>
> eXtensible Mark-up Language, used especially for web pages. XML allows
> designers to create their own customized tags to provide functionality not
> available with HTML. It is especially useful for pages that come from
> database information and parts of the page are standardized and need to
> appear the same many times.
> </P>

One of the quirks of HTML parsers is their tendency to rearrange tags
to make the HTML more valid. This means that a Beautiful Soup tree may
not correspond exactly to the original document.

In this case, the rules of HTML forbid an <h2> tag from showing up
inside a <p> tag. You can see how different parsers deal with this by
using the diagnose() function
(http://www.crummy.com/software/BeautifulSoup/bs4/doc/#diagnose).

My code:

===
from bs4.diagnose import diagnose
data = """<P ALIGN="LEFT">
<a name="x">
</a>
<A NAME="xml">
</a>
<h2>
XML
</h2>
eXtensible Mark-up Language...
</P>"""
diagnose(data)
===

The output:

===
Found lxml version 2.3.2.0
Found html5lib version 0.90

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<p align="LEFT">
<a name="x">
</a>
<a name="xml">
</a>
<h2>
XML
</h2>
eXtensible Mark-up Language...
</p>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
</head>
<body>
<p align="LEFT">
<a name="x">
</a>
<a name="xml">
</a>
</p>
<h2>
XML
</h2>
eXtensible Mark-up Language...
<p>
</p>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
<p align="LEFT">
<a name="x">
</a>
<a name="xml">
</a>
</p>
<h2>
XML
</h2>
eXtensible Mark-up Language...
</body>
</html>
===

As you can see, all three HTML parsers move the <h2> tag outside of
the <p> tag. That's why you can't access the <h2> tag as a child of
the <p> tag. You can find the <h2> tag on its own, or as a sibling of
the <p> tag, but not as a child.

For more on the differences between parsers, see:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

One alternative is to parse the document as XML instead of HTML. XML
doesn't have any rules about which tags can go inside which other
tags, so it will gladly put an <h2> tag inside a <p> tag. Here's the
rest of the output from diagnose():

===
Trying to parse your markup with ['lxml', 'xml']
Here's what ['lxml', 'xml'] did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<P ALIGN="LEFT">
<a name="x">
</a>
<A NAME="xml">
</A>
<h2>
XML
</h2>
eXtensible Mark-up Language...
</P>
===

But as you can see, XML also doesn't have HTML's rule that tag names
are always lowercase. Considered as XML, this document has an <h2> tag
inside a <P> tag.

Hope this helps,
Leonard
Reply all
Reply to author
Forward
0 new messages