Possible bug?

3 views
Skip to first unread message

Andres Riancho

unread,
Oct 6, 2008, 11:50:18 AM10/6/08
to beautifulsoup
List,

Hi! First of all, thanks for a great software, which I use for
most of my projects that relate to HTML parsing in one or another way.
Now... for the interesting part of the email ;) I think I found a bug
in beautifulsoup (Beautiful Soup version 3.0.7a, released July 3,
2008). Here is how to reproduce it:

import BeautifulSoup

html = '''
<table width="500" border="0" cellspacing="0" cellpadding="0"
align="center" style="border-left: 1px solid #cccccc; border-right:
1px solid #cccccc;" id="fpri" bgcolor="#ffffff">
<form method="post" action="login.php">
<tr>
<td align="center">

<table width="320" border="0" cellspacing="5" cellpadding="1"
style="border: 1px solid #666666">
<tr><td background="img/logo_fon.gif" height="55" colspan="2"><img
src="img/logo.gif" border="0" width="159" height="53"></td></tr>
<tr>
<td>Usuario</td>
<td><input type="text" name="usuario" value="" id="usuario"></
td>
</tr>
<tr>
<td>&nbsp;</td>
<td><input type="submit" name="submit" value="Ingresar"></td>
</tr>
</table>

</td>
</tr>
</form>
'''

print BeautifulSoup.BeautifulSoup(html)


The result will show you something like:

...
<form method="post" action="login.php">
</form><tr>
...
<td><input type="text" name="usuario" value="" id="usuario" /></td>
...
<td><input type="submit" name="submit" value="Ingresar" /></td>

In other words: The input tags are being left out of the form! Is this
really a bug?

Cheers,

Andres Riancho

unread,
Oct 8, 2008, 2:53:36 PM10/8/08
to beautifulsoup

I tried with MinimalSoup and ICantBelieveItsBeautifulSoup and they
also "break" the html :(

> Cheers,
>
> >
>

--
Andres Riancho
http://w3af.sourceforge.net/
Web Application Attack and Audit Framework

Aaron DeVore

unread,
Oct 14, 2008, 5:59:06 AM10/14/08
to beautifulsoup
On Oct 6, 8:50 am, Andres Riancho <andres.rian...@gmail.com> wrote:

> In other words: The input tags are being left out of the form! Is this
> really a bug?

From what I can tell the problem is that Beautiful Soup is having
trouble dealing with where the <form> tag is.

This is what is happening:

<table>
<form>
<tr><td></td></tr>
</form>
<table>

Comes out as:
<table>
<form></form>
<tr><td></td></tr>
</table>

The following is what the author should have written:
<form>
<table>
<tr><td></td></tr>
</table>
</form>

In valid HTML the <form> tag surrounds the <table> tag, not the other
way around. No one can seem to figure out exactly how to deal with
tags that improperly surround table children (<tr>, <tbody>, <thead>).
According to Mozilla's DOM Inspector the behavior of Mozilla's DOM
matches BeautifulSoup's handling. Mozilla seems to have some way to
track forms that is independent of its DOM tree. html5lib bumps the
form tag out of the table completely as a sibling to <table> with no
children.

MinimalSoup has an entirely different issue. The <form> tag stays
inside of the outer table, just like in the HTML. The inner table then
jumps out of the outer table and becomes the next sibling of the outer
table.

<table id="outer-table">
<form><tr><td><!-- former location of inner-table --></td></tr></form>
</table>
<table id="inner-table">
<tr>...input tags...</tr>
</table>

-Aaron DeVore

Andres Riancho

unread,
Oct 14, 2008, 8:35:53 AM10/14/08
to beauti...@googlegroups.com
Aaron,

Thanks for the really detailed analysis! Some important question that
is still unanswered is: Is beautifulsoup going to be fixed? and
another one is... should it be fixed?

Reply all
Reply to author
Forward
0 new messages