bug parsing <a><div ></a> nesting

225 views
Skip to first unread message

Daemmon

unread,
Feb 17, 2011, 11:25:38 AM2/17/11
to tagsoup-friends
I believe I have found a bug that happens when a <div> is nested in an
<a>. Here is a Groovy script that demonstrates:

#!/usr/bin/env groovy

xml = '<a class="parent"><div class="child">the div text </div> the
link text</a>'
xml2 = '<a class="parent"><span class="child">the span text </span>
the link text</a>'

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2')
XmlSlurper slurper = new XmlSlurper(new
org.ccil.cowan.tagsoup.Parser());

document = slurper.parseText(xml);
document2 = slurper.parseText(xml2);

def parent = document.'**'.grep{ it.@class == 'parent' }[0];
// These lines will not print out the text
println "text():"+parent.text()
println "toString():"+parent.toString()
def parent2 = document2.'**'.grep{ it.@class == 'parent' }[0];
// These lines will
println "text():"+parent2.text()
println "toString():"+parent2.toString()

John Cowan

unread,
Feb 17, 2011, 12:22:39 PM2/17/11
to Daemmon, tagsoup-friends
Daemmon scripsit:

> I believe I have found a bug that happens when a <div> is nested in an
> <a>. Here is a Groovy script that demonstrates:

It's because div is block-level, so it implicitly closes the a, which is
inline. Since a will not automatically be reopened later (as b, i, etc.
will be), that's that.

--
John Cowan co...@ccil.org http://ccil.org/~cowan
The whole of Gaul is quartered into three halves.
--Julius Caesar

Daemmon

unread,
Feb 17, 2011, 2:53:12 PM2/17/11
to tagsoup-friends
Thanks for you response. Yes, I suspected it was related to divs being
block-level, which is why I added the span demo in the test script. I
understand what you are saying but I thought TagSoup was meant to help
with these kinds of issues. I noticed that running my test script
using the default parser in XmlSlurper does not have this problem.

An aside: I ran into this while trying to parse the pagination links
on pages like this:
http://maps.google.com/maps/place?cid=14484478826722327285&view=feature&mcsrc=google_reviews&num=10&start=10

Juan Carlos Garcia Segovia

unread,
Jan 17, 2013, 5:21:58 AM1/17/13
to tagsoup...@googlegroups.com, Daemmon, co...@mercury.ccil.org
On 2011-02-17 17:22:39 UTC, John Cowan wrote:
Daemmon scripsit:

> I believe I have found a bug that happens when a <div> is nested in an
> <a>. Here is a Groovy script that demonstrates:

It's because div is block-level, so it implicitly closes the a, which is
inline.  Since a will not automatically be reopened later (as b, i, etc.
will be), that's that.

TagSoup has a bug as it is not following the HTML Parsing standard:

"div" should behave just like "span" inside "a".
Just fire a current browser that follows the HTML Parsing standard and see it for yourself.
 

John Cowan

unread,
Jan 17, 2013, 1:55:14 PM1/17/13
to Juan Carlos Garcia Segovia, tagsoup...@googlegroups.com, Daemmon
Juan Carlos Garcia Segovia scripsit:

> TagSoup has a bug as it is not following the HTML Parsing standard:
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

TagSoup does not follow any standard. Someday it may, but right now if
you want WhatWG parsing, you need to use a different parser.

--
Long-short-short, long-short-short / Dactyls in dimeter, John Cowan
Verse form with choriambs / (Masculine rhyme): co...@ccil.org
One sentence (two stanzas) / Hexasyllabically
Challenges poets who / Don't have the time. --robison who's at texas dot net

Juan Carlos Garcia Segovia

unread,
Jan 18, 2013, 4:22:15 AM1/18/13
to tagsoup...@googlegroups.com, Juan Carlos Garcia Segovia, Daemmon, co...@mercury.ccil.org
On 2013-01-17 18:55:14 UTC, John Cowan wrote:
Juan Carlos Garcia Segovia scripsit:
> TagSoup has a bug as it is not following the HTML Parsing standard:
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

TagSoup does not follow any standard.  Someday it may, but right now if
you want WhatWG parsing, you need to use a different parser.

Do you mean the parsing rules are just made up and not following the standard is not a bug?

You should state on the home page that TagSoup is not following nor trying to follow the HTML Parsing standard.
 

John Cowan

unread,
Jan 18, 2013, 10:17:54 AM1/18/13
to Juan Carlos Garcia Segovia, tagsoup...@googlegroups.com, Daemmon
Juan Carlos Garcia Segovia scripsit:

> Do you mean the parsing rules are just made up and not following the
> standard is not a bug?

A standard cannot claim to control the behavior of an application that
doesn't claim to support that standard. If something more readable than
the WhatWG document ever emerges, I'll consider switching TagSoup to it.
Possibly in version 3.0.

--
John Cowan http://ccil.org/~cowan co...@ccil.org
Mr. Henry James writes fiction as if it were a painful duty. --Oscar Wilde

Juan Carlos Garcia Segovia

unread,
Jan 19, 2013, 7:19:23 AM1/19/13
to tagsoup...@googlegroups.com, Juan Carlos Garcia Segovia, Daemmon, co...@mercury.ccil.org
On 2013-01-18 15:17:54 UTC, John Cowan wrote:
Juan Carlos Garcia Segovia scripsit:

> Do you mean the parsing rules are just made up and not following the
> standard is not a bug?

A standard cannot claim to control the behavior of an application that
doesn't claim to support that standard.  If something more readable than
the WhatWG document ever emerges, I'll consider switching TagSoup to it.
Possibly in version 3.0.

Would you accept patches to make TagSoup follow the HTML Parsing standard?

There is already a test suite for the standard at:
and also used by:
if you do not want to read the standard.

John Cowan

unread,
Jan 19, 2013, 9:36:41 AM1/19/13
to Juan Carlos Garcia Segovia, tagsoup...@googlegroups.com, Daemmon
Juan Carlos Garcia Segovia scripsit:

> Would you accept patches to make TagSoup follow the HTML Parsing standard?

Arbitrary code patches, no. Changes to html.tssl, yes.
The work of Henry James has always seemed divisible by a simple dynastic
arrangement into three reigns: James I, James II, and the Old Pretender.
--Philip Guedalla
Reply all
Reply to author
Forward
0 new messages