New issue 45 by portman....@gmail.com: HTML that doesn't parse correctly
(but doesn't fail either)
http://code.google.com/p/fizzler/issues/detail?id=45
I've been using Fizzler with great success, but today I came across some
HTML that silently failed to parse correctly.
I was selecting all of the <a> elements and noticed that one was being
ignored. Here are the repo steps:
1. Load the HTML from http://pastebin.com/T1Lsr6w6 (this is the "View
Source" for
http://www.diapers.com/product/productdetail.aspx?productid=16913)
2. Try to query the selector "#pdp"
3. Example code (assuming String html has the HTML above)
var doc = new HtmlDocument();
doc.LoadHtml(html);
var dom = doc.DocumentNode;
var pdpElement = dom.QuerySelector("#pdp");
What is the expected output? What do you see instead?
Expect pdpElement to be an HtmlNode of <a
href="http://c1.diapers.com/images/products/p/pg/pg-256_1z.jpg"
class="MagicZoomPlus" id="pdp" title="Pampers Sensitive Thick Baby Wipes
Refill 360ct." target="_blank">
Instead, it doesn't find a match.
What version of the product are you using? On what operating system?
Fizzler 0.9
Please provide any additional information below.
I narrowed down the error slightly.
Using VisualFizzler (neat tool!) I can see that everything up to line 282
is selectable (for example "#siteNav").
But after line 283, I can't select anything (for example "div.topToolBox").
So the issue has to do with long lines like on line 283 of that pastebin
example.
Sure enough, when I remove this line (#283) from the HTML, everything works
perfectly. It's pathologically long (51,553 characters in fact!!) so this
is probably a defect in one of the underlying framework classes that
Fizzler is using.
In the meantime, I've changed my code to chop long lines at 1024 characters
before handing off to Fizzler, and everything is working again. But you
still might want to investigate what precisely is going wrong on that long
line, so I'll keep the issue open.
We're using HTMLAgilityPack so it's probably an issue there, but it should
be fairly trivial to swap out HTMLAgilityPack for another parser. It could
also be that this issue has been fixed by a more recent version of
HTMLAgilityPack than the one in the download.