Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
htmlparser.js - bug in handling of CDATA sections in style and script elements
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  6 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Alex Robinson  
View profile  
 More options Oct 12 2008, 6:29 pm
From: Alex Robinson <solidgold...@gmail.com>
Date: Sun, 12 Oct 2008 15:29:55 -0700 (PDT)
Local: Sun, Oct 12 2008 6:29 pm
Subject: htmlparser.js - bug in handling of CDATA sections in style and script elements
John,

Excellent news to see movement on env.js and htmlparser.js.

The main problem I found when trying to use the original env.js to
scrape stuff with was the fact that env.js demands the response to be
well-formed valid html. I did originally try to use htmlparser.js to
massage the response but found myself adding on so many kludges that I
resorted to simply shelling out to TagSoup so:


        var cleanup = { input: self.responseText, output: '', args: ['-jar',
'/usr/local/java/tagsoup-1.2.jar'] };
        var cleanupsoak = runCommand('java', cleanup);
        self.responseText = cleanup.output;
<<<

Now that things are moving, I actually bothered to look to see what
htmlparser.js was doing and realised that there were a couple of
problems with the regex dealing with style and script elements.

First, the use of .* - that needs to be [\s\S]*.
(see http://blog.stevenlevithan.com/archives/singleline-multiline-confusing
)

Second, the backslash in the regex object to match the closing tag
itself needs to be escaped for it to work.

I've put a patch which fixes these problems here:

http://fu2k.org/alex/javascript/htmlparser/htmlparser.patch.20081012.js

 which as far as I can tell is the cause of these reports:

http://ejohn.org/blog/pure-javascript-html-parser/#comment-308850
http://ejohn.org/blog/pure-javascript-html-parser/#comment-310521

The patch also includes a fix for a different problem - namely that
htmlparser.js falls over when presented with a string that contains a
doctype declaration. All I've done is strip the doctype out. I'm sure
that something better can be done than that, but I thought it better
to do that for now, rather than just have things fall over.

The html += ''; is because, well, rhino claims that the doctype regex
is ambiguous sometimes - it seems that it needs an extra nudge to make
sure that it's treated as a javascript string rather than a java
string object.

I've not provided a patch for another problem that I've noted with in
the wild html - script elements within the body element. That's
because, to be honest, I can't totally get my head around how
htmlparser.js works.

Anyhow, fix that and it would be trivial to change env.js like so:

728a729

>               var responseText = HTMLtoXML(self.responseText);

732c733
<                          self.responseText)).getBytes("UTF8")));
---

>                          responseText)).getBytes("UTF8")));

If you wanted to, that is ;)

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex Robinson  
View profile  
 More options Oct 13 2008, 5:49 am
From: Alex Robinson <solidgold...@gmail.com>
Date: Mon, 13 Oct 2008 02:49:35 -0700 (PDT)
Local: Mon, Oct 13 2008 5:49 am
Subject: Re: htmlparser.js - bug in handling of CDATA sections in style and script elements
On Oct 12, 11:29 pm, Alex Robinson <solidgold...@gmail.com> wrote:

> Anyhow, fix that and it would be trivial to change env.js like so:

Er, sorry I think I got ahead of myself on the triviality. I meant to
say something along the lines of - "Of course, there probably remain
all sorts of horrible edge cases that have taken tools like TagSoup
and Hpricot many many man hours to identify and fix".

However, I presume that the reason for htmlparser.js's existence is so
as not to rely on such external libraries, right? And that it is being
lined up to handle the parsing of non-perfect html for env.js. Unless
told to use some other faster parser that the user has lying around?
</fish>


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Resig  
View profile  
 More options Oct 13 2008, 9:35 am
From: "John Resig" <jere...@gmail.com>
Date: Mon, 13 Oct 2008 09:35:38 -0400
Local: Mon, Oct 13 2008 9:35 am
Subject: Re: htmlparser.js - bug in handling of CDATA sections in style and script elements
Alex -

> However, I presume that the reason for htmlparser.js's existence is so
> as not to rely on such external libraries, right? And that it is being
> lined up to handle the parsing of non-perfect html for env.js. Unless
> told to use some other faster parser that the user has lying around?

Yes, that is correct. And thank you for all your contributions - I'm
going to try and land them and remove the old Java XML parser. I'll
let you know how it goes!

--John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
chris thatcher  
View profile  
 More options Oct 13 2008, 10:33 am
From: "chris thatcher" <thatcher.christop...@gmail.com>
Date: Mon, 13 Oct 2008 10:33:41 -0400
Local: Mon, Oct 13 2008 10:33 am
Subject: Re: htmlparser.js - bug in handling of CDATA sections in style and script elements

I think it's important to allow the parsers to live together and base the
choice on the contentType, eg the htmlparser is not going to help me if I
use the xhr to grab an atom feed.  So maybe we can use a switch statement or
the like?  You can still get rid of the jax parser and use e4x but thats
it's own con of worms.

--
Christopher Thatcher

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Resig  
View profile  
 More options Oct 13 2008, 10:53 am
From: "John Resig" <jere...@gmail.com>
Date: Mon, 13 Oct 2008 10:53:25 -0400
Local: Mon, Oct 13 2008 10:53 am
Subject: Re: htmlparser.js - bug in handling of CDATA sections in style and script elements
"htmlparser is not going to help me if I use the xhr to grab an atom feed."

How so? It's able to parse both HTML and XML - and generate a DOM for both.

--John

On Mon, Oct 13, 2008 at 10:33 AM, chris thatcher


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
chris thatcher  
View profile  
 More options Oct 13 2008, 10:55 am
From: "chris thatcher" <thatcher.christop...@gmail.com>
Date: Mon, 13 Oct 2008 10:55:21 -0400
Local: Mon, Oct 13 2008 10:55 am
Subject: Re: htmlparser.js - bug in handling of CDATA sections in style and script elements

oh oops.  my misunderstanding, I thought it was specifically for html.  I
hadn't had a chance to look under the hood yet.

--
Christopher Thatcher

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »