John,
Excellent news to see movement on env.js and htmlparser.js.
The main problem I found when trying to use the original env.js to
scrape stuff with was the fact that env.js demands the response to be
well-formed valid html. I did originally try to use htmlparser.js to
massage the response but found myself adding on so many kludges that I
resorted to simply shelling out to TagSoup so:
var cleanup = { input: self.responseText, output: '', args: ['-jar',
'/usr/local/java/tagsoup-1.2.jar'] };
var cleanupsoak = runCommand('java', cleanup);
self.responseText = cleanup.output;
<<<
Now that things are moving, I actually bothered to look to see what
htmlparser.js was doing and realised that there were a couple of
problems with the regex dealing with style and script elements.
First, the use of .* - that needs to be [\s\S]*.
(see http://blog.stevenlevithan.com/archives/singleline-multiline-confusing
)
Second, the backslash in the regex object to match the closing tag
itself needs to be escaped for it to work.
I've put a patch which fixes these problems here:
http://fu2k.org/alex/javascript/htmlparser/htmlparser.patch.20081012.js
which as far as I can tell is the cause of these reports:
http://ejohn.org/blog/pure-javascript-html-parser/#comment-308850
http://ejohn.org/blog/pure-javascript-html-parser/#comment-310521
The patch also includes a fix for a different problem - namely that
htmlparser.js falls over when presented with a string that contains a
doctype declaration. All I've done is strip the doctype out. I'm sure
that something better can be done than that, but I thought it better
to do that for now, rather than just have things fall over.
The html += ''; is because, well, rhino claims that the doctype regex
is ambiguous sometimes - it seems that it needs an extra nudge to make
sure that it's treated as a javascript string rather than a java
string object.
I've not provided a patch for another problem that I've noted with in
the wild html - script elements within the body element. That's
because, to be honest, I can't totally get my head around how
htmlparser.js works.
Anyhow, fix that and it would be trivial to change env.js like so:
728a729
> var responseText = HTMLtoXML(self.responseText);
732c733
< self.responseText)).getBytes("UTF8")));
---
> responseText)).getBytes("UTF8")));
If you wanted to, that is ;)