Way to recover from errors in XML/HTML parsing ?

587 views
Skip to first unread message

Raffaele Sena

unread,
Oct 3, 2013, 1:58:41 AM10/3/13
to golan...@googlegroups.com
Here is my problem: I need to parse a bunch of HTML files and extract some data, and I am using xml.Decoder in non strict mode (actually i am using this great XPath package http://godoc.org/launchpad.net/xmlpath that internally uses encoding/xml).

Everything seemed to be working well but I had a bunch of files getting rejected. Looking at the files I found the following snippet of code, that makes the parser fail:

<script language="javascript">
function ordnum(i) {
var s = "" + i;
if (i > 10 && i < 20) {
s += "th";
} else if (s.search(/1$/) >= 0) {
s += "st";
} else if (s.search(/2$/) >= 0) {
s += "nd";
} else if (s.search(/3$/) >= 0) {
s += "rd";
} else {
s += "th";
}
document.write("The " + s);
}
</script>

(please excuse the formatting :)

The problem is  the first "if" (i > 10 && i < 20). The parser explodes because the "<" is not a start element and it's not escaped.
In the good old days the content of <script> would be wrapped in <!CDATA ]]> but it looks like in HTML5 that is not required anymore (the is a note here: http://wiki.whatwg.org/wiki/HTML_vs._XHTML#Element-specific_parsing)

Anyway, my attempts to resolve this:

- I could read the file as a string and "strip" (maybe via regexp) the content of <script> </script> but I refuse to do that :)

- I have tried using the go.net/html package that actually does parse the file just fine but then I don't know what to do with it (the xpath package expects an xml.Decoder). And I really don't want to scan the tree generated by the HTML parser (but in theory I could write a modified xpath package that uses go.net/html.

- I would love to write a "bridge" that scans the HTML tree and generates XML events, but the Decoder package doesn't allow me to do that (I don't want to say it but I miss Java and Sax :) Is there a way that I am missing to implement my own Decoder.Token() or Decoder.RawToken() and pass it to xmlpath ?

- Actually all I wanted was a way to "restart" xml.Decoder after an error. I didn't try but I suspect that adding a Decoder.ClearError() that set Decoder.err=nil would probably do the trick. Then, after an error Iike what I am getting, I would clear it, skip to the end of the last open element ( the script ) and continue as if nothing has happened.

Would this be a useful feature worth considering ?

Is my current problem a "bug" of non-strict mode ? (i.e. should the non-strict mode treat the content of script, and possibly the other tags describe in the previous link as CDATA, or at least ignore parsing error inside those tags ?

Is there any plan to extend encoding/xml (and maybe encoding/json) so that it will be possible to implement a "decoder" that generates tokens "out of thin air" ? (again, I still miss the possibilities of a Sax parser :)

Did I miss any obvious way to fix my specific problem ?

Thanks!

-- Raffaele
 


Kyle Lemons

unread,
Oct 3, 2013, 3:11:41 AM10/3/13
to Raffaele Sena, golang-nuts
I think you'd need to make a go.net/html version of xpath.  go.net/html is using the HTML5 state machine, not the XML parser, so making them talk will be very difficult.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andy Balholm

unread,
Oct 3, 2013, 10:23:32 AM10/3/13
to golan...@googlegroups.com
How complicated is what you are doing in XPath? Could it be handled by CSS selectors? If so, you could try code.google.com/p/cascadia.

Nigel Tao

unread,
Oct 3, 2013, 6:57:59 PM10/3/13
to Raffaele Sena, golang-nuts
On Thu, Oct 3, 2013 at 3:58 PM, Raffaele Sena <raf...@gmail.com> wrote:
> - I could read the file as a string and "strip" (maybe via regexp) the
> content of <script> </script> but I refuse to do that :)

That doesn't sound so bad to me. You could strip <script>s via the
go.net/html tokenizer instead of a regexp:

--------
package main

import (
"io"
"os"
"strings"

"code.google.com/p/go.net/html"
"code.google.com/p/go.net/html/atom"
)

func stripScripts(w io.Writer, r io.Reader) error {
script := false
buf := make([]byte, 0, 4096)
z := html.NewTokenizer(r)
for {
tt := z.Next()
if tt == html.ErrorToken {
break
}

// Save the raw bytes of the token. This is needed because calling
// z.TagName can change z.Raw's returned slice's contents.
buf = append(buf[:0], z.Raw()...)

if !script && tt == html.StartTagToken {
tn, _ := z.TagName()
script = atom.Lookup(tn) == atom.Script
}
if !script {
w.Write(buf)
}
if script && tt == html.EndTagToken {
tn, _ := z.TagName()
script = atom.Lookup(tn) != atom.Script
}
}
return z.Err()
}

func main() {
r := strings.NewReader(`<html><script>if (i > 10 && i < 20) {
alert("asdf"); }</script>` +
`<body>hello</body></html>`)
stripScripts(os.Stdout, r)
}
--------

$ go run main.go
<html><body>hello</body></html>

Hooking this up to an io.Pipe and then an xml parser is left as an
exercise for the reader.

Raffaele Sena

unread,
Oct 3, 2013, 9:49:43 PM10/3/13
to Andy Balholm, golan...@googlegroups.com
Thanks! I ended up using go.net/html and cascadia. The queries were
pretty simple so the changes turned out to be very close to just
"renaming" the packages.

Last night I also tried my "trick" of resetting the error in
xml.Decoder and actually that worked too (3 lines of code there to add
a ClearError method, and 3 more in the xpath module to call it). But
that requires "patching" the libraries on every machine I use.

I am still left with the questions if xml.Decoder should do a better
job when in non-strict mode (after all I was parsing a valid html
document) and if it would be useful to have a parser that can recover
from errors (in a previous project, written in Ruby because Go didn't
exists, I had to deal with corrupted XML documents, where I needed to
skip the "broken" part in order to process as much as possible of the
document)

-- Raffaele



On Thu, Oct 3, 2013 at 7:23 AM, Andy Balholm <andyb...@gmail.com> wrote:
> How complicated is what you are doing in XPath? Could it be handled by CSS
> selectors? If so, you could try code.google.com/p/cascadia.
>

Nigel Tao

unread,
Oct 4, 2013, 1:13:51 AM10/4/13
to Raffaele Sena, Andy Balholm, golang-nuts
On Fri, Oct 4, 2013 at 11:49 AM, Raffaele Sena <raf...@gmail.com> wrote:
> I am still left with the questions if xml.Decoder should do a better
> job when in non-strict mode (after all I was parsing a valid html
> document) and if it would be useful to have a parser that can recover
> from errors (in a previous project, written in Ruby because Go didn't
> exists, I had to deal with corrupted XML documents, where I needed to
> skip the "broken" part in order to process as much as possible of the
> document)

I'm not sure if it's worth it. HTML isn't XML; parse HTML with an HTML
parser, not with a non-strict XML parser.

As for dealing with corrupted XML documents, I'm also not convinced if
adventuring off-spec will give enough benefits to justify the costs. A
one-off hack might be better than trying to extend the standard
library. To mangle Tolstoy, happy XML documents are all alike, every
unhappy XML document is unhappy in its own way.

Rick

unread,
Oct 4, 2013, 2:35:53 AM10/4/13
to golan...@googlegroups.com, Raffaele Sena, Andy Balholm
... To mangle Tolstoy, happy XML documents are all alike, every
unhappy XML document is unhappy in its own way.

Now this is the reason I love this group.

John Nagle

unread,
Oct 4, 2013, 3:30:16 AM10/4/13
to golan...@googlegroups.com
On 10/2/2013 10:58 PM, Raffaele Sena wrote:
> Here is my problem: I need to parse a bunch of HTML files and extract some
> data, and I am using xml.Decoder in non strict mode (actually i am using
> this great XPath package http://godoc.org/launchpad.net/xmlpath that
> internally uses encoding/xml).

Wrong tool for the job.

> Everything seemed to be working well but I had a bunch of files getting
> rejected. Looking at the files I found the following snippet of code, that
> makes the parser fail:
>
> <script language="javascript">
> function ordnum(i) {
> var s = "" + i;
> if (i > 10 && i < 20) {
...
>
> The problem is the first "if" (i > 10 && i < 20). The parser explodes
> because the "<" is not a start element and it's not escaped.

That's valid HTML5. You used to have to escape Javascript
with comment brackets ("<-- text -->") or a CDATA section, but
that's no longer required. Once in <script> mode, the parser
is supposed to stay in script mode until "</" or "<!" is seen. See

http://www.w3.org/html/wg/drafts/html/master/syntax.html#script-data-less-than-sign-state

But it's not valid XML. In XML (and XHTML) you have to put
such things in a CDATA section or comment brackets. That
won't parse with an XML parser.

So use an HTML parser.

With the Go "net/html" parser, though, you have to implement
the code that looks at the first 1024 characters, figures out
what character set is being used, and restarts parsing from
the beginning. The rules for that are in the W3C spec.
You can't assume incoming documents are UTF-8. Most of
the older ones aren't. Is there a package for that yet?

John Nagle

Johann Höchtl

unread,
Oct 4, 2013, 4:56:46 AM10/4/13
to golan...@googlegroups.com, Andy Balholm


Am Freitag, 4. Oktober 2013 05:49:43 UTC+4 schrieb Raffaele Sena:
 
 and if it would be useful to have a parser that can recover
from errors

That would also be my wish for parsing Json. Either to continue parsing after a failure or to get much more context information where the error happened
Reply all
Reply to author
Forward
0 new messages