Way to recover from errors in XML/HTML parsing ?

Raffaele Sena

unread,

Oct 3, 2013, 1:58:41 AM10/3/13

to golan...@googlegroups.com

Here is my problem: I need to parse a bunch of HTML files and extract some data, and I am using xml.Decoder in non strict mode (actually i am using this great XPath package http://godoc.org/launchpad.net/xmlpath that internally uses encoding/xml).

Everything seemed to be working well but I had a bunch of files getting rejected. Looking at the files I found the following snippet of code, that makes the parser fail:

Kyle Lemons

unread,

Oct 3, 2013, 3:11:41 AM10/3/13

to Raffaele Sena, golang-nuts

I think you'd need to make a go.net/html version of xpath. go.net/html is using the HTML5 state machine, not the XML parser, so making them talk will be very difficult.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andy Balholm

unread,

Oct 3, 2013, 10:23:32 AM10/3/13

to golan...@googlegroups.com

How complicated is what you are doing in XPath? Could it be handled by CSS selectors? If so, you could try code.google.com/p/cascadia.

Nigel Tao

unread,

Oct 3, 2013, 6:57:59 PM10/3/13

to Raffaele Sena, golang-nuts

On Thu, Oct 3, 2013 at 3:58 PM, Raffaele Sena <raf...@gmail.com> wrote:
> - I could read the file as a string and "strip" (maybe via regexp) the
> content of <script> </script> but I refuse to do that :)

That doesn't sound so bad to me. You could strip <script>s via the
go.net/html tokenizer instead of a regexp:

--------
package main

import (
"io"
"os"
"strings"

"code.google.com/p/go.net/html"
"code.google.com/p/go.net/html/atom"
)

func stripScripts(w io.Writer, r io.Reader) error {
script := false
buf := make([]byte, 0, 4096)
z := html.NewTokenizer(r)
for {
tt := z.Next()
if tt == html.ErrorToken {
break
}

// Save the raw bytes of the token. This is needed because calling
// z.TagName can change z.Raw's returned slice's contents.
buf = append(buf[:0], z.Raw()...)

if !script && tt == html.StartTagToken {
tn, _ := z.TagName()
script = atom.Lookup(tn) == atom.Script
}
if !script {
w.Write(buf)
}
if script && tt == html.EndTagToken {
tn, _ := z.TagName()
script = atom.Lookup(tn) != atom.Script
}
}
return z.Err()
}

func main() {
r := strings.NewReader(`<html><script>if (i > 10 && i < 20) {
alert("asdf"); }</script>` +
`<body>hello</body></html>`)
stripScripts(os.Stdout, r)
}
--------

$ go run main.go
<html><body>hello</body></html>

Hooking this up to an io.Pipe and then an xml parser is left as an
exercise for the reader.

Raffaele Sena

unread,

Oct 3, 2013, 9:49:43 PM10/3/13

to Andy Balholm, golan...@googlegroups.com

Thanks! I ended up using go.net/html and cascadia. The queries were
pretty simple so the changes turned out to be very close to just
"renaming" the packages.

Last night I also tried my "trick" of resetting the error in
xml.Decoder and actually that worked too (3 lines of code there to add
a ClearError method, and 3 more in the xpath module to call it). But
that requires "patching" the libraries on every machine I use.

I am still left with the questions if xml.Decoder should do a better
job when in non-strict mode (after all I was parsing a valid html
document) and if it would be useful to have a parser that can recover
from errors (in a previous project, written in Ruby because Go didn't
exists, I had to deal with corrupted XML documents, where I needed to
skip the "broken" part in order to process as much as possible of the
document)

-- Raffaele

On Thu, Oct 3, 2013 at 7:23 AM, Andy Balholm <andyb...@gmail.com> wrote:
> How complicated is what you are doing in XPath? Could it be handled by CSS
> selectors? If so, you could try code.google.com/p/cascadia.
>

Nigel Tao

unread,

Oct 4, 2013, 1:13:51 AM10/4/13

to Raffaele Sena, Andy Balholm, golang-nuts

On Fri, Oct 4, 2013 at 11:49 AM, Raffaele Sena <raf...@gmail.com> wrote:
> I am still left with the questions if xml.Decoder should do a better
> job when in non-strict mode (after all I was parsing a valid html
> document) and if it would be useful to have a parser that can recover
> from errors (in a previous project, written in Ruby because Go didn't
> exists, I had to deal with corrupted XML documents, where I needed to
> skip the "broken" part in order to process as much as possible of the
> document)

I'm not sure if it's worth it. HTML isn't XML; parse HTML with an HTML
parser, not with a non-strict XML parser.

As for dealing with corrupted XML documents, I'm also not convinced if
adventuring off-spec will give enough benefits to justify the costs. A
one-off hack might be better than trying to extend the standard
library. To mangle Tolstoy, happy XML documents are all alike, every
unhappy XML document is unhappy in its own way.

Rick

unread,

Oct 4, 2013, 2:35:53 AM10/4/13

to golan...@googlegroups.com, Raffaele Sena, Andy Balholm

... To mangle Tolstoy, happy XML documents are all alike, every

unhappy XML document is unhappy in its own way.

Now this is the reason I love this group.

John Nagle

unread,

Oct 4, 2013, 3:30:16 AM10/4/13

to golan...@googlegroups.com

On 10/2/2013 10:58 PM, Raffaele Sena wrote:
> Here is my problem: I need to parse a bunch of HTML files and extract some
> data, and I am using xml.Decoder in non strict mode (actually i am using
> this great XPath package http://godoc.org/launchpad.net/xmlpath that
> internally uses encoding/xml).

Wrong tool for the job.

> Everything seemed to be working well but I had a bunch of files getting
> rejected. Looking at the files I found the following snippet of code, that
> makes the parser fail:
>
> <script language="javascript">
> function ordnum(i) {
> var s = "" + i;
> if (i > 10 && i < 20) {

...

>
> The problem is the first "if" (i > 10 && i < 20). The parser explodes
> because the "<" is not a start element and it's not escaped.

That's valid HTML5. You used to have to escape Javascript
with comment brackets ("<-- text -->") or a CDATA section, but
that's no longer required. Once in <script> mode, the parser
is supposed to stay in script mode until "</" or "<!" is seen. See

http://www.w3.org/html/wg/drafts/html/master/syntax.html#script-data-less-than-sign-state

But it's not valid XML. In XML (and XHTML) you have to put
such things in a CDATA section or comment brackets. That
won't parse with an XML parser.

So use an HTML parser.

With the Go "net/html" parser, though, you have to implement
the code that looks at the first 1024 characters, figures out
what character set is being used, and restarts parsing from
the beginning. The rules for that are in the W3C spec.
You can't assume incoming documents are UTF-8. Most of
the older ones aren't. Is there a package for that yet?

John Nagle

Johann Höchtl

unread,

Oct 4, 2013, 4:56:46 AM10/4/13

to golan...@googlegroups.com, Andy Balholm

Am Freitag, 4. Oktober 2013 05:49:43 UTC+4 schrieb Raffaele Sena:

and if it would be useful to have a parser that can recover
from errors

That would also be my wish for parsing Json. Either to continue parsing after a failure or to get much more context information where the error happened

Reply all

Reply to author

Forward