exercising xml.Parser, style

nigel.kerr

unread,

Nov 4, 2010, 8:45:40 AM11/4/10

to golang-nuts

Good Folk,

To learn about Go, I thought I'd start with a topic I knew something
about elsewhere, XML-parsing. With a copy of the XML Conformance Test
Suites (the 20080827 suite from http://www.w3.org/XML/Test/) and go-
lang.org in hand, I have produced the following program,
parsefiles.go. It attempts to read a list of XML files, and tries to
decide if the file should successfully parse, and if it did.

I chose the Conformance Suite because it presents a lot of different
ways XML can be valid and invalid, and I wanted to try to use that to
see how xml.Parser behaves. I'm a little unclear as to whether I'm
using xml.Parser correctly for this exact purpose: RawToken() may not
be what to use for "parse this file to see if it violates any
constraints". Uses of the xml package I see elsewhere are usually
Unmarshal() to populate a data-structure (src/cmd/godoc/codewalk.go
for instance). The xml package's tests also are of a different
flavour that what I've written here. So I'm wary of this that I've
written and the results.

Is there any interest in the full results (off list, say, since
there's quite a lot...)?
I'm pretty sure I've not gotten Go idioms yet. I solicit commentary
on the code itself.

cheers,
Nigel Kerr
nigel...@gmail.com

package main

// parsefiles.go, attempt to run files past xml.Parser, checking
// expectations as to whether the file should parse or not.

import (
"bufio"
"flag"
"fmt"
"os"
"strings"
"xml"
)

func parserPassesTest(fname string, resultType string) (should bool,
did bool, mesg string) {

var errorstring string = ""
var shoulderror bool = false
if resultType != "valid" {
shoulderror = true
}
var sawerror bool = false

defer func() {
if r := recover(); r != nil {
mesg = fmt.Sprintf("recovered from a panic: %v", r)
should = shoulderror
did = true
}
}()

f, err := os.Open(fname, 0, 0)
defer f.Close()
if err != nil {
panic(err)
}
p := xml.NewParser(f)

for {
pt, perr := p.RawToken()
if perr != nil {
if perr != os.EOF {
errorstring = fmt.Sprintf("%v", perr)
sawerror = true
}
break
}
if pt == nil {
break
}
}
return shoulderror, sawerror, errorstring
}

func main() {
flag.Parse()

if flag.NArg() != 1 {
fmt.Fprintf(os.Stderr, "parsefile: need one file on command line.
\n")
os.Exit(1)
}

dat, derr := os.Open(flag.Arg(0), 0, 0)
if derr != nil {
panic(derr)
}
defer dat.Close()
br := bufio.NewReader(dat)

fmt.Fprintf(os.Stdout, "FILE\tSHOULD_ERROR\tDID_ERROR\tMESG\n")

for {
line, err := br.ReadString('\n')
if err != nil {
if err != os.EOF {
fmt.Fprintf(os.Stderr, "reading %s: %s\n", flag.Arg(0), err)
}
break
}
pieces := strings.Split(strings.TrimSpace(line), "\t", -1)
shouldError, didError, errMesg := parserPassesTest(pieces[0],
pieces[1])
if shouldError != didError {
fmt.Fprintf(os.Stdout, "%s\t%v\t%v\t%s\n", pieces[0], shouldError,
didError, errMesg)
}
}
}
// end of parsefiles.go

which takes as its sole filename argument a filename for a file that
contains lines like this:

eduni/errata-2e/E57.xml error
eduni/errata-2e/E60.xml valid
eduni/errata-2e/E61.xml not-wf
eduni/errata-3e/E05a.xml valid
eduni/errata-3e/E05b.xml valid

(relative path to a test file in the conformance suite TAB test-type
value for that test file (valid, invalid, error, not-wf)) The test
suite has 2,411 such test files by my count, and I don't include my
whole file here. The results for those above five lines, showing that
three behaved as expected, and two of them were expected to fail, but
did not:

FILE SHOULD_ERROR DID_ERROR MESG
eduni/errata-2e/E57.xml true false
eduni/errata-2e/E61.xml true false

Russ Cox

unread,

Nov 4, 2010, 3:45:15 PM11/4/10

to nigel.kerr, golang-nuts

I think you want to use p.Token, not p.RawToken.
Token considers the state of the stream, so it will
reject "<a>" and "<a></b>" and "</b><a>".
RawToken only checks the validity of the single token
it is reading, without regard to context, so it won't
reject any of those.

I'd be interested to see one or two examples that
are not handled properly (using p.Token) before
deciding whether I want to see hundreds. :-)

Russ

nigel.kerr

unread,

Nov 4, 2010, 7:58:14 PM11/4/10

to golang-nuts

Hello Russ, et al,

Making the recommended change from RawToken() to Token() improves
things some: more files pass per expectations; some files fail
expectations for new reasons.

I've looked at some tens here, and choosing some (I hope useful)
samples:

1. File from suite xmltest/valid/sa/024.xml:

<!DOCTYPE doc [
<!ELEMENT doc (foo)>
<!ELEMENT foo (#PCDATA)>
<!ENTITY e "<foo></foo>">
]>
<doc>&e;</doc>

about which the author of the test case says:

<TEST TYPE="valid" ENTITIES="none" ID="valid-sa-024"
URI="valid/sa/024.xml" SECTIONS="3.1 4.1 [43] [66]"
OUTPUT="valid/sa/out/024.xml">
Test demonstrates that Entity References are valid element content
and also demonstrates a valid Entity Declaration. </TEST>

The call to parser.Token() has this to say:

XML syntax error on line 4: unexpected end element </foo>

2. File from suite oasis/p03fail16.xml:

0x12<doc/>

(that's a byte 0x12 at the beginning) about which the author of the
test case says:

<TEST TYPE='not-wf' SECTIONS='2.3 [3]'
ID='o-p03fail16' URI='p03fail16.xml'>
Use of illegal character within XML document. </TEST>

The call to parser.Token() did not apparently error, or at least
didn't error the way I am using it.

3. File from suite eduni/errata-4e/ibm89n05.xml:

<!DOCTYPE animal [
<!ELEMENT animal ANY>
<?_ٟ an only legal per 5th edition extender #x65f in PITarget ?>
]>
<animal/>

about which the author of the test case says:

<TEST VERSION="1.0" URI="ibm89n05.xml" TYPE="valid" ID="ibm-valid-
P89-ibm89n05.xml" ENTITIES="none" SECTIONS="B." RECOMMENDATION="XML1.0-
errata4e" EDITION="5">
Tests Extender with an only legal per 5th edition character. The
character #x065F
occurs as the second character in the PITarget in the PI in the
DTD.
</TEST>

xml.Parser says "XML syntax error on line 3: invalid XML name: _ 0xD9
0x9F"
(bytes 0xd9 and 0x9f)

The Conformance Suite, the more i poke at it, contains quite a lot of
cases from the different editions (1-5 of XML 1.0, at least the 1e of
XML 1.1, a couple of Namespaces), some of which may be of various
currency (my third sample, for instance, is valid per 5e...). The
more tedious cases to examine are those that the CS asserts should
fail, when xml.Parser does not fail them.

On the up side, there are some wins like over half the files in the CS
met expectaions, and most of the files in the japanese/ folder
(including some interesting encodings) of the CS passed.

These data can readily be made into unit tests. What is helpful at
this juncture?

cheers,
Nigel

Russ Cox

unread,

Nov 4, 2010, 10:34:25 PM11/4/10

to nigel.kerr, golang-nuts

There are a bunch of complex things we're just not trying to do
in the xml parser, mainly because no one has needed it.
One is handle markup declarations in the <!DOCTYPE>.
Anything that uses that syntax is going to fail. That covers
your #1 and #3.

#2, the raw 0x12 (DC2) byte, is a bug. The raw parser should
be ensuring that the input only contains Unicode code points
drawn from {0x09, 0x0A, 0x0D, 0x20-0xD7FF, 0xE000-0xFFFD, 0x10000-0x10FFFF}.
If you'd like to fix that, feel free to send us a code review.
http://golang.org/doc/contribute.html

If you're excited about playing with the XML parser and finding
things like this, then by all means have fun. If not, no big deal.
You said:

> The
> more tedious cases to examine are those that the CS asserts should
> fail, when xml.Parser does not fail them.

It might suffice to collect the test descriptions for those.
That should be easy to extract, since you can use the
xml parser to read the xml <test> clauses.

Russ

nigel.kerr

unread,

Nov 6, 2010, 10:54:19 PM11/6/10

to golang-nuts

On Nov 4, 10:34 pm, Russ Cox <r...@golang.org> wrote:

> #2, the raw 0x12 (DC2) byte, is a bug. The raw parser should
> be ensuring that the input only contains Unicode code points
> drawn from {0x09, 0x0A, 0x0D, 0x20-0xD7FF, 0xE000-0xFFFD, 0x10000-0x10FFFF}.
> If you'd like to fix that, feel free to send us a code review.http://golang.org/doc/contribute.html

Reading xml.go, I think to propose at least a naive start:

add a new []unicode.Range describing the legal characters (XML Rec
section 2.2)

add to xml.text(), just preceding the line-end handling ("Must rewrite
\r and \r\n into \n."),
rune-by-run treatment of the slice of bytes read.

if a given rune isn't in a new []unicode.Range, make a new SyntaxError
to p.err, and return nil.

The disadvantages are certainly that we are reading the bytes an extra
time to see if they encode runes we don't want. The rest of the code
gets quite a long ways without hardly treating runes at all. This
naive solution can probably be improved upon.

Tests easy to write.

cheers,
Nigel

Russ Cox

unread,

Nov 15, 2010, 3:10:09 PM11/15/10

to nigel.kerr, golang-nuts

>> #2, the raw 0x12 (DC2) byte, is a bug. The raw parser should
>> be ensuring that the input only contains Unicode code points
>> drawn from {0x09, 0x0A, 0x0D, 0x20-0xD7FF, 0xE000-0xFFFD, 0x10000-0x10FFFF}.
>> If you'd like to fix that, feel free to send us a code review.http://golang.org/doc/contribute.html
>
> Reading xml.go, I think to propose at least a naive start:
>
> add a new []unicode.Range describing the legal characters (XML Rec
> section 2.2)
>
> add to xml.text(), just preceding the line-end handling ("Must rewrite
> \r and \r\n into \n."),
> rune-by-run treatment of the slice of bytes read.
>
> if a given rune isn't in a new []unicode.Range, make a new SyntaxError
> to p.err, and return nil.

Ugh. I didn't realize that package xml didn't check
any of the bytes for validity. I'm inclined to leave
this one alone. Be liberal in what you accept and all.

Russ

Reply all

Reply to author

Forward