XML Parsing Nested Elements

1,624 views
Skip to first unread message

les...@gmail.com

unread,
Nov 7, 2017, 6:36:21 AM11/7/17
to golang-nuts
I am really struggling to access nested elements of an XML string and suspect it is down to the namespaces.  This string is obtained from a larger document and is the "innerXML" of some elements.  A simplified version is at...

I could probably do this with multiple structs but want to have this in a single struct.


I can seem to read things at the root but cannot get them using the ">" syntax at all.  What am I doing wrong?  Can I "insert" a namespace element to assist it at all?

I have manually removed the namespaces from this example to show what I think should happen!?


Konstantin Khomoutov

unread,
Nov 7, 2017, 11:07:02 AM11/7/17
to golang-nuts, les...@gmail.com
The chief problem with your approach is lack of error checking.
The encoding/xml.Unmarshal() function returns an error value.
Had you checked it for being set (not nil), it would have given you an
immediate idea of what was wrong with your approach.

Regarding namespaces, your hunch is correct: since your XML document is
a fragment extracted from another document by a seemingly "textual"
method, all those "XML namespace prefixes" — parts in the names of the
elements which come before the ':' characters — have no meaning to the
XML parser since they are not defined in the document itself.

Unfortunately, currently there's no way to somehow explicitly define
them anywhere (say, in an instance of encoding/xml.Decoder) before
decoding, so you basically have three options:

- Somehow textually stick their definition on the top element of your
XML document fragrems, so, say, it reads something like

<fdm:trackInformation xmlns:fdm="urn:whatever:ns1"
xmlns:nxcm="http://example.com/another/namespace/uri/"
...>

…and then parse the resulting document into a value of a struct
type the tags on whose fields contain full namespaces in the names
of the XML elements they're supposed to decode.

- Use iterative approach by creating an instance of encoding/xml.Decoder
and calling its Token() method.

When it returns a token of the types StartElement or EndElement,
their Name property can be examined to see what its "Space" and
"Local" fields are.

- Ignore the XML namespace prefixes completely.

In your case this appears to be the simplest solution as the
names of the elements appear to be unique anyway.

The variant which checks for errors, ignores the XML namespace prefixes
and also defines the field named "XMLName" on the type to check the
name of the element it's supposed to unmarshal can be implemented
as follows:

--------------------------------8<--------------------------------
package main

import (
"encoding/xml"
"log"
)

type TrackInformation struct {
XMLName struct{} `xml:"trackInformation"`

TimeAtPosition string `xml:"timeAtPosition"`
Speed int `xml:"speed"`

DepApt string `xml:"qualifiedAircraftId>departurePoint>airport"`
ArrApt string `xml:"qualifiedAircraftId>arrivalPoint>airport"`
Gufi string `xml:"qualifiedAircraftId>gufi"`
}

func main() {

xmlToParse := `
<fdm:trackInformation>
<nxcm:qualifiedAircraftId>
<nxce:aircraftId>TEST</nxce:aircraftId>
<nxce:gufi>KR32642300</nxce:gufi>
<nxce:departurePoint>
<nxce:airport>KJFK</nxce:airport>
</nxce:departurePoint>
<nxce:arrivalPoint>
<nxce:airport>KJFK</nxce:airport>
</nxce:arrivalPoint>
</nxcm:qualifiedAircraftId>
<nxcm:speed>245</nxcm:speed>
<nxcm:timeAtPosition>2017-11-07T11:20:43Z</nxcm:timeAtPosition>
</fdm:trackInformation>`

var trackInfo TrackInformation
err := xml.Unmarshal([]byte(xmlToParse), &trackInfo)
if err != nil {
log.Fatal(err)
}
log.Println(trackInfo)
}
--------------------------------8<--------------------------------

Playground [1].


A couple of more notes.

- You can't use namespaces when defining the names of the nested
elements. The wording of the documentation is a bit moot but it does
explicitly state this: «If the XML element contains a sub-element
whose name matches the prefix of a tag formatted as "a" or "a>b>c"…» —
notice that "the prefix of a tag" bit which actually means "the local
name of an element".

So when you need to match on full names of the elements, you'd have to
use nested structs so that each field stands for an element without
nesting, and the nesting is defined via your types rather than
tags on their fields.

- The XML decoder implements a "strict" mode, which is "on" by default.

What's interesting about it is that even when it's on, it turns a
blind eye on undefined XML namespace prefixes: «Strict mode does not
enforce the requirements of the XML name spaces TR. In particular it
does not reject name space tags using undefined prefixes. Such tags
are recorded with the unknown prefix as the name space URL.»

This means that you can use your undefined namespace prefixes "as is"
when decoding. [2] demonstrates this approach applied to the top-level
XML elements. You can't do this for that "a>b>c" notation in the tags
but you still can apply it when implementing parsing using the nested
types.

- Another trick up the sleeve of the XML decoder is support for custom
unmarshaling functions for your custom types.

Any of your types (such as TrackInformation) can implement a function

UnmarshalXML(d *xml.Decoder, start xml.StartElement) error

to make that type implement the encoding/xml.Unmarshaler interface.

When the decoder sees a type implements this interface, it calls the
UnmarshalXML function instead of dealing with the element's contents
itself.

What follows, is that you can have a hierarchy of low-level unexported
types and a top-level "facade" type defining UnmarshalXML which
internally first unmarshals the element using that hierarchy of types
and then populates your "facade" type with the information ended up
in that hierarchy of values.


Hope this helps.

1. https://play.golang.org/p/KJvvWg9apu
2. https://play.golang.org/p/AR5vDTKX0Q

les...@gmail.com

unread,
Nov 7, 2017, 11:19:29 AM11/7/17
to golang-nuts
Thank you for the incredibly detailed response.  It has really helped to understand the situation.

I actually started with an iterative approach with a Decoder object and this got very complex, very quickly.  It worked but the code was unworkable going forwards. I thought it might be worth trying this approach with an Unmarshaller.

I didn't think of ignoring the namespace prefixes.  You are right and after checking over the definitions there are no conflicting names at all so this works well.

Once again many thanks, the detailed write up I'm sure will help others.
Reply all
Reply to author
Forward
0 new messages