xml.Unmarshal and order-dependent XML

871 views
Skip to first unread message

Florian Weimer

unread,
Jun 23, 2012, 4:04:27 PM6/23/12
to golan...@googlegroups.com
I would like to use xml.Unmarshal to parse XML documents which contain
elements like this:

<p><b>Note:</b> This is an <i>important</i> example</p>

The contents of <p> elements could be represented as a slice of
interface type, where the concrete types represent character data, <b>
elements and <i> elements. But the xml package wouldn't know which
types to create, so this won't work. (Datatypes/enums-with-values
would help with that, by the way.)

Another approach would be a type

struct Hmaterial {
XMLName xml.Name
I []Hmaterial
B []Hmaterial
Text string
}

where only one of the I, B, Text elements has a non-zero value. But
as far as I understand it, it is only possible to match fields against
sub-elements.

Is there something else I could try?

Kyle Lemons

unread,
Jun 24, 2012, 1:54:45 AM6/24/12
to Florian Weimer, golan...@googlegroups.com
Check out exp/html at tip.

Florian Weimer

unread,
Jun 24, 2012, 2:09:12 AM6/24/12
to Kyle Lemons, golan...@googlegroups.com
* Kyle Lemons:

> Check out exp/html at tip.

Sorry, I should have been clear about this—I'm not trying to parse
HTML, the tags are quite different. But the dependency on order is
similar.

Patrick Mylund Nielsen

unread,
Jun 24, 2012, 2:12:15 AM6/24/12
to Florian Weimer, Kyle Lemons, golan...@googlegroups.com
exp/html really just parses arbitrary tags, getting their attributes and values (in order.) I think it would be sufficient for your purpose.

Florian Weimer

unread,
Jun 24, 2012, 2:25:46 AM6/24/12
to Patrick Mylund Nielsen, Kyle Lemons, golan...@googlegroups.com
* Patrick Mylund Nielsen:

> exp/html really just parses arbitrary tags, getting their attributes and
> values (in order.) I think it would be sufficient for your purpose.

Some tag names are hard-coded in the Tokenizer from the exp/html
package. Is this really a better fit than xml.Decoder?

Russ Cox

unread,
Jun 24, 2012, 6:34:31 PM6/24/12
to Florian Weimer, golan...@googlegroups.com
On Sat, Jun 23, 2012 at 4:04 PM, Florian Weimer <f...@deneb.enyo.de> wrote:
> Is there something else I could try?

Unmarshal is really for data structures, which, as
http://golang.org/pkg/encoding/xml/#bugs points out, have different
properties than pure XML. If you need to preserve ordering, then
Unmarshal isn't going to be of any help, as you discovered.

I think you need to process the XML stream yourself. Note that you can
create an xml.Decoder and pull individual xml tokens out one at a time
in your own parser. Then at least you don't have to deal with all the
tokenization and interpretation of character sequences.

Russ

Patrick Mylund Nielsen

unread,
Jun 25, 2012, 9:19:01 PM6/25/12
to Florian Weimer, Kyle Lemons, golan...@googlegroups.com
Don't mean to leave you hanging. I don't know. I haven't actually tried to do something like using the html tokenizer for xml. I think Russ' suggestion (use xml.Decoder and pull out tokens) is your best bet. (Makes more sense, too.) This is basically what I was thinking you could do with exp/html. 
Reply all
Reply to author
Forward
0 new messages