Event driven based XML parsing

358 views
Skip to first unread message

Tong Sun

unread,
Oct 28, 2015, 11:39:57 AM10/28/15
to golang-nuts
Hi, 

I'm facing a problem that I think can only be solved by the event driven based XML parsing approach, i.e., I found I can't make the existing "encoding/xml" parsing works for me. Of course, this most probably is not the case, so I need help. 

I am parsing Microsoft webtest files. A sample file looks like this:

I.e., in the Microsoft webtest XML file, there are bunch of requests. This is what I'm processing. The challenge is that such webtest requests may be buried under other XML tags (like transactions or conditions), instead of strictly at WebTest>Items>Request level. 

For condition, for e.g., the only thing I care about is `xml:"ConditionalRule>RuleParameters>RuleParameter"`, however, if I process it like this --

Then all the webtest requests buried beneath the condition tag will be eaten, by the `xml.Unmarshal` I think, so all "interesting" stuff would be gone. 

I need a event driven base XML parsing approach so that, when seeing the interesting XML starting tag, *regardless where they are*, process my interested tags, without side-effects like eating up sub-xmls, or being ignored because of they are under certain tags. How is that possible?

Thanks


Daniel Skinner

unread,
Oct 28, 2015, 11:54:32 AM10/28/15
to Tong Sun, golang-nuts
Sounds like the tags your interested in are arbitrarily located within the document? Not sure if it's the best choice but at the very least you could manually decode the xml document and handle each token yourself. https://play.golang.org/p/QnY9_p0-NF

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Konstantin Khomoutov

unread,
Oct 28, 2015, 12:04:36 PM10/28/15
to Tong Sun, golang-nuts
On Wed, 28 Oct 2015 08:39:57 -0700 (PDT)
Tong Sun <sunto...@gmail.com> wrote:

[...]
> For condition, for e.g., the only thing I care about is
> `xml:"ConditionalRule>RuleParameters>RuleParameter"`, however, if I
> process it like this --
> https://play.golang.org/p/ryq3l3MKwy
>
> Then all the webtest requests buried beneath the condition tag will
> be eaten, by the `xml.Unmarshal` I think, so all "interesting" stuff
> would be gone.
>
> I need a event driven base XML parsing approach so that, when seeing
> the interesting XML starting tag, **regardless where they are**,
> process my interested tags, without side-effects like eating up
> sub-xmls, or being ignored because of they are under certain tags.
> How is that possible?

The xml.Decoder of the encoding/xml package supports mixed-mode
decoding, where you iterate over the XML document using the decoder's
RawToken() method until you find an "interesting" start element,
and once such found, you can use the decoder's DecodeElement() method
to get the same effect as you would call Unmarshal() on exactly that
part of the XML document's tree your start element points at.

Tong Sun

unread,
Oct 28, 2015, 12:11:25 PM10/28/15
to golang-nuts, sunto...@gmail.com

Yep, that's what I'm doing now. 

I've collected how to manually decode the xml document and handle each token myself at,
https://github.com/suntong/lang/blob/master/lang/Go/src/xml/parser02.go

and I've using the following mixed approach when dealing with each interesting xml tags. 

The missing dot for me is that, with the `Condition` tag, I need to deal with the part in `xml:"ConditionalRule>RuleParameters>RuleParameter"`, while ignoring other parts. How would that be possible. I've put full-yet-shrinked xml Condition code in 

in which I'm interested in `Condition>ConditionalRule>RuleParameters>RuleParameter`, while I like to leave `Condition>Then>Items>Request` alone. 

Tong Sun

unread,
Oct 28, 2015, 12:21:31 PM10/28/15
to golang-nuts, sunto...@gmail.com
Good idea, that's what I'm trying to get to, but stuck now. 

I'm using the following mixed approach when dealing with each interesting xml tags. 


would you show me how to output timestamp for each revision while still output that revision's text please? 

thanks

Tong Sun

unread,
Oct 28, 2015, 12:40:32 PM10/28/15
to golang-nuts
 
I meant, I am using DecodeElement() already. Once I've change the `decoder.Token()` to `decoder.RawToken()`, the code is not working as expected. 

 
would you show me how to output timestamp for each revision while still output that revision's text please? 

I meant, for the nature of my goal, I need to deal with the `revision` and `text` tags separately, as they are not arranged under `page` as nicely as the sample code is. 

 
thanks

Konstantin Khomoutov

unread,
Oct 28, 2015, 12:51:16 PM10/28/15
to Tong Sun, golang-nuts
On Wed, 28 Oct 2015 09:11:24 -0700 (PDT)
Tong Sun <sunto...@gmail.com> wrote:

> Yep, that's what I'm doing now.
>
> I've collected how to manually decode the xml document and handle
> each token myself at,
> https://github.com/suntong/lang/blob/master/lang/Go/src/xml/parser02.go
>
> and I've using the following mixed approach when dealing with each
> interesting xml tags.
> https://github.com/suntong/lang/blob/master/lang/Go/src/xml/parser03.go
>
> The missing dot for me is that, with the `Condition` tag, I need to
> deal with the part in
> `xml:"ConditionalRule>RuleParameters>RuleParameter"`, while ignoring
> other parts. How would that be possible. I've put full-yet-shrinked
> xml Condition code in https://play.golang.org/p/CzYClRz551
>
> in which I'm interested in
> `Condition>ConditionalRule>RuleParameters>RuleParameter`, while I
> like to leave `Condition>Then>Items>Request` alone.

What about maintaining a stack of past start elements?
Sketched out here: https://play.golang.org/p/T1-boS14XO

The basic idea is that you push the names of all encountered start
elements onto a stack and pop them back when processing their
respective end elements. Now you only look at start
elements with the (local) name "RuleParameter": once you've found one,
you inspect its *context* by looking at the stack of recorded names of
the enclosing start elements. Since in both cases "RuleParameter"
elements seems to be enclosed in a "RuleParameters" element, you need
to look two levels deep.

My sketch supposedly misses handling some corner cases (clearly, it
assumes any "RuleParameter" element is enclosed in at least two parent
elements) but appears to do what you need.

Konstantin Khomoutov

unread,
Oct 28, 2015, 1:03:01 PM10/28/15
to Tong Sun, golang-nuts
On Wed, 28 Oct 2015 09:40:32 -0700 (PDT)
Tong Sun <sunto...@gmail.com> wrote:

[...]
> >> The xml.Decoder of the encoding/xml package supports mixed-mode
> >> decoding, where you iterate over the XML document using the
> >> decoder's RawToken() method until you find an "interesting" start
> >> element, and once such found, you can use the decoder's
> >> DecodeElement() method to get the same effect as you would call
> >> Unmarshal() on exactly that part of the XML document's tree your
> >> start element points at.
[...]
> I meant, I am using DecodeElement() already. Once I've change the
> `decoder.Token()` to `decoder.RawToken()`, the code is not working as
> expected.

Disregard what I said: xml.Decoder's Token() appears to work just OK
with DecodeElement(), so I think you're doing it right.

[...]

Tong Sun

unread,
Oct 28, 2015, 8:35:05 PM10/28/15
to golang-nuts
I've rewrite above into a short demo 


It emphasizes the challenge that I'm facing -- my goal is to output every content of `revision` tags, and meanwhile, output every Page Title as well. But currently I can get either, but not both. 

But I think I'm really close. If I can get the line 119 working, I think I should be good. Can somebody fix the line 119 for me please? Currently, if commenting line 119 out, all revision contents are printed; if then commenting line 118 out, the "decoder.DecodeElement(&p, &se)" line, all Page contents are printed, but the `revision` contents are no longer be. 

Please help.


thanks

Daniel Skinner

unread,
Oct 28, 2015, 9:32:03 PM10/28/15
to Tong Sun, golang-nuts
only briefly looked, but 119 fails b/c 115 returns a plain `xml.Token`. You need to recast to `xml.StartElement`

`tc := xml.CopyToken(t).(xml.StartElement)`

C Banning

unread,
Dec 14, 2015, 5:47:30 AM12/14/15
to golang-nuts
Solved with new (alternative) XML-map[string]interface{} decoder/encoder in https://github.com/clbanning/mxj.
See examples gonuts11seq.go and gonuts12seq.go.
Reply all
Reply to author
Forward
0 new messages