XML syntax error when input contains 

548 views
Skip to first unread message

hannson

unread,
Sep 6, 2012, 4:39:29 PM9/6/12
to golan...@googlegroups.com
I'm parsing a 2GB XML file that contains the illegal entity  for some reason. I don't have access to the source data and the file is generated weekly so I need to be lax about those entities (replace them with a space or nothing).

Here's an example of the error: http://play.golang.org/p/PQNusIo_Ix

First, should it cause an error? 
Second, do you have any tips on how to remove those entities on the fly? 

I was thinking of writing an XmlCharReader that implements io.Reader that filters the entities but might need some tips on how to do that.

Ideas?

Kamil Kisiel

unread,
Sep 6, 2012, 4:45:00 PM9/6/12
to golan...@googlegroups.com
I encountered the same problem in a project I was recently working on. I solved it by implementing a Reader that filters out invalid UTF-8 characters from the stream.

hannson

unread,
Sep 6, 2012, 5:11:05 PM9/6/12
to golan...@googlegroups.com
It's probably not the exact same problem. I still get the same result when using your code. The thing is the xml parser decodes the entity into 0x08 after the input is read. I tried a similar solution myself before I figured out it was an xml entity but not an illegal byte . 

Having tried adding [#8 = ""] to the Entity map in xml.Decoder without change I see no other possibility than to write a filter that searches and replaces those illegal entities.

Kamil Kisiel

unread,
Sep 6, 2012, 6:03:18 PM9/6/12
to golan...@googlegroups.com
Ah sorry, I misunderstood the problem. I see what's happening now, the decoder is seeing it as a character entity but it's one that's outside of the valid range: http://www.xml.com/axml/testaxml.htm (section 2.2 - Characters). You'd have to either modify the decoder to either ignore these instead of returning an error or else filter them out somehow before decoding.

hannson

unread,
Sep 6, 2012, 7:12:40 PM9/6/12
to golan...@googlegroups.com
Yeah I think I'll rip out the entity code from xml.Decoder and use it in a filter. I can't modify the decoder because I might share the code later and I'd rather not have to hack every release of Go to work for this particular file.

For now I'll just remove the entity from the file and see what happens.

Jan Mercl

unread,
Sep 7, 2012, 5:38:01 AM9/7/12
to hannson, golan...@googlegroups.com

On Sep 6, 2012 10:39 PM, "hannson" <han...@gmail.com> wrote:
> Ideas?

sed?

-j

Mike Samuel

unread,
Sep 7, 2012, 2:17:08 PM9/7/12
to golan...@googlegroups.com, hannson
It's possible using sed, but not trivial to get correct for arbitrary markup.  Consider that the "&#8;" sequence in

    <![CDATA[ --> &#8;]]> <![CDATA[ ... ]]>

should not be fixed, but the one in

    <!-- <![CDATA[ -->&#8; <![CDATA[ ... ]]>

should be fixed.

To handle it in the general case, you have to write a SAX parser in sed and that still won't handle illegal codepoints introduced via external entity inclusion.

That said, well-formed but numerically invalid character references inside CDATA sections are probably rare, and CDATA section boundary tokens inside comments are probably rarer still.
Reply all
Reply to author
Forward
0 new messages