how is xml.Decoder.CharsetReader supposed to be held?

191 views
Skip to first unread message

Dan Kortschak

unread,
May 5, 2022, 11:03:46 PM5/5/22
to golang-nuts
I'm in the situation of needing to provide cross-platform xml decoding.
So I thought that xml.Decoder.CharsetReader would be the right approach
in conjunction with golag.org/x/text/encoding. However, the xml decoder
needs to be able to understand the text in order to be able to read the
proc inst to get the charset out to hand to CharsetReader.

So it seems that we need to get the proc inst out from the io.Reader
input, deduce the charset and convert it to UTF-8 and then reinject it
into the io.Reader so that the charset can then be passed to
CharsetReader. This can't be the right way to do things.

I'm wondering what is the use of CharsetReader if it can't be used to
determine the charset without already having determined the charset.
How should it be used?

Dan


Diego Joss

unread,
May 6, 2022, 5:23:21 AM5/6/22
to Dan Kortschak, golang-nuts
Does this work for you?


-- Diego

Dan Kortschak

unread,
May 6, 2022, 6:07:19 AM5/6/22
to golan...@googlegroups.com
On Fri, 2022-05-06 at 11:22 +0200, Diego Joss wrote:
> Does this work for you?
>
> https://go.dev/play/p/xLRawVhcRtF
>

Thanks. No, the documents are in UTF-16, and the procinst will be too.
So it looks more like this https://go.dev/play/p/4IcXNI3yd2M. If I pull
the proc inst out of the UTF-16, then I can get it to work;
https://go.dev/play/p/kHwkVWtxbNO. But this leads to the issue where at
that point I could just decode the whole message and pass it through.
So I don't really see the point of using CharsetReader (at least not
with UTF-16).

Dan


Ian Lance Taylor

unread,
May 6, 2022, 6:56:26 PM5/6/22
to Dan Kortschak, golan...@googlegroups.com
Yeah, that's not the kind of thing that CharsetReader can help with.
You'll need a plain io.Reader that converts from UTF-16 to UTF-8.

CharsetReader only works if the character set name is available in
plain ASCII in the first XML definitions, but the data doesn't use
UTF-8. It can be used with the kinds of encodings found in the
subdirectories of https://pkg.go.dev/golang.org/x/text/encoding.

Ian

Dan Kortschak

unread,
May 6, 2022, 7:05:43 PM5/6/22
to golan...@googlegroups.com
Thanks, Ian.

It might be moot, because it looks like the encoding declaration in the
XML that I have is lying. But in general the solution would need to
sniff the first line and then try for finding the encoding declaration.
I suspect that this is what other languages do in this situation.

Dan


Reply all
Reply to author
Forward
0 new messages