How to handle UTF-16 LE XML

1,982 views
Skip to first unread message

Tobias S.

unread,
Sep 1, 2015, 9:01:48 AM9/1/15
to golang-nuts
Hello Gophers, 


In my little Go program I grab an UTF-16 LE XML File from a Windoze Server. I found a few hints on the internet how to handle the parsing with Go:

import (
    "encoding/xml"
    "golang.org/x/net/html/charset"
)

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)


This was posted on Stackoverflow by user moraes. But I still don´t understand the  basic procedure of handling UTF-16 files.

1. Do I have to convert the UTF-16 encoded file to UTF-8  prior to further decode it with the above commands. Or will the "NewDecoder" handle this.

2. How do handle the line in the XML files which denotes the encoding. Do I have to manually replace it with <?xml version="1.0" encoding="UTF-8"?> 




Thanks for your input. 


Tobias

Giulio Iotti

unread,
Sep 1, 2015, 2:13:35 PM9/1/15
to golang-nuts
On Tuesday, September 1, 2015 at 4:01:48 PM UTC+3, Tobias S. wrote:
Hello Gophers, 


In my little Go program I grab an UTF-16 LE XML File from a Windoze Server. I found a few hints on the internet how to handle the parsing with Go:

import (
    "encoding/xml"
    "golang.org/x/net/html/charset"
)

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)


This was posted on Stackoverflow by user moraes. But I still don´t understand the  basic procedure of handling UTF-16 files.

1. Do I have to convert the UTF-16 encoded file to UTF-8  prior to further decode it with the above commands. Or will the "NewDecoder" handle this.

You have to do it in CharsetReader; In the example you pasted, charset.NewReaderLabel handles the decoding/encoding for you. Internally Go only handles utf-8.
 
2. How do handle the line in the XML files which denotes the encoding. Do I have to manually replace it with <?xml version="1.0" encoding="UTF-8"?> 

I think this should be the real encoding; in your case utf-16. The value found in the <?xml?> tag is passed to CharsetReader as first argument. The second argument is the reader itself.

CharsetReader must return the Reader of the utf-8 encoded contents, or an error. This is exactly (and not incidentally) what NewReaderLabel does :)

-- 
Giulio Iotti 

Tobias S.

unread,
Sep 3, 2015, 10:04:21 AM9/3/15
to golang-nuts

Thanks for your clarification. My problem now seems to be the BOM. The code looks like this:

b, _ := ioutil.ReadAll(xmlFile)
text := strings.NewReader(string(b))
decoder := xml.NewDecoder(text)
decoder.CharsetReader = charset.NewReaderLabel

When I print out the text variable I get:

&{??<?xml version="1.0" encoding="UTF-16"?>    


At the start of the file. The two leading question marks are probably the BOM marks. I get the error message:

XML syntax error on line 1: invalid UTF-8 

from the decoder....





 

andrey mirtchovski

unread,
Sep 3, 2015, 11:36:32 AM9/3/15
to Tobias S., golang-nuts
charset encoding should be able to handle BOM (because the unicode
transforms it uses do). two things to try: see what DetermineEncoding
says about your text, and then add your test file to the
sniffTestCases inside the charset package's charset_test.go.

if the latter sounds like too much, just print the first line of your
"text" variable fmt'd using %q and let's see exactly those bytes are.

Konstantin Khomoutov

unread,
Sep 3, 2015, 11:39:35 AM9/3/15
to Tobias S., golang-nuts
On Thu, 3 Sep 2015 07:04:20 -0700 (PDT)
"Tobias S." <tobias....@gmail.com> wrote:

>
> Thanks for your clarification. My problem now seems to be the BOM.
> The code looks like this:
>
> b, _ := ioutil.ReadAll(xmlFile)
> text := strings.NewReader(string(b))
> decoder := xml.NewDecoder(text)
> decoder.CharsetReader = charset.NewReaderLabel

Overengeneered. os.File already implements io.Reader,
so just do

decoder := xml.NewDecoder(xmlFile)
decoder.CharsetReader = charset.NewReaderLabel

> When I print out the text variable I get:
>
> &{??<?xml version="1.0" encoding="UTF-16"?>
>
> At the start of the file. The two leading question marks are probably
> the BOM marks. I get the error message:
>
> XML syntax error on line 1: invalid UTF-8
>
> from the decoder....

OK, so I'd then employ buffering and its ability to "peek" at the data,
literally, and discard it, if needed:
http://play.golang.org/p/zGrNnYRkPF

Nigel Tao

unread,
Sep 3, 2015, 8:49:45 PM9/3/15
to Tobias S., Andy Balholm, Marcel van Lohuizen, golang-nuts
Adding Andy and Marcel for their thoughts re how a UTF-16 charset
reader from golang.org/x/net/html/charset should handle BOMs.

Tobias S.

unread,
Sep 4, 2015, 10:23:31 AM9/4/15
to golang-nuts, tobias....@gmail.com
Hi Andrey,

here is the printout with %q of the first few characters:

&{"\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x001\x00.\


I will try your suggestions later on and let you know what´s going on. 

Andy Balholm

unread,
Sep 4, 2015, 11:30:46 AM9/4/15
to Nigel Tao, Tobias S., Marcel van Lohuizen, golang-nuts
As I understand the WHATWG spec, the BOMs are to be left in the decoded output, to be ignored by the tokenizer. But I can’t imagine what problem it would cause for the decoder to remove them.

andrey mirtchovski

unread,
Sep 4, 2015, 3:34:37 PM9/4/15
to Tobias S., golang-nuts
The problem lies with encoding/xml's design: in order to use the
charset reader the xml library needs to examine the first line of text
from the xml file (where the encoding is specified). unfortunately
that first line contains invalid UTF-8 already, and libxml barfs
before it even figures out what the encoding should be to pass it to
our charset reader.

To solve this we can cheat and pass the input through the charset
reader before it goes to encoding/xml. Unfortunately in that case the
XML library will find an UTF-16 encoding specified in the xml headers
and will complain that it has no charset reader to convert that. We
solve this by supplying a dummy charset reader.

This is inelegant but I don't see another solution, at least not a
quick one. There is talk of redesigning xml, so if you file a bug
there may be a chance to get this fixed somehow in the library.

I've attached the test program and the test utf-16le encoded file I
created from your initial input. Sample run below:

$ hexdump -C bomtest.txt
00000000 ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 |..<.?.x.m.l. .v.|
00000010 65 00 72 00 73 00 69 00 6f 00 6e 00 3d 00 22 00 |e.r.s.i.o.n.=.".|
00000020 31 00 2e 00 30 00 22 00 20 00 65 00 6e 00 63 00 |1...0.". .e.n.c.|
00000030 6f 00 64 00 69 00 6e 00 67 00 3d 00 22 00 55 00 |o.d.i.n.g.=.".U.|
00000040 54 00 46 00 2d 00 31 00 36 00 22 00 3f 00 3e 00 |T.F.-.1.6.".?.>.|
00000050 20 00 20 00 0d 00 0a 00 3c 00 4f 00 75 00 74 00 | . .....<.O.u.t.|
00000060 65 00 72 00 3e 00 3c 00 49 00 6e 00 6e 00 65 00 |e.r.>.<.I.n.n.e.|
00000070 72 00 3e 00 74 00 65 00 73 00 74 00 3c 00 2f 00 |r.>.t.e.s.t.<./.|
00000080 49 00 6e 00 6e 00 65 00 72 00 3e 00 3c 00 2f 00 |I.n.n.e.r.>.<./.|
00000090 4f 00 75 00 74 00 65 00 72 00 3e 00 0d 00 0a 00 |O.u.t.e.r.>.....|
000000a0 3c 00 2f 00 78 00 6d 00 6c 00 3e 00 0d 00 0a 00 |<./.x.m.l.>.....|
000000b0
$ go run t.go
test
$
bomtest.txt
t.go

Tobias S.

unread,
Sep 7, 2015, 5:59:42 AM9/7/15
to golang-nuts, tobias....@gmail.com
Thank you very much, for your workaround code, it works !!

Also many thanks to all who posted in order to solve the problem. You are a special community, always willing to help.
Reply all
Reply to author
Forward
0 new messages