Converting utf16 to utf8

2,098 views
Skip to first unread message

Athiwat Chunlakhan

unread,
Jul 9, 2011, 1:19:13 PM7/9/11
to golan...@googlegroups.com
How do I do this? The file I'm reading from contains a utf16 formatting, and I want to compare with a string in my program. I also want to use it in maps like this

output = test[some-UTF16-string]

which doesn't work unless I convert it.

-- 
Athiwat Chunlakhan

hka...@gmail.com

unread,
Jul 9, 2011, 1:29:47 PM7/9/11
to golan...@googlegroups.com
"UTF16toString()" from syscall package should do the job.

Ostsol

unread,
Jul 9, 2011, 1:31:25 PM7/9/11
to golan...@googlegroups.com
Check out the utf16 and utf8 packages.

-Daniel

Athiwat Chunlakhan

unread,
Jul 9, 2011, 1:35:38 PM7/9/11
to golan...@googlegroups.com
I know about those two packages already, but the question here is how do I use them to get my result?

-- 
Athiwat Chunlakhan

Athiwat Chunlakhan

unread,
Jul 9, 2011, 1:46:03 PM7/9/11
to golan...@googlegroups.com
UTF16ToString doesn't exists in the weekly release :(
Do you have an idea where that has been move to?

-- 
Athiwat Chunlakhan

On Sunday, July 10, 2011 at 12:29 AM, hka...@gmail.com wrote:

UTF16toString()

zhai

unread,
Jul 9, 2011, 1:55:59 PM7/9/11
to Athiwat Chunlakhan, golan...@googlegroups.com
http://golang.org/src/pkg/syscall/syscall_windows.go?h=UTF16ToString#L59

func UTF16ToString(s []uint16) string {
    60		for i, v := range s {
    61			if v == 0 {
    62				s = s[0:i]
    63				break
    64			}
    65		}
    66		return string(utf16.Decode(s))
    67	}

Vida

unread,
Jul 9, 2011, 2:00:38 PM7/9/11
to golang-nuts
Please take a look, http://goneat.org/pkg/syscall/
I can recreate that function, but I'm asking where has it gone to.

On Jul 10, 12:55 am, zhai <qyz...@gmail.com> wrote:
> http://golang.org/src/pkg/syscall/syscall_windows.go?h=UTF16ToString#L59
>
> func UTF16ToString(s []uint16) string {    60           for i, v := range s {
>  61                     if v == 0 {    62                               s = s[0:i]    63                                break    64                     }
> 65              }    66         return string(utf16.Decode(s))    67    }
>
> On Sun, Jul 10, 2011 at 1:46 AM, Athiwat Chunlakhan <athiw...@googlemail.com

Vida

unread,
Jul 9, 2011, 2:02:02 PM7/9/11
to golang-nuts
Sorry, looks like the function some how doesn't exists on my Mac.

On Jul 10, 1:00 am, Vida <athiw...@googlemail.com> wrote:
> Please take a look,http://goneat.org/pkg/syscall/

brainman

unread,
Jul 9, 2011, 7:01:20 PM7/9/11
to golan...@googlegroups.com
It is Windows specific function. It only exist, if GOOS=windows. Use utf16 package instead.

Alex

Ostsol

unread,
Jul 9, 2011, 9:05:44 PM7/9/11
to golan...@googlegroups.com
On Saturday, 9 July 2011 17:01:20 UTC-6, brainman wrote:
It is Windows specific function. It only exist, if GOOS=windows. Use utf16 package instead.

Alex


I'm confused as to why that function is in the syscall package at all.  I see that it and related function are needed by some Windows syscalls, but if it's going to import the utf16 package anyway, why not have those functions in the utf16 package itself?

-Daniel

shiwei xu

unread,
Jul 9, 2011, 9:55:35 PM7/9/11
to golan...@googlegroups.com
syscall.UTF16ToString convert UTF16 text to unicode text and then convert unicode text to UTF-8 text。It's expensive and temporary solution. I think that's why it is not in utf16 package. 

John Arbash Meinel

unread,
Jul 9, 2011, 10:03:16 PM7/9/11
to Athiwat Chunlakhan, golan...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 7/9/2011 7:35 PM, Athiwat Chunlakhan wrote:
> I know about those two packages already, but the question here is how do I use them to get my result?
>

It looks clumsy to me, but something like:

import (
"encoding/bytes"
"utf8"
"utf16"
)

func decode_utf16_to_utf8_string(string) string {
if len(string) % 1 != 0 { // Error, can't be UTF-16
}
as_uint16 := make([]uint16, len(string) / 2)
for i := 0; i < len(string); i += 2 {
as_uint16[i/2] = bytes.LittleEndian.Uint16(as_bytes[i:i+2])
}
as_int_slice := utf16.Decode(as_uint16)
// I think 4 bytes is enough, worst case, but UCS-2 is actually
// only 3 bytes
//
http://stackoverflow.com/questions/6466071/whats-the-longest-in-bytes-utf-8-character-which-is-present-in-ucs-2
as_bytes := make([]byte, len(as_int_slice)*4)
offset := 0
for _, rune := range as_int_slice {
offset += utf8.EncodeRune(as_bytes[offset:], rune)
}
return string(as_bytes[:offset])
}


I haven't verified it. But that should at least mutate the types
correctly at each step. If you wanted, you could probably change the
algorithm a bit, to avoid some of the intermediate buffers. However,
because of extended Unicode, not every char can be encoded in a single
uint16. If you're sure the data doesn't contain the extended set, then
you could change the first as_uint16 loop to decode straight to a "rune
int". Something like:

func decode_utf16_to_utf8_string(string) string {
if len(string) % 1 != 0 { // Error, can't be UTF-16
}
as_uint16 := make([]uint16, 1)
// UCS-2 into UTF-8 has a worst-case of growing by 1 byte per rune
as_bytes := make([]bytes, len(string)* 3 / 2)
offset := 0
for i := 0; i < len(string); i += 2 {
as_uint16[0] := bytes.LittleEndian.Uint16(as_bytes[i:i+2])
rune := utf16.Decode(as_uint16)
offset += utf8.EncodeRune(as_bytes[offset:], rune)
}
return string(as_bytes[:offset])
}

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4ZCGQACgkQJdeBCYSNAAM01gCeJ6URqFV1f/KnUKr/xlrk01Mn
dycAn1ODfPyuub2674QTgFy1xm8+Aesd
=ICKG
-----END PGP SIGNATURE-----

brainman

unread,
Jul 10, 2011, 3:15:19 AM7/10/11
to golan...@googlegroups.com
On Sunday, 10 July 2011 11:05:44 UTC+10, Ostsol wrote:

I'm confused as to why that function is in the syscall package at all.  I see that it and related function are needed by some Windows syscalls, ...

You have answered your question yourself.
 
... why not have those functions in the utf16 package itself?


I think it is too trivial to implement if anyone needs it. Also, I'm not sure where else this function would be applicable, but in windows syscall package.

Alex

Athiwat Chunlakhan

unread,
Jul 10, 2011, 3:58:25 AM7/10/11
to John Arbash Meinel, golan...@googlegroups.com
Thanks

-- 
Athiwat Chunlakhan

as__uint16 := make([]uint16, 1)

Ostsol

unread,
Jul 10, 2011, 11:28:30 AM7/10/11
to golan...@googlegroups.com

On Sunday, 10 July 2011 01:15:19 UTC-6, brainman wrote:
I think it is too trivial to implement if anyone needs it. Also, I'm not sure where else this function would be applicable, but in windows syscall package.

Alex

I was just thinking that since it's exposed, that means that it is intended to be used.  However, those functions do not require the syscall package, so it seems odd that they are exposed in the syscall package.  If they are meant to be library functions, they should be somewhere more appropriate, like the utf16 package (though I realize that the specific implementation of these functions assumes a zero-terminating string).  On the other hand, if they are merely trivial helper functions with no use outside of certain Windows syscalls, they should not be exposed.

It's not a particularly important matter; it just seemed odd to me.

-Daniel

brainman

unread,
Jul 10, 2011, 7:36:11 PM7/10/11
to golan...@googlegroups.com
On Monday, 11 July 2011 01:28:30 UTC+10, Ostsol wrote:

...  On the other hand, if they are merely trivial helper functions with no use outside of certain Windows syscalls, they should not be exposed.


syscall.UTF16ToString function is used in a few places in the go tree:

# cd $GOROOT/src
# grep syscall.UTF16ToString * -r | sed 's/^\(.*\.go\):.*/\1/' | sort | uniq
pkg/net/lookup_windows.go
pkg/os/env_windows.go
pkg/os/stat_windows.go
pkg/time/zoneinfo_windows.go
#

So it has to be exported.

Alex

Russ Cox

unread,
Jul 11, 2011, 4:47:15 PM7/11/11
to John Arbash Meinel, Athiwat Chunlakhan, golan...@googlegroups.com
var u16 []uint16
binary.Read(f, binary.LittleEndian, &u16)
utf8 := string(utf16.Decode(u16))
Reply all
Reply to author
Forward
0 new messages