Decoding UTF-16-BE into UTF-8 []byte

1,853 views
Skip to first unread message

Maxim Khitrov

unread,
May 16, 2012, 10:24:45 AM5/16/12
to golang-nuts
Hi,

I have an idea of how to solve this problem, but wanted to check if
there is a better way. I have a base64-encoded []byte that contains a
UTF-16-BE string. Base64 part is easy, but what is the best way of
converting UTF-16-BE to UTF-8 and returning that result as []byte?

My current idea is below, let me know if I can make any improvements
in performance or code simplicity. It would be good if there was a way
of going directly from []rune to []byte without the intermediate
string step, but I couldn't find any existing functions that would do
this. There is a bytes.Runes function that does the reverse.

http://play.golang.org/p/jQc6QF5pfo

- Max

Russ Cox

unread,
May 22, 2012, 1:53:44 AM5/22/12
to Maxim Khitrov, golang-nuts

Russ Cox

unread,
May 22, 2012, 1:54:12 AM5/22/12
to Maxim Khitrov, golang-nuts
On Tue, May 22, 2012 at 1:53 AM, Russ Cox <r...@golang.org> wrote:
> http://play.golang.org/p/jQc6QF5pfo

Well that's not very helpful. I meant http://play.golang.org/p/q5n5T6CHgr

Maxim Khitrov

unread,
May 22, 2012, 4:02:24 AM5/22/12
to r...@golang.org, golang-nuts
Thanks, but this version seems to be a bit slower than the original.
Even if buf is pre-allocated, running 1000000 iterations gives me
1.0940626s (original) vs 1.2350706s. Without pre-allocation, the time
is 1.4740843s.

- Max

Kevin Ballard

unread,
May 22, 2012, 3:55:46 PM5/22/12
to Maxim Khitrov, r...@golang.org, golang-nuts
Counting up the size of buf and allocating at once makes it faster than your original.


BenchmarkOriginal 2000000       876 ns/op
BenchmarkNew 2000000       778 ns/op

-Kevin

Kevin Ballard

unread,
May 22, 2012, 4:42:43 PM5/22/12
to Maxim Khitrov, r...@golang.org, golang-nuts
Skipping the intermediate slices can shave off even more time, especially if we use unsafe to do the utf16be->utf16 conversion in-place.


BenchmarkOriginal 2000000       876 ns/op
BenchmarkNew 5000000       536 ns/op

-Kevin

-- 
Kevin Ballard
Sent with Sparrow

Maxim Khitrov

unread,
May 22, 2012, 4:56:42 PM5/22/12
to Kevin Ballard, r...@golang.org, golang-nuts
Cool :)

I haven't played with the unsafe package yet, thanks for a nice demo.
In reality, I was just hoping that there is already a function in the
standard library that would help me do part of this conversion. I
think your []rune -> []byte code would be a useful addition to the
bytes package.

- Max

Kevin Ballard

unread,
May 22, 2012, 5:51:48 PM5/22/12
to Maxim Khitrov, r...@golang.org, golang-nuts
Without the unsafe bit, it runs at somewhere in the vicinity of 640 ns/op (on my machine). So, while still worthwhile, the unsafe part can be safely omitted and you'll still see a speedup.

In any case, it does seem like there's a good argument for adding a []byte(runes) conversion that does the same thing as []byte(string(runes)) (except without the intermediate string).

Of course, in your case, it turned out to be faster to translate directly from []uint16 -> []byte using both the utf16 and utf8 packages together, and that's not something that seems as generally utilitarian.

-Kevin

-- 
Kevin Ballard
Sent with Sparrow

Reply all
Reply to author
Forward
0 new messages