casting slice of rune to string picks up extra characters for some inputs

287 views
Skip to first unread message

Fraser Hanson

unread,
Feb 28, 2017, 12:29:07 PM2/28/17
to golang-nuts
https://play.golang.org/p/05wZM9BhfB

I'm working on some code that reads UTF32 and converts it to go strings. 
I'm finding some surprising behavior when casting slices of runes to strings.

 runes := []rune{'©'}
 fmt.Printf(" cast to string: (%s)\n", string(runes))
 fmt
.Printf("bytes in string: (%x)\n", string(runes))
Output:
 cast to string: (©)
bytes in string: (c2a9) // <-- where's the C2 byte coming from??

The weird part is that casting the rune slice to a string causes it to pick up an additional leading character.

runesi 0x00-0x7f get nothing prepended.
runes 0x80-0xbf gets a leading c2 byte as seen above.
runes 0xc0-0xff gets a leading c3 byte.
rune 0x100 gets a leading c4 byte.  Seems like a pattern here.

The same thing happens if I add the runes into a bytes.Buffer with WriteRune(), then print it out with bytes.Buffer.String().

Can anyone explain this?  
What's the correct way to convert a slice of runes into a string?

Volker Dobler

unread,
Feb 28, 2017, 12:48:55 PM2/28/17
to golang-nuts
Strings in Go are UTF-8, so it's expected.
See e.g. http://www.fileformat.info/info/unicode/char/00a9/index.htm
What bytes for © _would_ you expect? And why?

V.

Rob Pike

unread,
Feb 28, 2017, 12:49:45 PM2/28/17
to Fraser Hanson, golang-nuts

Ian Lance Taylor

unread,
Feb 28, 2017, 12:49:45 PM2/28/17
to Fraser Hanson, golang-nuts
When you convert []rune to string, the runes are encoded into UTF-8
and the resulting bytes are the contents of the string. That is what
you are seeing. I don't know what you expect to see.

Ian

Fraser Hanson

unread,
Feb 28, 2017, 12:49:54 PM2/28/17
to golang-nuts
I understand now, it's just the UTF-8 representation of these runes.

Even though ascii 128-255 are representable as single bytes (e.g. 0x80), UTF-8 doesn't do it that way and prepends a byte.
The results seen in my output are shown as the UTF-8 representation in the unicode tables:


As described in the go docs, casting anything to a string results in UTF-8.  


Robert Johnstone

unread,
Feb 28, 2017, 12:50:57 PM2/28/17
to golang-nuts
Hello,

Strings are encoded using UTF-8, which is a multi-byte encoding.  Different runes require different lengths to be encoded, and the prefix you noted is how that length is transmitted (although the ranges in your message don't seem to be correct).

Robert

gary.wi...@victoriaplumb.com

unread,
Mar 3, 2017, 10:10:29 AM3/3/17
to golang-nuts
Go strings are UTF-8 encoded as others have mentioned. This means that each human readable character in the string is really a cluster of one or more runes. Some characters are made up of one rune, some are made up of many. Some runes combine with others to create different characters. Also, runes don't have a preset size in bytes, some are made up of one byte, others are made up of more.

In your example, the character © is made up of one rune, which is defined using two bytes, each with the values 0xc2 and x0a9 respectively.
Reply all
Reply to author
Forward
0 new messages