String slices are byte slices?

2,996 views
Skip to first unread message

Ben Bullock

unread,
Nov 17, 2009, 10:22:36 PM11/17/09
to golang-nuts
I was trying to solve the following problem:

http://stackoverflow.com/questions/1752414/how-to-reverse-a-string-in-go

and I came up with the following solution:

package main
func main () {
input := "The quick brown 狐 jumped over the lazy 犬";
output := "";
for i := len (input); i > 0; i-- {
output += input[i-1:i];
}
print (output,"\n");
}

(Here I've deliberately added two characters in kanji to test the
UTF-8.)

I was thinking that this would give me something like Perl's reverse
function which reverses without corrupting UTF-8. However, the above
Go mangles the UTF-8 characters in the string. It seems like string
slices are bytewise slices. Is this intentional or is it a bug?

Ben Bullock

unread,
Nov 17, 2009, 10:29:35 PM11/17/09
to golang-nuts
I read the language specification:

"A string type represents the set of string values. Strings behave
like arrays of bytes but are immutable: once created, it is impossible
to change the contents of a string. The predeclared string type is
string.

The elements of strings have type byte and may be accessed using the
usual indexing operations. It is illegal to take the address of such
an element; if s[i] is the ith byte of a string, &s[i] is invalid. The
length of string s can be discovered using the built-in function len.
The length is a compile-time constant if s is a string literal."

OK, my mistake.

Adam Langley

unread,
Nov 17, 2009, 10:29:44 PM11/17/09
to Ben Bullock, golang-nuts
On Tue, Nov 17, 2009 at 7:22 PM, Ben Bullock <benkasmi...@gmail.com> wrote:
> Go mangles the UTF-8 characters in the string. It seems like string
> slices are bytewise slices. Is this intentional or is it a bug?

It's intentional.

You can iterate over the code points in a string with:
for byteIndex, codePoint := range s { ...

Then you can use EncodeRune to write from the end of a newly allocated
byte slice (of the same length as the string), backwards.

Of course, you'll still mangle the Unicode since combining characters
and other code points should be reordered. Implementing the grapheme
boundary algorithm is rather more involved.


AGL

Rob 'Commander' Pike

unread,
Nov 17, 2009, 10:33:59 PM11/17/09
to Adam Langley, Ben Bullock, golang-nuts

On Nov 17, 2009, at 7:29 PM, Adam Langley wrote:

> On Tue, Nov 17, 2009 at 7:22 PM, Ben Bullock <benkasmi...@gmail.com> wrote:
>> Go mangles the UTF-8 characters in the string. It seems like string
>> slices are bytewise slices. Is this intentional or is it a bug?
>
> It's intentional.
>
> You can iterate over the code points in a string with:
> for byteIndex, codePoint := range s { ...
>
> Then you can use EncodeRune to write from the end of a newly allocated
> byte slice (of the same length as the string), backwards.

Except when you can't. If the input string is not valid UTF-8, it can grow when you do this because the range operation will generate a (1, U+FFFD) pair for each invalid byte.

> Of course, you'll still mangle the Unicode since combining characters
> and other code points should be reordered. Implementing the grapheme
> boundary algorithm is rather more involved.

As Adam knows, "rather more involved" is an understatement. UTF-8-encoded Unicode strings do not like to be reversed.

-rob


Ben Bullock

unread,
Nov 17, 2009, 10:47:50 PM11/17/09
to golang-nuts
On Nov 18, 12:29 pm, Adam Langley <a...@golang.org> wrote:

> You can iterate over the code points in a string with:
>   for byteIndex, codePoint := range s { ...

Thanks for the suggestion.

> Then you can use EncodeRune to write from the end of a newly allocated
> byte slice (of the same length as the string), backwards.

I am trying something like this:

package main
import "utf8";
func main () {
var dummy [5]byte;
input := "The quick brown 狐 jumped over the lazy 犬";
output := "";
for byteIndex, codePoint := range input {
n_bytes := utf8.EncodeRune (codePoint, dummy);
output = input[byteIndex:byteIndex + n_bytes] + output
}
print (output,"\n");
}

However, I don't see the syntax for how to declare "dummy" correctly.
I am just getting compiler errors or segmentation faults with
everything I try. Although this isn't the way you suggest, can someone
tell me how to declare dummy so that it works here?

> Of course, you'll still mangle the Unicode since combining characters
> and other code points should be reordered. Implementing the grapheme
> boundary algorithm is rather more involved.

Do other languages currently do that? It seems like it would involve
checking every character against a database.

Adam Langley

unread,
Nov 17, 2009, 11:01:33 PM11/17/09
to Ben Bullock, golang-nuts
On Tue, Nov 17, 2009 at 7:47 PM, Ben Bullock <benkasmi...@gmail.com> wrote:
> However, I don't see the syntax for how to declare "dummy" correctly.
> I am just getting compiler errors or segmentation faults with

var dummy [utf8.UTFMax]byte;
utf8.EncodeRune (codePoint, &dummy);



AGL

Rob 'Commander' Pike

unread,
Nov 18, 2009, 12:17:18 AM11/18/09
to Ben Bullock, golang-nuts

On Nov 17, 2009, at 7:47 PM, Ben Bullock wrote:

> Do other languages currently do that? It seems like it would involve
> checking every character against a database.

Yes, it's truly horrible. The database is huge too. There may be
languages that do this but as far as I know a library - and a big one
- is involved.

-rob

Russ Cox

unread,
Nov 18, 2009, 1:45:09 AM11/18/09
to Ben Bullock, golang-nuts
Continuing to ignore combining characters, you can do the code point
reversal without importing "utf8", by using range over string
to pick apart the code points and then the string([]int) to
convert back to UTF-8:

package main

func main() {
input := "The quick brown 狐 jumped over the lazy 犬";

// Get Unicode code points.
n := 0;
rune := make([]int, len(input));
for _, r := range input {
rune[n] = r;
n++;
}
rune = rune[0:n];

// Reverse
for i := 0; i < n/2; i++ {
rune[i], rune[n-1-i] = rune[n-1-i], rune[i]
}

// Convert back to UTF-8.
output := string(rune);
print(output, "\n");
}

Zhai

unread,
Apr 26, 2011, 7:31:33 AM4/26/11
to golan...@googlegroups.com, Ben Bullock, r...@golang.org
It's easy to convert string to runes:  []int(string)
just like convert  runes to string: string([]int)

package main

import "fmt"
func reverse(str string) string {
ls := []int(str)
ilen := len(ls)
for i:=0 ; i< ilen /2 ; i++ {
ls[i],ls[ilen-i-1] = ls[ilen-i-1],ls[i]
}
return string(ls)
}

func main() {
input := "The quick brown 狐 jumped over the lazy 犬"; 
output := reverse(input)
fmt.Println(output)
}

Reply all
Reply to author
Forward
0 new messages