How to substring for utf-8 encoding

3,663 views
Skip to first unread message

SingoWong

unread,
Jul 17, 2011, 11:32:50 PM7/17/11
to golang-nuts
Hi all,

I want to cut of a string include chinese and english wording, but the
chinese wording of utf-8 encoding impropriate 3 bytes
If i cut it direct, should got problem, is it got a exist package was
handle that?

Regards,
Singo

Evan Shaw

unread,
Jul 17, 2011, 11:42:52 PM7/17/11
to SingoWong, golang-nuts

How are you deciding the substring? If you're doing something like:

searchStr := "a string"
i := strings.Index(str, searchStr)
if i < 0 {
// string wasn't found; handle appropriately
}
substr := str[i:i+len(searchStr)]

This is always safe for any searchStr.

If you want to do something like extract a substring containing the
first 20 characters of another string, then you have to worry about
UTF-8. You could use utf8.String or convert the string to its code
points with a conversion to []int.

- Evan

SingoWong

unread,
Jul 18, 2011, 12:36:12 AM7/18/11
to golang-nuts
erm, i want to cut the string as below

s := "中文字符abc"

if i cut this direct s[0:1] shuold return a wrong string, if i want
cut the first wording, need to code s[0:3]
so the string include chinese an english

by := []byte(s)
fmt.Printf("%#v\n", by) //[]uint8{0xe4, 0xb8, 0xad, 0xe6, 0x96, 0x87,
0xe5, 0xad, 0x97, 0xe7, 0xac, 0xa6, 0x61, 0x62, 0x63}

0xe4, 0xb8, 0xad, => 中
0xe6, 0x96, 0x87, => 文
0xe5, 0xad, 0x97, => 字
0xe7, 0xac, 0xa6, => 符
0x61, => a
0x62, => b
0x63 => c

On Jul 18, 11:42 am, Evan Shaw <eds...@gmail.com> wrote:

zhai

unread,
Jul 18, 2011, 12:43:24 AM7/18/11
to SingoWong, golang-nuts

package main

import "fmt"

func main() {
s:="中文字符abc"
ints := []int(s)
s1 :=string(ints[1:5])
fmt.Println(s1)

Evan Shaw

unread,
Jul 18, 2011, 12:44:06 AM7/18/11
to SingoWong, golang-nuts
On Mon, Jul 18, 2011 at 4:36 PM, SingoWong <singo...@gmail.com> wrote:
> erm, i want to cut the string as below
>
> s := "中文字符abc"
>
> if i cut this direct s[0:1] shuold return a wrong string, if i want
> cut the first wording, need to code s[0:3]
> so the string include chinese an english

Sure, I understand the problem, but to get better advice you need to
give more details. If we consider a constant string, the solution is
obvious because the string is known ahead of time. You probably want a
more general solution than that.

What substring do you want? How will you find it?

- Evan

SingoWong

unread,
Jul 18, 2011, 12:44:15 AM7/18/11
to golang-nuts
package main

import (
"fmt"
//"strings"
)

func main() {
s := "中文字符abc"

fmt.Printf("%#v\n", substring(s, 1))
}

func substring(s string, len int) string {
by := []byte(s)

if int(by[len - 1]) >= 224 {
len += 2
} else if int(by[len - 1]) >= 192 && int(by[len - 1]) < 224 {
len += 1
}

return s[0:len]
}

looks like solve~~

zhai

unread,
Jul 18, 2011, 12:52:14 AM7/18/11
to SingoWong, golang-nuts
package main

import "fmt"
func substr(s string,pos,length int) string{
ints :=[]int(s)
l := pos+length
if l > len(ints) {
l = len(ints)
}
return string(ints[pos:l])
}

func main() {
s:="中文字符abc"
s1 := substr(s,2,3)
s2 := substr(s,2,8)
fmt.Println(s1,s2)

SingoWong

unread,
Jul 18, 2011, 1:00:18 AM7/18/11
to golang-nuts
Thanks zhao, used your solustion

SingoWong

unread,
Jul 18, 2011, 1:03:32 AM7/18/11
to golang-nuts
thanks a lot, the subsrting i want is like as zhai's function :)

On Jul 18, 12:44 pm, Evan Shaw <eds...@gmail.com> wrote:

Andrew Gerrand

unread,
Jul 18, 2011, 2:09:48 AM7/18/11
to golan...@googlegroups.com
I think what you want is the utf8.String type, which will perform better than converting to []int and back.


s := "中文字符abc"
us := utf8.NewString(s)
us.Slice(0, 1) == "中"

Andrew

Mauro Trajber

unread,
May 11, 2015, 12:14:16 PM5/11/15
to golan...@googlegroups.com
The utf8.NewString function doesn't exists anymore. How can I do it now?

Ian Lance Taylor

unread,
May 11, 2015, 12:30:01 PM5/11/15
to Mauro Trajber, golang-nuts
On Mon, May 11, 2015 at 9:08 AM, Mauro Trajber <tra...@gmail.com> wrote:
> The utf8.NewString function doesn't exists anymore. How can I do it now?

It was moved to golang.org/x/exp/utf8string, which is available via go
get. As far as I know it still works.

Or you could convert to []int and back.

Ian



> On Monday, July 18, 2011 at 3:09:48 AM UTC-3, Andrew Gerrand wrote:
>>
>> I think what you want is the utf8.String type, which will perform better
>> than converting to []int and back.
>>
>> http://golang.org/pkg/utf8/#String
>>
>> s := "中文字符abc"
>> us := utf8.NewString(s)
>> us.Slice(0, 1) == "中"
>>
>> Andrew
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Mauro Trajber

unread,
May 11, 2015, 12:48:23 PM5/11/15
to golan...@googlegroups.com, tra...@gmail.com
What if I cast it to []rune ?
http://play.golang.org/p/wKQ36VHwa_

Ian Lance Taylor

unread,
May 11, 2015, 1:09:31 PM5/11/15
to Mauro Trajber, golang-nuts
On Mon, May 11, 2015 at 9:48 AM, Mauro Trajber <tra...@gmail.com> wrote:
> What if I cast it to []rune ?
> http://play.golang.org/p/wKQ36VHwa_

Yes, sorry. That is what I should have said.

Ian

Mauro Trajber

unread,
May 11, 2015, 1:13:18 PM5/11/15
to golan...@googlegroups.com, tra...@gmail.com
Thanks Ian,

Just one more question: what's the complexity of a casting of a string to []rune ? O(n) ?

minux

unread,
May 11, 2015, 1:19:46 PM5/11/15
to Mauro Trajber, golang-nuts
On Mon, May 11, 2015 at 1:13 PM, Mauro Trajber <tra...@gmail.com> wrote:
Just one more question: what's the complexity of a casting of a string to []rune ? O(n) ?

O(n) time and O(n) space. 
Reply all
Reply to author
Forward
0 new messages