Re: [go-nuts] the best way to copy a string in go

10878 views
Skip to first unread message
Message has been deleted

Maxim Khitrov

unread,
Jun 4, 2013, 9:25:00 AM6/4/13
to beevik, golan...@googlegroups.com
On Tue, Jun 4, 2013 at 2:42 AM, beevik <br...@beevik.com> wrote:
> Go gives us the built-in copy function to make a copy of a slice's contents,
> but it doesn't (to my knowledge) offer anything similar for strings. This
> means that in order to copy a substring (and thereby allow the larger string
> to be released by the garbage collector once it is no longer referenced),
> you have to copy the substring to a byte slice and then back to a string.
> This seems less than optimal. It's also kind of ugly.
>
> Is there a faster way to "deep copy" a string that doesn't require
> conversion to and from a slice first? Wouldn't a built-in copy function
> that accepts a destination string be useful in this regard?

You can do this with reflect and unsafe, but I think a better solution
would be for the compiler to recognize code like string([]byte(s)) and
optimize it to a single copy operation. Right now, the version below
is about 2x faster:

func Copy(s string) string {
var b []byte
h := (*reflect.SliceHeader)(unsafe.Pointer(&b))
h.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
h.Len = len(s)
h.Cap = len(s)
return string(b)
}

Carlos Castillo

unread,
Jun 4, 2013, 8:01:53 PM6/4/13
to golan...@googlegroups.com
I have two that don't require reflect: http://play.golang.org/p/3rVsMTaHUz

The first appends a character to the string (forcing a copy) and then immediately returns the string with the extra character sliced off. This wastes 1 byte.

The second appends two different sub-strings of the string together (again forcing a copy). This can panic if used on the empty string so I added the simple test, this does mean that if passed the empty string you might get aliases to the same empty string.

Unfortunately, you can't use unsafe on the playground, so there's no way of knowing if the strings were "copied". I have checked it offline though.

beevik

unread,
Jun 4, 2013, 9:06:05 PM6/4/13
to golan...@googlegroups.com
Thanks for the answers.  Those are pretty clever solutions.

It's odd that the language forces us to jump through these kinds of hoops just to copy a string's contents.  String copying is a pretty common operation, and the only way to do it in go is to (1) use unsafe, (2) use string concatenation, or (3) copy the string to a byte slice and back to a string.  None of them is very elegant, and the most straightforward approach (#3) is slow.  Clearly, the designers recognized the importance of convenience when copying slices and decided to offer the built-in copy() function.  But they didn't offer a similar solution for strings.  I wonder how difficult it would have been to allow a string as the first parameter of the copy() function.  It already allows a special case where the destination parameter is a slice of bytes and the source parameter is a string. Perhaps I'm missing something that would make this approach fraught with difficulty (maybe UTF-8 parsing?).

I really love programming in go, but for some reason this feels to me like a flaw in the language.

Nigel Tao

unread,
Jun 4, 2013, 9:13:22 PM6/4/13
to beevik, golang-nuts
On Wed, Jun 5, 2013 at 11:06 AM, beevik <br...@beevik.com> wrote:
> String copying is a pretty common operation

Is it really? It's already trivial to slice a string. You say you want
to make a copy for GC purposes, but that isn't free. Sure, you avoid
referencing the source string, but you are allocating a new copy.
What's the circumstance where you have large input strings that you
want to make small sub-copies of?

Jens Alfke

unread,
Jun 4, 2013, 9:19:00 PM6/4/13
to golan...@googlegroups.com


On Tuesday, June 4, 2013 6:06:05 PM UTC-7, beevik wrote:
It's odd that the language forces us to jump through these kinds of hoops just to copy a string's contents.  String copying is a pretty common operation

Strings in Go are immutable, so strictly speaking there's no need to copy them. (Similarly, in Objective-C sending -copy to an immutable NSString is a no-op. Lua goes further, interning all strings so there is never more than one instance of a particular string in memory.)

It sounds as though there is detail of the implementation that means that a substring object points to the same backing store as the containing string and will keep the entire containing string in memory. There's nothing about the Go language that implies that, though, and a hypothetical sufficiently-smart implementation of strings could optimize away that problem.

So this problem isn't an issue with the language, it's a quirk of the runtime that may need to be worked around in some programs. Going to a byte array and back to a string seems like a reasonable way to do it.

--Jens

beevik

unread,
Jun 4, 2013, 9:20:51 PM6/4/13
to golan...@googlegroups.com, beevik
I'm parsing a long string and storing a substring of that parsed string in another object, after which I no longer need the long string.  I'd rather that the garbage collector not hold onto the long string, as I parse a lot of them.

Nigel Tao

unread,
Jun 4, 2013, 9:24:44 PM6/4/13
to beevik, golang-nuts
On Wed, Jun 5, 2013 at 11:20 AM, beevik <br...@beevik.com> wrote:
> I'm parsing a long string and storing a substring of that parsed string in
> another object, after which I no longer need the long string. I'd rather
> that the garbage collector not hold onto the long string, as I parse a lot
> of them.

Presuming that the input eventually comes from an io.Reader, perhaps
it'd be better to parse a long []byte instead of a long string. In
other words, delay the []byte to string conversion until after you
trim.

Nigel Tao

unread,
Jun 4, 2013, 9:29:13 PM6/4/13
to beevik, golang-nuts
On Wed, Jun 5, 2013 at 11:24 AM, Nigel Tao <nige...@golang.org> wrote:
> Presuming that the input eventually comes from an io.Reader, perhaps
> it'd be better to parse a long []byte instead of a long string. In
> other words, delay the []byte to string conversion until after you
> trim.

I should add that a benefit of keeping things as []bytes as long as
possible means that you can re-use buffers (and hence allocate less
garbage). If you currently do:

for moreInput {
s := readString() // This will allocate a large garbage string.
save(copyOf(s[i:j]))
}

do this instead:

b := bytes.NewBuffer(nil)
for moreInput {
b.Reset()
readIntoBuffer(b) // Re-use the buffer.
save(string(b[i:j]) // This will allocate a small string.
}

Carlos Castillo

unread,
Jun 4, 2013, 9:43:35 PM6/4/13
to beevik, golang-nuts
Strings are meant to be immutable. The benefit this entails is that it is essentially free to get a substring of any size. The downside as you mention is that they aren't as "controllable" as byte slices, meaning you don't directly decide when to copy them, and of course you can't mutate them (a copy is made instead). Most of the time, if you need the control to optimize for size / speed, you should be using a []byte. Nearly every operation that can be done on a string can be done to a slice (eg: see how similar the string & bytes packages are). The purpose of strings is that they're easier to use, and due to immutability guarentee, are safe to share with other code.

String copy is common because there are many languages that don't have immutable strings, or don't have automatic memory management. In C for example, you'd use strdup to make a copy of a string that you are about to pass to a function to prevent the function from being able to modify your string. If you have a large string, where you only need a piece of it, it again makes sense to copy the piece and free the large string so you don't have to keep track of the large string, and free it once you are no longer using the piece. In both these situations, the copy is being made to ensure or simplify correctness. Neither is needed for go to work correctly w.r.t. strings, as the GC will ensure no longer accessible memory is freed/reused, and the immutable property of strings means you don't have to worry about strings you hand to foreign code..

IMHO, unless you know for certain that one of the small sub-strings will pin your large string "forever", I usually don't worry about this kind of micro-optimization, especially since you are trading already allocated space for speed (it takes time to copy, even only once) and complexity. 


--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/naMCI9Jt6Qg/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Carlos Castillo

Kevin Gillette

unread,
Jun 4, 2013, 10:56:40 PM6/4/13
to golan...@googlegroups.com
On Tuesday, June 4, 2013 6:01:53 PM UTC-6, Carlos Castillo wrote:
I have two that don't require reflect: http://play.golang.org/p/3rVsMTaHUz

You can't rely on copys2 producing a copy, since strings are immutable and the compiler may eventually notice that `a[:1] + a[1:] == a`. An optimizing compiler could also notice that `(a + " ")[:len(a)] == a` also, and so copys1 also can't be relied upon -- both are based on implementation details.

`string([]byte("something"))` is trivial for a compiler to optimize, and is worth the wasted allocation until that time.

PS: the compiler could also make the above string -> []byte -> string conversion result in a reference to the same memory, but it's also a lot more obvious what it's trying to achieve.

Ian Lance Taylor

unread,
Jun 5, 2013, 12:00:31 AM6/5/13
to beevik, golan...@googlegroups.com
On Tue, Jun 4, 2013 at 6:20 PM, beevik <br...@beevik.com> wrote:
> I'm parsing a long string and storing a substring of that parsed string in
> another object, after which I no longer need the long string. I'd rather
> that the garbage collector not hold onto the long string, as I parse a lot
> of them.

Where did the long string come from?

Ian

beevik

unread,
Jun 5, 2013, 12:04:56 PM6/5/13
to golan...@googlegroups.com, beevik
The strings passed into this API tend to be of of two classes: string literals and URL query strings supplied by another package.

Nigel Tao

unread,
Jun 5, 2013, 8:17:41 PM6/5/13
to beevik, golang-nuts
On Thu, Jun 6, 2013 at 2:04 AM, beevik <br...@beevik.com> wrote:
> The strings passed into this API tend to be of of two classes: string
> literals and URL query strings supplied by another package.

Ah, I'm guessing that that's tens or hundreds of bytes. When you said
large, I was thinking tens of thousands. I'd be surprised if making
string copies instead of string slices would affect performance that
much, but if you want to measure, it's not hard to insert the 4-line
copys2 function that Carlos wrote and see. The current compiler should
allocate a new string for copys2. Future implementations might
'optimize' that, but for now you'd be able to see if it actually makes
a measurable difference.

beevik

unread,
Jun 5, 2013, 9:10:57 PM6/5/13
to golan...@googlegroups.com, beevik
You're probably right that in the grand scheme of things, this particular case isn't going to kill my application.  I guess it just bothered me that the runtime could be doing something "not so obvious" like this, and that my only recourse is to do a double-string-copy or play around with unsafe.

Interestingly enough, Java's String.substring function used to work in similarly to go's, but Oracle changed it last year to make copies instead:

Eric Lippert also had an interesting take on this topic when it came up in .NET (which automatically makes copies of substrings even though .NET strings are also immutable):


I don't know if I agree with either of their approaches; I actually like that current go implementations do O(1) string slicing and hope this never changes. But I can see where people might get stung by it, since it's not obvious that the original string sticks around.  (Similar precautions have to be taken with slices, but it's more of a known issue with slices, since the specification makes it clear that slices of slices share the same backing store.)

beevik

unread,
Jun 5, 2013, 9:31:23 PM6/5/13
to golan...@googlegroups.com, beevik
On Wednesday, June 5, 2013 6:10:57 PM UTC-7, beevik wrote: 
I guess it just bothered me that the runtime could be doing something "not so obvious" like this.

Oops, I meant compiler.  Not runtime.
 

Ziad Hatahet

unread,
Jun 5, 2013, 11:51:27 PM6/5/13
to beevik, golan...@googlegroups.com
I wonder if this issue is partly due to the fact that Go lacks a way to enforce runtime constants. Wouldn't one solution be to return a constant []byte when casting string into a []byte?


--
Ziad


Reply all
Reply to author
Forward
0 new messages