Clarification on unsafe conversion between string <-> []byte

1246 views
Skip to first unread message

Francis

unread,
Sep 18, 2019, 5:42:31 AM9/18/19
to golang-nuts
I am looking at the correct way to convert from a byte slice to a string and back with no allocations. All very unsafe.

I think these two cases are fairly symmetrical. So to simplify the discussion below I will only talk about converting from a string to []byte.

func StringToBytes(s string) (b []byte)

From what I have read it is currently not clear how to perform this correctly.

When I say correctly I mean that the function returns a `[]byte` which contains all of and only the bytes in the string and never confuses the garbage collector. We fully expect that the `[]byte` returned will contain the same underlying memory as the string and modifying its contents will modify the string, making the string dangerously mutable. We are comfortable with the dangerously mutable string.

Following the directions in unsafe you _might_ think that this would be a good solution.

func StringToBytes(s string) []byte {
        return *(*[]byte)(unsafe.Pointer(&reflect.SliceHeader{
                Data: (*(*reflect.StringHeader)(unsafe.Pointer(&s))).Data,
                Len:  len(s),
                Cap:  len(s),
        }))
}

The line

Data: (*(*reflect.StringHeader)(unsafe.Pointer(&s))).Data,

here is a really carefully commented example of this approach from github

seems to satisfy unsafe rule 5 about converting uintptr to and from unsafe.Pointer in a singe expression.

However, it clearly violates rule 6 which states `...SliceHeader and StringHeader are only valid when interpreting the content of an actual slice or string value.`. The `[]byte` we are returning here is built from a `&reflect.SliceHeader` and not based on an existing `[]byte`.

So we can switch to

func StringToBytes(s string) (b []byte) {
        stringHeader := (*reflect.StringHeader)(unsafe.Pointer(&s))
        sliceHeader := (*reflect.SliceHeader)(unsafe.Pointer(&bytes))
        sliceHeader.Data = stringHeader.Data
        sliceHeader.Len = stringHeader.Len
        sliceHeader.Cap = stringHeader.Len
        return b
}

Now we are using an existing []byte to build `sliceHeader` which is good. But we end up with a new problem. sliceHeader.Data and stringHeader.Data are both uintptr. So by creating them in one expression and then writing them in another expression we violate the rule that `uintptr cannot be stored in variable`.

There is a possible sense that we are protected because both of our `uinptr`s are actually real pointers inside a real string and []byte. This seems to be indicated by the line `In this usage hdr.Data is really an alternate way to refer to the underlying pointer in the string header, not a uintptr variable itself.`

This feels very unclear to me.

In particular the code example in the unsafe package

var s string
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s)) // case 1
hdr.Data = uintptr(unsafe.Pointer(p))              // case 6 (this case)
hdr.Len = n

is not the same as the case we are dealing with here. Specifically in the unsafe package documentation we are writing from a uintpr stored in a separate variable to another uinptr. They are probably very similar in practice, but it isn't obvious and in my experience subtly incorrect code often comes from relying on vague understandings of important documents.

If we assume that our uinptrs are safe because they are backed by real pointers then there is another issue with our string being garbage collected.

func StringToBytes(s string) (b []byte) {
        stringHeader := (*reflect.StringHeader)(unsafe.Pointer(&s))
        // Our string is no longer referenced anywhere and could potentially be garbage collected
        sliceHeader := (*reflect.SliceHeader)(unsafe.Pointer(&bytes))
        sliceHeader.Data = stringHeader.Data
        sliceHeader.Len = stringHeader.Len
        sliceHeader.Cap = stringHeader.Len
        return b
}


There is a discussion where this potential problem is raised


we also see this issue mentioned in


The solution of 

func StringToBytes(s string) (b []byte) {
        stringHeader := (*reflect.StringHeader)(unsafe.Pointer(&s))
        // Our string is no longer referenced anywhere and could potentially be garbage collected
        sliceHeader := (*reflect.SliceHeader)(unsafe.Pointer(&bytes))
        sliceHeader.Data = stringHeader.Data
        sliceHeader.Len = stringHeader.Len
        sliceHeader.Cap = stringHeader.Len
                 runtime.KeepAlive(&s)
        return b
}


is proposed. This _probably_ works. But a survey of the implementations of unsafe string/[]byte conversions in Go projects that we depend on at work (this operation is very common), didn't show a single example of anyone using the KeepAlive trick.

In particular the person who initiated the conversation in golang-nuts where the KeepAlive was suggested has implemented this conversion without it


A workaround of writing 

func StringToBytes(s string) (b []byte) {
        stringHeader
:= (*reflect.StringHeader)(unsafe.Pointer(&s))
        sliceHeader
:= (*reflect.SliceHeader)(unsafe.Pointer(&bytes))
        sliceHeader
.Data = stringHeader.Data
        sliceHeader
.Len = len(s)
        sliceHeader
.Cap = len(s)
       
// Maybe we managed to hold onto s until here?
       
return b
}


was proposed. I think the reasoning here is that the references to `len(s)` keep the string alive. I am not totally convinced because I think the compiler is free to write this as 

func StringToBytes(s string) (b []byte) {
        stringHeader := (*reflect.StringHeader)(unsafe.Pointer(&s))
        sliceHeader := (*reflect.SliceHeader)(unsafe.Pointer(&bytes))
        sliceHeader.Len = len(s)
        sliceHeader.Cap = len(s)
        // Compiler has reordered our code, and s might be garbage collected
        sliceHeader.Data = stringHeader.Data
        return b
}

but, maybe this modification can never happen.

At this point I don't think we have any clear answers about how to write this code correctly.

If we look inside the Go codebase we see a few interesting approaches

going from []byte to string we can just

return *(*string)(unsafe.Pointer(&b))

This approach is used in 


So we can see that the Go std-lib isn't using Slice/StringHeader to perform these conversions, which seems like a shame.

Looking at how this is done on github reveals a variety of different approaches and discussion on the topic don't seem to ever be conclusive.



In general it seems that 

func StringToBytes(string) ([]byte) {
        stringHeader 
:= (*reflect.StringHeader)(unsafe.Pointer(&s))
        sliceHeader 
:= (*reflect.SliceHeader)(unsafe.Pointer(&bytes))
        sliceHeader
.Data = stringHeader.Data
        sliceHeader
.Len = len(s)
        sliceHeader
.Cap = len(s)
        
return b
}


is probably pretty good, and here is a very carefully commented example of it from github.


Is there an authoritative correct way to do this?

Ian Lance Taylor

unread,
Sep 18, 2019, 4:46:44 PM9/18/19
to Francis, golang-nuts
On Wed, Sep 18, 2019 at 2:42 AM Francis <francis...@gmail.com> wrote:
>
> I am looking at the correct way to convert from a byte slice to a string and back with no allocations. All very unsafe.
>
> I think these two cases are fairly symmetrical. So to simplify the discussion below I will only talk about converting from a string to []byte.
>
> func StringToBytes(s string) (b []byte)

No reason to use SliceHeader, and avoiding SliceHeader avoids the
problems you discuss.

func StringToBytes(s string) []byte {
const max = 0x7fff0000
if len(s) > max {
panic("string too long")
}
return (*[max]byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data))[:len(s):len(s)]
}

Of course, as you say, you must not mutate the returned []byte.

Ian

Francis

unread,
Sep 20, 2019, 4:30:58 PM9/20/19
to golang-nuts
Thanks Ian, that's a very interesting solution.

Is there a solution for going in the other direction? Although I excluded it from the initial post, it was only to reduce the size of the discussion. I would also like to implement

func BytesToString(b []byte) string {

I don't clearly see how to avoid using the StringHeader in this case.

F

Dan Kortschak

unread,
Sep 21, 2019, 2:11:08 AM9/21/19
to Francis, golang-nuts
func bytesToString(b []byte) string {
return *(*string)(unsafe.Pointer(&b))
}

https://play.golang.org/p/azJPbl946zj

fra...@adeven.com

unread,
Sep 23, 2019, 5:37:00 AM9/23/19
to golang-nuts
That would work Kortschak.

But this relies on a string's representation being the same as, but a bit smaller thabn, a []byte. I would prefer to use the Slice/StringHeader.

It's worth noting that _most_ of the problems I described in my initial post are hypothetical at this stage. The issue with strings being garbage collected mid-conversion are (I think) certain not to happen in Go 1.12 and the problems with storing a uinptr in a variable are mostly related to moving garbage collectors (so long as the location your uintpr points to doesn't get garbage collected). But both of these facts have a reasonable likely hood of changing in the future.

The thing I would like the most is an pair of unsafe string <-> []byte conversion implementations which satisfy the rules of unsafe as they are written now which will _not_ compile if the representation of []byte/string changes in the future.

This was what I thought I would get using the reflect.Slice/StringHeader structs and some unsafe. It's surprising that this doesn't work in a straight-forward way.
Message has been deleted

Jan Mercl

unread,
Sep 23, 2019, 5:42:55 AM9/23/19
to fra...@adeven.com, golang-nuts
On Mon, Sep 23, 2019 at 11:37 AM <fra...@adeven.com> wrote:

> ... and the problems with storing a uinptr in a variable are mostly related to moving garbage collectors ...

Please note that even though, AFAIK, none of the Go compilers so far
uses a moving GC, it's not the only source of objects being moved
around. For a long time the gc Go compiler has moveable stacks and
things may reside in stack whenever the compiler decides it can and
wants to do it.

fra...@adeven.com

unread,
Sep 23, 2019, 5:45:03 AM9/23/19
to golang-nuts
That's super interesting. Thanks for the pointer Jan :bow:

Dan Kortschak

unread,
Sep 23, 2019, 6:43:34 AM9/23/19
to fra...@adeven.com, golang-nuts
Any particular reason for that? Neither is safer than the other and
it's not clear to me that you can actually achieve the goal of having a
compile-time check for the correctness of this type of conversion.

Francis

unread,
Sep 23, 2019, 11:09:40 AM9/23/19
to golang-nuts
So I think the current state of unsafe conversions of string <-> []byte is roughly

1. Use the reflect Slice/StringHeader struct. These structs give you clear fields to set and read from. If the runtime representation of a string or []byte ever changes then these structs should change to reflect this (they have a non-backwards compatibility carve out in the comments). But this also means that you run into all these exotic problems because these two structs have a `uintpr` an `unsafe.Pointer` so for a short time the GC won't realise you are reading/writing a pointer. This makes correct use of these structs very difficult.
2. You can just cast between these two types going through `unsafe.Pointer` on the way. This works, because these two types have almost identical layouts. We don't use any uintptr at all and so the GC probably won't get confused. But, if the representations of string or []byte ever change then you code breaks silently, and could have very weird/hard to track down problems.

So I don't think `neither is safer than the other` is quite the right description in this context. They both have problems, so they are both not-perfect. But their problems are quite distinct. At the least if we choose one over the other we can describe clearly which set of problems we want to have.

My hope was that someone had thought through these problems and could indicate the right way to do it.

On a related note. I was trying to track down where the Slice/StringHeader was first introduced. It was a long time ago 

<Rob Pike> (10 years ago) 29e6eb21ec  (HEAD)

make a description of the slice header public

R=rsc
DELTA=18  (3 added, 0 deleted, 15 changed)
OCL=31086
CL=31094

Although I couldn't open that CL in gerrit (I assume user-error). From reading the commit I think the intention was for these header structs to be used for this or similar things. But the data was represented as a uintptr and a comment explicitly states that these structs are of no use without `unsafe.Pointer`. I have seen roughly three other CL which try to change the data field to `unsafe.Pointer` but are rejected because they change the reflect packages API.

There is also this issue


Which proposes that Slice/StringHeader be moved/duplicated in unsafe and use `unsafe.Pointer`. As far as I can tell once we have this then all the subtle problems disappear and lovingly crafted examples like


just become the right way to do it.

Until then maybe we should just rely on the structural similarities between the two types and cast between them. This seems especially appealing as Jan pointed out above that at least one of the hypothetical problems isn't hypothetical at all.

Robert Engels

unread,
Sep 23, 2019, 12:38:27 PM9/23/19
to Francis, golang-nuts
As someone that has worked with a lot of similar libraries in the HFT space - things like UnsafeString or FastString in Java I would caution against doing this in Go - especially as proposed here. Taking an immutable object like string and making it mutable by accident is a recipe for disaster. You are almost always better mapping a struct with accessors and letting Go escape analysis perform the work on the stack and keep the safety. 
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/422ca2bd-d6c8-4ebe-9578-8dd3cd8317e9%40googlegroups.com.

Keith Randall

unread,
Sep 23, 2019, 7:40:02 PM9/23/19
to golang-nuts
In the runtime we use structs like these, but with unsafe.Pointer data fields (runtime.stringStruct and runtime.slice). They are much safer to use than reflect's types with uintptr Data fields. Unfortunately we can't change reflect's types because of the Go 1 compatibility guarantee.

You can do the same thing the runtime does. Something like this would work, and hopefully catch any future changes which would break the implementation:

type stringHeader struct {
data unsafe.Pointer
len  int
}
type sliceHeader struct {
data unsafe.Pointer
len  int
cap  int
}

func init() {
// Check to make sure string header is what reflect thinks it is.
// They should be the same except for the type of the Data field.
if unsafe.Sizeof(stringHeader{}) != unsafe.Sizeof(reflect.StringHeader{}) {
panic("string layout has changed")
}
x := stringHeader{}
y := reflect.StringHeader{}
x.data = unsafe.Pointer(y.Data)
y.Data = uintptr(x.data)
x.len = y.Len

// same for slice
}

To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

fra...@adeven.com

unread,
Sep 24, 2019, 5:13:44 AM9/24/19
to golang-nuts
Keith, I think you've cracked it. It seems very simple in retrospect. Thanks.

fra...@adeven.com

unread,
Sep 24, 2019, 5:44:46 AM9/24/19
to golang-nuts
Robert Engels, I am not familiar with the two libraries you named. But from your description I think (I'm not sure) that we have different uses in mind.

The escape analysis that would be required for us to avoid using unsafe is _possible_, but does not yet exist in the Go compiler. The compiler facilities needed to negate our need for unsafe are best described in this issue


So the pattern that exists in many Go programs is that you have a variable of type string([]byte) and you have a function which takes type []byte(string) and the function only reads from []byte(string) argument.

Often this comes in the form of 'I have a string, and I want to write its contents via some interface which takes []byte. Most often it is about code-deduplication, we want to read from some kind of sequence of bytes and []byte and string would both do fine.

While this does open us up to a class of bugs, which are both dangerous and potentially hard to diagnose the places I see it used are usually very self contained and the benefits, if that read string/[]byte lies on a hot path are potentially significant.

Although I don't know what UnsafeString/SafeString are used for I _suspect_ they are for high-performance unsafe manipulation of strings. I have never seen anyone try to use unsafe to do this in Go. Someone probably does, but the overwhelmingly most common use case that I see is 'turn this string into a []byte and use this function to read from it' going the other way is less common.

Personally I would _love_ to see the read-only bytes escape analysis built into the compiler so we can throw away all of this unsafe code.
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

k...@golang.org

unread,
Sep 24, 2019, 8:36:41 PM9/24/19
to golang-nuts


On Tuesday, September 24, 2019 at 2:44:46 AM UTC-7, fra...@adeven.com wrote:
Robert Engels, I am not familiar with the two libraries you named. But from your description I think (I'm not sure) that we have different uses in mind.

The escape analysis that would be required for us to avoid using unsafe is _possible_, but does not yet exist in the Go compiler. The compiler facilities needed to negate our need for unsafe are best described in this issue


So the pattern that exists in many Go programs is that you have a variable of type string([]byte) and you have a function which takes type []byte(string) and the function only reads from []byte(string) argument.

Often this comes in the form of 'I have a string, and I want to write its contents via some interface which takes []byte. Most often it is about code-deduplication, we want to read from some kind of sequence of bytes and []byte and string would both do fine.

While this does open us up to a class of bugs, which are both dangerous and potentially hard to diagnose the places I see it used are usually very self contained and the benefits, if that read string/[]byte lies on a hot path are potentially significant.

Although I don't know what UnsafeString/SafeString are used for I _suspect_ they are for high-performance unsafe manipulation of strings. I have never seen anyone try to use unsafe to do this in Go. Someone probably does, but the overwhelmingly most common use case that I see is 'turn this string into a []byte and use this function to read from it' going the other way is less common.

Personally I would _love_ to see the read-only bytes escape analysis built into the compiler so we can throw away all of this unsafe code.


I would really love that also. I think we could do it for some cases, by propagating information about writes to []byte args up the call tree.
Unfortunately, probably the most common case would be passing through an interface (io.Writer), which would defeat that analysis.

Serhat Şevki Dinçer

unread,
Sep 26, 2019, 4:59:08 PM9/26/19
to golang-nuts
Hi,

I wrote a string utility library with many tests to ensure the conversion assumptions are correct. It could be helpful.

Cheers..

fra...@adeven.com

unread,
Oct 4, 2019, 5:38:22 AM10/4/19
to golang-nuts
Serhat,

That implementation looks very tidy. But it still uses uintptr. So it doesn't solve the GC problems discussed above.

fra...@adeven.com

unread,
Oct 4, 2019, 6:01:15 AM10/4/19
to golang-nuts
I wrote a self contained implementation of the solution described by Keith Randall.

It seemed at the end of this long thread it would be nice to have something concrete to look at.

Serhat Şevki Dinçer

unread,
Oct 7, 2019, 4:04:05 AM10/7/19
to golang-nuts
Turns out it did not really need uintptr so I switched to unsafe.Pointer. Also added more tests.

Thanks Francis..

fra...@adeven.com

unread,
Oct 7, 2019, 4:35:38 AM10/7/19
to golang-nuts
Thank you Serhat :)
Reply all
Reply to author
Forward
0 new messages