Go string and UTF-8

310 views
Skip to first unread message

Pierre Durand

unread,
Aug 20, 2019, 4:12:20 AM8/20/19
to golang-nuts
I know that by convention Go string contain UTF-8 encoded text.

Is it recommended/a good practice to store invalid bytes in a string ?

The use case:
- compute a hash => get a []byte
- convert the []byte to string (this string is not UTF-8 valid)
- use the string as a map key

In my case, the hash algorithm is md5.

djego...@gmail.com

unread,
Aug 20, 2019, 4:34:55 AM8/20/19
to golang-nuts
On Tue, Aug 20, 2019 at 10:12 AM Pierre Durand wrote:
>
> I know that by convention Go string contain UTF-8 encoded text.

To my understanding this is not entirely true -- see https://blog.golang.org/strings#TOC_2. -- It is simply a readonly slice of bytes. However there is at least 2 places where UTF-8 encoding is used for strings in the language spec: source code file is expected to be UTF-8 (thus string literals are partially influenced), and when using the `for range` construct on a string. Otherwise there are various packages (e.g. unicode/utf8) which expect UTF-8 encoded strings as arguments.


> Is it recommended/a good practice to store invalid bytes in a string ?

Thus the concept of _invalid bytes in a string_ doesn't really exist ;-).

> The use case:
> - compute a hash => get a []byte
> - convert the []byte to string (this string is not UTF-8 valid)
> - use the string as a map key

I don't see any issues with this.

Pierre Durand

unread,
Aug 20, 2019, 6:01:50 AM8/20/19
to golang-nuts
OK, thank you !

Sam Whited

unread,
Aug 20, 2019, 7:00:29 AM8/20/19
to Pierre Durand, golang-nuts
I personally wouldn't do this. If you're going to incur the overhead of a heap allocation, might as well incur a bit more and encode the hash, eg. using hex.EncodeToString [1], just so that you don't forget and try to print or decode the string as utf8 later.

—Sam

[1]: https://godoc.org/encoding/hex#EncodeToString

Rob Pike

unread,
Aug 20, 2019, 7:51:28 AM8/20/19
to Sam Whited, Pierre Durand, golang-nuts
Printf can print hexadecimal just fine. Never understood the point of encoding/hex.

Meanwhile, for questions about strings, see blog.golang.org/strings.

-rob


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/42989202-2854-479E-A536-13664BE41946%40samwhited.com.

Sam Whited

unread,
Aug 20, 2019, 8:01:38 AM8/20/19
to golan...@googlegroups.com


On August 20, 2019 11:50:54 AM UTC, Rob Pike <r...@golang.org> wrote:
>Printf can print hexadecimal just fine. Never understood the point of
>encoding/hex.

I always thought that the C style format strings were unreadable and the hex package methods were much clearer, personally.

—Sam

Pierre Durand

unread,
Aug 20, 2019, 9:17:29 AM8/20/19
to golang-nuts
Well, in my case I don't want to convert the []byte to hexadecimal string, because it uses 2x more memory.
The code contains a huge map where the key is an MD5 hash.

Please note that I'm not personally working on this.
I was reviewing the code written by a coworker, and I noticed that there was a string variable containing "invalid UTF-8 bytes".
It felt very strange to have a string containing invalid text.

So I have another question: since md5.Sum() is returning a [16]byte, is it better to use [16]byte as a map key ?
Or should I use a string containing invalid text as a map key ?

Jan Mercl

unread,
Aug 20, 2019, 9:33:03 AM8/20/19
to Pierre Durand, golang-nuts
On Tue, Aug 20, 2019 at 3:17 PM Pierre Durand <pierre...@gmail.com> wrote:
>
> Well, in my case I don't want to convert the []byte to hexadecimal string, because it uses 2x more memory.
> The code contains a huge map where the key is an MD5 hash.

md5 hash is an array type and can be used as a map key directly:
https://golang.org/pkg/crypto/md5/#Sum

Example: https://play.golang.org/p/qp-LFWh2Jln
Reply all
Reply to author
Forward
0 new messages