Sanitising a UTF-8 string

631 views
Skip to first unread message

Juliusz Chroboczek

unread,
Oct 22, 2017, 11:22:03 AM10/22/17
to golan...@googlegroups.com
I'm probably missing something obvious, but I've looked through the
standard library to no avail. How do I sanitise a []byte to make sure
it's a UTF-8 string by replacing all incorrect sequences by the
replacement character (or whatever)?

I've found unicode/utf8.Valid, which tells me if a []byte is a UTF-8
string, but I don't see a convenient function that I can use on the string
before I pass it to the frontend that requires well-formed UTF-8.

Thanks,

-- Juliusz

andrey mirtchovski

unread,
Oct 22, 2017, 11:57:17 AM10/22/17
to Juliusz Chroboczek, golang-nuts
See the section "For statements with range clause" in the spec:
https://golang.org/ref/spec#For_statements

"For a string value, the "range" clause iterates over the Unicode code
points in the string starting at byte index 0. On successive
iterations, the index value will be the index of the first byte of
successive UTF-8-encoded code points in the string, and the second
value, of type rune, will be the value of the corresponding code
point. If the iteration encounters an invalid UTF-8 sequence, the
second value will be 0xFFFD, the Unicode replacement character, and
the next iteration will advance a single byte in the string."
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jakob Borg

unread,
Oct 22, 2017, 12:46:43 PM10/22/17
to Juliusz Chroboczek, golan...@googlegroups.com
Converting a string to a slice of runes gives you the individual code points, with the replacement character as necessary. Converting a slice of runes into a string gives you the UTF-8 representation. So sanitation of a string should be as simple as string([]rune(someString)). This will be O(n) and incur allocations. To and from []byte is another conversion and copy. 

There may be a more efficient way directly on a byte slice. 

//jb

Sam Whited

unread,
Oct 22, 2017, 1:04:35 PM10/22/17
to golan...@googlegroups.com
On Sun, Oct 22, 2017, at 09:29, Juliusz Chroboczek wrote:
> I'm probably missing something obvious, but I've looked through the
> standard library to no avail. How do I sanitise a []byte to make sure
> it's a UTF-8 string by replacing all incorrect sequences by the
> replacement character (or whatever)?

The golang.org/x/text/transform package can be used to do very efficient
transformations on byte slices and strings.
A transformer called ReplaceIllFormed
(https://godoc.org/golang.org/x/text/runes#ReplaceIllFormed) exists in
the golang.org/x/text/runes package to do what you want (replace invalid
UTF-8 sequences with utf8.RuneError).
If you have a string already and don't mind a bit of extra allocation
overhead, you can do something simple like this:

runes.ReplaceIllFormed().String("mystring")

—Sam
Reply all
Reply to author
Forward
0 new messages