Is []rune(invalidUTF8str) underspecified?

121 views
Skip to first unread message

David Anderson

unread,
Jul 9, 2024, 3:42:51 PM (7 days ago) Jul 9
to golang-nuts
I've been going over the spec to clarify finer points of how string vs. []byte behave, I think there may be an unnecessary degree of freedom that could be removed. Either that, or I missed a load-bearing statement that constrains implementations.

In https://go.dev/ref/spec#Conversions, `[]rune(str)` is specified as: "Converting a value of a string type to a slice of runes type yields a slice containing the individual Unicode code points of the string."

This does not specify the behavior if the string contains invalid UTF-8 byte sequences. If my reading is correct, a compliant implementation would be free to panic() on such a conversion, or implement the conversion in an arbitrary way of its choosing.

This is in contrast to for...range over a string, which strictly specifies how invalid UTF-8 byte sequences are handled. https://go.dev/ref/spec#For_statements says: "For a string value [...] If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string." This is in line with current Unicode recommendations for input processing, and (IMO) is the only reasonable thing to do when decoding invalid UTF-8.

Empirically, the reference Go compiler does the sensible thing: string to []rune conversions behave consistently with the ranged-for behavior. I haven't checked but presume that gccgo et al. do the same: they must implement the ranged for-behavior anyway, doing something different for []rune conversion would be more work to introduce gratuitous surprising behavior.

But, unless I missed a clarification in the spec, a contrarian implementation _could_ implement novel behavior for []rune conversion of invalid UTF-8. Did I miss anything?

If not I'll file a proposal to spell out required behavior in the spec, since I don't think there are any compatibility concerns or reasonable arguments for allowing []rune conversion alone to behave strangely in this respect.

- Dave

peterGo

unread,
Jul 9, 2024, 6:22:54 PM (7 days ago) Jul 9
to golang-nuts
On Tuesday, July 9, 2024 at 3:42:51 PM UTC-4 David Anderson wrote:
I've been going over the spec to clarify finer points of how string vs. []byte behave, I think there may be an unnecessary degree of freedom that could be removed. Either that, or I missed a load-bearing statement that constrains implementations.

In https://go.dev/ref/spec#Conversions, `[]rune(str)` is specified as: "Converting a value of a string type to a slice of runes type yields a slice containing the individual Unicode code points of the string."

This does not specify the behavior if the string contains invalid UTF-8 byte sequences. If my reading is correct, a compliant implementation would be free to panic() on such a conversion, or implement the conversion in an arbitrary way of its choosing.

- Dave

A run-time panic requires explicit mention.

UTF-8 is defined by the Unicode Standard: https://www.unicode.org/versions/latest. Does the Unicode Standard allow arbitrary conversion behavior?

peter

peterGo

unread,
Jul 9, 2024, 7:02:52 PM (7 days ago) Jul 9
to golang-nuts
On Tuesday, July 9, 2024 at 3:42:51 PM UTC-4 David Anderson wrote:
I've been going over the spec to clarify finer points of how string vs. []byte behave,

In https://go.dev/ref/spec#Conversions, `[]rune(str)` is specified as: "Converting a value of a string type to a slice of runes type yields a slice containing the individual Unicode code points of the string."

- Dave

 The Go runtime implementation of []rune(str):

func stringtoslicerune(buf *[tmpStringBufSize]rune, s string) []rune {
    // ...
n = 0
for _, r := range s {
a[n] = r
n++
}
return a
}


peter

David Anderson

unread,
Jul 9, 2024, 8:24:22 PM (7 days ago) Jul 9
to golan...@googlegroups.com
Not arbitrary, but behaviors other than what Go currently implements, yes. I've been reading the standard to try and get a precise answer, everything below is my current understanding based on a small amount of time with the standard document. I may be missing other relevant parts of the standard or its annexes.

Section 3.9[1] defines UTF-8 in terms of non overlapping well-formed and ill-formed byte sequences. A conformant implementation must not mistakenly decode ill-formed sequences as valid runes, and must decode all well-formed sequences to the correct runes.

All correct implementations will identify the same well-formed and ill-formed byte ranges, and produce the same rune sequences for the well-formed ranges. But ill-formed sequences are less nailed down, aside from the rule that you must not unintentionally map them to valid Unicode characters. But you can abort with an error, silently ignore the ill-formed sequence (in practice nobody does, security problems), or you can produce context-appropriate replacement characters, conventionally (but not necessarily) U+FFFD.

But even if you assume replacement with U+FFFD, a run of 3 invalid bytes can be decoded as 1, 2 or 3 U+FFFD characters. All are valid replacements according to the spec[2] as long as you don't break decoding of valid characters on either side of the invalid bytes.

The standard does point at a specific unambiguous algorithm for handling ill-formed sequences[3], on page 127 of the v15.1 "U+FFFD Substitution of Maximal Subparts". This references a W3C spec[4] which defines a single mapping of an ill-formed sequence to one or more U+FFFD characters. However, the standard explicitly says that following W3C's algorithm is not required for conformance, and doesn't use the word "recommends" either - although I feel it invites you to come to your own conclusions from its prominent placement in the main standard document.

I do not believe that for...range as specified implements the algorithm offered by W3C. The W3C algorithm emits a single U+FFFD for runs of ill-formed bytes (although not always a single one per run - but it's deterministic). Range iteration is specified to always advance the input by one byte per U+FFFD produced. That's fine, it's a conformant behavior, and the spec describes it sufficiently to implement. It just means the Go spec has to describe its behavior explicitly, rather than by reference to W3C or Unicode documents.


To bring it back to the Go spec: as currently specified, if panics are off the table, I believe a conformant implementation could implement "[]rune(notUTF8)" by silently discarding the ill-formed bytes, or by producing U+FFFD replacements in the same way range iteration does, or by producing U+FFFD or any other replacement characters in any other amount.

In practice, as you point out, the original Go implementation does the obvious thing and reuses the range iteration behavior. Would it be reasonable to nail down that `[]rune(foo)` must behave the same as range iteration in all implementations?

- Dave

[2]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 126 "Constraints on Conversion Processes"
[3]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 127 "U+FFFD Substitution of Maximal Subparts"



peter


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Ian Lance Taylor

unread,
Jul 9, 2024, 10:46:59 PM (7 days ago) Jul 9
to David Anderson, golan...@googlegroups.com
On Tue, Jul 9, 2024 at 5:24 PM 'David Anderson' via golang-nuts
<golan...@googlegroups.com> wrote:
>
> In practice, as you point out, the original Go implementation does the obvious thing and reuses the range iteration behavior. Would it be reasonable to nail down that `[]rune(foo)` must behave the same as range iteration in all implementations?

It seems reasonable at first glance. Want to open an issue for this?
Or just send a patch? Thanks.

Ian

David Anderson

unread,
Jul 9, 2024, 10:52:13 PM (7 days ago) Jul 9
to Ian Lance Taylor, golan...@googlegroups.com
Sure, I'll send a patch in a bit, once I knock the rust off my gerrit knowledge.

- Dave


Ian

-- 
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages