Requiring strings to be valid UTF-8

1,553 views
Skip to first unread message

Elly

unread,
Feb 8, 2024, 1:37:34 PM2/8/24
to chromium-mojo
Hello mojo enjoyers!

I've been investigating https://issues.chromium.org/issues/40509710, which says that we should require mojo strings to be UTF-8. For a while we had a histogram called Mojo.InvalidUTF8String which tracked how frequently we have such strings at deserialization time, and the answer is "extremely often".

It would be good, imo, if Mojo strings were always valid UTF-8, but the question is how to get to that happy state. I tried applying the attached patch, which causes Serialize() to DCHECK when asked to serialize a non-UTF8 string, and not as many unit tests as I would have expected broke - only a couple of dozen localized to a few suites.

I started digging into exactly what causes this and I found 2 fairly common causes, including one that I suspect accounts (in one case) for a LOT of the UMA reports:

1. protobuf MessageLite::SerializeToString(), which produces a non-UTF-8 std::string; I have found (just by grepping) quite a bit of code that jams the result into Mojo messages.
2. pickle data which is passed around as non-UTF-8 std::strings; again there are quite a few places where this is jammed into Mojo messages. In particular Blink's PageState type does this, and we stick the result into CommitNavigationParams on every single navigation.

So I propose this plan of attack:

1. Reintroduce the old histogram. I strongly suspect that the current volume of failures is dominated by a few very very common problems and then there is an extremely long tail of other sources of invalid strings; if we introduce a DumpWithoutCrashing before fixing the very common problems I suspect that a) the crash volume will be too high and b) we'll never see anything other than the head cases. Having the histogram will let us know how burning down the volume is going.

2. Repeatedly fix, or ask teams to fix, cases until we're able to commit a patch that DCHECKs that strings are valid UTF-8 at serialization time. This will mean that all the tests that run with DCHECKs, plus the DCHECK-enabled canary, have flushed out all the common cases.

3. Augment the DCHECK with a DumpWithoutCrashing to start catching the long tail cases.

4. Once we're happy with the rate of long-tail cases, change the DWC and DCHECK to a CHECK and declare victory.

Anyone have any opinions? Yea/nay?

-- elly
serialize-utf8-check.patch

Thomas Sepez

unread,
Feb 12, 2024, 5:14:58 PM2/12/24
to chromium-mojo, Elly
Yeah, pickle shouldn't return std::string representations now that we have, say, HeapArray<uint8_t>.  Maybe you could file a bug against pickle to remove the std::string rep of the data?

Daniel Cheng

unread,
Feb 13, 2024, 3:23:04 AM2/13/24
to Thomas Sepez, chromium-mojo, Elly
This sounds like a reasonable plan to me as well. We've definitely been a bit lax about using ByteString for encoded protobufs; hopefully it won't be too hard to narrow those down.

Is this something that would be suitable for distributing out, if we changed it to a DCHECK() and collected the failing tests from the bots and grouped them together?

Daniel

--
You received this message because you are subscribed to the Google Groups "chromium-mojo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-moj...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-mojo/5783e74e-7396-449f-958d-247a8661ee82n%40chromium.org.

Elly

unread,
Feb 13, 2024, 11:24:03 AM2/13/24
to Daniel Cheng, Thomas Sepez, chromium-mojo
Yes, we could split it up (via code health rotation or any other mechanism) if we CQed the DCHECK and collected the set of failing tests. I am happy to gather that data.

-- elly

Elly

unread,
Mar 20, 2024, 11:04:23 AM3/20/24
to Daniel Cheng, Thomas Sepez, chromium-mojo
I have learned an unfortunate fact today which will have a major impact on how hard this project is: WebIDL DOMStrings are not, in the general case, required to be valid UTF-16, and therefore do not necessarily translate to valid UTF-8 (https://webidl.spec.whatwg.org/#idl-USVString). That means that *every* place we take a DOMString and ultimately pass it over Mojo to another process is suspect and will need to be represented as a ByteString instead of a Mojo string. :(

-- elly

Marijn Kruisselbrink

unread,
Mar 20, 2024, 11:22:37 AM3/20/24
to Elly, Daniel Cheng, Thomas Sepez, chromium-mojo
I thought the blink/WTF::String mojo string traits (mojo/public/cpp/bindings/lib/string_traits_wtf.cc) already enforce UTF8 strings though? So wouldn't this only be a problem for cases where a blink string is somehow passed to content (without being forced to be UTF-8) and then passed over mojo? Or is the default "lenient utf-8 conversion mode" too lenient in that it still allows invalid utf-8 in the output?

Jeremy Roman

unread,
Mar 20, 2024, 11:37:26 AM3/20/24
to Marijn Kruisselbrink, Elly, Daniel Cheng, Thomas Sepez, chromium-mojo
Lenient conversion can result in invalid UTF-8 if this input is not valid UTF-16.

Elly

unread,
Mar 20, 2024, 1:12:08 PM3/20/24
to Jeremy Roman, Marijn Kruisselbrink, Daniel Cheng, Thomas Sepez, chromium-mojo
Reply all
Reply to author
Forward
0 new messages