Hello mojo enjoyers!
I've been investigating
https://issues.chromium.org/issues/40509710, which says that we should require mojo strings to be UTF-8. For a while we had a histogram called Mojo.InvalidUTF8String which tracked how frequently we have such strings at deserialization time, and the answer is "extremely often".
It would be good, imo, if Mojo strings were always valid UTF-8, but the question is how to get to that happy state. I tried applying the attached patch, which causes Serialize() to DCHECK when asked to serialize a non-UTF8 string, and not as many unit tests as I would have expected broke - only a couple of dozen localized to a few suites.
I started digging into exactly what causes this and I found 2 fairly common causes, including one that I suspect accounts (in one case) for a LOT of the UMA reports:
1. protobuf MessageLite::SerializeToString(), which produces a non-UTF-8 std::string; I have found (just by grepping) quite a bit of code that jams the result into Mojo messages.
2. pickle data which is passed around as non-UTF-8 std::strings; again there are quite a few places where this is jammed into Mojo messages. In particular Blink's PageState type does this, and we stick the result into CommitNavigationParams on every single navigation.
So I propose this plan of attack:
1. Reintroduce the old histogram. I strongly suspect that the current volume of failures is dominated by a few very very common problems and then there is an extremely long tail of other sources of invalid strings; if we introduce a DumpWithoutCrashing before fixing the very common problems I suspect that a) the crash volume will be too high and b) we'll never see anything other than the head cases. Having the histogram will let us know how burning down the volume is going.
2. Repeatedly fix, or ask teams to fix, cases until we're able to commit a patch that DCHECKs that strings are valid UTF-8 at serialization time. This will mean that all the tests that run with DCHECKs, plus the DCHECK-enabled canary, have flushed out all the common cases.
3. Augment the DCHECK with a DumpWithoutCrashing to start catching the long tail cases.
4. Once we're happy with the rate of long-tail cases, change the DWC and DCHECK to a CHECK and declare victory.
Anyone have any opinions? Yea/nay?
-- elly