PaintOpReader invalid inputs: recovery, CHECK or ImmediateCrash?

31 views
Skip to first unread message

Xianzhu Wang

unread,
Feb 27, 2025, 5:18:22 PMFeb 27
to graphics-dev
Hi,

We have seen a lot of clusterfuzz bugs about crashes in PaintOpReader. Some of them never happen in a valid environment because our serializer can never produce such data. We have fixed some of such bugs, while I closed some as WAI because I think crashing is a better choice (but clusterfuzz will sometimes create another test case and file another bug for the same crash).

Now PaintOpReader handles invalid inputs in the following ways:
1. Normal control flow using the valid_ flag.
2. CHECK/NOTREACHED.
3. Crash in c++ library, Skia etc.

According to the CHECK guide, because PaintOpReader handles "data across security boundaries" in a high privilege process, we should use normal control flow (#1). However, I'm wondering if the effort is worth it. IMHO crashing is better because it entirely avoids the impact of an invalid input. If we can make the GPU process an exception to the rule in the CHECK guide,PaintOpReader should use ImmediateCrash to check for invalid inputs.

UMA data about PaintOpReader failures shows that 99.99+% of the failures were caused by several top reasons which are not all about invalid inputs. If we made the remaining mere invalid inputs crash, the GPU crash rate would not have a visible change.

WDYT?

Xianzhu Wang

unread,
Mar 25, 2025, 2:02:44 PMMar 25
to graphics-dev, Peter Kasting
+pkasting who has been recently working on PaintOpReader.

Xianzhu Wang

unread,
Mar 31, 2025, 5:20:38 PMMar 31
to Peter Kasting, graphics-dev


On Mon, Mar 31, 2025 at 1:48 PM Peter Kasting <pkas...@chromium.org> wrote:
On Tue, Mar 25, 2025 at 11:02 AM Xianzhu Wang <wangx...@chromium.org> wrote:
+pkasting who has been recently working on PaintOpReader.

I haven't, actually -- I think I checked in a cleanup change to it in passing and the fuzzer has been furiously assigning bugs to me as a result, which I've been ignoring :( 

On Thu, Feb 27, 2025 at 2:18 PM Xianzhu Wang <wangx...@chromium.org> wrote:
We have seen a lot of clusterfuzz bugs about crashes in PaintOpReader. Some of them never happen in a valid environment because our serializer can never produce such data. We have fixed some of such bugs, while I closed some as WAI because I think crashing is a better choice (but clusterfuzz will sometimes create another test case and file another bug for the same crash).

If we have CHECKs that say that certain preconditions are expected on the caller side, we should update fuzzers to not pass inputs that violate those CHECKs.

If there are no CHECKs, then IMO code should either handle those inputs, or else add such CHECKs and prevent them caller-side (and then not fuzz such input ranges). Doing neither (e.g. just crashing on shouldn't-happen inputs to public APIs) is unclear to readers. 

Now PaintOpReader handles invalid inputs in the following ways:
1. Normal control flow using the valid_ flag.
2. CHECK/NOTREACHED.
3. Crash in c++ library, Skia etc.

According to the CHECK guide, because PaintOpReader handles "data across security boundaries" in a high privilege process, we should use normal control flow (#1). However, I'm wondering if the effort is worth it. IMHO crashing is better because it entirely avoids the impact of an invalid input. If we can make the GPU process an exception to the rule in the CHECK guide,PaintOpReader should use ImmediateCrash to check for invalid inputs.

Why not CHECK (option 2)?

Actually I prefer CHECK (which is the case of the bugs I closed as WAI), but that doesn't seem to strictly follow the the CHECK guide. I feel our PaintOpReader code will be much simpler and even more robust with CHECKs instead of the current option 1 in many code paths.
 

PK

Peter Kasting

unread,
Mar 31, 2025, 5:47:20 PMMar 31
to Xianzhu Wang, graphics-dev
Do they not follow the guide because we can't actually enforce those preconditions from the caller side, or some other reason? The high-level intent of the guide is "express preconditions as checks and don't handle them", so the only things that should really be outside that are cases where external data needs to be safely processed directly. It is acceptable to CHECK if data from another process is outside the range it was supposed to be in (due to a security breach, for example). Now, if we're directly reading network data and CHECKing that the values are "as expected per the file spec" or something, then no, that wouldn't be a valid use of CHECK.

PK

Peter Kasting

unread,
Mar 31, 2025, 5:47:20 PMMar 31
to Xianzhu Wang, graphics-dev
On Tue, Mar 25, 2025 at 11:02 AM Xianzhu Wang <wangx...@chromium.org> wrote:
+pkasting who has been recently working on PaintOpReader.

I haven't, actually -- I think I checked in a cleanup change to it in passing and the fuzzer has been furiously assigning bugs to me as a result, which I've been ignoring :( 

On Thu, Feb 27, 2025 at 2:18 PM Xianzhu Wang <wangx...@chromium.org> wrote:
We have seen a lot of clusterfuzz bugs about crashes in PaintOpReader. Some of them never happen in a valid environment because our serializer can never produce such data. We have fixed some of such bugs, while I closed some as WAI because I think crashing is a better choice (but clusterfuzz will sometimes create another test case and file another bug for the same crash).

If we have CHECKs that say that certain preconditions are expected on the caller side, we should update fuzzers to not pass inputs that violate those CHECKs.

If there are no CHECKs, then IMO code should either handle those inputs, or else add such CHECKs and prevent them caller-side (and then not fuzz such input ranges). Doing neither (e.g. just crashing on shouldn't-happen inputs to public APIs) is unclear to readers.

Now PaintOpReader handles invalid inputs in the following ways:
1. Normal control flow using the valid_ flag.
2. CHECK/NOTREACHED.
3. Crash in c++ library, Skia etc.

According to the CHECK guide, because PaintOpReader handles "data across security boundaries" in a high privilege process, we should use normal control flow (#1). However, I'm wondering if the effort is worth it. IMHO crashing is better because it entirely avoids the impact of an invalid input. If we can make the GPU process an exception to the rule in the CHECK guide,PaintOpReader should use ImmediateCrash to check for invalid inputs.

Why not CHECK (option 2)?

PK

Xianzhu Wang

unread,
Mar 31, 2025, 6:12:28 PMMar 31
to Peter Kasting, graphics-dev
The rule in the guide that "CHECK in PaintOpReader" doesn't strictly follow is (under this section):
  • Data across security boundaries: A compromised renderer should not be able to bring down the browser process (higher privilege). Bad IPC messages should be safely rejected by Chromium without the use of base::ImmediateCrash() or CHECK() etc. as part of normal control flow.
PaintOpReader resides in the GPU process which is a high privilege process. It processes input produced by renderer processes.

In the cases of the closed WAI bugs, the preconditions are enforced from the renderer process (can we treat them as the caller?), except for the fuzzer. It sounds complex to let the fuzzer meet the preconditions, so I'm wondering if there is a way to let clusterfuzz know that a crash is intentional and ignore it. 

 
PK

Peter Kasting

unread,
Mar 31, 2025, 7:11:25 PMMar 31
to Xianzhu Wang, graphics-dev
On Mon, Mar 31, 2025 at 3:12 PM Xianzhu Wang <wangx...@chromium.org> wrote:
The rule in the guide that "CHECK in PaintOpReader" doesn't strictly follow is (under this section):
  • Data across security boundaries: A compromised renderer should not be able to bring down the browser process (higher privilege). Bad IPC messages should be safely rejected by Chromium without the use of base::ImmediateCrash() or CHECK() etc. as part of normal control flow.
PaintOpReader resides in the GPU process which is a high privilege process. It processes input produced by renderer processes.

I don't think a compromised renderer bringing down the GPU process violates that rule, but the security people would know definitively. Especially if the GPU process is auto-restarted on crashing, the consequences of that are much less user-visible than a browser crash.

In the cases of the closed WAI bugs, the preconditions are enforced from the renderer process (can we treat them as the caller?), except for the fuzzer. It sounds complex to let the fuzzer meet the preconditions, so I'm wondering if there is a way to let clusterfuzz know that a crash is intentional and ignore it.

Complex because the conditions are complicated to encode (or unclear), or complex because you aren't sure how the fuzzer works and where to encode such things?

PK

Xianzhu Wang

unread,
Mar 31, 2025, 8:49:55 PMMar 31
to Peter Kasting, graphics-dev, tis...@chromium.org, Khushal Sagar
On Mon, Mar 31, 2025 at 4:11 PM Peter Kasting <pkas...@chromium.org> wrote:
On Mon, Mar 31, 2025 at 3:12 PM Xianzhu Wang <wangx...@chromium.org> wrote:
The rule in the guide that "CHECK in PaintOpReader" doesn't strictly follow is (under this section):
  • Data across security boundaries: A compromised renderer should not be able to bring down the browser process (higher privilege). Bad IPC messages should be safely rejected by Chromium without the use of base::ImmediateCrash() or CHECK() etc. as part of normal control flow.
PaintOpReader resides in the GPU process which is a high privilege process. It processes input produced by renderer processes.

I don't think a compromised renderer bringing down the GPU process violates that rule, but the security people would know definitively. Especially if the GPU process is auto-restarted on crashing, the consequences of that are much less user-visible than a browser crash.

Agreed.

In the cases of the closed WAI bugs, the preconditions are enforced from the renderer process (can we treat them as the caller?), except for the fuzzer. It sounds complex to let the fuzzer meet the preconditions, so I'm wondering if there is a way to let clusterfuzz know that a crash is intentional and ignore it.

Complex because the conditions are complicated to encode (or unclear), or complex because you aren't sure how the fuzzer works and where to encode such things?

The current paint_op_buffer_fuzzer simply passes a buffer from libfuzzer to cc::PaintOpBuffer::MakeFromMemory(). At this level, even a simple condition (e.g. DrawTextBlobOp should not exist in the serialized data) seems hard to meet. On the other hand, we still need the invalid inputs to test the crash behavior. I can imagine that such a fuzzer will need many times more lines of code than the current paint_op_buffer_fuzzer.


PK

Peter Kasting

unread,
Mar 31, 2025, 8:54:05 PMMar 31
to Xianzhu Wang, graphics-dev, tis...@chromium.org, Khushal Sagar
It certainly depends on where we draw the attack surface line. If we think CHECKing on invalid data from the renderer is OK, then we need to change the fuzzer to construct the buffer field-by-field and constrain each field's value to something that fits the preconditions. If, OTOH, we need to be defensive against a compromised renderer, the current fuzzer is correct, but the paint op code needs to detect and handle errors at the renderer-level entry points, and only use CHECKs below that.

I don't think we should try and have a third world where we use a trivial fuzzer, CHECK everywhere, and try to annotate for the fuzzer that "some failures are expected".

PK 

Colin Blundell

unread,
Apr 1, 2025, 2:41:44 AMApr 1
to Peter Kasting, Elly, Xianzhu Wang, graphics-dev, tis...@chromium.org, Khushal Sagar
On Tue, Apr 1, 2025 at 2:54 AM Peter Kasting <pkas...@chromium.org> wrote:
On Mon, Mar 31, 2025 at 5:49 PM Xianzhu Wang <wangx...@chromium.org> wrote:
On Mon, Mar 31, 2025 at 4:11 PM Peter Kasting <pkas...@chromium.org> wrote:
On Mon, Mar 31, 2025 at 3:12 PM Xianzhu Wang <wangx...@chromium.org> wrote:
The rule in the guide that "CHECK in PaintOpReader" doesn't strictly follow is (under this section):
  • Data across security boundaries: A compromised renderer should not be able to bring down the browser process (higher privilege). Bad IPC messages should be safely rejected by Chromium without the use of base::ImmediateCrash() or CHECK() etc. as part of normal control flow.
PaintOpReader resides in the GPU process which is a high privilege process. It processes input produced by renderer processes.

I don't think a compromised renderer bringing down the GPU process violates that rule, but the security people would know definitively. Especially if the GPU process is auto-restarted on crashing, the consequences of that are much less user-visible than a browser crash.

Agreed.

+Elly I'm pretty sure that we do not want compromised renderers bringing down the GPU process - there are implications of repeated GPU process crashes such as switching to software compositing mode, which attackers can potentially exploit. Curious for Elly's thoughts from the security team POV.

Peter Kasting

unread,
Apr 1, 2025, 1:17:42 PMApr 1
to Elly, Colin Blundell, Xianzhu Wang, graphics-dev, tis...@chromium.org, Khushal Sagar
On Tue, Apr 1, 2025 at 8:33 AM Elly <elly...@chromium.org> wrote:
It would be better if the GPU process did not crash in this situation, and instead killed the misbehaving renderer.

I don't know offhand whether the second part of this is easy (is there some mojo hook to say "the other end of the pipe is compromised"?), but the first part sounds like "most of these invariants should be condition-checked-and-handled, not CHECKed; the fuzzer should stay as-is".

PK 

Khushal Sagar

unread,
Apr 2, 2025, 1:14:43 PMApr 2
to Peter Kasting, Elly, Colin Blundell, Xianzhu Wang, graphics-dev, tis...@chromium.org, Khushal Sagar
+1 to handling invalid renderer input safely and failing to de-serialize. If it's easy to bubble up the error to a point which can terminate the mojo connection, that's even better.

I assumed that's how it already works, PaintOpReader sets a bit to indicate invalid input and fails safely instead of asserting. If there's a case where we're CHECK-ing, that would be a bug we should fix. Are you seeing such cases when we deserialize Skia data structures?

Xianzhu Wang

unread,
Apr 2, 2025, 1:33:30 PMApr 2
to Khushal Sagar, Peter Kasting, Elly, Colin Blundell, graphics-dev, tis...@chromium.org
On Wed, Apr 2, 2025 at 10:14 AM Khushal Sagar <khusha...@chromium.org> wrote:
+1 to handling invalid renderer input safely and failing to de-serialize. If it's easy to bubble up the error to a point which can terminate the mojo connection, that's even better.

I assumed that's how it already works, PaintOpReader sets a bit to indicate invalid input and fails safely instead of asserting. If there's a case where we're CHECK-ing, that would be a bug we should fix. Are you seeing such cases when we deserialize Skia data structures?

I have closed many assertion failure bugs as WAI: https://issues.chromium.org/issues?q=paint_op_buffer_fuzzer%20status:intended_behavior (which shows much fewer bugs closed by me; Now sure how to query all of them).

Based on the discussions, I think the most compelling benefit of choice 1 (normal control flow using the valid_ flag, see my first email in this thread) is the simple and thorough fuzzing, and I agree to use it instead of choice 2. My concern was about its complex control flow and less security (in case of incomplete handling/recovering) than crashing, but hopefully the simple and thorough fuzzing can help us ensure the correctness and security of the control flow.

We should probably revisit the WAI bugs to convert the assertions to control flows.

Elly

unread,
Apr 2, 2025, 5:46:56 PMApr 2
to Colin Blundell, Peter Kasting, Xianzhu Wang, graphics-dev, tis...@chromium.org, Khushal Sagar
It would be better if the GPU process did not crash in this situation, and instead killed the misbehaving renderer. It is not strictly a security boundary that renderers can't cause the GPU process to crash. They often can anyway by causing it to run out of resources, and we can't feasibly prevent these crashes without impairing its functionality on heavy workloads. That said, it's nicer if the GPU process does not crash and instead nicely handles invalid data coming in from renderers. Even absent any security risk, the user experience of a GPU crash is a lot worse than the user experience of a renderer crash, and we should avoid it when feasible.

-- elly

Kai Ninomiya

unread,
Apr 3, 2025, 8:33:24 PMApr 3
to Xianzhu Wang, Khushal Sagar, Peter Kasting, Elly, Colin Blundell, graphics-dev, tis...@chromium.org
On Wed, Apr 2, 2025 at 2:46 PM Elly <elly...@chromium.org> wrote:
Even absent any security risk, the user experience of a GPU crash is a lot worse than the user experience of a renderer crash, and we should avoid it when feasible.

IIUC we're only talking about crashing when there's a compromised renderer. Does a slightly degraded user experience really matter in this case?
If they're actual stability problems, then of course we'll still be getting crash reports for these crashes in the wild, and we'd monitor them like any other stability problem.

Generally I share Xianzhu's concern about increasing code complexity to handle things that aren't supposed to ever happen. We have a bit of this problem in Dawn, but generally our architecture avoids it being too onerous. If you're finding graceful recovery in PaintOpReader to be complex, maybe Dawn can give ideas on how to refactor.

- In Dawn Wire server deserialization, most or all fatal errors are supposed to be impossible without a compromised renderer, so we could probably crash. However this code has been structured in a way that there isn't much depth to the call stack, so it's usually as simple as `return WireResult::FatalError;`.
- In Dawn Native validation, most errors _are_ possible without a compromised renderer, so this is a bit different, but the approach may still be informative. This code is much more complex than the wire, but we use "Rust style" error handling (Result<T, Error> and DAWN_TRY), which kind of looks like exceptions, so it's easy to stop an operation as soon as there's an error.

-Kai (he/they)

Colin Blundell

unread,
Apr 4, 2025, 7:26:36 AMApr 4
to Kai Ninomiya, Xianzhu Wang, Khushal Sagar, Peter Kasting, Elly, Colin Blundell, graphics-dev, tis...@chromium.org
On Fri, Apr 4, 2025 at 2:33 AM Kai Ninomiya <kai...@chromium.org> wrote:
On Wed, Apr 2, 2025 at 2:46 PM Elly <elly...@chromium.org> wrote:
Even absent any security risk, the user experience of a GPU crash is a lot worse than the user experience of a renderer crash, and we should avoid it when feasible.

IIUC we're only talking about crashing when there's a compromised renderer. Does a slightly degraded user experience really matter in this case?

I'm not 100% sure whether it's only compromised renderers or not. If it is only compromised renderers, I think that a relevant question here is whether it's possible to crash the compromised renderer in this case. If so (e.g. if we can do something like report a bad Mojo message or whatnot which is what the browser uses to crash renderers in certain cases IIRC), then it definitely seems strictly better to me from the user's POV for the compromised renderer to crash than for the GPU process to crash.

If it's not only compromised renderers, then I think we definitely shouldn't be crashing the GPU process in this case. But in general my understanding is that in the GPU process we try to handle invalid input from other processes gracefully rather than crashing. If others have different experiences/evidence, I'd be interested in hearing that of course.

Vasiliy Telezhnikov

unread,
Apr 4, 2025, 10:19:37 AMApr 4
to Colin Blundell, Kai Ninomiya, Xianzhu Wang, Khushal Sagar, Peter Kasting, Elly, graphics-dev, tis...@chromium.org
> Even absent any security risk, the user experience of a GPU crash is a lot worse than the user experience of a renderer crash, and we should avoid it when feasible.

If we're talking in general and not only a compromised renderer case, then it's quite the opposite (and e.g we configure OOM-killer priorities in a way that it would more likely to kill the gpu process rather than foreground renderer). 

The reason for that is that gpu process doesn't have user state, on some platforms (e.g Android) gpu process crash is practically unobservable from the user perspective (no flickering, no missing content, just slight jank that user would see if they were scrolling), on others it might flicker and any video playback would pause. Renderer process on the other hand has a user state (user might have been typing a comment or filling a form) which is gone and crash is quite visible (user has to reload page).

But I don't think it's important in this context, as we're talking about invalid inputs on the IPC boundary that should never happen in the first place.

It's totally possible to surface error to the IPC level and to close the mojo channel, though I'm not sure it would crash the renderer process. But it would require a bit more reasoning about error cases than now (i.e we can't just return error here, which already does report error over mojo to the renderer). 

In general the gpu process tries to not crash on the invalid IPC input (as it comes from an untrusted process). It's reasonable to be resilient in this case, but on the other hand it's also a constant source of complexity, because we have to handle things that are not supposed to happen. The problem is not only that code now needs to handle everything, the problem is also that it often hides real problems. 

e.g renderer due to a bug sent invalid input to gpu process, gpu process being resilient handled it gracefully and didn't do anything bad, renderer process also handled it gracefully and did fallback to another path. And everything "works", but not really. It also makes debugging much harder. It would be great to be able to crash at least in tests.

- Vasiliy

Colin Blundell

unread,
Apr 7, 2025, 7:58:38 AMApr 7
to Vasiliy Telezhnikov, Colin Blundell, Kai Ninomiya, Xianzhu Wang, Khushal Sagar, Peter Kasting, Elly, graphics-dev, tis...@chromium.org
Thanks Vasiliy, this insight is invaluable! Do you have a concrete recommendation for what you would have us do in this particular case? e.g. should we just leave these as CHECKs?

Zooming out a bit: Given the tradeoffs that you're highlighting here, is it actually worth preserving the high-level design that the GPU process not crash on invalid input from renderers (i.e. input that should only come from a compromised renderer or a programming error) given that (a) a compromised renderer can cause the GPU process to crash with enough work in any case and (b) we would want to detect and resolve invalid input due to programming errors rather than adding code complexity to handle it gracefully on both sides? 


 

- Vasiliy

Reply all
Reply to author
Forward
0 new messages