Re: [chromium-dev] Question about long UI “stall” reports with stacks in Chromium message loop

Hewro Hewei

unread,

Mar 31, 2026, 6:52:45 AMMar 31

to Chromium-dev, Joe Mason, Chromium-dev, scheduler-dev, ihe...@gmail.com

Thanks for the reply.

Our detector is heartbeat-based: a periodic timer on the UI thread updates a timestamp, and a monitor thread reports a stall if that heartbeat is overdue.

I looked through Chromium’s existing mechanisms (`HangWatcher`, `GpuWatchdogThread`, `JankMonitorImpl`, responsiveness watcher), and my understanding is that they mostly reason about task/event/message-pump work boundaries rather than a heartbeat signal.

So for a heartbeat-based detector, it seems expected that some reports will capture only the normal message-loop wait path: the signal reflects overall UI-thread responsiveness, not necessarily the runtime of one specific task/event/message. By the time we sample, the UI thread may already have returned to the run loop. The idea of discounting reports when the monitor thread itself appears delayed is also very helpful for us.

Separately, in production we sometimes see very shallow crashpad stacks for the UI thread, e.g. only 2 frames on macOS:

```text
_mach_msg2_trap
_mach_msg2_internal
```

and on Windows sometimes similarly shallow stacks such as:

```text
NtUserCalcMenuBar
NcGetFrameMetrics
```

Has the Chromium team seen this kind of very shallow stack in apparent UI-hang reports? If so, any guidance on common causes or ways to improve diagnostic quality would be much appreciated.

Thanks again!

On Friday, March 27, 2026 at 3:03:29 AM UTC+8 Joe Mason wrote:

Is your app registering a HangWatcher? (See https://source.chromium.org/chromium/chromium/src/+/main:base/threading/hang_watcher.h;l=144;drc=70e6ba389f518889549997d14ab0659de18a8b1d). Your heartbeat sounds like it would detect the same situations that the hang watcher already covers, so I'm curious if there's a discrepancy.

On Thu, Mar 26, 2026 at 2:54 PM Joe Mason <joenot...@google.com> wrote:
+scheduler-dev

On Thu, Mar 26, 2026 at 11:18 AM hewro <ihe...@gmail.com> wrote:
Hi Chromium folks,

I’m working on responsiveness monitoring for a Chromium-based application. We report a UI stall when the browser main thread has no heartbeat for more than 6 seconds. The heartbeat is application-defined: the UI thread periodically posts to a watchdog thread.

A noticeable portion of these reports capture a stack that is only in the normal message-loop wait path, with no product/business logic above it.

On Windows, stacks often look like:

- `ZwUserMsgWaitForMultipleObjectsEx`
- `user32!RealMsgWaitForMultipleObjectsEx`
- `base::MessagePumpForUI::WaitForWork`
- `base::MessagePumpForUI::DoRunLoop`

On macOS, they often look like:

- `_mach_msg`
- `CFRunLoopRunSpecific`
- `-[NSApplication run]`
- `base::MessagePumpNSApplication::DoRun`

Our current interpretation is mainly:

1. the UI thread/process was not scheduled for a long time due to system-level reasons (high load, suspension, etc.), or
2. the sampled stack is not the actual hang site, and the thread had already returned to the message loop by the time we captured it.

We also account for possible watchdog-thread delay on our side, with separate telemetry to detect that.

For this kind of pure wait-path stack, is that the right way to think about it from the Chromium team’s perspective? Are there other common explanations we should consider?

Also, are there any recommended diagnostics, heuristics, or practical mitigation tricks for investigating and reducing this kind of report?

Thanks!
--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/894b3e2c-4353-4c33-84a3-8d158012f6b9n%40chromium.org.

Olivier Li

unread,

Apr 7, 2026, 11:09:33 AMApr 7

to Chromium-dev, Hewro Hewei, Joe Mason, Chromium-dev, scheduler-dev, ihe...@gmail.com

Hello,

We've indeed often seen such examples of shallow stacks or blaming of the mechanisms waiting for work and HangWatcher is specifically implemented to avoid those.

By monitoring the message pumps view of work items and blocking the escape of work items once stack collection has started we make sure to discard reports that where the code would have moved on.

The only drawback here is that if you are indeed hanging specifically in code outside of these work items you won't have visibility into it and that's a tradeoff we made knowlingly.

I've remained confident in this decision in general because as the current conversation shows it's very close to impossible to understand from a crash report whether you're blaming the message pump mechanism for the wrong reason when based on a heartbeat. This also covers cases of long queues of multiple tasks where a crash report will randomly blame one of the tasks which is misleading unless aggregated over a large number of reports.

We use system tracing to examine user actions that took too long in the hope of catching the scenarios not covered by HangWatcher.

Hewro Hewei

unread,

Apr 14, 2026, 12:34:20 AMApr 14

to Chromium-dev, Olivier Li, Hewro Hewei, Joe Mason, Chromium-dev, scheduler-dev, ihe...@gmail.com

Thanks, Pratyush Mohanty and Olivier Li, for your helpful replies. I appreciate your suggestions and will dig deeper into the information you shared. Thanks again!

Reply all

Reply to author

Forward