Kernel signaling boosts are potentially hurting Chrome

57 views
Skip to first unread message

Gabriel Charette

unread,
Aug 20, 2018, 7:37:21 PM8/20/18
to scheduler-dev, v8-dev, chromi...@chromium.org, Bruce Dawson
Hello scheduler devs (and v8/chromium-mojo friends -- sorry for cross-posting; see related note below).

Some kernels give a boost to a thread when the resource it was waiting on is signaled (lock, event, pipe, file I/O, etc.). Some platforms document this; on others we've anecdotally observed things that make us believe they do.

I think this might be hurting Chrome's task system.

The Chrome semantics when signaling a thread is often "hey, you have work, you should run soon"; not "hey, please do this work ASAP"; I think... This is certainly the case for TaskScheduler use cases, I'm not so sure about input use cases (e.g. 16 thread hops to respond to input IIRC; boost probably helps that chain a lot..?).
But in a case where there are many messages (e.g. mojo), this means many context switches (send one message; switch; process one message; switch back; etc.).

https://crbug.com/872248#c4 suggests that MessageLoop::ScheduleWork() is really expensive (though there may be sampling bias there -- investigation in progress).

https://crbug.com/872248 also suggests that the Blink main thread is descheduled while it's trying to signal workers to help it on a parallel task (I've observed this first hand when working in v8 this winter but didn't know what to think of it then trace1 trace2).

On Windows we can tweak this with ::SetProcessPriorityBoost/SetThreadPriorityBoost(). Not sure about POSIX. I might try to experiment with this (feels scary..!).

In the meantime I figured it would at least be good to inform all of you so you no longer scratch your head at these occasional unexplained latency delays in traces.

Cheers!
Gab

Bruce Dawson

unread,
Aug 21, 2018, 8:15:51 PM8/21/18
to Gabriel Charette, schedu...@chromium.org, v8-...@googlegroups.com, chromium-mojo
I've definitely been bitten by this. On one game engine that I worked on they would signal all of the worker threads when a task was ready. Due to the priority boosting all of them would wake up and try to acquire the scheduler lock. The scheduler lock was held by the thread that had signaled all of the worker threads, which was reliably no longer running. And oh, by the way, it was a spin lock, so the main thread couldn't release because it wasn't running. The call to SetEvent() would frequently take 20 ms to return.

There were a lot of problems with this:
  • Don't signal all of your worker threads when you have just one task
  • Don't use a spin lock
In this case the priority raising made the issues critical, but it wasn't the underlying issue.

I commented on the bug. I do think this is worth exploring, but there are probably cases where we rely on this priority boost to avoid starvation or improve response times. It's possible that we'd see better results by somehow reducing the number of cross-thread/cross-process messages we send, somehow.

Also, note that on systems with enough cores the priority boost can become irrelevant - two communicating threads will migrate to different cores and both will continue running. So, our workstations will behave fundamentally differently from customer machines. Yay.

Sami Kyostila

unread,
Aug 28, 2018, 12:26:23 PM8/28/18
to bruce...@chromium.org, Gabriel Charette, scheduler-dev, v8-...@googlegroups.com, chromium-mojo
I think I've seen instances of this problem even with the old IPC system: the sending thread is likely to get descheduled because the receiving thread is woken up before the former finished running. We kicked around an idea once about buffering message sends and only flushing them once the current task is finished -- maybe it would be time to revisit something like that?

- Sami 

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-de...@chromium.org.
To post to this group, send email to schedu...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/scheduler-dev/CAE5mQiNLNRQiCyv%2BNLU8X7ToQ-s-wRt%2BQgy2B%2BcjwuEKfCu%2B5g%40mail.gmail.com.

Bruce Dawson

unread,
Aug 29, 2018, 4:30:25 PM8/29/18
to Sami Kyostila, Gabriel Charette, schedu...@chromium.org, v8-...@googlegroups.com, chromium-mojo
BTW, another issue which the signal boost can cause is around locks. If the scheduling thread is holding a lock when it signals that a task is available and if there aren't enough cores available then the receiving thread will be boosted, will take the CPU from the scheduling thread, try to acquire the lock, fail, block on an event, the scheduling thread will then (one hopes) be scheduled, will release the lock, and then the receiving thread will wake up, grab the lock, grab the task, and start running. If we hit this pattern then scheduling a single task can take three context switches.

Gabriel and I brainstormed a few ways to investigate both the consequences of priority boosting and why Chrome does so many context switches.

Sami Kyostila

unread,
Aug 30, 2018, 6:37:23 AM8/30/18
to bruce...@chromium.org, Gabriel Charette, scheduler-dev, v8-...@googlegroups.com, chromium-mojo
This came up again in a different context today, so I filed a bug to track the investigation (unless you already had one Gab?): crbug.com/879097.

- Sami

Reply all
Reply to author
Forward
0 new messages