Hello scheduler devs (and
v8/chromium-mojo friends -- sorry for cross-posting; see related note below).
Some kernels give a boost to a thread when the resource it was waiting on is signaled (lock, event, pipe, file I/O, etc.). Some platforms
document this; on others we've anecdotally observed things that make us believe they do.
I think this might be hurting Chrome's task system.
The Chrome semantics when signaling a thread is often "hey, you have work, you should run soon"; not "hey, please do this work ASAP"; I think... This is certainly the case for TaskScheduler use cases, I'm not so sure about input use cases (e.g. 16 thread hops to respond to input IIRC; boost probably helps that chain a lot..?).
But in a case where there are many messages (e.g. mojo), this means many context switches (send one message; switch; process one message; switch back; etc.).
https://crbug.com/872248#c4 suggests that MessageLoop::ScheduleWork() is really expensive (though there may be sampling bias there -- investigation in progress).
https://crbug.com/872248 also suggests that the Blink main thread is descheduled while it's trying to signal workers to help it on a parallel task (I've observed this first hand when working in
v8 this winter but didn't know what to think of it then
trace1 trace2).
On Windows we can tweak this with ::SetProcessPriorityBoost/SetThreadPriorityBoost(). Not sure about POSIX. I might try to experiment with this (feels scary..!).
In the meantime I figured it would at least be good to inform all of you so you no longer scratch your head at these occasional unexplained latency delays in traces.
Cheers!
Gab