High run-queue debugging

227 views
Skip to first unread message

Ivan Valeriani

unread,
Jan 29, 2018, 5:36:38 AM1/29/18
to mechanical-sympathy
Hello,

Running vmstat on a production box, I am seeing run-queue suspiciously high and occasionally spiking up to double digit. Say:

Avg = 7.1
Std = 5.33
Min = 4
Max = 31

Top(1) shows 80% idle cpu.

This box runs 6 fix engines, 1 spinning thread each. It’s running 16x2 hyperthreaded cores, SandyBridge.

I would like to understand where these tasks are coming from and why they can’t be processed. I do not have root access.

I might have a few theories, but the question it’s rather on how I could collect evidences.

Many thanks
Ivan

Wojciech Kudla

unread,
Jan 29, 2018, 5:54:12 AM1/29/18
to mechanica...@googlegroups.com

I'd just run stap or ftrace to capture sched_switch events to see what's causing the scheduling pressure.
Are there any anomalies in voluntary and involuntary context switching during the run queue length spikes?
Also, depending on your load running several spinning fix engine threads may cause resource saturation. Most fix traffic I deal with is more bursty than steady.
You may also get some hints on what the os is doing by monitoring /proc/interrupts and /proc/softirqs.
Why are you running with HT enabled? Does it offer better performance characteristics in your case?


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,
Jan 29, 2018, 10:56:19 AM1/29/18
to mechanical-sympathy
When you say "run-queue" below, I assume you are referring to the "r" column in vmstat output, right? That's not the run queue depth...

While the vmstat man pages (e.g. https://linux.die.net/man/8/vmstat) misleadingly describe the "r" column with "r: The number of processes waiting for run time.", vmstat (see source e.g. here https://gitlab.com/procps-ng/procps/blob/master/vmstat.c) actually picks the value for the "r" column up from the procs_running value in /proc/stat, with one very important tweak: it subtracts one from that number so as not to count the vmstat process itself. (see the actual scraping and -1 math here: https://gitlab.com/procps-ng/procps/blob/master/proc/sysinfo.c#L577). 

The actual meaning of procs_running (per https://www.kernel.org/doc/Documentation/filesystems/proc.txt) is "the total number of threads that are running or ready to run (i.e., the total number of runnable threads)." 

Annoying, but "that's how it's always been". Someone should really clean up that vmstat man page (e.g. https://gitlab.com/procps-ng/procps/blob/master/vmstat.8) (he said lazily, without volunteering).

So, back to your question, and assuming my logic above is correct:

- Given your description of having at least 6 hot-spinning threads, I'd expect the "r" column to never drop below 6 unless your spinning threads use some blocking logic (e.g. locking). So the fact that you see a min of 4 is actually the only "strange" thing I see.

- Occasional spikes to much higher procs_running numbers are very normal. Each runnable thread counts, and there are all sorts of things on healthy Linux systems that spike that number up temporarily. The point in time during which vmstata sampled the procs_running number may just have been within the short period during which some multi-threaded process did 42 milliseconds worth of work using 25 active threads, and then went idle again.

- For a system with 32 or 64 vcores (is it 16 x 2 hyperthreaded cores, or 16 hyperthreaded cores?), you seem to have enough run queues to handle the peak number of runnable threads you have observed (31 + the 1 vmstat thread doing the reporting). However, I would note that the vmstat only samples this number once per reporitng cycle (so e.g. once per second), which means that it is much more likely than not to miss spikes that are large in amplitude but very short in duration. Your actual max may actually be much higher than 31. 

- If you want to study your actual peak number of runnable processes, you want to watch procs_running in /proc/stat at a much finer time granularity. You can take a look at my example LoadMeter tool for something that does that. It samples this number every milliseocnd or so, and reports interval histograms of the procs_running level in HdrHistogram logs (a "now common format" which can be plotted and analyzed with various tools, including e.g. https://github.com/HdrHistogram/HistogramLogAnalyzer). Loadmeter's entire purpose is to better track the behavior of the max procs_running at any given time, such that spikes longer than 1-2 milliseconds won't hide. I find it to be very useful when trying to triage and e.g. confirm or disprove the common "do I have temporary spikes of lots of very short running threads causing latency spikes even at low reported CPU% levels" question.

- In addition, keep in mind that without special treatment, "enough run queues for the peak number of running threads" works within the reaction time of cross-core load balancing in the scheduler, and that means that there can be many of milliseconds during which one core has lots of threads waiting in it's run queue while other cores are idle (and have not yet chosen to steal work from other run queues). Schedukler load balancing (across cores) is a long subject on it's own, and for a good contemporary set of gripes about cross-core load balacing, you can read Andrian Colyer's nice summary of The Linux Scheduler: a Decade of Wasted Cores (and then read the full paper if you want).

There is much you can do to control which cores do what if you want to avoid temporary load spikes on a single core causing embarrassing hiccups in your low-latency spinners (as threads that wake up on the same core as the spinning thread steal cpu away from it before being load balanced to some other core). E.g. when you have 6 hot-spinning, a common practice is to use isolcpus (or some other keep-these-cores-away-from-others mechanisms like cpu sets) and assigning each of your spinners to a specific core (with e.g. taskset of some api calls), such that no other thread will compete with them**.

-- Gil.

** Helpful "may save you some serious frustration" hint for when you use isolcpus: Keep in mind to avoid the common pitfall of assigning anything other than a single thread to a single isolcpu core at a time [I run across the "taskset my process (as a whole) to this group of 6 isolcpu cores" mistake at least 5 times a year, and we've covered it elsewhere on this group before. It simply doesn't do what a large % of people try to do with it]

Ivan Valeriani

unread,
Jan 29, 2018, 6:13:05 PM1/29/18
to mechanical-sympathy
Thanks to both of you. I’ll digest your suggestions and come back with some founding.
Reply all
Reply to author
Forward
0 new messages