When you say "run-queue" below, I assume you are referring to the "r" column in vmstat output, right? That's not the run queue depth...
So, back to your question, and assuming my logic above is correct:
- Given your description of having at least 6 hot-spinning threads, I'd expect the "r" column to never drop below 6 unless your spinning threads use some blocking logic (e.g. locking). So the fact that you see a min of 4 is actually the only "strange" thing I see.
- Occasional spikes to much higher procs_running numbers are very normal. Each runnable thread counts, and there are all sorts of things on healthy Linux systems that spike that number up temporarily. The point in time during which vmstata sampled the procs_running number may just have been within the short period during which some multi-threaded process did 42 milliseconds worth of work using 25 active threads, and then went idle again.
- For a system with 32 or 64 vcores (is it 16 x 2 hyperthreaded cores, or 16 hyperthreaded cores?), you seem to have enough run queues to handle the peak number of runnable threads you have observed (31 + the 1 vmstat thread doing the reporting). However, I would note that the vmstat only samples this number once per reporitng cycle (so e.g. once per second), which means that it is much more likely than not to miss spikes that are large in amplitude but very short in duration. Your actual max may actually be much higher than 31.
- If you want to study your actual peak number of runnable processes, you want to watch procs_running in /proc/stat at a much finer time granularity. You can take a look at my example
LoadMeter tool for something that does that. It samples this number every milliseocnd or so, and reports interval histograms of the procs_running level in HdrHistogram logs (a "now common format" which can be plotted and analyzed with various tools, including e.g.
https://github.com/HdrHistogram/HistogramLogAnalyzer). Loadmeter's entire purpose is to better track the behavior of the max procs_running at any given time, such that spikes longer than 1-2 milliseconds won't hide. I find it to be very useful when trying to triage and e.g. confirm or disprove the common "do I have temporary spikes of lots of very short running threads causing latency spikes even at low reported CPU% levels" question.
- In addition, keep in mind that without special treatment, "enough run queues for the peak number of running threads" works within the reaction time of cross-core load balancing in the scheduler, and that means that there can be many of milliseconds during which one core has lots of threads waiting in it's run queue while other cores are idle (and have not yet chosen to steal work from other run queues). Schedukler load balancing (across cores) is a long subject on it's own, and for a good contemporary set of gripes about cross-core load balacing, you can read
Andrian Colyer's nice summary of
The Linux Scheduler: a Decade of Wasted Cores (and then read the full paper if you want).
There is much you can do to control which cores do what if you want to avoid temporary load spikes on a single core causing embarrassing hiccups in your low-latency spinners (as threads that wake up on the same core as the spinning thread steal cpu away from it before being load balanced to some other core). E.g. when you have 6 hot-spinning, a common practice is to use isolcpus (or some other keep-these-cores-away-from-others mechanisms like cpu sets) and assigning each of your spinners to a specific core (with e.g. taskset of some api calls), such that no other thread will compete with them**.
-- Gil.
** Helpful "may save you some serious frustration" hint for when you use isolcpus: Keep in mind to avoid the common pitfall of assigning anything other than a single thread to a single isolcpu core at a time [I run across the "taskset my process (as a whole) to this group of 6 isolcpu cores" mistake at least 5 times a year, and we've covered it elsewhere on this group before. It simply doesn't do what a large % of people try to do with it]