On queuing theory

Michaël REMOND

unread,

Sep 28, 2019, 9:22:12 AM9/28/19

to mechanical-sympathy

Hello,

I know about Martin Thompson excellent work since a good time now, and recently I wanted to better understand some queuing theory he discussed in the Arrested Devops podcast.

He gave the example of a service having an average response time of 100ms, and this service receives on average 9 requests/second.

If the service response time is divided by 2 (50 ms), Martin said that the service becomes 20x times more "reactive".

I was not sure what he meant exactly by reactive, but I tried to understand where these 20x came from.

Here is what I came to, and I wanted that some experienced people confirm to me if my reasoning is correct or not.

I assume that Martin is considering the M/M/1 queue model. Then the theory says that on average, the waiting time (not the sojourn time) is: ρ/(μ − λ).

With the real numbers, we have:

- case 1 (service time = 100ms): waiting time = (9/10) / (10 - 9) = 0.9s = 900ms

- case 2 (service time = 50ms): waiting time = (9/20) / (20 - 9) = 0.04s = 40ms

And then 900 / 40 ≈ 22. So I think that when Martin is saying that the system is more reactive, he is talking about the waiting time.

So the first take-away for me, is that our systems should not run at a utilization bigger than 80% (see https://www.johndcook.com/blog/2009/01/30/server-utilization-joel-on-queuing/#comment-13511).

So my next question is: should I consider this result true for our CPUs and memory as well?

I am particularly interested in this question, because I read in the Google SRE Book that for optimal resource utilization (and thereafter for optimal costs), they try to make their CPUs almost always full of tasks.

Thank you in advance for your responses and thoughts on this subject.

Michaël

Martin Thompson

unread,

Sep 29, 2019, 4:12:34 AM9/29/19

to mechanical-sympathy

Thanks Michaël

I may have been a little muffled in the podcast but I was saying "responsive" rather than "reactive". You are correct below in that the average wait time is what reduces by ~20X as a result of halving service time when utilisation is 90%, therefore the system offering the service is more responsive to user requests.

This is why utilisation plays such an important role in systems performance as it impacts response time or latency. This applies to all computing resources. The higher the utilisation then the higher the probability it will be in use and thus queues can form. This can been seen with CPUs when you have more threads requiring a CPU to run on than CPUs available. The model with CPUs is more complicated when preemption and priority comes into play.

The cloud model is about cost reduction by increasing utilisation. We can see VMs get sluggish on over subscribed systems. If you want predictable response times then you need dedicated resource and low utilisation but this comes at a cost. It is an engineering trade off.

The really interesting question is what time interval is the average utilisation measured over. In financial trading to measure averages in units of seconds for utilisation is pointless. The really interesting intervals are burst of a few microseconds. It is the average utilisation in those bursts which makes the difference and why batching is a key technique to reduce utilisation by amortising costs. Creating a batch of 2 can half the utilisation.

Martin...

Michaël REMOND

unread,

Sep 29, 2019, 9:15:11 AM9/29/19

to mechanical-sympathy

Hi Martin,

Thank you very much for your clear and detailed answer.

You are right, you said responsive in that podcast, sorry. But I would have done the maths anyway :)

Michaël

Martin Thompson

unread,

Sep 29, 2019, 9:29:38 AM9/29/19

to mechanical-sympathy

This should be half the service time.

Reply all

Reply to author

Forward