Good talk about capacity planning

295 views
Skip to first unread message

benikenobi

unread,
Oct 16, 2017, 11:29:04 AM10/16/17
to Guerrilla Capacity Planning
Hi,

In Strange Loop 2017 conference, there was a pretty interesting talk  "Stop Rate Limiting! Capacity Management Done Right" by Jon Moore. The link is:

It uses Little Law's to study the best way to do back pressure (concurrence limit vs rate limit) on client requests when the origin servers are overloaded.

I would like to know the opinion of the people of this group about this.

Regards

DrQ

unread,
Oct 16, 2017, 1:53:27 PM10/16/17
to Guerrilla Capacity Planning
I don't see anything new here (from the standpoint of performance analysis), but it may be news to that "StrangeLoop"-y audience. 

Regarding rate throttling vs. thread concurrency in an application, choose your poison. Performance is about trade-offs. 

Just as a reminder, there are actually THREE versions of Little's law. The presenter is referring to the first of these: the average number of requests in residence. The second version (the average number of requests in waiting) is the one John Little proved in 1961 and thus, appears in the title of his paper. The third one is the average number of requests in service. Most people don't understand that these apparently different equations are just versions of the same thing.

Guerrilla alum, Jeff P. (2016), came up with a cute way to remember the three versions of LL. 

benikenobi

unread,
Oct 16, 2017, 3:14:32 PM10/16/17
to Guerrilla Capacity Planning
Please, could you elaborate a bit more about that trade off? Or please could you point me out where you talk about it?


Thanks in advance

DrQ

unread,
Oct 16, 2017, 3:44:47 PM10/16/17
to Guerrilla Capacity Planning
Certainly, although I don't think it's anything very deep. I present a lot of examples in my Guerrilla classes

By "trade-off," I'm referring to the notion that proper performance analysis is about optimization of one kind or another.  Basically, you are varying one performance metric against another, or several others, to achieve an optimal goal. Identifying the appropriate performance metrics to vary is one thing, but not the most important thing. The more important point is that all performance metrics are related ... and related nonlinearly. That's what makes performance analysis both interesting and difficult. Queueing theory provides an appropriate rigorous framework by which to characterize those nonlinearities: including Little's law and the USL.

By way of a very simple example, if you are more concerned about user impatience when they're interacting with an application, you might tend to focus on optimizing the user waiting-time metric as a (nonlinear) function of message queue-lengths or resource utilization or whatever other metrics. Conversely, if you were a weather forecaster and needed to crunch a lot of simulation data on a supercomputer, you might tend to focus on optimizing the throughput metric as a function of the number of cores or memory size or whatever other relevant metrics. All the metric relationships are nonlinear. Those two performance goals, represented respectively by the waiting time and throughput, can be thought of as being at opposite ends of the performance spectrum. 

And then there's the matter of the cost of your selected performance solution. The "best" (optimal) choice mathematically might not be affordable so, you have to fall back to a goal that may be technically sub-optimal but less expensive. 

In general, there is no single "right way" or one size fits all. It depends on your performance goals.

Scott Johnson

unread,
Oct 16, 2017, 4:22:11 PM10/16/17
to guerrilla-cap...@googlegroups.com
Just remember, once you decide to dance with the bear, you don't stop dancing when you get tired :-)

On Mon, Oct 16, 2017 at 3:12 PM, benikenobi <beni...@gmail.com> wrote:
Please, could you elaborate a bit more about that trade off? Or please could you point me out where you talk about it?


Thanks in advance

--
You received this message because you are subscribed to the Google Groups "Guerrilla Capacity Planning" group.
To unsubscribe from this group and stop receiving emails from it, send an email to guerrilla-capacity-planning+unsub...@googlegroups.com.
To post to this group, send email to guerrilla-capacity-planning@googlegroups.com.
Visit this group at https://groups.google.com/group/guerrilla-capacity-planning.
For more options, visit https://groups.google.com/d/optout.



--
Regards,

Scott Johnson

benikenobi

unread,
Oct 16, 2017, 7:00:06 PM10/16/17
to Guerrilla Capacity Planning
Thanks very much.

I think that I understand your explanation but I am not sure :( how to apply it what the presenter is claiming, i.e., for an API proxy, it seems more appropiate to limit the number of concurrent requests (capacity) rather than doing rate-limit (throughput). 

DrQ

unread,
Oct 16, 2017, 7:20:22 PM10/16/17
to Guerrilla Capacity Planning
I confess I didn't watch the entire video but basically I believe he's saying this.

If we take version 1 of LL, viz., N = X * R, you can hold N at a fixed value, e.g., N = 1000. Then, X (mean throughput) will vary as the inverse of R (mean response time). Notice that this inverse relationship is nonlinear—as I mentioned previously. In other words, a plot of X vs. R is  a curve (not a straight line).

In practice, fixing N might correspond to setting a limit on the size of thread pool (the max level of concurrency), for example. I think that's what he's suggesting; rather than trying to fix X by throttling the arrival rate of requests. But that's his choice and it may be a good one for that Comcast app. Apparently, Twitter believes the latter is the best choice for them. Rate-limiting is also done on the Internet.

benikenobi

unread,
Oct 17, 2017, 11:44:20 AM10/17/17
to Guerrilla Capacity Planning
Thanks again for your reply because I love this kind of explanations.

My final point was to know/find out pros and cons about your last comment. Normally in Internet what I have seen is that people are using rate-limit (clearer info that I have found is at https://stripe.com/blog/rate-limiters). However, the author in this talks claims that for API Gateway is better to limit concurrence rather than request throttling.

Regards

DrQ

unread,
Oct 17, 2017, 1:08:01 PM10/17/17
to Guerrilla Capacity Planning
I looked at some more of the video. Personally, I find his presentation a bit concocted and muddled. Otherwise, everything I've said here still holds (IMHO).

At the outset, he seems to be saying that rate-limiting is "bad" b/c the "queue" length, Q (actually he means the waiting-line length, L) can become "infinite" or, more likely, overflow some internal buffer. Elementary queueing thy tells us that the arrival rate of request, (λ) cannot exceed the service rate (μ), i.e., λ < μ, always.  I'm using the ubiquitous Greek notation here (no point fighting it now). He tries to convince you of that by using some simple (concocted) arithmetic and completely misses the third version of LL:  U = λ * S or more commonly using the Greek notation: ρ = λ * S.

The ratio of λ and μ is the server or "worker" utilization (ρ, above). That is, ρ = λ / μ or yet another version of LL. BTW, ρ is the Greek 'r' for "ratio". Now, you can see that if λ > μ, then ρ = λ / μ > 1 or, what is the same thing. the "worker" utilization would be greater than 100% busy, which is impossible (for a single resource).

All that said, you'd like to know how to determine the pros and cons of these various schemes. The answer to that, however, is very much dependent of what architecture you are looking at in your shop. The Comcast guy's demos play around very specific things: origin proxy, nginx settings, hidden algorithms and God knows what else. Are his toys exactly your toys? If yes, maybe you can use his approach. If not, you need to analyze your situation thoroughly to decide. That's all part of the capacity planning game.

On a related note (not mentioned in his presentation), cloud-based systems like AWS, provide capacity management schemes, e.g., "auto scaling" where a rate limit is set and then additional instances (capacity) are spun up on demand. In other words, it's kind of a combination of both and all the details of how it's done are hidden from you. You just stipulate the capacity requirement as part of a predefined protocol that AWS understands. 


Then he goes on to demonstrate how fixing N (what he calls "capacity) in an attempt to convince you that it's the better choice. In the course of doing all that, however, he monkeys with the values of the worker R (effectively the service time) and the number of threads, N. In other words, unlike his rate-limited example, his "better" demo has become rather dynamic. Conversely, I could also create scenarios where the rate-limit is a dynamic value and possibly achieve the same performance from the user perspective. 

BTW, I was surprised to see a reference to Van Jacobsen's TCP slow start paper, which is precisely what I was referring to when I mentioned rate-limiting on the Internet. 

DrQ

unread,
Oct 17, 2017, 6:38:01 PM10/17/17
to Guerrilla Capacity Planning
It just dawned on me that he actually does refer to all three versions of LL but, in such a strange way that I missed it. He completely overloads the notation.


This slide shows precisely the "QLU" mnemonic that I mentioned previously. Here's how it breaks down:
  1. His "N = XR" (top line) is my Q = λ * R, which is the average queue length or the mean number of requests in all the system queues.
  2. His "Nq = XqRq" (I dunno how to write subscripts in this silly environment, so I'll use lower case) is my L =  λ * R, which is the waiting-line length (as used by John Little).
  3. His "Nr = XrRr", which is very confusing b/c he's using 'R' to stand for both (elephant) "Rides" and "Response time." Ugh!  And Nr is ... well ... what is it? 
Since he wants to be able to fire up an arbitrary number of "workers", his "Nr" can be any positive number, e.g., Nr = 50. But there's a problem here.

The "workers" are a resource, like the cashiers in a grocery store. However, the box he's labeled "Queue" is actually the waiting line (in proper queue-theoretic speak): just like the customers waiting in a grocery checkout line. In other words, he's actually drawn a grocery checkout lane with a single waiting line and multiple cashiers servicing those customers. I've actually seen such a configuration for the "Express Lane" at a Safeway store in Melbourne, Australia. There, Nr up to 6 cashiers during peak traffic periods. Australia is a highly advanced civilization. 

So what's the problem? The problem is his "Rr" in that box is NOT a response time or residence time, even though it's labeled by an R. He refers to it as the "time on the ride" (previous slide) which tells us that it's actually the service time, in QT parlance. This disambiguation is very important. If I adjust his notation, it should be something like "Nr = XrSr", where Sr is the service time on the ride. Now, Sr is the inverse of the service rate (μ in my notation) and I stated previously that λ < μ in order that the queue doesn't blow up (to infinite length). That led to the definition of utilization as:  ρ = λ/μ which must be less than 100% busy. As a ratio, we can write that constraint arithmetically as ρ < 1. In his notation therefore, it would read Nr < 1. But he wants to be able to have MORE than 1 worker! And what does LESS than 1 worker mean, anyway.

Note that 1/μ is the same thing as the service time S. Hence, ρ = λ/μ is the thing as ρ = λS. This is the third version of LL.

The resolution is that his Nr refers the TOTAL average utilization of all the workers. I usually write this a U = m * ρ, where 'm' would be the integer number of workers. If m = 50 (as above) then U = Nr = 50 if and only if all 50 workers are running at 100% busy. In other words, at 5000%. That only makes sense when you have more than 1 worker. The actual measured value might be more like 2500% if each worker is only running at 50% busy, on average. 

Note that, in general, we would write the above as: m ρ = λ/μ so that the per-server (or per-worker) utilization metric fulfills the necessary condition ρ < 100%. 

This leads to a very important and little-known conclusion: the utilization metric is also a measure of the average number of requests in service. So, with m = 1 worker or cashier,  ρ = 25% (i.e., ρ < 1) means there is an average of a quarter of a request in service. QT provides this kind of capacity planning insight.

In the "QLU" mnemonic, we wrote it assuming m =1 just to keep it simple. Once again, that mnemonic gives the relationships b/w the various queue-length (or size) components and the corresponding component times in the queueing system. That's why all the queue parts get special names.


On Monday, October 16, 2017 at 8:29:04 AM UTC-7, benikenobi wrote:

Mohan Radhakrishnan

unread,
Oct 18, 2017, 11:31:16 AM10/18/17
to guerrilla-cap...@googlegroups.com
I think rate-limiting a webservice means something completely different from what I used to think. I was confusing that with call quotas.

Thanks,
Mohan

--

benikenobi

unread,
Oct 24, 2017, 10:33:20 AM10/24/17
to Guerrilla Capacity Planning
Hi,

In this short video (5 mins), a guy talk about that.


Hope it helps

To post to this group, send email to guerrilla-cap...@googlegroups.com.

benikenobi

unread,
Oct 24, 2017, 10:33:20 AM10/24/17
to Guerrilla Capacity Planning
Hi again,

I have read your explanation very carefully (I had to review some old posts where you talk this subject) and I agree with your comments (including the fact of using and maintaining the right notation is key).

But I would like to bring the discussion to the important point (from my point of view).  Setting the restriction that I cannot put more nodes online when needed (i.e. I cannot add more capacity to my system) and being my goal to avoid to enter in a death spiral in my system because a lot of requests come into my system. Should I limit by max throughput or  max concurrency? Intuitively if I think in terms of TCP connections between my API GW and downstream services, if one of the downstream services starts to give bad response times and throughput (or arrival rate) is maintained, the number of requests (in that case the number of connections between the API GW and the downstream server) grows so it seems good idea to limit it, i.e., to limit by concurrency seems good idea. 

Could USL help with this analysis?

I have to admit that I am not 100% sure about what I am saying because intuition in these kind analysis is a bad friend.

On other hand, as DrQ says, most of the platforms are doing rate limit (throughput limit) rather than concurrency limit but it is also true that they dont explain the reasons (at least I have not found them). That is because I have brought the discussion to the group.

Regards.

benikenobi

unread,
Oct 24, 2017, 12:48:02 PM10/24/17
to Guerrilla Capacity Planning
Hi,

I forgot to include one more intuition described in the talk. If an origin server is overloaded (e.g. a lot of requests were routed to it), rate limiting whose value would be a high one (you do not discard if you dont need to), that will not solve the problem because that origin server continue receiving a lot of request. It seems that it would be more effective to limit by concurrency, while the origin is in that mode (overloaded), only a request will get into the origin when a request is gone out (the same situation when a parking is full of cars). 

Of course I am also assuming that it is allowed to reject the requests.

Does it make sense? 

Regards.

DrQ

unread,
Oct 24, 2017, 1:35:32 PM10/24/17
to Guerrilla Capacity Planning
Actually, I did NOT say nor intend to say, "most of the platforms are doing rate limit rather than concurrency limit". I was merely offering a counterpoint to what the Comcast guy was saying in his presentation. He is promoting rate limiting, while I merely pointed out the another big company (Twitter) is using rate limiting for some of its apps. I did not intend to imply that I favored one approach over the other. I really don't know, b/c it depends fundamentally on what your application performance goals are and what technologies (and budgets) are available to achieve them.

More importantly, one should not just blindly follow what someone says just because they have a big logo behind them or have some slick demo on YouTube. It might be correct for them but not correct for you. As my undergrad adviser kept telling me, "No matter what you read in some book or scientific journal, you still have to convince yourself that it's true." As an undergrad, I was still in the mental mode of receiving information, uncritically, i.e., glugging from various founts of wisdom (lecturers, books, journals, etc.). When you start to do research (as I was then) simply accepting what you're told (no matter by whom or what) is not good enough. You have to critically analyze everything; especially your OWN work. As Feynman said, YOU are the easiest person to foolIt's fine to listen to experts (including me) but at some point you have to perform your own analysis and decide whether or not you believe your own conclusions.

Making that transition is by no means easy. And as far as I've seen, most (if not all) educational institutions (including my own) singularly fail to provide any useful means for developing the rigor that can help you to get there. Somehow, you're just supposed to absorb it out of the vacuum.

From what you've said so far, I think you are sneaking up on that realization wrt to capacity planning, and I quote: "I have to admit that I am not 100% sure about what I am saying because intuition in these kind analysis is a bad friend."  Quite right. As I've been known to tweetIntuition is a seductive siren. Avoid a wreck by tying yourself to the mast of math.

Part of developing the rigor comes through the use of modeling tools. They provide a framework by which to make and assess your CaP conclusions. The USL might be one of these but, that relies on having scalability data of some sort, which you may not have if the app doesn't exist yet. On the other hand, it looks to me like PDQ might be a better option in that you can just use some wild-arsed guesses for input parameters, such as service times, etc. Very quickly you'll see where you suddenly bump into failures of intuition.

Perhaps most importantly, starting to apply such tools forces you to think rigorously about what data is needed as inputs, long before you ever get to constructing any actual performance models. It's the start of a long and on-going process. There are no short cuts.

DrQ

unread,
Oct 24, 2017, 1:37:54 PM10/24/17
to Guerrilla Capacity Planning
Gark!  I meant: "He is promoting thread limiting"

DrQ

unread,
Oct 24, 2017, 2:46:09 PM10/24/17
to Guerrilla Capacity Planning
There are many subtle issues in your question and this lame, hand-cuffed, environment is no place to try to address them at the needed level of detail. That said, and at the risk of oversimplifying, let's see if I can provide some insights based on queueing theory.

Suppose we have m = 1000 threads (capacity). In QT, those are the service facilities; like a grocery-store cashier. They are the things that do work (at some level). 
  • If the rate of arrivals is "low" the threads will not be very busy and incoming requests will get serviced (with high probability). 
  • What happens when the arrival rate has increased, during the biz day, to the point where all service facilities are generally busy? 
  • A user would then get a 500 error presumably and have to retry.
  • Alternatively, those un-serviced requests could be put in a waiting line ("queue up") to be serviced eventually. This is why queues or buffers are a good thing—in general. Of course, the waiting time will add to their response time and the user may defect from the website.
  • In some sense, you don't want the waiting line to become too long.
  • The number of active threads is limited by the size the thread pool (one way or another).
  • Practically speaking, at some point the total number of requests (in service + waiting) will reach the finite capacity limit of the O/S or whatever.
  • On the other hand, we might possibly add more nodes or VMs and thereby increase the total available thread pool.
It's all about trade-offs.

Conversely, suppose we limit the rate of arrivals to some threshold value associated with the point where all service facilities are generally busy. 
  • The effect of that constraint will be to limit the maximum the size of the waiting line (which could still be big). 
  • Of course, the waiting time will add to their response time and the user may defect from the website.
  • The total number of requests (in service + waiting) will reach a finite limit.
  • On the other hand, we might possibly add more nodes or VMs and thereby increase the total available thread pool.
Modulo the ugly details of how these alternative constraints might be implemented, they are constraints that correspond to the finite capacity of any real system. Theoretically, how you choose to match the capacity constraints looks like six of one, half dozen of the other. The "best choice" (optimal architecture) will depend on all those ugly details.

It's all about trade-offs.

Uri Carrasquilla

unread,
Oct 24, 2017, 3:13:31 PM10/24/17
to guerrilla-cap...@googlegroups.com
I apologize for my trivial questions.
Assuming the load is the number of TCP Connection requests from the clients, the throughput is Connections/sec and the delay is the time spent from the proxy (proxies) to the origin and back, are the two options being discussed to limit by Connections/sec or by number of clients?  Did I get this right?
Is it possible to account for persistent connections?
Sometimes it is very difficult to know if the client is one or multiple users making the option per client difficult to implement.  Isn't it?
Uri

--
You received this message because you are subscribed to the Google Groups "Guerrilla Capacity Planning" group.
To unsubscribe from this group and stop receiving emails from it, send an email to guerrilla-capacity-...@googlegroups.com.

DrQ

unread,
Oct 24, 2017, 3:36:18 PM10/24/17
to Guerrilla Capacity Planning
Plz don't apologize. Your questions are not "trivial" and your struggle is just as genuine as that of anyone else who tries to understand the subtitles of CaP.
As I've said many times: the only dumb question is the one you never ask (pun intended). :)


On Tuesday, October 24, 2017 at 12:13:31 PM UTC-7, uriel@xxxxx wrote:
I apologize for my trivial questions.
Assuming the load is the number of TCP Connection requests from the clients, the throughput is Connections/sec and the delay is the time spent from the proxy (proxies) to the origin and back, are the two options being discussed to limit by Connections/sec or by number of clients?  Did I get this right?
Is it possible to account for persistent connections?
Sometimes it is very difficult to know if the client is one or multiple users making the option per client difficult to implement.  Isn't it?
Uri

Baron Schwartz

unread,
Oct 24, 2017, 3:37:07 PM10/24/17
to guerrilla-cap...@googlegroups.com
One example of limiting concurrency, in a different context, was the work Facebook did years ago on "admission control" for MySQL. MySQL's internals were not implemented scalably, and if you asked MySQL to do a lot of work at once, it would get much less efficient. Limiting the amount of work in progress -- in this case "active queries" -- avoided the death trap: "MySQL with admission control is much better at sustaining peak QPS when the server is saturated." In other words, limiting concurrency enabled higher throughput. As in the USL, throughput is the dependent  variable, not the independent variable.

DrQ

unread,
Oct 24, 2017, 4:27:28 PM10/24/17
to Guerrilla Capacity Planning

>> Did I get this right?

Not sure if this helps or hinders but, if we look at it from the standpoint of LL as usually stated (as per the Comcast video), i.e., LL1 in my terminology, things look rather confusing. 

X = connections/sec is fine. It's a rate metric.
N = avg number of requests in the system. 
R = avg time spent in the system or the response time as seen by a user.

LL1:  N = λ * R for the system.
  1. Rate-limiteded: λ = const. means R ~ N.
  2. Thread-limited  N = const. means R ~ 1 / λ.
???

Try again. But this time look at the service facility, i.e., LL3.

m = number of threads, which may equal the number of client-users if there's 1 user per thread. 

LL3:  U = λ * S / m

For simplicity, assume a fixed avg service time S = 1 time unit.
  1. Thread-limited, i.e., m = const., means  U ~ λ. Total thread utilization is proportional to the arrival rate; up to thread saturation (i.e., U ≤  m * 100%).
  2. Rate-limited, i.e., λ ≤ const., means U ~ 1 / m. Total thread utilization is INVERSELY proportional to the number of active threads; up to thread exhaustion (m = pool size).
The avg response time seen by a client-user (due to waiting line growth) is given by
R = 1 / (1 - (U/m)^m)

benikenobi

unread,
Oct 27, 2017, 5:57:00 PM10/27/17
to Guerrilla Capacity Planning
DrQ, I am sorry for the misunderstanding "Actually, I did NOT say nor intend to say, "most of the platforms are doing rate limit rather than concurrency limit". I think that "most of the platform ..." is the conclusion that I have reached after  my survery about what some platforms do in this regard (e.g. Amazon API Gateway, Google Ad, Twitter). I am repeat what I said earlier "it is also true that they dont explain the reasons (at least I have not found them)" 

I agree with you said especially "It's fine to listen to experts (including me) but at some point you have to perform your own analysis and decide whether or not you believe your own conclusions." but with tons of information that we have currently about any subject, I normally start following the experts until "a click" in my head raises and I give a step forward in my learning (but it is not easy, specially with stuff of this class). But for that happens I need to iterate between theory and practice.

Thanks for giving your input comparing both scenarios. I need to have a look carefully to the comparison and see how to check it. May be I can extract some data and build a USL model to see how the throughput behaves as a funtion of N. It may be give me some hint about some N value.

Thanks 

DrQ

unread,
Oct 27, 2017, 6:57:42 PM10/27/17
to Guerrilla Capacity Planning
No worries. As you find out more or try to explore things with the USL or whatever, feel free to report back here and ask more questions. We all stand to learn something (especially me).

BTW, maybe I missed it but I still didn't quite get your conclusion:  "I think that "most of the platform ..." is the conclusion that I have reached after "
Most do which: 
  1. rate limit? 
  2. thread limit?

benitocm

unread,
Oct 28, 2017, 11:23:12 AM10/28/17
to guerrilla-cap...@googlegroups.com
Hi,

Most of them that I have checked do rate limit. Why? I thought that I knew but now I dont really know. One takeaway for me has been to think about the two options, because before watching the talk, I had rate-limit as the only option.
Anyway, I think there is no writings or talks on this subject or at least I have not found them. And the references about such complicated subject are always good for trying to reproduce the same experiment or small variant of it or try another approach completely different.

Regards

Virus-free. www.avast.com

--
You received this message because you are subscribed to a topic in the Google Groups "Guerrilla Capacity Planning" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/guerrilla-capacity-planning/7KGkhSsaL4g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to guerrilla-capacity-planning+unsub...@googlegroups.com.
To post to this group, send email to guerrilla-capacity-planning@googlegroups.com.

DrQ

unread,
Oct 28, 2017, 4:57:27 PM10/28/17
to Guerrilla Capacity Planning
I'm not an expert on the implementation details but I can find any number of blog posts that discuss rate-limited APIs. 

Why do they use rate-limiting? (you ask)

It seems to me that it's used b/c various APIs (e.g., Redis), or IDEs or whatever, provide a library of functions to measure the rate of incoming requests into whatever resource and allow certain optional actions to be taken. Looking at some of the codes, they simply seem to implement the PDQ definition of arrival rate, viz., 

λ = A / T

where A is the count of incoming requests (measured by some probe) during a measurement time T which is the sample period (or bucket). The ratio λ is the time-averaged arrival rate (or steady-state throughput). Similarly, one can use other functions from the same library to set a threshold value. When λ exceeds that threshold, those requests are either dropped on the floor or put into a queue (buffer) to be serviced eventually or dealt with by some other action.


On Saturday, October 28, 2017 at 8:23:12 AM UTC-7, benikenobi wrote:
Hi,

Most of them that I have checked do rate limit. Why? I thought that I knew but now I dont really know. One takeaway for me has been to think about the two options, because before watching the talk, I had rate-limit as the only option.
Anyway, I think there is no writings or talks on this subject or at least I have not found them. And the references about such complicated subject are always good for trying to reproduce the same experiment or small variant of it or try another approach completely different.

Regards

Virus-free. www.avast.com

benikenobi

unread,
Nov 8, 2017, 4:19:24 PM11/8/17
to Guerrilla Capacity Planning
Hi,

What I have seen is that this rate-limit feature is implemented using a token bucket algorithm (https://en.wikipedia.org/wiki/Token_bucket) (mostly) and sometimes the leaky bucket algorithm (https://en.wikipedia.org/wiki/Leaky_bucket). Both of them, I think, rely on what you are describing.

My question :) is not really why they are using rate limit. The final reason is to avoid to overload the service My question is that why nobody seems to talk about limit in concurrency (may be the reason is that it is no effective, I dont know)  that I think its is more related with the amount of resources that a server has. Because it is clear that when a service starts to increase its response time (e.g. network problems), you need more resources in the caller services  to handle the same amount of traffic (rate limit constant).

Regards

DrQ

unread,
Nov 8, 2017, 5:40:38 PM11/8/17
to Guerrilla Capacity Planning
To tell you the truth (not that I haven't been), I have more often come across applications that are 
thread-bound, not rate-bound. Most assuredly, thread-limited capacity is not new. Eventually, 
you end up dealing with Little's law with a fixed N.

At one point in performance history, there was a hot-and-heavy argument about whether to 
pre-fork service threads or fork-on-demand. See §12.2.2 HTTP Analysis Using PDQ for 
the performance analysis. Perhaps the more recent variant would be something Linux OS threads vs 
Apache Tomcat or whatever.

As I said before, from a queue-theoretic standpoint, it's six of one; half-dozen of the 
other (to 1st order). I think what you're asking about now has more to do with particular
implementation details and ease of use for a given type of application architecture.
Reply all
Reply to author
Forward
0 new messages