Sizing and Capacity planning

997 views
Skip to first unread message

Avinash Agrawal

unread,
Dec 13, 2013, 9:33:47 AM12/13/13
to mechanica...@googlegroups.com
Hi All,


How to get started with sizing and capacity planning , assuming we don't know the software behavior and its completely new product to deal with?



Cheers.
Avinash

Gil Tene

unread,
Dec 20, 2013, 1:38:11 AM12/20/13
to
Start with requirements. I see way too many "capacity planning" exercises that go off spending weeks measuring some irrelevant metrics about a system (like how many widgets per hour can this thing do) without knowing what they actually need it to do.

There are two key sets of metrics to state here: the "How Much" set and the "How Bad" set:

In the "How Much" part, you need to establish, based on expected business needs, Numbers for things (like connections, users, streams, transactions or messages per second) that you expect to interact with at the peak time of normal operations, and the growth rate of those metrics that you need to be able keep up with. Also state expected things like data set size, data interaction rates,  and data set growth rates.

For the "How Bad" part, you need to make sure your metrics include a description of what acceptable behavior is, remembering that without describing what is not acceptable, you have not described what acceptable is. Be specific. Saying "always fast" is not nearly as useful as "never slower than X", and saying "mostly" (or "on the average") is usually a way to avoid facing (or even considering) potential consequences of non typical behavior. The best approach here is to think of how often it is ok to have certain levels of bad things happen. Don't get too greedy and ask for "perfect" here, or you'll get a big bill at the end. So consider things like how often it is ok for the system to be out of commission for longer than X (for multiple values of X like a year, a week, a day, an hour, a minute, etc.). Also consider how often it is ok for the system react in longer than T (for multiple values of T, like an hour, a minute, a second, 50msec, etc.). Both of these are usually best stated as levels at percentiles, with availability being stated at percentiles of time, and responsiveness stated at percentiles of actual interactions. Don't forget to state the worst acceptable case for each.

Once you have a feel for business-driven requirements stated with "how much" and "how bad" metrics,  design a set of experiments to test "how much" (measured in whatever capacity metrics your requirements use) the system can handle without EVER showing even the slightest hint of failing your "how bad" availability and responsiveness requirements. This will invariably include repeated testing under a wide range of "how much" levels to see how far things go before they start to fail.

Then run your experiments...

The rest, like padding for business requirements underestimating reality, and for being optimistically wrong in various ways in measurement, is a relatively easy exercise of arm wrestling between wanting to sleep well at night and wanting to have more beans left to count at the end.

An important note: Before you run the actual experiments and start considering your results, validate that the experimental setup can actually measure what you want. The best way to do that is by artificially introducing certain conditions and verifying that the setup correctly reports on what you know to have actually happened. My favorite tools for this step are the physical pulling out of network cables and power cords, and using ^Z (or equivalent signals). You may find yourself spending a good amount of time calibrating the experimental setup so that you can actually trust the results, but that us time well spent, as wasting your time (and risking your business) by analyzing and relying on badly measured data is a very expensive proposition.

Avinash Agrawal

unread,
Dec 15, 2013, 9:05:45 PM12/15/13
to mechanica...@googlegroups.com
Thanks Gil ,
 
Really valuable explanation.

Andy Doshi

unread,
Dec 18, 2013, 3:00:56 PM12/18/13
to mechanica...@googlegroups.com
Really helpful input, thanks a ton...you actually saved my couple weeks of capacity planning work by putting me in right direction, thanks again !!!

On Saturday, December 14, 2013 12:01:18 PM UTC-5, Gil Tene wrote:
Start with requirements. I see way too manny "capacity planning" exercises that go off spending weeks measuring some irrelevant metrics about a system (like how many widgets per hour can this thing do) without knowing what they actually need it to do.

There are two key sets of metrics to state here: the "how much" set and the "how bad" set:

In the "How Much" part, you need to establish, based on expected business needs, Numbers for things (like connections, users, streams, transactions or messages per second) that you expect to interact with at the peak time of normal operations, and the growth rate of those metrics that you need to be able keep up with. Also state expected things like data set size, data interaction rates,  and data set growth rates.

For the "How Bad" part, you need to make sure your metrics include a description of what acceptable behavior is, remembering that without describing what us not acceptable, you have not described what acceptable is. Be specific. Saying "always fast" is not nearly as useful as "never slower than X", and saying "mostly" (or "on the average") is usually a way to avoid facing (or even considering) potential consequences of non typical behavior, the best approach here is to think of how  often it is ok to have certain levels of bad things happen. (Don't get too greedy and ask for "perfect" here, or you'll get a big bill at the end.) So consider things like how often is it ok for the system to be out of commission for longer than X (for multiple values of X like a year, a week, a day, an hour, a minute, etc.). Also consider how often it is ok for the system react in longer than T (for multiple values of T, like an hour, a minus, a second, 50msec, etc.). Both of these are usually best stated as levels at percentiles, with availability being stated at percentiles of time, and responsiveness stared at percentiles of actual interactions. Don't forget to state the worst acceptable case for each.

Once you have a feel for business-driven requirements stated with "how much" and "how bad" metrics,  design a set of experiments to test "how much" (measured in whatever capacity metrics your requirements use) the system can handle without EVER showing even the slightest hint of failing your "how bad" availability and responsiveness requirements. This will invariably include repeated testing under a wide range of "how much" levels to see how far things go before they start to fail.

Then run your experiments...

The rest, like padding for business requirements underestimating reality, and for being optimistically wrong in various ways in measurement, is a relatively easy exercise of arm wrestling between waiting to sleep well at night and wanting to have more beans left to count.

An important note: Before you run the actual experiments and start considering your results, validate that the experimental setup can actually measure what you want. The best way to do that is by artificially introducing certain conditions and verifying that the setup correctly reports on what you know to have actually happened. My favorite tools for this step are the physical pulling out network cables, power cords, and using ^Z (or equivalent signals). You may find yourself spending a good amount of time calibrating the experimental setup so that you can actually trust the results, but that us time well spent, as wasting your time (and risking your business) by analyzing and relying on badly measured data is a very expensive proposition.

Thomas Eichberger

unread,
Dec 19, 2013, 2:04:33 PM12/19/13
to mechanica...@googlegroups.com
yes, one of the best explanations ever :-)
Reply all
Reply to author
Forward
0 new messages