Here are some thoughts about how Amazon's auto-scaling features differ
from Lifeguard's auto-scaling. I'd appreciate any feedback on this, as
I learn these new features and consider where and how (and if) to use
them.
AWS Auto Scaling supports infrastructure-level metrics as the basis
for triggering a scaling event. These metrics include such parameters
as cpu utilization, bytes received or sent over the network, bytes
written to or read from the disk, and number of disk read or write
operations. If you also use the Elastic Load Balancing feature then
your trigger can also measure latency and request count for the pool.
These fabric-level measurements are the digital equivalent of
measuring blood pressure, heart rate, temperature, and testing
neurological reflexes. These metrics, when they exceed a certain
threshhold, can indicate a need for more resources - but they are not
always the best way to characterize the need, and these metrics do not
provide enough information by themselves to determine the appropriate
response.
Often the metrics you want to observe are higher in the stack,
parameters such as database transactions per second or message queue
length. Such higher-level metrics are generally not expressable in
terms of lower-level metrics.
Lifeguard's trigger for scaling is the length of the SQS message queue
containing the work requests: When the PoolManager detects that the
number of messages has exceeded <QueueSizeFactor> it launches up to
<RampUpInterval> new service instances (until <MaxSize> for the pool
is reached).
Message queue length may very well be a better indicator of
application performance than cpu utilization or latency - this depends
on your application and its service-level requirements.
My observation is that "Fabric-level metrics are not necessarily
expressed in the same language as your application's SLA".
The business purpose served by load balancing is the ability to meet a
predefined SLA. In order to meet the SLA, the service level
measurements must be observable by the scaling mechanism. But
application-level service agreements are not expressed solely in terms
of the fabric-level metrics of cpu utilization, network I/O, disk I/O,
requests per second, and request/response latency. Application-level
service agreements may be expressed in terms of database transactions,
message queues, memory usage, index size, etc. Unless you can measure
these application-level metrics in your scaling mechanism, you may not
be able to design a scaling strategy to deliver your application's
SLA.
Lifeguard currently measures the message queue length, and it can be
extended to support other application-specific triggers.
I think that there will continue to be a need for lifeguard and
lifeguard-like frameworks that can be configured using the same
vocabulary as the application's SLA.
.. Shlomo