I have implemented an instance Ping type monitoring, where the
PoolManager responds on a socket (arbitrarily, port 8010). The
poolmanager pings instances somewhat randomly, and makes a more
definite effort when an instance doesn't respond. After N (5)
non-responses over time, that instance is terminated and replaced.
That's OK, but if lifeguard isn't running within the cloud, it
requires that port to be open on the security group and possibly in
your own firewall.
Another idea is to rely in instance status as an indicator of instance
health. Currently, instance report status up to twice a minute.
Mostly, this happens when switching from idle to busy, or busy to
idle. The sparse reporting is so that fast services don't flood the
pool manager with status messages. The instance duty cycle metric is
used so actual busy/idle state isn't required, rather a metric of
busyness is used. The problem with this has been that an busy or idle
instance sends no status if it remains in that state for any
significant amount of time.
So, there are a couple of problems here.
1. need to monitor instance health so it will be practical to replace
broken service instances.
2. need a better indication of instance state so scaling can be more accurate.
To fix this, I propose not using ping on a separate port for health
check and use regular status reporting via SQS to give better instance
health and activity indications. I still like the idea of instance and
pool duty cycle as an indication of pool capacity. I want to think
about what the instance should report and how often. I think if an
instance reports status every 30 seconds, that should be often enough
(1 minute should also work). If the instance reports duty cycle (more
recent busy time / (most recent idle + most recent busy times)), that
would simplify some things. The pool manager can also cease
auto-incrementing idle time for non-reporting instances (which solves
another problem someone reported).
Separate from this, would be a plugable scaling module.
Thoughts?
David