I shall make an effort to try and organically grow and clarify the docs as time
and feedback come in. That said the settings.py and docs have quite extensive
docstrings related to the new variables for the new apps.
Mirage:
So there are Mirage settings related to the runtime variables for Mirage.
But to enable a metric namespace to be pushed to Mirage if Analyzer finds a
metric in the namespace anomalous, is a simple addition of one parameter to the
normal ALERTS tuples, just define a SECOND_ORDER_RESOLUTION_HOURS for the metric
e.g.:
ALERTS = (
("skyline", "smtp", 1800),
("stats_counts.http.rpm.publishers.*", "smtp", 300, 168),
)
Here we have enabled the "stats_counts.http.rpm.publishers.*" for Mirage
analysis by adding 168. This means that if Analyzer determines say
"stats_counts.http.rpm.publishers.tenfold" is anomalous at a FULL_DURATION of
86400 let us say, instead of alerting, it will add the metric for Mirage check.
Mirage will then surface 168 hours of data from Graphite for the
"stats_counts.http.rpm.publishers.tenfold" metric and analyse that against the
MIRAGE_ALGORITHMS (which by default are the same as the Analyzer ALGORITHMS).
If Mirage finds the metric to currently be anomalous against the 168 hours
timeseries, it will alert -
http://earthgecko-skyline.readthedocs.io/en/latest/mirage.htmlBoundary:
Boundary also has its own settings, very similar to Analyzer, however Boundary
is very different in what it does. Boundary is really for your mission critical
metrics only. So let us say you have a metric called
"stats_counts.http.tenfold.impression.per.minute"
and this is always greater than 600 and if it drops down to anything under 100,
you KNOW that there is a problem, you would do this
BOUNDARY_METRICS = (
# ('metric', 'algorithm', EXPIRATION_TIME, MIN_AVERAGE, MIN_AVERAGE_SECONDS, TRIGGER_VALUE, ALERT_THRESHOLD, 'ALERT_VIAS'),
("stats_counts.http.tenfold.impression.per.minute", 'less_than', 900, 0, 0, 100, 3, 'smtp|hipchat'),
)
So this means, that the metric value was less than 100 2 times in a row,
Boundary would alert via SMTP and Hipchat, it and it would not alert again for
900 seconds on that metric.
Now lets say you have a metric "stats_counts.mysql.queries.per.minute" and let
us say this metrics is normal fairly constant at about 300 per minute, but
sometimes you clear the cache and it spikes to 6000 per minute, however
sometimes it goes do to 20 per minute at quieter times, however, even when there
is a problem it never gets to 0. You can not threshold it low as it will be
noisy. However you have seen problems where it goes from 400 per minute to 4,
5,5,3 and stays down there as there is some problem with a release, but a few
administration and other services occassionally make requests, enter
detect_drop_off_cliff. So we add the following
BOUNDARY_METRICS = (
# ('metric', 'algorithm', EXPIRATION_TIME, MIN_AVERAGE, MIN_AVERAGE_SECONDS, TRIGGER_VALUE, ALERT_THRESHOLD, 'ALERT_VIAS'),
("stats_counts.http.tenfold.impression.per.minute", 'less_than', 900, 0, 0, 100, 3, 'smtp|hipchat'),
("stats_counts.mysql.queries.per.minute", 'detect_drop_off_cliff', 600, 30, 3600, 0, 2, 'smtp|pagerduty|hipchat'),
)
And set the other BOUNDARY related settings.py variables as appropriate for your
environment.
http://earthgecko-skyline.readthedocs.io/en/latest/boundary.html#configuration-and-running-boundaryWebapp UI screenshots
I as may as well run through all the views.
Skyline.now:
Like the original Skyline screen, current anomalies that are in the
ANOMALY_DUMP = 'webapp/static/dump/anomalies.json'
https://drive.google.com/uc?id=0BwzV5s9wP71eREZXUTNjOEkwTDgSkyline.Panorama.Anomaly.Search:
The Panorama entry view, here by default the lastest 10 anomalies in the DB load
We can search the anomalies in the DB by various parameters, here we use metric
stats.statsd.packets_received and click Search
https://drive.google.com/uc?id=0BwzV5s9wP71eeVhsaHFRVXB3eWsSkyline.Panorama.Anomalies.Found:
The search returns the results in a table and has a dynamic lower view, with a
list which can be mouseovered and it loads data from Graphite, not Redis (like
the now view). It is always likely that it is surfacing historic and most
probably aggregated data from Graphite, so the ACTUAL anomalous data point may
not be in the timeseries, but aggregated to a similar value. The graph will
report this and highlight where the anomaly occurred. It also timeshifts the
timeseries so that it is visible and not right at the edge of the graph. The
original anomalous data it reported too.
https://drive.google.com/uc?id=0BwzV5s9wP71eUUxaWHFldWhDTVUSkyline.rebrow.login:
Login to the Redis instance - the awesome rebrow
https://drive.google.com/uc?id=0BwzV5s9wP71eOGVjRXM5cGFhOFUSkyline.rebrow.Server.Status:
The Redis server status
https://drive.google.com/uc?id=0BwzV5s9wP71eYzdVeGZCMjB6N1USkyline.rebrow.Server.Status.Command.Statistics:
https://drive.google.com/uc?id=0BwzV5s9wP71eV21yT25BUFN3b3cSkyline.rebrow.Keys:
A list of all the keys and search option.
https://drive.google.com/uc?id=0BwzV5s9wP71eZDRWNVp4SkpscDASkyline.rebrow.Keys.Search.metrics.stats.statsd:
Search for a namespace
https://drive.google.com/uc?id=0BwzV5s9wP71edHQ0bGl6S1hZcUkSkyline.rebrow.Keys.key.metrics.stats.statsd.packets_received
View the key and value. e.g timeseries data or alert keys, etc, etc
https://drive.google.com/uc?id=0BwzV5s9wP71eM0xPVThrWXQxZ3MSkyline.docs:
The Webapp serves it own copy of the docs :)
https://drive.google.com/uc?id=0BwzV5s9wP71eRmo2eXAxaTZqNHMHope that helps, sorry could not put those inline.
Regards
Gary