Skyline v1.0.0-beta Crucible and Panorama

earthgecko

unread,

Jul 17, 2016, 9:36:15 AM7/17/16

to skyline-dev

If anyone is interested there is a new version of Skyline available with a lot of new stuff.

It is really too much detail to go into in a post here, but the highlights:

- Much improved Analyzer performance
- Ad-hoc analysis with Crucible
- Panorama - an anomalies database
- A bit of a redesign of the Webapp UI with update to all the js libraries and some new endpoints
- Full Python sphinx documentation
- Set up for Python virtualenv and 2.7.11/12 compliant
- Upgrade of the all requirements including pandas 0.18.1

See:

- https://github.com/earthgecko/skyline

Full details at readthedocs - http://earthgecko-skyline.readthedocs.io/en/latest/

Regards
Gary

bl...@tenfold.com

unread,

Jul 21, 2016, 11:55:11 PM7/21/16

to skyline-dev

Hey Gary,

Great work on this. Glad to see someone is pushing the project forward.

One suggestion on the docs. As someone considering installing your improvements... it's unclear how I'll be interacting with these new "Analyzers" (Mirage and Boundary). Will they feed into the list of anomalies just like the ones created by the original "Analyzer"?

A screen shot on the WebApp page of the UI changes would be really helpful.

Thanks,

~blake

earthgecko

unread,

Jul 22, 2016, 6:14:59 AM7/22/16

to skyline-dev

Hi Blake

Thanks for the feedback

> One suggestion on the docs. As someone considering installing your improvements... it's unclear how I'll be interacting with these new "Analyzers" (Mirage and Boundary). Will they feed into the list of anomalies just like the ones created by the original "Analyzer"?

I shall make an effort to try and organically grow and clarify the docs as time
and feedback come in. That said the settings.py and docs have quite extensive
docstrings related to the new variables for the new apps.

Mirage:
So there are Mirage settings related to the runtime variables for Mirage.
But to enable a metric namespace to be pushed to Mirage if Analyzer finds a
metric in the namespace anomalous, is a simple addition of one parameter to the
normal ALERTS tuples, just define a SECOND_ORDER_RESOLUTION_HOURS for the metric
e.g.:

ALERTS = (
           ("skyline", "smtp", 1800),
           ("stats_counts.http.rpm.publishers.*", "smtp", 300, 168),
)

Here we have enabled the "stats_counts.http.rpm.publishers.*" for Mirage
analysis by adding 168. This means that if Analyzer determines say
"stats_counts.http.rpm.publishers.tenfold" is anomalous at a FULL_DURATION of
86400 let us say, instead of alerting, it will add the metric for Mirage check.
Mirage will then surface 168 hours of data from Graphite for the
"stats_counts.http.rpm.publishers.tenfold" metric and analyse that against the
MIRAGE_ALGORITHMS (which by default are the same as the Analyzer ALGORITHMS).
If Mirage finds the metric to currently be anomalous against the 168 hours
timeseries, it will alert - http://earthgecko-skyline.readthedocs.io/en/latest/mirage.html

Boundary:
Boundary also has its own settings, very similar to Analyzer, however Boundary
is very different in what it does. Boundary is really for your mission critical
metrics only. So let us say you have a metric called
"stats_counts.http.tenfold.impression.per.minute"
and this is always greater than 600 and if it drops down to anything under 100,
you KNOW that there is a problem, you would do this

BOUNDARY_METRICS = (
    # ('metric', 'algorithm', EXPIRATION_TIME, MIN_AVERAGE, MIN_AVERAGE_SECONDS, TRIGGER_VALUE, ALERT_THRESHOLD, 'ALERT_VIAS'),
    ("stats_counts.http.tenfold.impression.per.minute", 'less_than', 900, 0, 0, 100, 3, 'smtp|hipchat'),
)

So this means, that the metric value was less than 100 2 times in a row,
Boundary would alert via SMTP and Hipchat, it and it would not alert again for
900 seconds on that metric.

Now lets say you have a metric "stats_counts.mysql.queries.per.minute" and let
us say this metrics is normal fairly constant at about 300 per minute, but
sometimes you clear the cache and it spikes to 6000 per minute, however
sometimes it goes do to 20 per minute at quieter times, however, even when there
is a problem it never gets to 0. You can not threshold it low as it will be
noisy. However you have seen problems where it goes from 400 per minute to 4,
5,5,3 and stays down there as there is some problem with a release, but a few
administration and other services occassionally make requests, enter
detect_drop_off_cliff. So we add the following

BOUNDARY_METRICS = (
    # ('metric', 'algorithm', EXPIRATION_TIME, MIN_AVERAGE, MIN_AVERAGE_SECONDS, TRIGGER_VALUE, ALERT_THRESHOLD, 'ALERT_VIAS'),
    ("stats_counts.http.tenfold.impression.per.minute", 'less_than', 900, 0, 0, 100, 3, 'smtp|hipchat'),
    ("stats_counts.mysql.queries.per.minute", 'detect_drop_off_cliff', 600, 30, 3600, 0, 2, 'smtp|pagerduty|hipchat'),
)

And set the other BOUNDARY related settings.py variables as appropriate for your
environment.

http://earthgecko-skyline.readthedocs.io/en/latest/boundary.html#configuration-and-running-boundary

Webapp UI screenshots

I as may as well run through all the views.

Skyline.now:

Like the original Skyline screen, current anomalies that are in the

ANOMALY_DUMP = 'webapp/static/dump/anomalies.json'

https://drive.google.com/uc?id=0BwzV5s9wP71eREZXUTNjOEkwTDg

Skyline.Panorama.Anomaly.Search:

The Panorama entry view, here by default the lastest 10 anomalies in the DB load
We can search the anomalies in the DB by various parameters, here we use metric
stats.statsd.packets_received and click Search

https://drive.google.com/uc?id=0BwzV5s9wP71eeVhsaHFRVXB3eWs

Skyline.Panorama.Anomalies.Found:

The search returns the results in a table and has a dynamic lower view, with a
list which can be mouseovered and it loads data from Graphite, not Redis (like
the now view). It is always likely that it is surfacing historic and most
probably aggregated data from Graphite, so the ACTUAL anomalous data point may
not be in the timeseries, but aggregated to a similar value. The graph will
report this and highlight where the anomaly occurred. It also timeshifts the
timeseries so that it is visible and not right at the edge of the graph. The
original anomalous data it reported too.

https://drive.google.com/uc?id=0BwzV5s9wP71eUUxaWHFldWhDTVU

Skyline.rebrow.login:

Login to the Redis instance - the awesome rebrow

https://drive.google.com/uc?id=0BwzV5s9wP71eOGVjRXM5cGFhOFU

Skyline.rebrow.Server.Status:

The Redis server status

https://drive.google.com/uc?id=0BwzV5s9wP71eYzdVeGZCMjB6N1U

Skyline.rebrow.Server.Status.Command.Statistics:

https://drive.google.com/uc?id=0BwzV5s9wP71eV21yT25BUFN3b3c

Skyline.rebrow.Keys:

A list of all the keys and search option.

https://drive.google.com/uc?id=0BwzV5s9wP71eZDRWNVp4SkpscDA

Skyline.rebrow.Keys.Search.metrics.stats.statsd:

Search for a namespace

https://drive.google.com/uc?id=0BwzV5s9wP71edHQ0bGl6S1hZcUk

Skyline.rebrow.Keys.key.metrics.stats.statsd.packets_received

View the key and value. e.g timeseries data or alert keys, etc, etc

https://drive.google.com/uc?id=0BwzV5s9wP71eM0xPVThrWXQxZ3M

Skyline.docs:

The Webapp serves it own copy of the docs :)

https://drive.google.com/uc?id=0BwzV5s9wP71eRmo2eXAxaTZqNHM

Hope that helps, sorry could not put those inline.

Regards
Gary

earthgecko

unread,

Jul 22, 2016, 6:16:35 AM7/22/16

to skyline-dev

Those images are displayed in reverse order, good ol Google... :)

earthgecko

unread,

Jul 22, 2016, 6:47:37 AM7/22/16

to skyline-dev

Hi Blake

And I guess I forgot to mention, please do check out http://earthgecko-skyline.readthedocs.io/en/latest/skyline.html#module-settings in the docs as the format is maybe bit nicer and clearer to read than the raw docstrings in settings.py

Some further Mirage and Boundary clarification for you.

Mirage is feed by Analyzer, if Mirage is enabled and for any metric namespaces with a SECOND_ORDER_RESOLUTION_HOURS defined in the Alert tuple.
Mirage does not get timeseries data from Redis, but from Graphite.

Boundary is totally independent. It gets its data from Redis as well, just like Analyzer, but only the keys it is analyzing, not all keys as Analyzer does, so read it is very "lite". Great for monitoring error counts, 503 rates, etc.

Regards
Gary

Reply all

Reply to author

Forward