Simple idea for using KairosDB for alerting

554 views
Skip to first unread message

Kevin Burton

unread,
Mar 25, 2015, 5:34:28 PM3/25/15
to kairosd...@googlegroups.com
Internally we use KairosDB for alerting.  We upload all our metrics to KairosDB and then have external APIs that read the metrics.

I was thinking about how to make it EASIER to tie in systems like pagerduty, websitepulse, and also systems like nagios and I had the following idea.

I believe KairosDB already supports GET for metric query.  At least that was a recent patch that was committed. Not sure if it was released yet.

If we were to extend the protocol, to support a range of values for an query, we could have the HTTP request return HTTP 500 if the values aren't in a specific range.

So say you had % CPU utilization.

If you ran a 5 minute avg call on CPU utilization, and it came back at 95%, that might be an error.  

So you could specify a range in the query like:

"assert": {
    "range": {
        "min": 0,
        "max" 70
    }
}

(this is just a proposal on the syntax, just thinking out loud).

This would mean that you could take an ENTIRE metric assertion, encode it into one URL, then stick that URL into your monitoring system.

99% of the time it would return 200 OK... but when the range assertion fails, the servlet could return HTTP 500 with a json message.

We're doing this internally now, but our system is somewhat complex and I don't like having a basic/simple monitoring system embedded in our API server.  

It would return the normal JSON payload , as well, just with some error messages explaining that it fell outside the range and the HTTP response should be 500.

I admit that it's a bit of an abuse to use HTTP 500 but I don't know of another decent status code we could use.

I know some people use HTTP 420 for this sometimes (which is an undefined HTTP status code.) "  Maybe we could use "HTTP 420 Dude, not cool!"


Brian Hawkins

unread,
Mar 25, 2015, 7:11:02 PM3/25/15
to kairosd...@googlegroups.com
This is interesting.  We use nagios and have written perl code (gah!) to query and return 0,1 or 2 to nagios for an alert.

I could see this as a potential plugin.

Brian

Kevin Burton

unread,
Mar 25, 2015, 7:35:41 PM3/25/15
to kairosd...@googlegroups.com
wouldn't the JSON query need to be changed?  That seems like core... 

I like the idea of being able to use anything to alert against.  I might just take this and make a servlet on our end that forwards the queries (for now).

Kevin

Brian Hawkins

unread,
Mar 26, 2015, 10:07:59 AM3/26/15
to kairosd...@googlegroups.com
Plugins can add additional api endpoints.  You could add a plugin that adds /alert that does what you describe.

Brian

Loic Coulet

unread,
Mar 26, 2015, 2:40:33 PM3/26/15
to kairosd...@googlegroups.com
Yep, adding API endpoints via plugins is incredibly easy and we made extensive usage of it.

Your design leveraging Jersey + Guice is terribly efficient for implementing powerful add-ons.

BTW, Kevin, this idea sounds very interesting to me.

Kevin Burton

unread,
Mar 26, 2015, 5:13:05 PM3/26/15
to kairosd...@googlegroups.com
Maybe it's /alert with a query param which is just encoded JSON and a range.min and range.max parameters.. 

Unfortunately, I don't think I have time in the short term to implement this properly but maybe sometime this month.

Derek Farren

unread,
Mar 31, 2015, 3:17:23 PM3/31/15
to kairosd...@googlegroups.com
This is actually what I use kairosdb for. I have a machine learning algo written in Python that looks for anomalies in real time. It also infers the underlaying graph that relates all timeseries in kairosdb.

Erol Merdanović

unread,
May 11, 2015, 4:00:19 AM5/11/15
to kairosd...@googlegroups.com
Hi all

How would you implement this? Should datapoints be stored in memory as a queue? Every new datapoint removes the last one and then calculates avg/sum...?  Right now I get data chunks every 5 minutes, but I want to improve this to 5 seconds. I cannot run query to retrieve datapoints every 5 seconds (multiply this with 100 alerts) because sooner or later it will fail under load.

Can someone please share the simple idea of implementation? In my custom logic, I define which tag to monitor, comparer, duration (start=now-duration, end=now). I could use some background processing (akka) to load datapoints for each alert and store in some queue. When new datapoints come, I just remove datapoints where datapoint.time < now-duration. Then I calculate avg.

Does this makes sense? I'm only worried that I would be loading too much datapoints in memory. Max duration is 1 week, so 1 week of datapoints (every 5 seconds) means around 120k records.

Lance N.

unread,
May 23, 2015, 7:30:27 PM5/23/15
to kairosd...@googlegroups.com
This is very clever! I haven't followed closely, but does KairosDB support scripting language plugins like JS? That (with a SecurityManager object!) would be the general-purpose way of handling this use case.

Raz Baluchi

unread,
May 24, 2015, 11:45:26 AM5/24/15
to kairosd...@googlegroups.com
Wouldn't it be best to integrate KairosDB with something like Bosun? 

Bosun is a time series alerting framework developed by Stack Exchange. (bosun.org)

Bosun already supports querying opentsdb, graphite and elasticsearch. Adding KairosDB support for querying would fill the alerting needs.

Brian Hawkins

unread,
May 28, 2015, 4:10:56 PM5/28/15
to kairosd...@googlegroups.com, raz.b...@gmail.com
I like the Bosun option.  The problem with adding scripts to the server is that in a load balanced setup not all datapoints that you are alerting on may go through the same kairos node.  You would have to send them to a single machine for alerting.

Brian

Erol Merdanović

unread,
Sep 28, 2015, 2:17:47 PM9/28/15
to KairosDB, raz.b...@gmail.com
Is anyone using Bosun with KairosDB?

I'm looking for a solution for alerting. My idea is to use RabbitMQ to ingest data into KairosDB. I would just create another duplicate queue that would forward messages to some alerting service and do checking real-time.

I think this solves the KairosDB node problem.

Kevin Burton

unread,
Sep 28, 2015, 8:12:04 PM9/28/15
to KairosDB, raz.b...@gmail.com
We ended up writing our own called artemis-healthcheck ... for lack of a better name.

You basically deploy templates and they have a given range and it has an embedded KairosDB query.

We're probably going to Open Source it but I need to refactor the way we use our maven repository.  

Erol Merdanović

unread,
Sep 29, 2015, 7:20:53 AM9/29/15
to KairosDB, raz.b...@gmail.com
Can you tell more how it works?

What we are looking is to be able to calculate
- avg
- min
- max
- count 

on dataset and compare (<, >, ==) it with defined value for selected time range (5min, 15min, 1h, 1d, 1w) in real-time (max 5 seconds delay).

Kevin Burton

unread,
Sep 30, 2015, 6:16:56 PM9/30/15
to KairosDB, raz.b...@gmail.com


On Tuesday, September 29, 2015 at 4:20:53 AM UTC-7, Erol Merdanović wrote:
Can you tell more how it works?


Was that directed at me?
 
What we are looking is to be able to calculate
- avg
- min
- max
- count 

on dataset and compare (<, >, ==) it with defined value for selected time range (5min, 15min, 1h, 1d, 1w) in real-time (max 5 seconds delay).


So in our configuration we have a template like this (see below).

We have warnings like ERROR and FATAL.

ERROR just sends email... FATAL calls our phones.

It's exposed via REST which outputs something like this: 

{
  "timestamp" : 1443651197762,
  "timestampISO8601" : "2015-09-30T22:13:17Z",
  "pending" : false,
  "comment" : null,
  "healthchecks" : {
    "dead-letter-queue" : {
      "state" : "HEALTHY",
      "value" : 340.0,
      "level" : "FATAL"
    }
}

... when you request it.

If any of the healthchecks are failing it returns HTTP 500 so that we can tie it into nodeping. Nodeping then alerts on != HTTP 200 and then sends to pagerduty which calls our phones.

We setup the aggregators to return 1 value.

if you use nested aggregators you can compute things like stddev, min, max, etc.

And you can control time range.

Then there's a range which is an interval/range between min / max... 

If the values aren't within the range then the healthcheck fails.

I'm willing to OSS it if you want.. might take me a while though because I think it might open up a pandoras box for us :-P... we're also in the middle of a datacenter migration :-P


 {

  "name": "robots-running",
  "description": "Verify that we have robots up and running.  Content should fall first though. ",
  "level": "ERROR",
  "query": {
    "metrics": [
      {
        "tags": {
          "role": [
            "robot"
          ]
        },
        "name": "com.spinn3r.artemis.metrics.init.uptime.UptimeMetricsService.uptime",
        "aggregators": [
          {
            "name": "sum",
            "align_sampling": true,
            "sampling": {
              "value": "5",
              "unit": "minutes"
            }
          }
        ]
      }
    ],
    "cache_time": 0,
    "start_relative": {
      "value": "20",
      "unit": "minutes"
    }
  },
  "range": {
    "min": 15,
    "max": 2147483647
  }

}

Erol Merdanović

unread,
Oct 5, 2015, 2:21:42 AM10/5/15
to KairosDB, raz.b...@gmail.com
Hi Kevin. Yes :)

This might be something we are looking. Do you use cron or something similar to periodically run queries?

Brian Hawkins

unread,
Oct 8, 2015, 11:33:53 AM10/8/15
to KairosDB, raz.b...@gmail.com
We use Nagios checks that query Kairos for alerts.  Basically it is a perl (yuck) script that queries Kairos.

Here are some of the checks we do:
Make sure data rates don't fall below a certain level for hosts
Checking for dropped events
Checking queue sizes
Check the number of hosts reporting metrics

In some cases there is a bit of scripting work to analyze the data but, for queue size it is a pretty simple query that does a max over some period of time and checks the value.

I know these aren't real time but it gives us a lot more than we had before kairos.

Brian

Kevin Burton

unread,
Oct 8, 2015, 11:12:38 PM10/8/15
to KairosDB, raz.b...@gmail.com
It just uses a background task scheduler in Java. Every 60 seconds it re-polls.. nothing too fancy.

It's designed to be 100% java and self contained. I"ll try to OSS it here soon.  
Reply all
Reply to author
Forward
0 new messages