Internally we use KairosDB for alerting. We upload all our metrics to KairosDB and then have external APIs that read the metrics.
I was thinking about how to make it EASIER to tie in systems like pagerduty, websitepulse, and also systems like nagios and I had the following idea.
I believe KairosDB already supports GET for metric query. At least that was a recent patch that was committed. Not sure if it was released yet.
If we were to extend the protocol, to support a range of values for an query, we could have the HTTP request return HTTP 500 if the values aren't in a specific range.
So say you had % CPU utilization.
If you ran a 5 minute avg call on CPU utilization, and it came back at 95%, that might be an error.
So you could specify a range in the query like:
"assert": {
"range": {
"min": 0,
"max" 70
}
}
(this is just a proposal on the syntax, just thinking out loud).
This would mean that you could take an ENTIRE metric assertion, encode it into one URL, then stick that URL into your monitoring system.
99% of the time it would return 200 OK... but when the range assertion fails, the servlet could return HTTP 500 with a json message.
We're doing this internally now, but our system is somewhat complex and I don't like having a basic/simple monitoring system embedded in our API server.
It would return the normal JSON payload , as well, just with some error messages explaining that it fell outside the range and the HTTP response should be 500.
I admit that it's a bit of an abuse to use HTTP 500 but I don't know of another decent status code we could use.
I know some people use HTTP 420 for this sometimes (which is an undefined HTTP status code.) " Maybe we could use "HTTP 420 Dude, not cool!"