Derive duration metrics from start-stop events

7 views
Skip to first unread message

Michael Bravo

unread,
Mar 17, 2021, 7:27:16 AM3/17/21
to Riemann Users
Hi,

I am a sort of secret Riemann admirer for years, but unfortunately mostly from afar, however I have encountered a practical problem which I feel should be easily solvable with Riemann, but can't quite grasp how. Details follow, any and all pointers gratefully accepted.

The problem is with deriving duration metrics from start-stop events. Imagine you need to track build or deployment durations.

If you drive the process from some kind of CI/CD, like Jenkins, you have a contiguous line of execution in a controlling process, which can measure the duration and submit the metric to whichever time series solution you are using, e.g. Prometheus or Datadog (or you can decorate your code with @timed and it will automagically do the same for you)

If, however, you have something which is more of a state machine, e.g. ArgoCD deploying to k8s, you don't have a contiguous line of execution, instead, you have hooks to send start-stop events. If you still need your durations/intervals, you need a middleware that would catch the events, compute the intervals for matching start-stop events (per service, deployment name, what have you) and then submit to your timeseries solution.

The latter seems to be a natural fit for Riemann, conceptually, but I can't seem to put that into an actionable config.

Sanel Zukan

unread,
Mar 17, 2021, 11:37:48 AM3/17/21
to Michael Bravo, Riemann Users
Hi,

Michael Bravo <mike....@gmail.com> writes:
> If you drive the process from some kind of CI/CD, like Jenkins, you have a
> contiguous line of execution in a controlling process, which can measure
> the duration and submit the metric to whichever time series solution you
> are using, e.g. Prometheus or Datadog (or you can decorate your code with
> @timed and it will automagically do the same for you)
>
> If, however, you have something which is more of a state machine, e.g.
> ArgoCD deploying to k8s, you don't have a contiguous line of execution,
> instead, you have hooks to send start-stop events. If you still need your
> durations/intervals, you need a middleware that would catch the events,
> compute the intervals for matching start-stop events (per service,
> deployment name, what have you) and then submit to your timeseries solution.
>
> The latter seems to be a natural fit for Riemann, conceptually, but I can't
> seem to put that into an actionable config.

You can search through Riemann index for the previous events, so you can
try with this:

1. When X is started, let it send event with this payload:
{:metric <timestamp> :state "started"}

2. When it arrives to Riemann, set it's ttl to some large number
(assuming your configuration will expire events). This will assure this
"started" event is not deleted.

3. When X is stopped, let it send event with this payload:
{:metric <timestamp> :state "stopped"}

4. On Riemann side, look at the index for matching event with "started"
state. Subtract timestamps and you will have how long it took for
something to run.

Notice that this is something I typed quickly without actual testing, so
there might be errors. But you'll get the idea behind it.

Pseudocode:

(defn try-calc-elapsed [& children]
(fn [event]
(let [;; find events with the same host/service pair
;; but look only for 'started'
matched (riemann.index/lookup (:index @core)
(:host event)
(:service event))]
(when-let [found (some #(= (:state %) "started") matched)]
;; 'found' is event with 'started' state, 'event' should be
;; the same event but with 'stopped' state
(let [elapsed (- (:metric event) (:metric found))]
;; delete 'started' event from the index
(riemann.index/delete (:index @core) found)

;; call children with the new event; it is derived from
;; 'stopped' event so it keeps elements like host, ttl...
(call-rescue (merge event
{:metric elapsed
:service "X-elapsed"
:state "ok"})))))))

(let [index (index)]
(streams
(split
(and (service "X")
(state "started"))
;; change event ttl so some big value you think
;; will not expire before 'stopped' arrives; store it in index
(with {:ttl 1234567} index)

;; If we receive 'stopped' event, try to calculate elapsed time.
;; If fails (no event found), it will not store anything in
;; index or database.
(and (service "X")
(state "stopped"))
(try-calc-elapsed index influx)

;; nothing found, send to index and database
index influx)))


Best regards,
Sanel

Reply all
Reply to author
Forward
0 new messages