On Sat, 27 Oct 2018 at 00:06, <
tgr...@gmail.com> wrote:
>
> A concrete example is an airflow operator. While airflow
> itself stores some job level data that a proper pull based
> exporter can surface, the business logic code is by definition
> short running and has no native place to store metrics for the
> exporter. That means any instrumentation of the code internals
> *must* be persisted outside the process and outside the
> scheduler. Airflow simply has no way to store granular detail
> like this on the job's behalf. Hadoop and kubernetes jobs have
> the same issue. Something external must store the data.
I'd say if each instance of these short-lived jobs are
single-shot “unique” tasks, i.e. “at” style rather than “cron”
style, with a certain amount of information to be persisted
outside as the task reaches the end of its short lifetime, as I
understand you, then this really looks like event-logging. I
don't have all the information for a detailed analysis, but in
general Prometheus is utterly inappropriate for
event-logging. The Prometheus community has always advocated for
the use of event-logging and metrics-based monitoring in
parallel, complementing each other (plus not to forget
distributed tracing as the third pillar of
observability). Prometheus is a good choice for metrics. There
are a number of solutions available for event-logging. Prometheus
has no ambitions to “land-grab” into territories of other tools.
If users try to shoehorn the Pushgateway into a push-based
event-logging system, they usually push every event with a
different grouping key. Since the Pushgateway accumulates the
events forever, and the Prometheus server scrapes them all each
time, naturally a need for a TTL arises. The fallacy here is not
asking for the TTL, the fallacy is the (ab-)use of the
Pushgateway as an event-logging system. If you don't do the
latter, you don't need the former.
But perhaps what you try to push is actually legitimately
metrics-based. I see two categories for that.
The first category is where the instances of a job are all
doing the same thing sequentially, pushing into the same group,
with the proverbial example of a daily backup job. For
illustration check out the documentation of the Go client:
- A simple example:
https://godoc.org/github.com/prometheus/client_golang/prometheus/push#example-Pusher-Push
- A complex example:
https://godoc.org/github.com/prometheus/client_golang/prometheus/push#example-Pusher-Add
In this case, the proposed use of a TTL on the Pushgateway is
that there is no additional DELETE step required after a job is
removed deliberately. The pushed metric group will disappear
automatically after the TTL has passed.
That's a trap, however. With the TTL, you got the worst of both
worlds. Let's assume it's indeed a daily job. You alert if 25
hours pass without the job completing successfully. Your TTL,
obviously, needs to be significantly larger than 24 hours as you
only push once daily anyway. If the backup job fails, you want to
see when the last backup job succeeded for a couple of days, at
least, let's say for longer than a weekend, hence about three
days. Let's look at the two cases:
1. A job is deliberately removed, no backup is running
anymore. Your alert will fire 25h after the last backup has
run. It will continue to fire for ~2d until the TTL kicks in and
removes the metric. This is bad because an alert has fired for
no reason.
2. A job fails to run. As intended, after 25h the alert
fires. Let's assume it happens Friday evening. The engineer
on-call decides to work on it during work hours on Monday
because the backup is not that critical right now to justify a
weekend work session. However, on Monday morning the TTL has
kicked in and the alert ceased to fire. Let's say the Friday
on-call engineer forgot about it, or a different engineer is
on-call now. They see the alert has resolved and move on to
work on other things.
Conclusion: You get noisy alerts in the one case. You miss a real
outage in the other case.
Solution: Do _not_ implement a TTL. Instead, make metrics
deletion (with a simple DELETE call of the RESTful API) a part of
decommissioning a job.
The second possibility is if a larger workload is distributed
over many smaller short-lived tasks. This use-case becomes
increasingly popular with the serverless paradigm, and perhaps
that's similar to what you accomplish with your Airflow
setup. There are several interesting metrics I can see here, most
obviously the number of work units processed in aggregate by all
the short-lived tasks. The truth is that there is currently no
good way of doing this within the core Prometheus ecosystem. If
users try to shoehorn the Pushgateway into it, they usually push
to different groupings again (as above) and then try to aggregate
the metrics on the Prometheus server after scraping, e.g. a sum
over all the pushed amounts of work unit processed would give you
an idea of the total amount of work units processed. I see how a
TTL would seemingly come in handy, but it should be easy to see
how brittle this setup is. Again the right course of action is to
use the right tools instead of asking for a feature that would
make the wrong tool slightly more easier to shoehorn into
it. What you need in this case is a way for distributed
counting. This is the case explained in the first paragraph of
the non-goals, including possible tools to accomplish what you need:
https://github.com/prometheus/pushgateway#non-goals
I'm fairly certain that the Prometheus community will have to
think more about the “distributed counter” use-case, as it will
become more common (see above, serverless etc.). Perhaps, a
solution will happen within the Pushgateway (but I personally
don't think so, as it cannot provide a good HA-story). More
likely it will be solved very differently. But the solution is
definitely not a TTL for metrics pushed to the Pushgateway.
Having said all that, if your use-case doesn't fit any of the
categories above, please explain how it is different.
> 1) Would you be willing to accept PR for this type of functionality?
To avoid misunderstandings: The problem here is not that a TTL
would be difficult to implement (which it is not). It would not
even be difficult to maintain (which often is a reason to be
cautious about adding other features of limited use). The concern
here is that adding such a feature is considered positively
harmful by those that have created the Pushgateway and
Prometheus as a whole.
There needs to be a consensus within the community that this
feature is useful rather than harmful before even thinking about
an implementation.
> 2) If not, would you accept a PR that put this functionality
> into a sidecar type process that uses the DELETE api? We've
> discussed writing this ourselves and it sounds like others
> would benefit.
If the feature is harmful within the PGW, it is also harmful as a
sidecar. Of course, nobody will stop you from implementing
it. You could even fork the PGW and add such a feature. You might
even prove us wrong in that way. Personally, I'm happy to stand
corrected. I see forking as fair game in the open-source
world. However, you cannot expect the developers to recommend or
support a feature that they belief to be harmful.