Sending job metrics to pushgateway and expiring them

1,306 views
Skip to first unread message

juho.m...@gmail.com

unread,
Apr 9, 2019, 6:51:43 AM4/9/19
to Prometheus Developers
I'm running a system where I run spark jobs on Kubernetes. Each spark job has an unique ID. The jobs sends bunch of metrics to the Prometheus pushgateway with the job id as one of the labels (another important label is the name of the job).

So during a single day I might have a job "foobar" which is executed every hour. Thus every hour I will send for example a metric "spark_job_duratop{job_name="foobar", uuid="123"} where the uuid will change every hour (in addition I'm sending JVM metrics and spark specific metrics which all share these labels). Once the job is done no another job will ever send metrics with the same label combinations.

It seems that the pushgateway doesn't expire the sent metrics in any way, so my push gateway will fill up different metrics until there are so much of them that Prometheus can't even get all of them before the http request timeouts.

I've searched github issues and this email list and there have been a few mentions about "ttl for pushgateway" but the use case has been a bit different.

I'm wondering am I doing something wrong, or is there just a simple toggle how to clear old inactive metrics from the pushgateway?

Thanks.

Brian Brazil

unread,
Apr 9, 2019, 7:02:45 AM4/9/19
to juho.m...@gmail.com, Prometheus Developers
You should not send the uuid label, so that the single push you're sending at the end of the job replaces the previous push. 

--
Message has been deleted

juho.m...@gmail.com

unread,
Apr 9, 2019, 7:17:02 AM4/9/19
to Prometheus Developers
> You should not send the uuid label, so that the single push you're sending at the end of the job replaces the previous push. 

Thanks for your feedback.

I'm just not seeing that as a satisfying way for a number of reasons:

- In practice I might have multiple parallel runs at the same time for a same job, thus I need to separate them with their uuid.
- I want to be able to graph the full execution of a single run easily so that I'm sure that no other run can interfere with the graphing.
- With having the uuid as the label I can easily sum across all instances, or then pick just the one which I specifically request.

Sending the uuid solves all my cases really well, except that the pushgateway cannot handle it.

Bjoern Rabenstein

unread,
Apr 10, 2019, 6:16:31 AM4/10/19
to juho.m...@gmail.com, Prometheus Developers
On 09.04.19 04:17, juho.m...@gmail.com wrote:
>
> Sending the uuid solves all my cases really well, except that the pushgateway cannot handle it.

Except that it makes all your metrics unique. If I understand
correctly, each of the unique metrics will only have one sample (the
one pushed at the end of each job). Arguably, those aren't really
metrics anymore, those are events, and you are venturing deep into
event-logging territory here.

Prometheus is a very bad event logging store, as in: The collection of
events is cumbersome as you have to shoehorn metrics collection into
it. PromQL and the evaluation model is not well suited to it. And
finally it is _very_ expensive to store those events,
i.e. single-sample metrics, in Prometheus. Prometheus has a per-sample
cost of not much more than a byte. The per-metric cost is in the order
of Kilobytes.

Applying band-aid to parts of this set of problems (like introducing
of the infamously discussed timeout for metrics pushed to the
Pushgateway) will make the shoehorning easier, but it will still
remain the wrong approach, causing more harm than benefit in the
end. (That's why the Prometheus developers are so reluctant to add
those features.)

The fundamentally correct solution in your case is probably to not use
Prometheus.

In terms of shoehorning, you could go for something weird like finding
a label that is not unique (like the uuid) but rotates in some
meaningful way. For example, if each of your job is processing things
for a particular hour of the day, that "hour of the day" could be used
as a label (and part of the grouping key for pushing), and it would
repeat itself after a day, thus not completely exploding the number of
unique metrics. At the same time, an overlap would be unlikely. But as
said, this is all shoehorning, and ideally you would avoid it in the
first place.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

juho.m...@gmail.com

unread,
Apr 10, 2019, 6:45:33 AM4/10/19
to Prometheus Developers
Thank you Björn for continuing the discussion.

Pardon me for being obscure in the first place: You we're correct that he "spark_job_duration" metric I used as an example is indeed a single value sent to this unique metric.

Beside this metric I am sending about 200 other metrics every 15 seconds which are very much suited for Prometheus: JVM metrics such as memory usage, garbage collection times, amount of network usage and so on.

So in practice I have a job which starts, does a lot of processing and sends repeated values to a set of ~200 metrics every 15 seconds and then stops. The execution time can be anything from one minute to several hours. Each job is a single instance identified with its unique id.

To me this is very much like how Prometheus is normally used with Kubernetes: Start an application into a pod, expose metrics from that application and most importantly: Attach the pod name, which is unique, to the metrics as one label, so that one can look metrics from an individual pod during its lifetime. Another example would be that the ephemeral instance id or instance ip is added as a label and these instances might be short living (for example just a day).

Please correct me if I'm wrong on this: Prometheus has a finite timespan how long each unique metric lives in the storage and once new values are no longer sent to a particular unique metric+label combination the metric gets erased from Prometheus. What I am asking is that Pushgateway would respect the same operating principles.

In my use case I'm just not exposing the metrics via an internal http server to the application but via pushgateway. The reason behind this is that my application doesn't have embedded prometheus endpoint and also that each instance of my job consists of a single controller pod and a swarm of short living executor pods.

I was able to solve this by creating a pushgateway instance to every job as a sidecar container, so the pushgateway is terminated when the job exits.

Matthias Rampke

unread,
Apr 10, 2019, 7:19:38 AM4/10/19
to Juho Mäkinen, Prometheus Developers
The comparison to the Kubernetes model is apt – and Kubernetes pods don't push to pushgateway either but expose their own metrics, if necessary via sidecars. Prometheus is then responsible for scraping them, as long as they live.

What prevents you from instrumenting the controller pod?

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/14501e21-2130-4037-ac4f-2afa3a012e97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bjoern Rabenstein

unread,
Apr 11, 2019, 8:16:26 AM4/11/19
to juho.m...@gmail.com, Prometheus Developers
What MR said.

To add 2¢: Your scenario now really seems like one where you either
want to change things to fit the pull-based Prometheus collection
model, or you want to switch to a push-based monitoring
system. Turning Prometheus into a push-based monitoring system is
going to hit you with most of the combined problems that either
approach has, while it will give you very little of the
benefits. That's why the Prometheus developers don't recommend it and
why we are reluctant to add features that will mostly serve that
discouraged use case.

On 10.04.19 03:45, juho.m...@gmail.com wrote:
>
> Please correct me if I'm wrong on this: Prometheus has a finite
> timespan how long each unique metric lives in the storage and once
> new values are no longer sent to a particular unique metric+label
> combination the metric gets erased from Prometheus. What I am asking
> is that Pushgateway would respect the same operating principles.

So yeah, that timespan after which an old metric gets erased is the
retention time. It's 15d by default, and the Prometheus server
contains a lot of engineering effort to keep and manage that much data
on a single node. The Pushgateway is not meant to _also_ implement a
TSDB.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Ilya Esin

unread,
Aug 14, 2019, 8:02:48 AM8/14/19
to Prometheus Developers
Hello,

I am one of who dreaming about having expiration mechanics in push-gateway.

My scenario is pretty simple: cron-task which sends update about work done with feeds. Task itself is not very complex and working well. It producing a lot of metrics to pushgateway. Currently task itself does delete/put cycle on every iteration.Because there is not transaction mechanics in some rare cases we are receiving incomplete/empty set of metrics. This produces false positive trigger. Auto-cleanup done with expiration mechanism will allow me to avoid deleting whole series and just send updates without tacking care of outdated/inactive dimension values. I'm totally fine with global parameter like --expire-after.

-- 
Best Regards,
Ilya

On Thursday, April 11, 2019 at 2:16:26 PM UTC+2, Björn Rabenstein wrote:
What MR said.

To add 2¢: Your scenario now really seems like one where you either
want to change things to fit the pull-based Prometheus collection
model, or you want to switch to a push-based monitoring
system. Turning Prometheus into a push-based monitoring system is
going to hit you with most of the combined problems that either
approach has, while it will give you very little of the
benefits. That's why the Prometheus developers don't recommend it and
why we are reluctant to add features that will mostly serve that
discouraged use case.

Bjoern Rabenstein

unread,
Aug 14, 2019, 8:23:57 AM8/14/19
to Ilya Esin, Prometheus Developers
On 14.08.19 05:02, Ilya Esin wrote:
>
> My scenario is pretty simple: cron-task which sends update about work done with
> feeds. Task itself is not very complex and working well. It producing a lot of
> metrics to pushgateway. Currently task itself does delete/put cycle on every
> iteration.Because there is not transaction mechanics in some rare cases we are
> receiving incomplete/empty set of metrics. This produces false positive
> trigger. Auto-cleanup done with expiration mechanism will allow me to avoid
> deleting whole series and just send updates without tacking care of outdated/
> inactive dimension values. I'm totally fine with global parameter like
> --expire-after.

It sounds like you could solve your problem in a cleaner and more
"Promethean" way by picking the right grouping keys. All metrics in a
group are automatically replaced by a new push so that you don't need
to delete anything manually.

Ilya

unread,
Aug 14, 2019, 9:53:14 AM8/14/19
to Bjoern Rabenstein, Prometheus Developers
Well… ok.

If so, please, recommend me most "Promethean" way to solve my issue in
a way different from it's implemented already:
- we have virtually unlimited set of feeds
- feeds are updated by users
- most feed updates are automated by users, but some - don't
- cron-task processing queue with updates and updates feed storage
- cron-task have no idea about existing/non-existing-yet feeds
- cron-task have information about only those feeds that was processed
during current run
- cron-task reporting to pushgatevay two values per feed:
last_updated_time and items_updated
- we need to trigger alert if some feed wasn't updated for some time (2 days)
- if feed wasn't updated in more then 2 weeks — we need to discard it

Any ideas?

Bjoern Rabenstein

unread,
Aug 15, 2019, 12:01:35 PM8/15/19
to Ilya, Prometheus Developers
On 14.08.19 15:53, Ilya wrote:
>
> If so, please, recommend me most "Promethean" way to solve my issue in
> a way different from it's implemented already:
> - we have virtually unlimited set of feeds
> - feeds are updated by users
> - most feed updates are automated by users, but some - don't
> - cron-task processing queue with updates and updates feed storage
> - cron-task have no idea about existing/non-existing-yet feeds
> - cron-task have information about only those feeds that was processed
> during current run
> - cron-task reporting to pushgatevay two values per feed:
> last_updated_time and items_updated
> - we need to trigger alert if some feed wasn't updated for some time (2 days)
> - if feed wasn't updated in more then 2 weeks — we need to discard it

I guess for a full consulting of how to do monitor your setup with
Prometheus I'm still lacking details - and also time. (There are
people who do Prometheus consulting as they day job, see the
"commercial support" section of https://prometheus.io/community/ .)

Having said that, there are a few general concerns I can throw in
here:

- If your set of feeds is really unbounded, a monitoring per feed
should probably be implement via some kind of event-logging
solution. Prometheus is inherently bad for unbounded cardinality.

- It's a fundamental problem if you rely on pushes to the Pushgateway
to tell Prometheus which feeds exist at all. If a feed never sees a
single successful push, it will be completely ignored by your
alerts, which is certainly bad. Or in other words: To reliably detect
broken feed updates, you need a different solution anyway.

- In the same area as the previous item is the general problem that
the Pushgateway is in no way designed for HA or data durability. If
the machine your Pushgateway runs on goes down and you reschedule
the Pushgateway somewhere else, it will forget about all its
persisted metrics. The way your alerting works should be able to
cope with the metrics disappearing rather than just not changing
anymore. However, this is at odds with your idea of using removal
from the Pushgateway as a signal that the feed is legitimately gone
and doesn't need to be alerted on anymore.

- Taking out a feed from monitoring should be an informed process, not
just an expiration. Even if that's part of your business logic, it
has a bad smell to replicate the business logic of an expiration
period in your monitoring system. Whatever process discards a feed
after two weeks of inactivity should also clean up metrics from the
Pushgateway (or even better from whatever kind of target discovery
you have). An implicit removal after a certain expiration period has
the problem that it can mask legitimate update failures (just ignore
the alert that kicks in after 2d for another 12d, and it will
finally cease to fire) and that it looks the same as an (accidental
or deliberate) reset of the Pushgateway metrics storage.

In general, I guess you could use the feed ID as part of the grouping
key if the cardinality doesn't kill you, with some process to
garbage-collect groups for discarded feeds. However, as you can see by
the other concerns, a clean solution probably requires a fundamentally
different approach.

Ilya Esin

unread,
Oct 29, 2019, 4:06:50 PM10/29/19
to Prometheus Developers
Thank you for detailed response.

You mentioned "most "Promethean" way" of using Prometheus. Unfortunately, I wasn't able to find this term and any reference about Prometheus ideology. That's why I asked you for recommendation. Not for consultancy or support.
Reply all
Reply to author
Forward
0 new messages