Push gateway on Kubernetes

1,172 views
Skip to first unread message

Khusro Jaleel

unread,
Jan 23, 2018, 5:32:33 PM1/23/18
to Prometheus Users
Hi, I know that the push gateway accumulates metrics forever unless you clear them, either by restarting it, or using an API call.

We have several pods that send metrics to our push gateway periodically, but what would be the best mechanism for "clearing" it out on Kubernetes, where it's running as it's own pod?

We could give it a small amount of memory, and when it runs out, Kubernetes would automatically restart it, but this might mean that we will lose some metrics (if they have not yet been scraped). 

I could create a Kubernetes cronjob that somehow calls the API and clears it out, but again, how will I know I'm not clearing something that's not been scraped yet, and might be needed still? 

What would be the best approach for this? Thanks.

Brian Brazil

unread,
Jan 23, 2018, 6:10:41 PM1/23/18
to Khusro Jaleel, Prometheus Users
It sounds like you're trying to see the pushgateway for something it's not intended for, see https://prometheus.io/docs/practices/pushing/ 

--

khusro...@holidayextras.com

unread,
Jan 23, 2018, 6:19:32 PM1/23/18
to Prometheus Users
Hi Brian,

This has been mentioned to me before, and I'm not sure I fully understand. If I'm running a batch job or some processes are pushing metrics to the push gateway, is there something fundamentally wrong about that? If thousands of batch jobs suddenly push metrics to the push gateway, we still need some sort of mechanism to delete stuff from it's memory, is that not correct?

In addition, I noticed in the past that when the push gateway had a lot of metrics in memory, the Prometheus scrape took a long time and caused the CPU usage on the Prometheus process to climb. 

Brian Brazil

unread,
Jan 23, 2018, 6:25:58 PM1/23/18
to khusro...@holidayextras.com, Prometheus Users
On 23 January 2018 at 23:19, khusro.jaleel via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Brian,

This has been mentioned to me before, and I'm not sure I fully understand. If I'm running a batch job or some processes are pushing metrics to the push gateway, is there something fundamentally wrong about that? If thousands of batch jobs suddenly push metrics to the push gateway, we still need some sort of mechanism to delete stuff from it's memory, is that not correct?

The only reason to remove data from the pushgateway is if the batch job is never expected to run again, so deleting the data is one more step in your turndown docs.

Beyond that you want the pushed data to stay there forever.


Why do you have thousands of service-level batch jobs?

Brian
 

In addition, I noticed in the past that when the push gateway had a lot of metrics in memory, the Prometheus scrape took a long time and caused the CPU usage on the Prometheus process to climb. 



On Tuesday, 23 January 2018 23:10:41 UTC, Brian Brazil wrote:
On 23 January 2018 at 22:32, Khusro Jaleel <kerne...@gmail.com> wrote:
Hi, I know that the push gateway accumulates metrics forever unless you clear them, either by restarting it, or using an API call.

We have several pods that send metrics to our push gateway periodically, but what would be the best mechanism for "clearing" it out on Kubernetes, where it's running as it's own pod?

We could give it a small amount of memory, and when it runs out, Kubernetes would automatically restart it, but this might mean that we will lose some metrics (if they have not yet been scraped). 

I could create a Kubernetes cronjob that somehow calls the API and clears it out, but again, how will I know I'm not clearing something that's not been scraped yet, and might be needed still? 


It sounds like you're trying to see the pushgateway for something it's not intended for, see https://prometheus.io/docs/practices/pushing/ 

--

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/a823b5b4-f77f-408b-a11c-267e518050e9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Khusro Jaleel

unread,
Jan 23, 2018, 6:40:52 PM1/23/18
to Brian Brazil, Prometheus Users, khusro...@holidayextras.com
We don’t actually have thousands of batch jobs, I was just using that as an example.

What we would like to do, is to capture metrics from our micro services running in Kubernetes right before they terminate. These are last gasp metrics so to speak. These processes will disappear and not come back, however another pod may take their place of course.

So that’s the only use case for the push gateway for us. What we found in the past, however, is that if we never deleted the metrics from the push gateway or flushed it somehow, it would adversely affect the Prometheus process that was scraping it, resulting in higher and higher scrape times, CPU usage and timeouts. That’s why we need to keep clearing it, but only after we are sure that those metrics have been scraped.



On Tue, 23 Jan 2018 at 23:25, Brian Brazil <brian....@robustperception.io> wrote:
On 23 January 2018 at 23:19, khusro.jaleel via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Brian,

This has been mentioned to me before, and I'm not sure I fully understand. If I'm running a batch job or some processes are pushing metrics to the push gateway, is there something fundamentally wrong about that? If thousands of batch jobs suddenly push metrics to the push gateway, we still need some sort of mechanism to delete stuff from it's memory, is that not correct?

The only reason to remove data from the pushgateway is if the batch job is never expected to run again, so deleting the data is one more step in your turndown docs.

Beyond that you want the pushed data to stay there forever.


Why do you have thousands of service-level batch jobs?

Brian
 

In addition, I noticed in the past that when the push gateway had a lot of metrics in memory, the Prometheus scrape took a long time and caused the CPU usage on the Prometheus process to climb. 



On Tuesday, 23 January 2018 23:10:41 UTC, Brian Brazil wrote:
On 23 January 2018 at 22:32, Khusro Jaleel <kerne...@gmail.com> wrote:
Hi, I know that the push gateway accumulates metrics forever unless you clear them, either by restarting it, or using an API call.

We have several pods that send metrics to our push gateway periodically, but what would be the best mechanism for "clearing" it out on Kubernetes, where it's running as it's own pod?

We could give it a small amount of memory, and when it runs out, Kubernetes would automatically restart it, but this might mean that we will lose some metrics (if they have not yet been scraped). 

I could create a Kubernetes cronjob that somehow calls the API and clears it out, but again, how will I know I'm not clearing something that's not been scraped yet, and might be needed still? 


It sounds like you're trying to see the pushgateway for something it's not intended for, see https://prometheus.io/docs/practices/pushing/ 

--

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To post to this group, send email to promethe...@googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/OvlqmzQSO2E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAHJKeLqo-k1BnjNCZmu4Go05eR3QPv_MMs-s8cUkSRVPFj8CRQ%40mail.gmail.com.

Brian Brazil

unread,
Jan 24, 2018, 4:45:14 AM1/24/18
to Khusro Jaleel, Prometheus Users, khusro...@holidayextras.com
On 23 January 2018 at 23:40, Khusro Jaleel <kerne...@gmail.com> wrote:
We don’t actually have thousands of batch jobs, I was just using that as an example.

What we would like to do, is to capture metrics from our micro services running in Kubernetes right before they terminate. These are last gasp metrics so to speak. These processes will disappear and not come back, however another pod may take their place of course.

That's not what the pushgateway is meant for, nor is it required. Scrape them normally, and rate() will do the right thing.

Brian
 

So that’s the only use case for the push gateway for us. What we found in the past, however, is that if we never deleted the metrics from the push gateway or flushed it somehow, it would adversely affect the Prometheus process that was scraping it, resulting in higher and higher scrape times, CPU usage and timeouts. That’s why we need to keep clearing it, but only after we are sure that those metrics have been scraped.



On Tue, 23 Jan 2018 at 23:25, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 23 January 2018 at 23:19, khusro.jaleel via Prometheus Users <prometheus-users@googlegroups.com> wrote:
Hi Brian,

This has been mentioned to me before, and I'm not sure I fully understand. If I'm running a batch job or some processes are pushing metrics to the push gateway, is there something fundamentally wrong about that? If thousands of batch jobs suddenly push metrics to the push gateway, we still need some sort of mechanism to delete stuff from it's memory, is that not correct?

The only reason to remove data from the pushgateway is if the batch job is never expected to run again, so deleting the data is one more step in your turndown docs.

Beyond that you want the pushed data to stay there forever.


Why do you have thousands of service-level batch jobs?

Brian
 

In addition, I noticed in the past that when the push gateway had a lot of metrics in memory, the Prometheus scrape took a long time and caused the CPU usage on the Prometheus process to climb. 



On Tuesday, 23 January 2018 23:10:41 UTC, Brian Brazil wrote:
On 23 January 2018 at 22:32, Khusro Jaleel <kerne...@gmail.com> wrote:
Hi, I know that the push gateway accumulates metrics forever unless you clear them, either by restarting it, or using an API call.

We have several pods that send metrics to our push gateway periodically, but what would be the best mechanism for "clearing" it out on Kubernetes, where it's running as it's own pod?

We could give it a small amount of memory, and when it runs out, Kubernetes would automatically restart it, but this might mean that we will lose some metrics (if they have not yet been scraped). 

I could create a Kubernetes cronjob that somehow calls the API and clears it out, but again, how will I know I'm not clearing something that's not been scraped yet, and might be needed still? 


It sounds like you're trying to see the pushgateway for something it's not intended for, see https://prometheus.io/docs/practices/pushing/ 

--

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.

To post to this group, send email to prometheus-users@googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/OvlqmzQSO2E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.



--

mhernand...@gmail.com

unread,
Jun 26, 2018, 12:57:17 PM6/26/18
to Prometheus Users
What is the pushgateway actually meant for?  From the docs, it says that the PGW is meant for short-lived jobs.  Why not push short-lived job metrics to a regular /metrics endpoint rather than the push gateway?  What is the definition of short-lived batch jobs?  It's unclear to me, especially since I'm dealing with python Celery workers/tasks.  The workers live long, but the tasks themselves are units of execution that we would still like to run metrics on.  Any articles that go in depth to the implications of using the push gateway and why exactly one would need to use it instead of the regular pull model?
Thanks in advance.


On Wednesday, January 24, 2018 at 1:45:14 AM UTC-8, Brian Brazil wrote:
On 23 January 2018 at 23:40, Khusro Jaleel <kerne...@gmail.com> wrote:
We don’t actually have thousands of batch jobs, I was just using that as an example.

What we would like to do, is to capture metrics from our micro services running in Kubernetes right before they terminate. These are last gasp metrics so to speak. These processes will disappear and not come back, however another pod may take their place of course.

That's not what the pushgateway is meant for, nor is it required. Scrape them normally, and rate() will do the right thing.

Brian
 

So that’s the only use case for the push gateway for us. What we found in the past, however, is that if we never deleted the metrics from the push gateway or flushed it somehow, it would adversely affect the Prometheus process that was scraping it, resulting in higher and higher scrape times, CPU usage and timeouts. That’s why we need to keep clearing it, but only after we are sure that those metrics have been scraped.



On Tue, 23 Jan 2018 at 23:25, Brian Brazil <brian....@robustperception.io> wrote:
On 23 January 2018 at 23:19, khusro.jaleel via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi Brian,

This has been mentioned to me before, and I'm not sure I fully understand. If I'm running a batch job or some processes are pushing metrics to the push gateway, is there something fundamentally wrong about that? If thousands of batch jobs suddenly push metrics to the push gateway, we still need some sort of mechanism to delete stuff from it's memory, is that not correct?

The only reason to remove data from the pushgateway is if the batch job is never expected to run again, so deleting the data is one more step in your turndown docs.

Beyond that you want the pushed data to stay there forever.


Why do you have thousands of service-level batch jobs?

Brian
 

In addition, I noticed in the past that when the push gateway had a lot of metrics in memory, the Prometheus scrape took a long time and caused the CPU usage on the Prometheus process to climb. 



On Tuesday, 23 January 2018 23:10:41 UTC, Brian Brazil wrote:
On 23 January 2018 at 22:32, Khusro Jaleel <kerne...@gmail.com> wrote:
Hi, I know that the push gateway accumulates metrics forever unless you clear them, either by restarting it, or using an API call.

We have several pods that send metrics to our push gateway periodically, but what would be the best mechanism for "clearing" it out on Kubernetes, where it's running as it's own pod?

We could give it a small amount of memory, and when it runs out, Kubernetes would automatically restart it, but this might mean that we will lose some metrics (if they have not yet been scraped). 

I could create a Kubernetes cronjob that somehow calls the API and clears it out, but again, how will I know I'm not clearing something that's not been scraped yet, and might be needed still? 


It sounds like you're trying to see the pushgateway for something it's not intended for, see https://prometheus.io/docs/practices/pushing/ 

--

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To post to this group, send email to promethe...@googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/OvlqmzQSO2E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



--

Meier

unread,
Jun 26, 2018, 2:12:49 PM6/26/18
to Prometheus Users
"regular /metrics endpoint" only support get and respond with metrics, also which instance-endpoint would that be. if its a dedicated service to receive metrics from other instances and make them available via a "regular /metrics endpoint", well there you have defined exactly what the pushgateway is. citing from the readme:

The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.

mhernand...@gmail.com

unread,
Jun 26, 2018, 2:49:57 PM6/26/18
to Prometheus Users
Thank you for your response.  I am quite new to Prometheus so still trying to grasp my head around this.  Could you read the following and give some feedback please?  I would really appreciate it.

Let's assume that we have a web app that runs asynchronous tasks in the background.  There is a long-living worker running in the background that schedules and runs assigned tasks (python functions).
Suppose that we would like to collect metrics on those background tasks.  Since we are dealing with python functions, the prometheus client library would allow us to send metrics to an endpoint, not the push gateway, even though the "tasks (python functions)" are short-lived.  Would this eliminate the need for a push gateway?

"Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway"

Now, referencing the quote above, the jobs may not exist long enough to be scraped.  Does this mean that when Prometheus scrapes an endpoint, the endpoint won't contain older metrics (for ex. 5 minutes ago), so for a short-lived job, a metric sent to this endpoint in plain text won't even be picked up by prometheus at all?  If that is the case, then I can see why we would need the pushgateway as a "metrics cache."

Meier

unread,
Jun 26, 2018, 2:58:29 PM6/26/18
to Prometheus Users


On Tuesday, June 26, 2018 at 8:49:57 PM UTC+2, mhernand...@gmail.com wrote:
Thank you for your response.  I am quite new to Prometheus so still trying to grasp my head around this.  Could you read the following and give some feedback please?  I would really appreciate it.

Let's assume that we have a web app that runs asynchronous tasks in the background.  There is a long-living worker running in the background that schedules and runs assigned tasks (python functions).
Suppose that we would like to collect metrics on those background tasks.  Since we are dealing with python functions, the prometheus client library would allow us to send metrics to an endpoint, not the push gateway, even though the "tasks (python functions)" are short-lived.  Would this eliminate the need for a push gateway?

The proemtheus client library doesn't send metrics. Your application will have to provide a new endpoint wich the metrics can be fetched (scraped) from. 


"Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway"

Now, referencing the quote above, the jobs may not exist long enough to be scraped.  Does this mean that when Prometheus scrapes an endpoint, the endpoint won't contain older metrics (for ex. 5 minutes ago), so for a short-lived job, a metric sent to this endpoint in plain text won't even be picked up by prometheus at all?  If that is the case, then I can see why we would need the pushgateway as a "metrics cache."

As you propably have noticed by now this is a "pull-model", usually your application (think daemon) metrics endpoint provides the latest value, this value is as old as its last update to the metrics registry in the client libraries. so no need per se for a metrics cache. So your application exposes its metrics on an http-endpoint and prometheus regularly scrapes that endpoint.
In the case of jobs, when your job is done it terminates and thus can not provide an endpoint for scraping anymore. So in that case it has to "push" its metrics somewhere to be scrapable...
Reply all
Reply to author
Forward
0 new messages