The possibility of multiple push gateways

140 views
Skip to first unread message

Andy Pan

unread,
Nov 13, 2020, 3:34:08 AM11/13/20
to Prometheus Users
I've been investigating the push gateway for the past few days and I read this page https://prometheus.io/docs/practices/pushing/ which shows some pitfalls when using push gateway, I noticed there is a single point issue with push gateway, but I was confused, wouldn't deploying multiple machines and load-balancing from business side solve this problem? or it is impossible for Prometheus to collect metrics from multiple push gateways, only for single point?

Stuart Clark

unread,
Nov 13, 2020, 4:17:42 AM11/13/20
to promethe...@googlegroups.com
> metrics from multiple push gateways, only for single point? --


If you have multiple Push gateway servers behind a load balancer you
would quickly get meaningless data returned.

For example if you have 2 servers with Prometheus scraping through the
load balancer, Prometheus would probably alternate between scraping each
one (assuming round robin balancing). If a system pushes a set of new
metrics then it would only update one of the two servers. After that
time every other scrape would return the new data, with the old data (on
the other server) being scraped for some of the time.

You could have the scrapes come via the load balancer and then have the
metrics creating process push to both servers, but there would be quite
a bit of complexity, as you'd need to handle things like service
discovery (how do you know which servers to push to, which might include
dynamic changes if on totally fails and is removed from the load
balancer pool) and retries (if there is a temporary failure you need to
retry so the servers don't contain inconsistent data, again causing
meaningless data on the Prometheus side).

The Push gateway does allow you to persist the data stored to disk, so
in the event of a failure a restart wouldn't lose anything, just have an
availability gap (which could of course mean some pushes of new data are
missed). That sort of failure can be fairly easily detected and
rectified by many orchestration systems automatically - for example
liveness probes failing in Kubernetes causing a pod to be rescheduled.

I have also tied the Push gateway more closely to the source of the
metrics. So instead of having a single central service which has to have
100% uptime, have several instances which are used by different pieces
of functionality (e.g. one per namespace or per type of non-scrapable
metrics source). This then reduces the impact of a temporary failure.
There is a small overhead of multiple instances, but it is fairly
lightweight.

Andy Pan

unread,
Nov 15, 2020, 9:51:06 PM11/15/20
to Prometheus Users
Can't Prometheus absorb metrics from multiple pushgateways?
My initial thought was to deploy multiple pushgateways and put a load-balancer in front of those pushgateways, 
then have the Prometheus consume all pushgateways, Does Prometheus support consuming multiple pushgateways?

Andy Pan

unread,
Nov 15, 2020, 9:57:55 PM11/15/20
to Prometheus Users
Is there a right path forward for Prometheus to collect metrics from multiple push gateways directly instead of going through a load-balancer?

Laurent Dumont

unread,
Nov 16, 2020, 3:47:57 PM11/16/20
to Andy Pan, Prometheus Users
I don't think there would be any issues with multiple jobs with one target OR one job with multiple targets.

Is the data being exposed by each PG different?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8972d5ee-17fc-402d-a03f-7c5c66f5ff63n%40googlegroups.com.

Stuart Clark

unread,
Nov 16, 2020, 4:12:16 PM11/16/20
to Andy Pan, Prometheus Users
On 16/11/2020 02:57, Andy Pan wrote:
> Is there a right path forward for Prometheus to collect metrics from
> multiple push gateways directly instead of going through a load-balancer?
>

There is nothing special about the push gateway as far as Prometheus is
concerned. It is just a target to scrape.

As mentioned previously you can't use a load balancer in front of a set
of push gateway instances, as otherwise the data will be meaningless.

Andy Pan

unread,
Nov 16, 2020, 10:42:30 PM11/16/20
to Prometheus Users
I think you might misunderstand what I said, the load-balancer is not for Prometheus collection metrics (Prometheus collects metrics from all push gateways) but for business servers pushing metrics.
In this case, would this be the workaround?

Stuart Clark

unread,
Nov 19, 2020, 12:29:52 PM11/19/20
to Andy Pan, Prometheus Users
On 17/11/2020 03:42, Andy Pan wrote:
> I think you might misunderstand what I said, the load-balancer is not
> for Prometheus collection metrics (Prometheus collects metrics from
> all push gateways) but for business servers pushing metrics.
> In this case, would this be the workaround?


If you have 2 push gateways and one thing pushing metrics (for example)
as you say you could easily scrape both push gateways. In steady state
you now have 2 copies of every metric (differing just by the instance
label).

If you have a load balancer between the pushing app & the two push
gateways whenever you push it will update one of the two push gateways.

You now have one push gateway containing the old metrics and one
containing the new.

You continue to scrape both push gateways and now Prometheus has
ingested those metrics.

How would you query this data?

If you just ignore the 2nd push gateway (e.g. by adding a instance label
filter in your query) then you will miss some of the changes (because
the push gateway you are ignoring was the one that got updated)

If you don't ignore one, you'd end up with two sets of answers to your
query, one based on the data from push gateway 1 & the other from push
gateway 2. At various points in time different answers will be correct,
but you wouldn't know which one.

So you are now in a situation where you have twice the number of metrics
that you actually want, while also not knowing which values are correct.
So not a good situation to be in.

Andy Pan

unread,
Nov 20, 2020, 7:25:12 AM11/20/20
to Prometheus Users
Got it! Thank you very much!
About the inconsistent metrics, is it possible to dispatch metrics by hash and put those metrics with the same labels into the same push gateway, avoiding the inconsistent metrics?

Stuart Clark

unread,
Nov 20, 2020, 7:55:02 AM11/20/20
to Andy Pan, Prometheus Users
On 2020-11-20 12:25, Andy Pan wrote:
> Got it! Thank you very much!
> About the inconsistent metrics, is it possible to dispatch metrics by
> hash and put those metrics with the same labels into the same push
> gateway, avoiding the inconsistent metrics?

You were originally talking about having multiple push gateway servers
to remove single points of failure. Having different instances of push
gateway for different subsets of usage is a fairly normal pattern,
either due to performance, error domain or location reasons, but it
doesn't create any form of high availability.

A "normal" push gateway setup, with auto-restart in case of failure
(e.g. Kubernetes, systemd, auto scaling group), optionally with state
storage on disk is usually adequate to handle failures (which should be
pretty rare). If there was a failure there would be a temporary outage
for however long it takes to restart the application (shouldn't be more
than a few minutes). You would miss data from any attemted pushes during
that outage period.

In general Prometheus doesn't have any guarantee that metric deliver
will always succeed. Scrape failures do occasionally happen, due to
networking blips, timeouts or other transient issues.

One alternative to the push gateway that some use is the textfile
collector of the node exporter. This is particularly useful for cron
jobs that run on a server. One advantage is that the failure domains are
more tightly coupled - if the server fails the cron job will also fail
alongside the node exporter. Equally if the node exporter fails or has a
blip it doesn't stop the cron job from recording metrics, as that is
just a file creation operation on the server rather than requiring an
API call.

--
Stuart Clark

Harald Koch

unread,
Nov 20, 2020, 11:39:05 AM11/20/20
to Prometheus Users
On Fri, Nov 20, 2020, at 07:54, Stuart Clark wrote:
> A "normal" push gateway setup, with auto-restart in case of failure
> (e.g. Kubernetes, systemd, auto scaling group), optionally with state
> storage on disk is usually adequate to handle failures (which should be
> pretty rare). If there was a failure there would be a temporary outage
> for however long it takes to restart the application (shouldn't be more
> than a few minutes). You would miss data from any attemted pushes during
> that outage period.

Just to emphasize - this is an excellent model for any HA service that can tolerate short outages caused by (rare) application restarts. It's vastly easier to architect, deploy, and manage than a multi-node application.

--
Harald
Reply all
Reply to author
Forward
0 new messages