> metrics from multiple push gateways, only for single point? --
If you have multiple Push gateway servers behind a load balancer you
would quickly get meaningless data returned.
For example if you have 2 servers with Prometheus scraping through the
load balancer, Prometheus would probably alternate between scraping each
one (assuming round robin balancing). If a system pushes a set of new
metrics then it would only update one of the two servers. After that
time every other scrape would return the new data, with the old data (on
the other server) being scraped for some of the time.
You could have the scrapes come via the load balancer and then have the
metrics creating process push to both servers, but there would be quite
a bit of complexity, as you'd need to handle things like service
discovery (how do you know which servers to push to, which might include
dynamic changes if on totally fails and is removed from the load
balancer pool) and retries (if there is a temporary failure you need to
retry so the servers don't contain inconsistent data, again causing
meaningless data on the Prometheus side).
The Push gateway does allow you to persist the data stored to disk, so
in the event of a failure a restart wouldn't lose anything, just have an
availability gap (which could of course mean some pushes of new data are
missed). That sort of failure can be fairly easily detected and
rectified by many orchestration systems automatically - for example
liveness probes failing in Kubernetes causing a pod to be rescheduled.
I have also tied the Push gateway more closely to the source of the
metrics. So instead of having a single central service which has to have
100% uptime, have several instances which are used by different pieces
of functionality (e.g. one per namespace or per type of non-scrapable
metrics source). This then reduces the impact of a temporary failure.
There is a small overhead of multiple instances, but it is fairly
lightweight.