Hi Rafał, hi Giedrius,
Thanks for your interest.
Any provable performance improvement that doesn't come with a huge
increase in code complexity will certainly be welcome.
However, I have a few comments below, with the possible conclusion
that the effort isn't really worth it in this case (or that the effort
would be way more involved than you currently anticipate).
On 06.01.25 22:03, Rafał Dowgird wrote:
>
> The current logic for the duplicates check is in client_golang in the
> Gather() method:
>
https://github.com/prometheus/client_golang/blob/aea1a5996a9d8119592baea7310810c65dc598f5/prometheus/registry.go#L424
> Unfortunately this API can only take a whole set of metrics and answer if
> it's consistent. It does so by calculating hashes for the whole set, which
> in case of Pushgateway leads to quadratic complexity. Pushgateway keeps a
> dynamic set of metrics and needs to keep track of its consistency and you
> cannot do it efficiently using the current Gather() API.
I wrote this code (both the PGW side and the client_golang side) long
ago. My memory might be patchy, but I'll try to recall the rationale
from back then.
Whenever the PGW (or in fact any program instrumented with
prometheus/client_golang) is scraped, the logic implemented in the
Gather() method takes place, i.e. in that moment the current state of
metrics to be exposed is checked for self-consistency. In the way it
is implemented, this is linear with the number of metrics (O(n)), so
it is generally an accepted burden and hasn't really been perceived as a
problem except in very specialized edge cases (kube-state-metrics is
an (in-)famous example). Part of the reason is that a scrape happens
relatively rarely (a few times per minute) so that the resource need
to serve metrics is usually negligible compared to the resource need
of the actual primary task the instrumented program is doing.
So what happens during pushing? What we do in the current code is to
essentially simulate what would happen if the PGW gets scraped with
the newly pushed metrics added to the already existing metrics. This
appears quite costly, but the rationale here is that the same cost
will be paid again when the PGW is scraped for real. While I said
above that scrapes are relatively rare (a few times per minute),
pushes happen even less often.
This means in turn that all the effort to make the consistency check
less expensive will be small compared to the effort required during
scraping.
While you are technically right that the consistency check is O(n*m)
with n being the total number of metrics in the PGW and m being the
number of pushes, I doubt that this is the relevant metric to look
at. In the same way, you could say that the scrape is quadratic with n
being the total number of metrics and m being the number of
scrapes. As long as you scrape more often than you push, you have to
also change the whole way scraping works to actually make a
dent. (This is what kube-state-metric did. They removed all layers of
abstractions and are now rendering the metrics output directly. In the
PGW case, you could probably follow a less radical approach, but you
would still break contracts like Gather() being responsible for the
final self-consistency check.)
Unless of course you are using the PGW for a use case it is not
designed for. You quoted me saying "Pushgateway is not meant to be
high performance", but that's not what I said. Pushgateway performs
just fine for the use case it was designed for. If you really use it
in a situation where you push more often than you scrape, I would be
concerned about more things than just performance. You are now
funneling a whole lot of metrics through a SPOF. The PGW has no HA
story whatsoever, following the idea that it is for metrics that only
update a few times a day or so, so that nothing bad will happen if it
has some downtime. If a huge number of your frequently updating
metrics are lost while the PGW is down, you have a bigger problem than
performance.
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email]
bjo...@rabenste.in