On 24.06.19 19:25, nick noto wrote:
> It also might be sufficient for us it we can just exclude older results from a
> PromQL query, rather than delete them entirely.
> For example, if we can use the push_time_seconds metric to find Pushers that
> have pushed within X minutes using a query like: push_time_seconds{exported_job
> =~"WorkerPusher.+"} > (time() - X)
>
> Is there any way to construct a PromQL query that uses the output of the query
> above to grab the Pusher names from the exported_job label and then use them to
> query the "JobStats" metric with each label? (JobStats{exported_job=
> WorkerPusher1}, JobStats{exported_job=WorkerPusher2, etc)
You could probably craft such a query, but it would be a complicated one.
I'd rather try to set up things in a way that you don't have to depend
on the push time.
Another way to look at it is that excluding metrics based on push time
is just a more complicated way of implementing the infamous TTL of
metrics. See
https://groups.google.com/forum/#!topic/prometheus-developers/9IyUxRvhY7w
for the related discussion on prometheus-developers.
> Each worker pod that's launched for the job needs to create a Pusher, and
> each Pusher requires a unique "Pusher Job" ID.
> For example, if we launch three workers for the job, we need three Pushers
> "WorkerPusher1", "WorkerPusher2", and "WorkerPusher3" that are declared in
> each worker's code respectively.
> If we query the "JobStats" metric, we can see the data for all three pods.
> The data was pushed to the Push Gateway, and Prometheus was able to pull
> the data successfully.
> If on a subsequent run, we happen to only need two workers for the job,
> we'd only need two Pushers, "WorkerPusher1" and "WorkerPusher2".
>
> However, since the metrics are not deleted from the Push Gateway, it seems
> like Prometheus will continue to pull the data that was pushed to the Push
> Gateway by "WorkerPusher3" indefinitely.
> I had thought that it might only pull the values once, so if the 3-worker
> job and 2-worker job were two weeks apart, we wouldn't see the data for
> "WorkerPusher3" if we only queried for the time frame around the 2-worker
> job, but it seems that Prometheus continues to pull the data for
> "WorkerPusher3" as if it's new data.
>
> Each time our Kubernetes job is run, it has no knowledge of the number of
> workers used in the previous job, so it isn't possible to issue delete
> calls to remove the data from the Push Gateway from previous job runs (not
> that the Golang package seems to support the individual Delete calls
> anyway).
I think the problem here is that you have a distributed workload here,
and the PGW is not really suited for aggregations of any kind (see the
first paragraph about the non-goals in the README.md). You might want
to consider the detour via statsd_exporter or the aggregation gateway
mentioned in that paragraph.
Another way out would be to let your job scheduler (whatever manages
the various workers) report the results to the PGW in one push, once
it has concluded that all workers are done.
An upgly work-around would be to put make the worker-pusher name part
of the metric name so that you can use POST pushs to add to the same
grouping key. You still needed to issue a DELETE call before starting
a now batch job, and dealing with the differently named metrics will
be cumbersome.
> Ideally, we could just clear the Push Gateway out before we start up a new
> job. Is there any command or any way to completely clear the Push Gateway
> that isn't too hacky?
You can shutdown the PGW, delete the persistency file, and start it up
again.
I guess this qualifies as "hacky".
I think there is no harm in adding an endpoint to wipe the storage. I
have just filed
https://github.com/prometheus/pushgateway/issues/264 .
However, I would always see that as some maintenance or repair task,
not as something that should be part of your regular routine.
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email]
bjo...@rabenste.in