Best option for short-lived jobs instead pushgateway?

40 views
Skip to first unread message

Lucas Lobosque

unread,
Mar 12, 2022, 3:54:35 PM3/12/22
to Prometheus Users
Hi, I have 0 to many crawlers running at a given time, where each crawler is a docker container. I have a lot of metrics related to crawling, but lets stick to downloaded bytes.

Metrics are sent just before shutting down the process.

I want to use prometheus + grafana to build dashboards and alerts for this metric. I thought that pushgateway was perfect for my use case here, since it acts as a proxy to aggregate and expose metrics from short-lived process.

However, I noticed that once the job finishes, the value of the downloaded bytes for that crawler in that job never goes down, it keeps the value as a line, instead keeping it as a single data point.

I came across an issue on pushgateway concluding that this behavior is by design, and will not change: https://github.com/prometheus/pushgateway/issues/19

So, for my specific use case, what should I use to aggregate  metrics from these different jobs, in a way that data points are generated only while the job is aline, and not forever?

Matthias Rampke

unread,
Mar 12, 2022, 4:13:52 PM3/12/22
to Lucas Lobosque, Prometheus Users
Prometheus does not really deal in single points. Many queries won't work. You can record the finished crawl as an event, in a system of your choice that handles events (any database, or log aggregators).

Or, if your crawlers live for a while, treat them as "long" running.  Make them expose metrics continuously using the appropriate client library, and have Prometheus discover them as they come and go. The limitation here is how fast you churn through instance labels, and what the cardinality overall is. If a crawler lives for hours, that's going to work fine; minutes, maybe; seconds, probably not.

If you have a way of identifying successions of crawlers, you could use relabeling to model these as "instances" that just happen to be different containers over time. For example, if a given container crawls a specific category of … somethings (even if the "category" is only a sharding key), and later another container will do the same thing, you can relabel that category into the instance label, making sure not to have any other "per crawler container" labels that blow up the cardinality. This way, even though the individual crawler process is short-lived, you treat a slightly higher level as the "instance". This very much depends on the specifics of your crawling process though, which you did not specify.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3ca487ed-2e10-495d-b5f1-e5c32e9ef48bn%40googlegroups.com.

Brian Candler

unread,
Mar 13, 2022, 5:28:34 AM3/13/22
to Prometheus Users
If what you're interested in is the total number of download jobs and the total number of downloaded bytes - and not which particular job downloaded how many bytes - then you could use statsd_exporter.  It's like pushgateway, but it can add values to a counter, rather than just replacing values.  Then prometheus can scrape the statsd counter.  This works in many more scenarios, including when multiple download jobs occur between a pair of scrapes.

What you *don't* want is a separate prometheus timeseries per download job; there lies cardinality explosion and the problems you're already identified about where the timeseries "starts" and "ends".  Also makes it very hard to do aggregate calculations for reporting.

If you do need to report individually on each job and its number of downloaded bytes, then you're better off using an event logging system such as Loki or Elasticsearch.

Reply all
Reply to author
Forward
0 new messages