Efficient way to query non-active time series's last value

66 views
Skip to first unread message

Peter S

unread,
Aug 31, 2020, 1:45:13 PM8/31/20
to Prometheus Users
Hi,

We've been using quite a lot of time series to represent states of some services. Values of these time series don't change often, but we have to export and scrape the same value all the time to keep them active and used in alerts.

Is there a way we could optimize this? Ideally, there should be an efficient query to return a time series's last recorded value, going back a range of time. If there is a way to do this, it would be great.

Thanks,

Regards,

Peter

Brian Candler

unread,
Aug 31, 2020, 3:01:49 PM8/31/20
to Prometheus Users
There's no need to optimise this.  Just keep scraping the same value repeatedly.  Prometheus' delta compression is highly efficient, extremely so when the values don't change: the delta between adjacent values is zero, and the delta between scrape timestamps is roughly constant.  Besides, storage these days is very cheap.

It's also semantically useful to keep all the scrapes.  There is a difference between "the value was known to be X at time T", and "the value at time T was not recorded; maybe it was the same as at time T-1, maybe not"

P Shan

unread,
Aug 31, 2020, 8:55:15 PM8/31/20
to Brian Candler, Prometheus Users
Thanks. Unfortunately, exporting and scraping the same values have become costly for us. We have metrics endpoints of 50MB+, and scraping have begun to time out more and more often.

On Mon, Aug 31, 2020, 3:01 PM Brian Candler <b.ca...@pobox.com> wrote:
There's no need to optimise this.  Just keep scraping the same value repeatedly.  Prometheus' delta compression is highly efficient, extremely so when the values don't change: the delta between adjacent values is zero, and the delta between scrape timestamps is roughly constant.  Besides, storage these days is very cheap.

It's also semantically useful to keep all the scrapes.  There is a difference between "the value was known to be X at time T", and "the value at time T was not recorded; maybe it was the same as at time T-1, maybe not"

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/LTP8_hOgfz0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9f5c94dc-1047-49df-a839-b14f4b997d4ao%40googlegroups.com.

Brian Candler

unread,
Sep 1, 2020, 5:01:47 AM9/1/20
to Prometheus Users
On Tuesday, 1 September 2020 01:55:15 UTC+1, Peter S wrote:
Thanks. Unfortunately, exporting and scraping the same values have become costly for us. We have metrics endpoints of 50MB+, and scraping have begun to time out more and more often.


Sorry, can you explain what you mean by "metrics endpoints of 50MB+" ?  Where are you measuring 50MB exactly?

If you have 50 million timeseries, that's huge.  But I don't think that's what you mean.

If you are returning 50MB of prometheus line-format data in a single scrape, that's quite a lot, but it will compress to very little in the TSDB if the values are not changing.

What's important to prometheus is not the volume of the scrape, but the number of active timeseries.  Timeseries are active if they're in the head, which means a sample has been seen in the last ~2 hours.  Leaving gaps in the timeseries, when the gaps are less than 2 hours, is not going to save you any TSDB resources at all, but will cause you problems with staleness at query time.

What are you trying to optimise: the volume of TSDB storage, or the volume of network traffic?  If it's network traffic then you might be better off having a local prometheus server right next to where the data is collected.  You can either query it directly, or via promxy, or use something like Thanos.  In either case, the only traffic will be the query request/response.

You could also use remote_write to forward data to a central server such as VictoriaMetrics, although I have not measured how the volume of remote_write traffic compares with the volume of prometheus line protocol traffic.

Another option to consider would be to use statsd_exporter or possibly pushgateway, and have those local to your prometheus server.  The remote metrics updates would be done via statsd or pushgateway updates, and when they don't change, prometheus just scrapes the same value.

Finally, it would be pretty easy to write a proxy which is tailored to your requirements: incoming scrape performs outbound scrape, merges the results into a cache, and then returns the whole cache contents.

Peter S

unread,
Sep 1, 2020, 11:37:49 AM9/1/20
to Prometheus Users
We measured by `curl <metrics_endpoint> | wc` Also `scrape_samples_scraped` reports that 400k metrics are exported and scraped.

TSDB is fine. Sorry I wasn't being clear. Network traffic has become the bottleneck. Even though the exporter and prometheus are collocated on the same machine, scrapes have begun timing out more and more often. Next we think we're increasing scraping interval (15s) to buy us some time.

What we really want, in an ideal world, is that only states changes are exported and scraped, and there is an efficient way to query last reported states, so all these network traffic (and storage although it's not an issue for us) can be saved, and the system becomes much more scalable. 

Thanks,

Peter

Brian Candler

unread,
Sep 1, 2020, 1:35:27 PM9/1/20
to Prometheus Users
On Tuesday, 1 September 2020 16:37:49 UTC+1, Peter S wrote:
TSDB is fine. Sorry I wasn't being clear. Network traffic has become the bottleneck. Even though the exporter and prometheus are collocated on the same machine, scrapes have begun timing out more and more often. Next we think we're increasing scraping interval (15s) to buy us some time.


prometheus can run multiple scrapes in parallel - so the solution may be as simple as breaking the data into (say) 16 chunks, and having a job which scrapes 16 endpoints.  There is usually some natural way in which the data breaks down into smaller groups which logically belong together.

Failing that, you could look at a "push" based approach rather than "pull" based, for example VictoriaMetrics supports this.

Ben Kochie

unread,
Sep 1, 2020, 4:39:19 PM9/1/20
to Peter S, Prometheus Users
Prometheus attempts to use gzip http compression by default, but as you say, your exporter is local.

Your 400k samples per scrape is pretty far out of bounds for a normal setup. Prometheus scales by scraping many small requests in parallel. Typically I recommend 50k samples per scrape is an absolute maximum recommended, and more than 10k samples per scrape is "large but OK".

It sounds like you've either got some metrics with excessive cardinality, or you're pre-aggregating data for Prometheus. Both of which are going against best practices and are going to lead you into trouble long-term.

Without more understanding of what you're really doing, it's hard to say. But it's definitely not how Prometheus is designed to be used and you're suggesting workarounds for problems you shouldn't have in the first place.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/096dac17-76eb-4e79-8e39-cf4e60b55bbcn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages