Guaranteed ingestion of metrics with historical timestamps

2,005 views
Skip to first unread message

Jeremy Collette

unread,
Jun 18, 2022, 6:47:49 AM6/18/22
to Prometheus Users
Hello,

We have written a custom exporter that exposes metrics with explicit timestamps, which Prometheus periodically scrapes. In the case where Prometheus becomes temporarily unavailable, these metric samples will be cached in the exporter until they are scraped, causing affected metrics to age. 

I understand that if a metric is older than a certain threshold, it will be rejected by Prometheus with the message:  "Error on ingesting samples that are too old or are too far into the future".

I'm trying to understand if there are any guarantees surrounding the ingestion of historical metrics. Is there some metric sample age that is guaranteed to be recent enough to be ingested? For example, are samples with timestamps within the last hour always going to be considered recent? Within the last five minutes?

According to this previous thread: Error on ingesting samples that are too old, MR seems to indicate that metrics as old as 1 second can be dropped due to being too old. Is this interpretation correct? If so, is there any way to ensure metrics with timestamps won't be dropped for being too old?


Cheers,

Jeremy

Stuart Clark

unread,
Jun 18, 2022, 7:13:47 AM6/18/22
to Jeremy Collette, Prometheus Users

The use of timestamps in metrics is not something that should be used except in some very specific cases. The main use case for adding a timestamp is when you are scraping metrics into Prometheus that have been sourced from another existing metrics system (for example things like the Cloudwatch Exporter). You also mention something about your exporter caching things until they are scraped, which also sounds like something that is not advisable. The action of the exporter shouldn't really be changing depending on the requests being received (or not received).

An exporter is expected to return the various metrics that reflect "now", in the same way that a directly instrumented application would be expected to return the current state of the metrics being maintained in memory. For a simple exporter the normal mechanism is for a request to be received which then triggers some mechanism to generate the metrics. For example with something like the MySQL Exporter a request would trigger a query on the connected database which then returns various information that is converted into Prometheus metrics and returned. In some situations the process to fetch information from the underlying system can be quite resource intensive or slow. In that case a common design is to decouple the information fetching process from the request handling process. One example is to perform the information fetching process on a periodic timer, with the information fetched then stored in memory. The request process then reads and returns that information - returning the same values for every request until the next cycle of the information fetching process. In none of these standard scenarios would you expect timestamps to be attached to the returned metrics.

It would be good to hear a bit more about what you are trying to do, as it is highly likely that the use of timestamps in your use case is probably not the right option and they should just be dropped.

-- 
Stuart Clark

Ben Kochie

unread,
Jun 18, 2022, 7:29:42 AM6/18/22
to Stuart Clark, Jeremy Collette, Prometheus Users
For this use case, it's likely what they want is Prometheus in agent mode, which uses remote write, which can buffer and catch up.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/007cf462-d87e-d4c7-316e-4007567c74a1%40Jahingo.com.

Jeremy Collette

unread,
Jun 21, 2022, 8:26:38 PM6/21/22
to Prometheus Users
Hi Stuart and sup (email appears to be truncated),

Thanks for your responses. 

The use of timestamps in metrics is not something that should be used except in some very specific cases.

Stuart, after reading through your reply I did some research and talked with my team to understand why we are emitting metrics with timestamps. It seems that we are using gauges incorrectly in an effort to simplify our exporter. For example, we are using gauges for HTTP request duration, where each sample value is a request duration in milliseconds. My colleagues were under the assumption that if we had multiple requests to the same endpoint during the scraping interval, we would need to expose two separate gauge samples or we would have data loss. For example, if we just take the last latency value, there could be an intermediate value (during the same scraping interval) with a higher latency (that may have triggered an Alerting rule) that we might have missed. This is why we are exporting metrics with timestamps: to expose multiple gauge samples with the same labels in the same scrape response.

After reading some more Prometheus documentation, this appears to be poor practice. Instead, I now understand that we should be using a histogram or summary in this scenario.

> You also mention something about your exporter caching things until they are scraped, which also sounds like something that is not advisable.

Our metric samples are based on events that are emitted by partner team components. To collect these samples, we built an exporter that listens for partner team events and caches them. Upon being scraped, these events are turned in to metrics that are ingested in to Prometheus. The reason that we purged the scraped metrics was because they were all being emitted with distinct timestamps. However, if we implement support for histograms and summaries as I mentioned above, we can remove the timestamp from metrics and thus can continuously emit all metrics, taking the last known sample as the value.

> For this use case, it's likely what they want is Prometheus in agent mode, which uses remote write, which can buffer and catch up.

sup, we require local Alerting / Querying to be available in our Prometheus instance. I believe "Agent" mode does not support this. 


Cheers,

Jeremy

Ben Kochie

unread,
Jun 22, 2022, 3:28:15 AM6/22/22
to Jeremy Collette, Prometheus Users
On Wed, Jun 22, 2022 at 2:26 AM Jeremy Collette <jeremy.c...@gmail.com> wrote:
Hi Stuart and sup (email appears to be truncated),

Thanks for your responses. 

The use of timestamps in metrics is not something that should be used except in some very specific cases.

Stuart, after reading through your reply I did some research and talked with my team to understand why we are emitting metrics with timestamps. It seems that we are using gauges incorrectly in an effort to simplify our exporter. For example, we are using gauges for HTTP request duration, where each sample value is a request duration in milliseconds. My colleagues were under the assumption that if we had multiple requests to the same endpoint during the scraping interval, we would need to expose two separate gauge samples or we would have data loss. For example, if we just take the last latency value, there could be an intermediate value (during the same scraping interval) with a higher latency (that may have triggered an Alerting rule) that we might have missed. This is why we are exporting metrics with timestamps: to expose multiple gauge samples with the same labels in the same scrape response.

After reading some more Prometheus documentation, this appears to be poor practice. Instead, I now understand that we should be using a histogram or summary in this scenario.

Definitely use a Histogram. Summaries are, IMO, legacy. They can not be aggregated over multiple instances. Histograms can be aggregated. It's also best practice that you convert all event durations to seconds. Don't worry, Prometheus uses float64, which will not give you any precision loss. You can convert back to nano, micro, milli, or years later in Grafana.
 

> You also mention something about your exporter caching things until they are scraped, which also sounds like something that is not advisable.

Our metric samples are based on events that are emitted by partner team components. To collect these samples, we built an exporter that listens for partner team events and caches them. Upon being scraped, these events are turned in to metrics that are ingested in to Prometheus. The reason that we purged the scraped metrics was because they were all being emitted with distinct timestamps. However, if we implement support for histograms and summaries as I mentioned above, we can remove the timestamp from metrics and thus can continuously emit all metrics, taking the last known sample as the value.

> For this use case, it's likely what they want is Prometheus in agent mode, which uses remote write, which can buffer and catch up.

sup, we require local Alerting / Querying to be available in our Prometheus instance. I believe "Agent" mode does not support this. 

Yes, that's correct. Agent mode can not run queries/rules. It's mean to be as lightweight as possible as a forwarder.

My second guess was going to be that you had events that should be aggregated into a histogram. :-)
 


Cheers,

Jeremy

On Saturday, June 18, 2022 at 4:29:42 AM UTC-7 sup...@gmail.com wrote:
For this use case, it's likely what they want is Prometheus in agent mode, which uses remote write, which can buffer and catch up.

On Sat, Jun 18, 2022, 1:13 PM Stuart Clark <stuart...@jahingo.com> wrote:
On 14/06/2022 18:32, Jeremy Collette wrote:
Hello,

We have written a custom exporter that exposes metrics with explicit timestamps, which Prometheus periodically scrapes. In the case where Prometheus becomes temporarily unavailable, these metric samples will be cached in the exporter until they are scraped, causing affected metrics to age. 

I understand that if a metric is older than a certain threshold, it will be rejected by Prometheus with the message:  "Error on ingesting samples that are too old or are too far into the future".

I'm trying to understand if there are any guarantees surrounding the ingestion of historical metrics. Is there some metric sample age that is guaranteed to be recent enough to be ingested? For example, are samples with timestamps within the last hour always going to be considered recent? Within the last five minutes?

According to this previous thread: Error on ingesting samples that are too old, MR seems to indicate that metrics as old as 1 second can be dropped due to being too old. Is this interpretation correct? If so, is there any way to ensure metrics with timestamps won't be dropped for being too old?

The use of timestamps in metrics is not something that should be used except in some very specific cases. The main use case for adding a timestamp is when you are scraping metrics into Prometheus that have been sourced from another existing metrics system (for example things like the Cloudwatch Exporter). You also mention something about your exporter caching things until they are scraped, which also sounds like something that is not advisable. The action of the exporter shouldn't really be changing depending on the requests being received (or not received).

An exporter is expected to return the various metrics that reflect "now", in the same way that a directly instrumented application would be expected to return the current state of the metrics being maintained in memory. For a simple exporter the normal mechanism is for a request to be received which then triggers some mechanism to generate the metrics. For example with something like the MySQL Exporter a request would trigger a query on the connected database which then returns various information that is converted into Prometheus metrics and returned. In some situations the process to fetch information from the underlying system can be quite resource intensive or slow. In that case a common design is to decouple the information fetching process from the request handling process. One example is to perform the information fetching process on a periodic timer, with the information fetched then stored in memory. The request process then reads and returns that information - returning the same values for every request until the next cycle of the information fetching process. In none of these standard scenarios would you expect timestamps to be attached to the returned metrics.

It would be good to hear a bit more about what you are trying to do, as it is highly likely that the use of timestamps in your use case is probably not the right option and they should just be dropped.

-- 
Stuart Clark

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/007cf462-d87e-d4c7-316e-4007567c74a1%40Jahingo.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Omar Khazamov

unread,
Oct 7, 2022, 5:18:20 AM10/7/22
to Prometheus Users
Hi Stuart, 

I can see that support of timestamps has been discontinued around November 21st, 2021. Indeed, when I try 

C02G74F9Q6LR bat-datapipeline % echo "test_metric_with_timestamp 33 1665623039" | curl --data-binary @- https://<pushgatewayURL>/metrics/job/pushgateway-job

I get  "pushed metrics are invalid or inconsistent with existing metrics: pushed metrics must not have timestamps"

Could you please specify how do you use timestamps in metrics? Thanks

Stuart Clark

unread,
Oct 7, 2022, 7:30:22 AM10/7/22
to Omar Khazamov, Prometheus Users
On 07/10/2022 10:16, Omar Khazamov wrote:
Hi Stuart, 

I can see that support of timestamps has been discontinued around November 21st, 2021. Indeed, when I try 

C02G74F9Q6LR bat-datapipeline % echo "test_metric_with_timestamp 33 1665623039" | curl --data-binary @- https://<pushgatewayURL>/metrics/job/pushgateway-job

I get  "pushed metrics are invalid or inconsistent with existing metrics: pushed metrics must not have timestamps"

Could you please specify how do you use timestamps in metrics? Thanks

As mentioned before timestamps in general should not be used.

You should always be publishing the "latest" value of any metric when Prometheus scrapes the endpoint (or the push gateway in this case).

-- 
Stuart Clark

Omar Khazamov

unread,
Oct 25, 2022, 1:25:23 PM10/25/22
to Stuart Clark, Prometheus Users
Thanks, I'm importing metrics from our internal metrics system. Do you have any advice on how to push with    the explicit   timestamps?
--
Thanks,
Omar Khazamov

Stuart Clark

unread,
Oct 25, 2022, 2:21:34 PM10/25/22
to Omar Khazamov, Prometheus Users
If you are trying to interface with another metrics system the Push Gateway isn't the right tool. The main use case for the Push Gateway is for batch jobs that aren't able to be directly scraped, but still have useful metrics. For systems which are constantly running you should instead look at direct instrumentation or the use of exporters.

Is this a custom metrics system, or something off the shelf and common? If so, there might already be an exporter available.

If you do need to make a custom exporter, I'd suggest looking at some of the similar existing ones (for example the Cloudwatch exporter) to see how they are made - but basically when a scrape request is received API calls would be made to your other metrics system to fetch the latest values, converted to Prometheus format (including the timestamp of that latest value from the other metric system) and returned. Prometheus would regularly scrape that exporter and add new values on a regular basis.

Alternatively, if the existing metric system already has extensive historical data which you'd like to be able to query (for dashboards and alerts) take a look at the remote read system. With this option Prometheus would use the remote system as an additional data source, running queries as needed (based on the PromQL queries it receives), combining the data with local information as needed. There are already remote read integrations available for some data stores.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Omar Khazamov

unread,
Oct 28, 2022, 4:44:26 PM10/28/22
to Stuart Clark, Prometheus Users
Thank you. 

>>Alternatively, if the existing metric system already has extensive historical data which you'd like to be able to query (for dashboards and alerts) take a look at the remote read system.

This is probably a silly question, but is it also true for remote write? I may use  Prometheus-compatible remote storage VictoriaMetrics and it looks like it supports only the remote write.
--
Thanks,
Omar Khazamov

Stuart Clark

unread,
Oct 30, 2022, 4:12:41 AM10/30/22
to Omar Khazamov, Prometheus Users
On 28/10/2022 21:44, Omar Khazamov wrote:
Thank you. 

>>Alternatively, if the existing metric system already has extensive historical data which you'd like to be able to query (for dashboards and alerts) take a look at the remote read system.

This is probably a silly question, but is it also true for remote write? I may use  Prometheus-compatible remote storage VictoriaMetrics and it looks like it supports only the remote write.

Remote read & remote write are complimentary but different.

Remote write will send a copy of the metrics you have just scraped to an external system. This could be some sort of metrics storage system, but could also be something like a machine learning analytics tool.

Remote read allows Prometheus to query an external system any time a PromQL request is made. Whatever data is returned is merged into any local data and presented to the requester. Again this could be some sort of metrics store, but could also be something different like a forecasting system or an event store.

Support for remote read & write are up to the external system. While for the use case of an external metrics store (for long term or global storage) it makes sense to support both, there are plenty of use cases which only require one or the other.

-- 
Stuart Clark
Reply all
Reply to author
Forward
0 new messages