Metrics timeseries lag on master prometheus of federated metrics

43 views
Skip to first unread message

Raghavendra k

unread,
Jun 10, 2020, 2:23:15 AM6/10/20
to Prometheus Users

We have a Azure kubernetes cluster with the applications running as PODs. And on the same cluster but on a different namespace we have prometheus running to monitor the health of  the applications.

We have a centralized monitoring service (Prometheus again) which is hosted on AWS as a kubernetes PODs.

With two different hosted solutions Azure (for applications) & AWS (for centralized monitoring), we are trying to federate the Prometheus running on the Azure cluster from the Prometheus running on the AWS cluster.
Federation is working absolutely fine, but there is a small lag in timeseries values of the federated metrics on the master prometheus (somewhere between 2-10s).

I'm not able to figure out the actual cause for this time lag.

Need help here to understand the time series lag and if that can be solved  by anyway

sayf eddine Hammemi

unread,
Jun 10, 2020, 2:31:22 AM6/10/20
to Raghavendra k, Prometheus Users
Do you `honor_timestamps` when reading the federated metrics? are the servers synchronized (ntp/chrony)?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ffb0ad5e-5fc3-445b-b93b-b1512fb5b0cbo%40googlegroups.com.

Raghavendra k

unread,
Jun 10, 2020, 6:03:15 AM6/10/20
to Prometheus Users
I have verified the NTP sync between servers and 'honor_timestamps' is enabled by default.
Is there anything else that I'm missing here?

ntpstat from two servers:

AWS
synchronised to NTP server (64.79.100.196) at stratum 3
   time correct to within 89 ms
   polling server every 1024 s


AZURE
synchronised to NTP server (216.126.233.109) at stratum 3
   time correct to within 26 ms
   polling server every 128 s

On Wednesday, June 10, 2020 at 12:01:22 PM UTC+5:30, sayf eddine Hammemi wrote:
Do you `honor_timestamps` when reading the federated metrics? are the servers synchronized (ntp/chrony)?

On Wed, Jun 10, 2020 at 8:23 AM Raghavendra k <amit...@gmail.com> wrote:

We have a Azure kubernetes cluster with the applications running as PODs. And on the same cluster but on a different namespace we have prometheus running to monitor the health of  the applications.

We have a centralized monitoring service (Prometheus again) which is hosted on AWS as a kubernetes PODs.

With two different hosted solutions Azure (for applications) & AWS (for centralized monitoring), we are trying to federate the Prometheus running on the Azure cluster from the Prometheus running on the AWS cluster.
Federation is working absolutely fine, but there is a small lag in timeseries values of the federated metrics on the master prometheus (somewhere between 2-10s).

I'm not able to figure out the actual cause for this time lag.

Need help here to understand the time series lag and if that can be solved  by anyway

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Bjoern Rabenstein

unread,
Jun 15, 2020, 2:41:44 PM6/15/20
to Raghavendra k, Prometheus Users
On 09.06.20 23:23, Raghavendra k wrote:
>
> With two different hosted solutions Azure (for applications) & AWS (for
> centralized monitoring), we are trying to federate the Prometheus running on
> the Azure cluster from the Prometheus running on the AWS cluster.
> Federation is working absolutely fine, but there is a small lag in timeseries
> values of the federated metrics on the master prometheus (somewhere between
> 2-10s).

If you are referring to metrics arriving 2-10s "late", then I'd guess
your scrape interval for the federation setup is 10s.

Or in other words: Federation is essentially a normal scrape. But it
keep the timestamp from the source instead of attaching the scrape
timestamp. So metrics come in after every scrape, so they might be
late up to the time of the scrape interval.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Abhirama Mallela

unread,
Jun 15, 2020, 11:36:26 PM6/15/20
to Prometheus Users
I'm attaching a couple of screenshots to help understand the problem better (I work with the OP). The top one is the source and the bottom one is the instance that's scraping the source. If you look at the time shown for the point value, you'll understand what the problem is.

If we looked at the timestamps (in Unix epoch) for the time series, they are identical. What we don't understand is why is there a difference when the metric is graphed.

source_prometheus.png

federated_prometheus.png

Brian Candler

unread,
Jun 16, 2020, 7:17:33 AM6/16/20
to Prometheus Users
> If we looked at the timestamps (in Unix epoch) for the time series, they are identical.

How did you look at them - what command exactly?

To see the raw timestamps you need a range vector query. e.g. all recorded values for metric foo with label bar="baz" over the last 5 minutes:

curl -Ssg 'localhost:9090/api/v1/query?query=foo{bar="baz"}[5m]' | python3 -mjson.tool

And remember that a graph is resampling the data, although if you zoom in far enough the issue is negligible.

Abhirama Mallela

unread,
Jun 16, 2020, 12:15:27 PM6/16/20
to Prometheus Users
When I run a promQL range query, like "http_server_requests_seconds_count[1m]", I get a list of (value, time) values. 

Example:

42 @1592323936.961
44 @1592323938.301
...

What I meant was the same query on the source prometheus and the one which was scraping from the source prometheus had identical values (which I think is the result of setting honor_timestamps=true).

What is puzzling is - when the underlying timeseries values are identical, why does the graph show a discrepancy in times?

Brian Candler

unread,
Jun 16, 2020, 3:00:55 PM6/16/20
to Prometheus Users
The graph is resampling the data, like a subquery.

Consider for example a graph which covers 1000 seconds, and there are 500 data points.  Each data point shows the value of the metric at time t, t-2, t-4, t-6 etc - where t is the current time.

You may not be scraping the data at the same interval. So the point at t-6 will be showing whatever the most recent value of the timeseries was *at or before* t-6.

I'm not saying this is the reason for the discrepancy you observe.  But it's one reason why viewing the same timeseries graph at different times might show slightly different timestamps.

(I can't see any discrepancy myself, because the two screenshots you posted look absolutely identical to me)
Reply all
Reply to author
Forward
0 new messages