Federated Prometheus - Clarifications

27 views

Skip to first unread message

Natraj Rams

unread,

Feb 9, 2021, 9:11:23 AM2/9/21

to Prometheus Users

I have prometheus server installed in each environment. Say, I have 5 environments. I have one federated prometheus which scrapes specific metric from all those 5 prometheus servers. I have retention period of metrics as 10 days in each individual prometheus servers, while the federated prometheus has the retention period of 1 year.

Let the individual prometheus servers name be:

1) Prometheus-1

2) Prometheus-2

3) Prometheus-3

4) Prometheus-4

5) Prometheus-5

Till a date, say 01/02/2021, I have all those metrics visible in my federated cluster. Let the metric value as of 01/02/2021 be "x" from all the 5 prometheus servers.

Now one of the 5 prometheus servers, say prometheus-3 server, has lost all its data on 02/02/2021, due to some reasons and the prometheus server itself has got stopped. Below are the doubts I have

1) If I query for a metric in federated prometheus, before the prometheus-3 server becomes active, will I get any data of prometheus-3 server?

2) As I mentioned , prometheus-3 server has lost all its data, now if the prometheus-3 server becomes active on 03/02/2021 with metric value "y"(which will be very less, since all historical data are lost) and when I query data of prometheus-3 in federated cluster, what will be the metric value I will be getting? Will it be x+y or just y?

Thanks

R.Natarajan

Stuart Clark

unread,

Feb 9, 2021, 11:15:09 AM2/9/21

to Natraj Rams, Prometheus Users

The server that is federating metrics from the 5 servers has its own
TSDB and isn't dependant on those servers in any way for queries.
Normally you would be federating certain metrics (not everything) so the
central server wouldn't have all the details, so you would still want to
query the 5 servers as needed.

If you stopped scraping one of the servers (e.g. because it failed)
nothing would change regarding the data the central server has already
ingested. From that point onward the scrape would fail for the missing
server, so any queries would have a gap. One the failed server returns
the scrapes would work again and the gap would finish.

So for (1) if you query the central server before server 3 is back what
you get depends on your query - if the query is for a time period before
server 3 failed then you get the full data, but after server 3 failed it
would be missing.

For (2) is depends what you mean by "which will be very less, since all
historical data are lost". Federation fetches the current value of the
matched metrics each time the central server makes a scrape of the 5
servers. Historical data is never queried (Prometheus will look back for
a maximum of 5 minutes to find the latest value for each metric). If the
metric is a gauge it is totally normal for the value to fluctuate. If
the metric is a counter then you will get occasional counter resets, but
that is down to the metric source and not the Prometheus server -
counters reset when an application restarts or start from 0 if a new pod
is created.

So in summary, the only impact of server 3 breaking would be a gap in
your query (or lower than expected aggregate values) while it was
unavailable. There is no impact for any historical data before that time
or data once the server is back.

--
Stuart Clark

Reply all

Reply to author

Forward

0 new messages