Not able to push the metrics from Prometheus on a Kubernetes cluster to another Prometheus server

730 views
Skip to first unread message

Saurabh Vartak

unread,
Feb 24, 2021, 3:37:02 PM2/24/21
to Prometheus Users
Hello all,

I am trying push the metrics from the Prometheus installed on mu Kubernetes cluster to another centralized Prometheus server.

In my Kubernetes cluster I have configured the remote_write section for the config map used by Prometheus on my Kubernetes cluster

  prometheus.yml: |
    global:
      evaluation_interval: 1m
      scrape_interval: 1m
      scrape_timeout: 10s
    remote_write:


In my centralized Prometheus server, I have the below configuration in the Prometheus.yaml file:

global:
  scrape_interval: 10s

scrape_configs:
  - job_name: 'prometheus_metrics'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node_exporter_metrics'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']
  - job_name: 'Pushgateway'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9091']

Am I doing anything wrong here. I am browsing my portal but I am not able to understand how to query the metrics which have been forwarded from my Kubernetes cluster ... if at all they are being forwarded.

Any help would be greatly appreciated here.

Regards,
Saurabh

Stuart Clark

unread,
Feb 24, 2021, 4:55:52 PM2/24/21
to Saurabh Vartak, Prometheus Users
Prometheus doesn't support sending metrics to another Prometheus server
via remote write, other than via an experimental option in the latest
version (2.25.0).

The normal mechanism would be for the central server to fetch metrics
via federation, although this is designed for a limited subset of
aggregate metrics and not a complete copy. Alternatively a system such
as Thanos could be used as an alternative global metrics store which can
be accessed by all your Prometheus servers.

--
Stuart Clark

Saurabh Vartak

unread,
Feb 25, 2021, 1:56:07 AM2/25/21
to Stuart Clark, Prometheus Users
Hi Stuart,

Thanks for all the help on this. I have created a Grafana cloud account and their documentation states that it is possible to send the metrics from one Prometheus server to the Prometheus instance on Grafana cloud: https://grafana.com/docs/grafana-cloud/metrics/prometheus/ 

remote_write:
- url: https://prometheus-us-central1.grafana.net/api/prom/push
  basic_auth:
    username: <Your Metrics instance ID>
    password: <Your Grafana.com API Key>

I thought that this can be duplicated for my on-premise set up. I am assuming that it is the Pushgateway to which the metrics are pushed from the source Prometheus server to destination Prometheus server. Any ideas on this?

Regards,
Saurabh

Stuart Clark

unread,
Feb 25, 2021, 3:54:05 AM2/25/21
to Saurabh Vartak, Prometheus Users
On 25/02/2021 06:55, Saurabh Vartak wrote:
Hi Stuart,

Thanks for all the help on this. I have created a Grafana cloud account and their documentation states that it is possible to send the metrics from one Prometheus server to the Prometheus instance on Grafana cloud: https://grafana.com/docs/grafana-cloud/metrics/prometheus/ 

remote_write:
- url: https://prometheus-us-central1.grafana.net/api/prom/push
  basic_auth:
    username: <Your Metrics instance ID>
    password: <Your Grafana.com API Key>

I thought that this can be duplicated for my on-premise set up. I am assuming that it is the Pushgateway to which the metrics are pushed from the source Prometheus server to destination Prometheus server. Any ideas on this?

Grafana may be using the experimental feature I mentioned or other custom code.

Pushgateway is not useful here - it is designed for short lived processes (such as cron jobs) which can't be scraped directly, and uses a different API (not the remote write API).

Maybe it would help if you could describe what you are trying to do from a non-technical perspective? Why are you trying to send metrics from one server to another?

-- 
Stuart Clark

Saurabh Vartak

unread,
Feb 25, 2021, 8:40:42 AM2/25/21
to Stuart Clark, Prometheus Users
Hi Stuart,

Thanks for the prompt response and all the guidance till date.

The set up we are looking for is that the user of the Grafana portal need not have access to absolutely any other piece of infrastructure (including the other Kubernetes clusters which are scraped for metrics). 
So what we have thought is to have all the Kubernetes clusters push their metrics to a Centralized Prometheus ... and have the Grafana sitting on top of only that Centralized Prometheus server.

I was able to set-up the Prometheus server to server communication using Prometheus Federation as you have correctly suggested. However I am still reading for what metrics I may miss if I use the Prometheus Federation. In all I have the below three queries:

  1. Are all the metrics forwarded using Prometheus Federation? Or is it that only a few are forwarded?
  2. The metrics that are forwarded using Prometheus Federation, do they get stored in the TSDB of the destination Prometheus Server?
  3. What would be the best way to take the back up of the Centralized Prometheus server? Do we need to use any external source like Thanos? Or are the disk backups of the Centralized Prometheus Server enough?

Regards,
Saurabh

Stuart Clark

unread,
Feb 25, 2021, 2:13:29 PM2/25/21
to Saurabh Vartak, Prometheus Users
On 25/02/2021 13:40, Saurabh Vartak wrote:
Hi Stuart,

Thanks for the prompt response and all the guidance till date.

The set up we are looking for is that the user of the Grafana portal need not have access to absolutely any other piece of infrastructure (including the other Kubernetes clusters which are scraped for metrics). 
So what we have thought is to have all the Kubernetes clusters push their metrics to a Centralized Prometheus ... and have the Grafana sitting on top of only that Centralized Prometheus server.

I was able to set-up the Prometheus server to server communication using Prometheus Federation as you have correctly suggested. However I am still reading for what metrics I may miss if I use the Prometheus Federation. In all I have the below three queries:

  1. Are all the metrics forwarded using Prometheus Federation? Or is it that only a few are forwarded?
  2. The metrics that are forwarded using Prometheus Federation, do they get stored in the TSDB of the destination Prometheus Server?
  3. What would be the best way to take the back up of the Centralized Prometheus server? Do we need to use any external source like Thanos? Or are the disk backups of the Centralized Prometheus Server enough?

Trying to bring all data into a single central server isn't recommended - resource requirements can quickly get very high as the number of time series would likely be huge.

For your use case it sounds like a solution such as Cortex or Thanos would be a good fit.

Instead of running a central Prometheus server each one send data to an object store (e.g. S3 bucket). That store is then presented in a Prometheus compatible way to allow queries from Grafana.


With federation one method is to produce aggregate metrics within each Prometheus using recording rules (e.g. sum together a metric to remove instance or pod labels) which are then selected for federation (possibly at a lower scraping frequency than the source server uses). That way you have the full resolution metrics in the localised servers, which can be used for per-pod queries and aggregate metrics in the central system, which can be used for "global" dashboards (services that span clusters or showing different geographic regions).

With that setup you could either run Grafana locally to each Prometheus (which has the advantage of allowing dashboards to be viewed even if the network or central server is broken) or a single central Grafana (or a combination of both options). The central Grafana as well as querying the central Prometheus server could be configured with additional Prometheus data sources for each of the local servers too, allowing both aggregated and specific queries.

-- 
Stuart Clark

Saurabh Vartak

unread,
Feb 25, 2021, 3:29:34 PM2/25/21
to Stuart Clark, Prometheus Users
Hi Stuart,

Thanks again for your help and continued guidance. So if I am able to summarise your suggestions in a nutshell:
1. If there is a requirement to have aggregated metrics in place, Prometheus Federation would be the way to go.
2. If there is a requirement for long term retention (either for a single Prometheus server or a bunch of Prometheus servers) an external storage solution like Cortex or Thanos can be used. 

I hope I am correct with the above two points. 

Also, I needed your help on the below 2 questions to wrap this thread:
1. When we use Prometheus Federation, the metrics sent from a Prometheus server to a Centralized Prometheus server do get stored in the TDSB of the Centralized Prometheus Server. Is the understanding correct?
2. When we use Prometheus Federation, all the metrics scraped by a Prometheus server can be sent to the Centralized Prometheus server. However as a best practice, it is always recommended to send only the aggregated metrics to the Centralized Prometheus server. Is the understanding correct?

Thank you again Stuart for all the help.

Regards,
Saurabh

Stuart Clark

unread,
Feb 25, 2021, 5:24:17 PM2/25/21
to Saurabh Vartak, Prometheus Users
On 25/02/2021 20:29, Saurabh Vartak wrote:
> Hi Stuart,
>
> Thanks again for your help and continued guidance. So if I am able to
> summarise your suggestions in a nutshell:
> 1. If there is a requirement to have aggregated metrics in place,
> Prometheus Federation would be the way to go.
> 2. If there is a requirement for long term retention (either for a
> single Prometheus server or a bunch of Prometheus servers) an external
> storage solution like Cortex or Thanos can be used.
>
> I hope I am correct with the above two points.
>
> Also, I needed your help on the below 2 questions to wrap this thread:
> 1. When we use Prometheus Federation, the metrics sent from a
> Prometheus server to a Centralized Prometheus server do get stored in
> the TDSB of the Centralized Prometheus Server. Is the understanding
> correct?

That is correct. The central server sees the federation with the other
server in exactly the same way as any other scrape target.

So whatever storage duration and any remote write configuration would
apply (in the same way as any other targets the central server scrapes).

> 2. When we use Prometheus Federation, all the metrics scraped by a
> Prometheus server can be sent to the Centralized Prometheus server.
> However as a best practice, it is always recommended to send only the
> aggregated metrics to the Centralized Prometheus server. Is the
> understanding correct?

Federation is different to "sending" metrics around. In particular when
a server scrapes the federation endpoint it returns the latest value for
all metrics that have been selected at that point in time. For example
if say the scrape period of a target was 30s but the period for the
federation was 120s then the local server would hold 4 values for every
2 minute period, but the central server will only contain 1.

While you could try to use federation to fetch all metrics (remembering
that you wouldn't necessarily get all values scraped by the local
server) you may quickly find resource limitations. The text format used
would not be as efficient as that used for remote write for example, so
you might find high network or CPU usage for both servers. Equally,
depending on the quantity of metrics and the scrape interval chosen for
the central server you could find the volume was so great that it failed
to ingest within the timeout period (at maximum the same as the scrape
period).

This would be in addition to the multiplication effect of trying to
store all metrics in a central server (a server could handle 1 million
time series, but trying to federate all metrics from 100 such servers
centrally would need the central server to handle 100 million time
series, which would likely require a lot more resources than would
reasonably be available).

--
Stuart Clark

Saurabh Vartak

unread,
Feb 27, 2021, 10:43:14 AM2/27/21
to Stuart Clark, Prometheus Users
Hi Stuart,

Thanks to all the knowledge and guidance you have imparted on me, I have decided to go with the below approach:

1. For the scenarios where aggregation of metrics is desired, I will implement Prometheus Federation
2. For viewing the metrics of multiple Kubernetes clusters individually, I will implement a Central Grafana dashboard with individual AKS clusters added as datasources
3. For long term retention of the metrics or back up of the metrics, I will use the option of remote_write to write all the metrics from individual Kubernetes clusters to an InfluxDB instance. In case of any data loss, I can have the new Prometheus server instance created and have its remote_read pointed to this instance of InfluxDB - so that the same Grafana dashboards with the same PromQL queries be used.
If a remote_write based back up is not desired due to any reasons, then the simple option of taking disk snapshots of the Prometheus server can be done ... although the snapshots have to be taken at a higher frequency if the loss of the metrics data is to be minimized.

Does this sound like a plan? 

Regards,
Saurabh

Stuart Clark

unread,
Feb 28, 2021, 8:32:53 AM2/28/21
to Saurabh Vartak, Prometheus Users
On 27/02/2021 15:42, Saurabh Vartak wrote:
Hi Stuart,

Thanks to all the knowledge and guidance you have imparted on me, I have decided to go with the below approach:

1. For the scenarios where aggregation of metrics is desired, I will implement Prometheus Federation
2. For viewing the metrics of multiple Kubernetes clusters individually, I will implement a Central Grafana dashboard with individual AKS clusters added as datasources
3. For long term retention of the metrics or back up of the metrics, I will use the option of remote_write to write all the metrics from individual Kubernetes clusters to an InfluxDB instance. In case of any data loss, I can have the new Prometheus server instance created and have its remote_read pointed to this instance of InfluxDB - so that the same Grafana dashboards with the same PromQL queries be used.
If a remote_write based back up is not desired due to any reasons, then the simple option of taking disk snapshots of the Prometheus server can be done ... although the snapshots have to be taken at a higher frequency if the loss of the metrics data is to be minimized.

Does this sound like a plan? 

That sounds perfectly reasonable.

I hope you get it all working and it does what you are hoping for :-)

-- 
Stuart Clark

Saurabh Vartak

unread,
Feb 28, 2021, 9:28:36 AM2/28/21
to Stuart Clark, Prometheus Users
Hi Stuart,

I hope so too. I have got points 1. and 2. working. Point no. 3 is a work in progress.
Thank you again for all the support.

Regards,
Saurabh
Reply all
Reply to author
Forward
0 new messages