Prometheus Redundancy, Federation, and Recording Rules

1,995 views
Skip to first unread message

Jack Neely

unread,
Mar 8, 2017, 11:21:25 AM3/8/17
to Prometheus Users
Greetings,

I have a question regarding the duplicate metrics that can be produced with redundant Prometheus instances and federation.

Some background about how we've deployed Prometheus:  Each team has their own Prometheus server/instance.  That team may add/create recording rules for important aggregations.  We encourage building Grafana dashboards from recording rules.  These "local" Prometheus instances are mostly Aurora jobs and, thus, have an ephemeral nature to them.

At a "global" level on real hardware we discover all Prometheus instances and federate metrics matching the recording rule format.  Grafana queries are (in most cases) done against the global Prometheus servers because dashboards will continue to present historical data even if your local Aurora Prometheus job has moved, restarted, or reconfigured (and lost its storage).

When a redundant Prometheus instance for a team is brought online for redundancy (or even testing) this populates the global Prometheus servers with another copy of the same recording rule metrics.  At this point Grafana dashboards become confusing as they receive double metrics.

I need to figure out a way to address folks "seeing double" without passing out strong glasses.  I can filter out specific prometheus server instances from the queries but that doesn't seem to be a scalable solution.  Probably either version of the timeseries will work, just not both.

Do folks have any techniques for managing this?

Thanks!

--
Jack Neely
Operations Engineer
42 Lines, Inc.

Brian Brazil

unread,
Mar 8, 2017, 12:11:54 PM3/8/17
to Jack Neely, Prometheus Users
The usual approach would be to filter out at the graphing stage, likely using Grafana templating. The redundant Prometheus should have slightly different external labels (e.g. dc-1 and dc-2) which will be present in the federated data, so you can filter using that.

--

Jack Neely

unread,
Mar 8, 2017, 12:23:54 PM3/8/17
to Brian Brazil, Prometheus Users
Presently that label is "instance".  If the recording rule metric doesn't have an instance Prometheus adds one to identify the Prometheus instance this was federated for.  (And of course there is more confusion if there are instance labels present.)  So this configuration at present needs to be fixed on my end.  Is there a label name / semantics here that are suggested? 

Brian Brazil

unread,
Mar 8, 2017, 12:29:37 PM3/8/17
to Jack Neely, Prometheus Users
On 8 March 2017 at 17:23, Jack Neely <jjn...@42lines.net> wrote:

On Wed, Mar 8, 2017 at 12:11 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 8 March 2017 at 16:21, Jack Neely <jjn...@42lines.net> wrote:
Greetings,

I have a question regarding the duplicate metrics that can be produced with redundant Prometheus instances and federation.

Some background about how we've deployed Prometheus:  Each team has their own Prometheus server/instance.  That team may add/create recording rules for important aggregations.  We encourage building Grafana dashboards from recording rules.  These "local" Prometheus instances are mostly Aurora jobs and, thus, have an ephemeral nature to them.

At a "global" level on real hardware we discover all Prometheus instances and federate metrics matching the recording rule format.  Grafana queries are (in most cases) done against the global Prometheus servers because dashboards will continue to present historical data even if your local Aurora Prometheus job has moved, restarted, or reconfigured (and lost its storage).

When a redundant Prometheus instance for a team is brought online for redundancy (or even testing) this populates the global Prometheus servers with another copy of the same recording rule metrics.  At this point Grafana dashboards become confusing as they receive double metrics.

I need to figure out a way to address folks "seeing double" without passing out strong glasses.  I can filter out specific prometheus server instances from the queries but that doesn't seem to be a scalable solution.  Probably either version of the timeseries will work, just not both.

Do folks have any techniques for managing this?
 
The usual approach would be to filter out at the graphing stage, likely using Grafana templating. The redundant Prometheus should have slightly different external labels (e.g. dc-1 and dc-2) which will be present in the federated data, so you can filter using that.

Presently that label is "instance". 

It'd usually be something like "datacenter". Metrics with an instance label shouldn't generally be making it to a global Promethues, and they'be identical across both of the HA pair anyway.
 
If the recording rule metric doesn't have an instance Prometheus adds one to identify the Prometheus instance this was federated for.  (And of course there is more confusion if there are instance labels present.)  So this configuration at present needs to be fixed on my end. 

It sounds like you've not set honor_labels: true, which should be set for federation.

Brian
 
Is there a label name / semantics here that are suggested? 


-- 



--

Jack Neely

unread,
Mar 9, 2017, 11:00:38 AM3/9/17
to Brian Brazil, Prometheus Users
Did some testing with this.  I do have honor_labels set to true.  The issue I have is that Prometheus always attempts to attach an 'instance' label.  So if one exists, that persists (in the very few recording rules that are aggregated by instance I have).  Most rules get an instance label slapped on equal to the Prometheus instance the metric was federated from.

I can relabel that out, in the following tricksome way.  See below.

This behaves the way users expect with regard to the instance label and applies a 'federation' label to clearly identify this metric was federated and from which Prometheus instance.  (We already apply a data_center label via service discovery, and some of our deployments have a pair of identical prometheus servers per data center or Mesos/Aurora cluster.)  The federation label should probably be set as an external label on each Prometheus server -- I think this is what I originally missed.  Identically configured prometheus servers don't have a unique label value in the external label set in my configuration.

      "job_name": "federate",                                                      
      "file_sd_configs": [                                                         
        {                                                                          
          "files": [                                                               
            "/etc/prometheus/cmdb/federate.json"                                   
          ]                                                                        
        }                                                                          
      ],                                                                           
      "relabel_configs": [                                                                                                                  
        {                                                                          
          "action": "replace",                                                     
          "source_labels": [                                                       
            "__address__"                                                          
          ],                                                                       
          "target_label": "federation",                                            
          "regex": "([\\w.-]+):(\\d+)",                                            
          "replacement": "$1"                                                      
        },                                                                         
        {                                                                          
          "action": "replace",                                                     
          "target_label": "instance",                                              
          "replacement": "federation"                                              
        }                                                                          
      ],                                                                           
      "metric_relabel_configs": [                                                  
        {                                                                          
          "action": "replace",                                                     
          "source_labels": [                                                       
            "instance"                                                             
          ],                                                                       
          "target_label": "instance",                                              
          "regex": "federation",                                                   
          "replacement": ""                                                        
        }                                                                          
      ],                                                                           
      "scrape_interval": "15s",                                                    
      "honor_labels": true,                                                        
      "metrics_path": "/federate",

Brian Brazil

unread,
Mar 9, 2017, 11:12:05 AM3/9/17
to Jack Neely, Prometheus Users
On 9 March 2017 at 16:00, Jack Neely <jjn...@42lines.net> wrote:
Did some testing with this.  I do have honor_labels set to true.  The issue I have is that Prometheus always attempts to attach an 'instance' label.  So if one exists, that persists (in the very few recording rules that are aggregated by instance I have).  Most rules get an instance label slapped on equal to the Prometheus instance the metric was federated from.

Hmm, that's a bug. We should be explicitly exporting the instance label as empty if it's not set.

Brian



--

afdra...@gmail.com

unread,
May 11, 2017, 9:32:01 AM5/11/17
to Prometheus Users, jjn...@42lines.net
Jack,

Have you considered using HAProxy to hide Prometheus redundancy from Grafana? I'm currently looking into this myself and would love to hear from other folks who has implemented this, since I'm still in the researching phase. 

Julius Volz

unread,
May 11, 2017, 9:41:04 AM5/11/17
to afdra...@gmail.com, Prometheus Users, Jack Neely
On Thu, May 11, 2017 at 3:32 PM, <afdra...@gmail.com> wrote:
Jack,

Have you considered using HAProxy to hide Prometheus redundancy from Grafana? I'm currently looking into this myself and would love to hear from other folks who has implemented this, since I'm still in the researching phase. 

While you can do this, it might get a tad annoying to have graphs slightly change on every reload (even for the same graph time range), as you alternate between backends, because both Prometheus servers are slightly phase-shifted in their data collection. Maybe with some kind of affinity where it always takes one server, unless it becomes unavailable...
 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fa0d84e7-86bc-4cf3-89f9-bc5c83da7e92%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Brian Brazil

unread,
May 11, 2017, 9:56:22 AM5/11/17
to Julius Volz, Aliesha, Prometheus Users, Jack Neely
On 11 May 2017 at 14:41, Julius Volz <juliu...@gmail.com> wrote:
On Thu, May 11, 2017 at 3:32 PM, <afdra...@gmail.com> wrote:
Jack,

Have you considered using HAProxy to hide Prometheus redundancy from Grafana? I'm currently looking into this myself and would love to hear from other folks who has implemented this, since I'm still in the researching phase. 

While you can do this, it might get a tad annoying to have graphs slightly change on every reload (even for the same graph time range), as you alternate between backends, because both Prometheus servers are slightly phase-shifted in their data collection. Maybe with some kind of affinity where it always takes one server, unless it becomes unavailable...

The "backup" feature of HAProxy would help with this, as long as the first server is hard down when it fails.

Brian
 

For more options, visit https://groups.google.com/d/optout.



--

Ben Kochie

unread,
May 11, 2017, 10:02:04 AM5/11/17
to Brian Brazil, Julius Volz, Aliesha, Prometheus Users, Jack Neely
Yes, this is how we use haproxy at GitLab[0].  Although the config option used is "balance first" which sends traffic to the first backend unless it's down.


afdra...@gmail.com

unread,
May 11, 2017, 10:13:37 AM5/11/17
to Prometheus Users, brian....@robustperception.io, juliu...@gmail.com, afdra...@gmail.com, jjn...@42lines.net
Thanks for all your feedback. This is very helpful.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



--

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

hcb...@gmail.com

unread,
Dec 1, 2018, 9:08:46 AM12/1/18
to Prometheus Users
This is an old thread but very much related to what I'm facing right now.
I'm using grafana dashboarding pulling from a global prometheus instance as data source. 2 separate site-level prometheus are federating up metrics to the global prometheus.
When I used Brian's Prometheus Benchmark dashboard template to assess the global prometheus' performance, only then I realized an abnormally high ingestion sample rate of ~40mil/s.
This seems to be due to the fact that I'm federating up the native metrics at the site-level up to global without using recording rules to aggregate them first.

Seems like a chicken-and-egg thing. If I need to federate only aggregated metrics, then I'd need to know what dashboards + queries on grafana I'd need then aggregate them at the site-level using recording rules. Only then federated data at global level would be the ones I'd use for grafana queries.

Is my understanding accurate?

Ben Kochie

unread,
Dec 1, 2018, 9:35:51 AM12/1/18
to hcb...@gmail.com, Prometheus Users
Repeat after me: Federation is not replication. :-)

It's explicitly not designed to gather all data from every server.

If you want a global query "single pane of glass" layer, try Thanos[0].

hcb...@gmail.com

unread,
Dec 1, 2018, 9:40:12 AM12/1/18
to Prometheus Users

Haha. Thanks Ben. Yes, I get that now.
Yes, we've used Thanos sidecar too when we needed to have long term storage.

Thanks for the advice!
Reply all
Reply to author
Forward
0 new messages