Demerits of federated prometheus with multiple shards

50 views
Skip to first unread message

dineshnithy...@gmail.com

unread,
Aug 28, 2020, 3:01:41 AM8/28/20
to Prometheus Users
Hi Team,

we have implemented huge number of shards in prometheus which scrapes 1000 target nodes across the enterprise 

we are currently in process of running some load test and ascertain the unknowns one below and need some guidance on the same - 

  1. On what cases sharding approach can fail?
  2. How do we quantify the load/scale requirements?
  3. Is there any optimized settings/configurations or best practices we need to comply for sharing 
  4. Any low level tuneable like recording rules/metrics_relabel_configs/scrape_internal and sample we ingest we need to accomodate and any pointers to it ?
Please do guide us through whether we think on right page and correct me incase of any misinterpretation

Regards,
Dinesh

Dinesh N

unread,
Aug 30, 2020, 10:16:33 AM8/30/20
to Prometheus Users
Can someone just share the knowledge about trades on sharding Prometheus into multiple nodes

Any hints/pointers would definitely be helpful

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3307c7af-6fac-4e5a-a422-d2eee13c9fccn%40googlegroups.com.

dineshnithy...@gmail.com

unread,
Sep 4, 2020, 12:17:30 PM9/4/20
to Prometheus Users
I have explored couple of observations and wanted to share seek necessary guidance 

One cons i see with Shards is -  if you have reached a limit on max targets (i.e total scrape time taking longer than intended scrape interval) or if you can’t vertically scale the pod/machine anymore.
Sharding complicates your setup (you need more component to get a unified view), adds to query latency (due to fanout) and increases your costs.

In that case  - Do we have any approach/methodology of determining the “number of shards” based on the and number of targets/scrapping ?

Brian Candler

unread,
Sep 4, 2020, 2:55:58 PM9/4/20
to Prometheus Users
Guidelines I have seen:
- no more than 2m metrics per prometheus server 
- aim for <10k metrics per scrape target if possible

If you're scaling really large then you might want to look at Thanos, Cortex etc.

Dinesh N

unread,
Sep 5, 2020, 11:19:09 AM9/5/20
to Brian Candler, Prometheus Users
Thanks Brian for the guidance. 

1) We are leveraging Thanos querier for consolidating the metrics across the shards 

2) We are using Thanos store for long term storage which kind of serving our needs


The only concern I feel here is - with shards we always bound to considerate on "Single point of failure" and how do we technically address it 

Likewise any shard going down we would get to drop "x" mins of metrics and due to which we can't achieve the fault tolerance of so called  4 9's (99.99% of availability)

Any thoughts on this lines on how do we get more resilient

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages