Demerits of federated prometheus with multiple shards

dineshnithy...@gmail.com

unread,

Aug 28, 2020, 3:01:41 AM8/28/20

to Prometheus Users

Hi Team,

we have implemented huge number of shards in prometheus which scrapes 1000 target nodes across the enterprise

we are currently in process of running some load test and ascertain the unknowns one below and need some guidance on the same -

On what cases sharding approach can fail?
How do we quantify the load/scale requirements?
Is there any optimized settings/configurations or best practices we need to comply for sharing
Any low level tuneable like recording rules/metrics_relabel_configs/scrape_internal and sample we ingest we need to accomodate and any pointers to it ?

Please do guide us through whether we think on right page and correct me incase of any misinterpretation

Regards,

Dinesh

Dinesh N

unread,

Aug 30, 2020, 10:16:33 AM8/30/20

to Prometheus Users

Can someone just share the knowledge about trades on sharding Prometheus into multiple nodes

Any hints/pointers would definitely be helpful

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3307c7af-6fac-4e5a-a422-d2eee13c9fccn%40googlegroups.com.

dineshnithy...@gmail.com

unread,

Sep 4, 2020, 12:17:30 PM9/4/20

to Prometheus Users

I have explored couple of observations and wanted to share seek necessary guidance

One cons i see with Shards is - if you have reached a limit on max targets (i.e total scrape time taking longer than intended scrape interval) or if you can’t vertically scale the pod/machine anymore.
Sharding complicates your setup (you need more component to get a unified view), adds to query latency (due to fanout) and increases your costs.

In that case - Do we have any approach/methodology of determining the “number of shards” based on the and number of targets/scrapping ?

Brian Candler

unread,

Sep 4, 2020, 2:55:58 PM9/4/20

to Prometheus Users

Guidelines I have seen:
- no more than 2m metrics per prometheus server

- aim for <10k metrics per scrape target if possible

If you're scaling really large then you might want to look at Thanos, Cortex etc.

Dinesh N

unread,

Sep 5, 2020, 11:19:09 AM9/5/20

to Brian Candler, Prometheus Users

Thanks Brian for the guidance.

1) We are leveraging Thanos querier for consolidating the metrics across the shards

2) We are using Thanos store for long term storage which kind of serving our needs

The only concern I feel here is - with shards we always bound to considerate on "Single point of failure" and how do we technically address it

Likewise any shard going down we would get to drop "x" mins of metrics and due to which we can't achieve the fault tolerance of so called 4 9's (99.99% of availability)

Any thoughts on this lines on how do we get more resilient

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/aa7ad804-9e13-4476-ab7b-9ecc546cd1f8o%40googlegroups.com.

Reply all

Reply to author

Forward