Scalability Limits of Single Prometheus Instance

32 views

Skip to first unread message

mohan garden

unread,

Jun 11, 2024, 1:24:19 AMJun 11

to Prometheus Users

Hi All,
I am reaching out to gather some quantitative insights and experiences regarding the scalability of single prometheus instance. I understand that performance and scalability can vary significantly based on different aspects of infrastructure like whether the backend storage is local disk of NFS, network bandwidth , number of targets and metrics per targets, scrape interval etc.

For the following queries you can assume ideal conditions I.e. Optimal hardware , Maximum network bandwidth , Local disk ( say SSD) Or
if you can share the information on the hardware and the performance metrics , that would help .

Here are the questions:

Q1. What is the practical limit on the number of active series which prometheus can handle or,
What is the maximum number of active series to which prometheus can scale?

https://www.robustperception.io/scaling-and-federating-prometheus
this article was written in 2015 and mentions that single instance can scale at minimum 1M series ( 1000 servers x 1000 metrics)
https://prometheus.io/docs/introduction/faq/#i-was-told-prometheus-doesnt-scale
here i assume under ideal conditions, it can scale between 20 to 90M.

Q2: What are the practical limits for storage and data retention in a single instance.

Q3: What is the highest number of targets and total metrics per target that can be efficiently scraped by single instance?

Q4: How does query performance ( latency) scale with increasing number of metrics and targets?

Any shared experiences , benchmarks or references to relevant documentation would help .

Thank you
Regards,

Ben Kochie

unread,

Jun 11, 2024, 3:39:15 AMJun 11

to mohan garden, Prometheus Users

To start, NFS is not supported. Only local disk storage.

Q1:

Prometheus today can scale to about 100M series, but operates a bit better below 50M series

Q2:
Infinite, the 2.0 TSDB has no practical storage limit.

Q3:

I've heard about instances with upwards of 50k targets.

Q4:

Query performance can start to degrade when you try and query over about 10M series or a large time range. A good proxy for this is how many samples are loaded to solve a single query. There is a default flag of --query.max-samples of 50M that cancels queries that are too large. This is enough to query 1400 metrics over a year time range with 15s scrape intervals.

Of course, this can easily be increased depending on your infra.

We run benchmarks with every Prometheus release with https://github.com/prometheus/test-infra

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/800c9642-1d65-4806-b634-412429766831n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages