Estimating capacity needs

817 views
Skip to first unread message

lefthan...@gmail.com

unread,
Dec 31, 2016, 9:16:20 PM12/31/16
to Prometheus Users
I would like to estimate how many servers I would need to monitor our network infrastructure. I would be using the snmp_exporter to poll the devices and expose the metrics to Prometheus. 

Estimate of setup:

2,000 network devices
3,500 metrics per device
   50 alarms that would cover just about ever metric/label
    8 labels per metric
    1 minute polling interval


I saw in the FAQ the benchmark data(525k samples/sec in 1.4M TS, 1650 targets) but am new to time series stuff and don't know how to translate my setup to compare to the benchmark.  Is a timeseries in my case (total metrics) * labels?

I would also like to estimate disk space usage over a period of time to get an idea of how long I can keep data without aggregating.  Is there a formula that people are using to estimate disk space(average or worst case)?  If not would it be fair to monitor a single device and extrapolate out?  Is there a metric exposed by the Prometheus metrics that I could use to see the disk utilization over time if I setup a single device?  


Julius Volz

unread,
Jan 1, 2017, 5:11:28 AM1/1/17
to lefthan...@gmail.com, Prometheus Users
On Sun, Jan 1, 2017 at 3:16 AM, <lefthan...@gmail.com> wrote:
I would like to estimate how many servers I would need to monitor our network infrastructure. I would be using the snmp_exporter to poll the devices and expose the metrics to Prometheus. 

Estimate of setup:

2,000 network devices
3,500 metrics per device
   50 alarms that would cover just about ever metric/label
    8 labels per metric
    1 minute polling interval


I saw in the FAQ the benchmark data(525k samples/sec in 1.4M TS, 1650 targets) but am new to time series stuff and don't know how to translate my setup to compare to the benchmark.  Is a timeseries in my case (total metrics) * labels?

Each unique combination of labels (the metric name is also just a special label) represents a time series. On beefy hardware, Prometheus can handle up to a couple of million of time series, so that's your budget per Prometheus server.

How many label names are on a metric is less relevant than the number of label values they can take on (especially in combination with other labels), as this determines the number of time series. Still, 8 labels per metric sounds on the high side. Even if they have only a few values each, the total cardinality would easily get quite big.
 
I would also like to estimate disk space usage over a period of time to get an idea of how long I can keep data without aggregating.  Is there a formula that people are using to estimate disk space(average or worst case)?  If not would it be fair to monitor a single device and extrapolate out?  Is there a metric exposed by the Prometheus metrics that I could use to see the disk utilization over time if I setup a single device?  

There's some overhead for indexing time series, but after that, you can get an idea of average bytes-per-sample from the table in https://prometheus.io/docs/operating/storage/#chunk-encoding. It depends a bit on the exact shape of your data though.

Depending on what kind of network devices you are monitoring, the network devices may actually be your bottleneck when you want a scrape interval of 1m. Some devices take *minutes* each to return all the requested metrics over SNMP (because production-grade network device manufacturers seemingly haven't seen pressure to improve their metrics situation yet). I don't have any experience with that myself, but have seen other people run into that problem multiple times.
 


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bec1acdb-e225-44a5-887e-bd53504bedc8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

californi...@gmail.com

unread,
Jan 11, 2017, 7:38:11 PM1/11/17
to Prometheus Users, lefthan...@gmail.com
There's some overhead for indexing time series, but after that, you can get an idea of average bytes-per-sample from the table in https://prometheus.io/docs/operating/storage/#chunk-encoding. It depends a bit on the exact shape of your data though.

Depending on what kind of network devices you are monitoring, the network devices may actually be your bottleneck when you want a scrape interval of 1m. Some devices take *minutes* each to return all the requested metrics over SNMP (because production-grade network device manufacturers seemingly haven't seen pressure to improve their metrics situation yet). I don't have any experience with that myself, but have seen other people run into that problem multiple times.

Prometheus server seems to have enough performance to handle thousands of metrics from network devices. How about capacity of SNMP exporter? Considering the bottleneck mentioned above, SNMP exporter have to handle more udp connectivity to retrieve MIB data because network device takes more time to return the data, that leads to consume more memory to keep connectivity information. I might have to scale out SNMP exporters for one prometheus server to have enough performance. But I don't know for how many network devices.

Julius Volz

unread,
Jan 12, 2017, 9:51:14 AM1/12/17
to californi...@gmail.com, Prometheus Users, jeffrey mcauley
Not knowing all the details about how the SNMP exporter works, I'm not sure whether a single one would scale to your network of 2000 devices, but at least it should be trivial to scale. Since it's completely stateless, you could just put multiple of it behind a load balancer, or even just scrape them as multiple jobs from Prometheus.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

californi...@gmail.com

unread,
Jan 12, 2017, 11:15:10 PM1/12/17
to Prometheus Users, californi...@gmail.com, lefthan...@gmail.com
Thank you for your reply. I don't want to assign lots of CPU/RAM resources to multiple SNMP exporters. I hope it won't happen in our environment.
Reply all
Reply to author
Forward
0 new messages