Common tag to metrics

8 views
Skip to first unread message

ritesh patel

unread,
Jan 21, 2022, 3:44:23 AMJan 21
to promethe...@googlegroups.com
Hello Team,

I have want to monitoring 10000 servers metrics via telegraf. Prometheus running as a stand alone not in docker and kubernet.  So what is idea size of host for running Prometheus without any failure.  

Currently I have 2 core cpu and 16gb memory and 100GB disk. But if I add 2000 host as a target. Memory utilization goes very high and then Prometheus service went down. 

Please help me to setup Prometheus.

Thanks and Regards
Ritesh patel 

Brian Candler

unread,
Jan 21, 2022, 4:56:04 AMJan 21
to Prometheus Users
What matters is the number of timeseries, not the number of hosts.  Try scraping a single host, and then see how many timeseries you get.  e.g. use this promql query:

count({__name__=~".+",instance="xxxxxxx"})

where xxxxxxx is the instance label that you scraped; it may be ip:port.  Then you'll get an idea how many timeseries that telegraf exposes for a single host.  (I don't use telegraf.  Personally I'd suggest node_exporter instead).

You can also get useful information by going to the Prometheus web interface and going to Status > TSDB Status.  The "Number of series" in "Head stats" is what you're looking for.  Try this with say 1 target, then 101 targets, and see how much it increases.

Once you know roughly how many timeseries you expect to collect, then there's a memory estimator here:

Suppose you're generating 1000 metrics per host.  Then 2000 hosts would be 2 million timeseries, which is fairly high - this is the point at which you typically start thinking about splitting into multiple prometheus servers, each scraping a subset of targets.

16GB may be able to handle this, but as you can see from the estimator, it's also quite sensitive to the number of labels per timeseries and the number of unique values per label.  Make sure you're running a recent version of prometheus; newer versions are more RAM-efficient.  And make sure you're using a block filesystem for storage (e.g. local disk or EBS), not a shared filesystem (definitely *not* NFS or SMB).

Also beware: if telegraf is configured to do something stupid, like expose a label with high cardinality which keeps changing, then it can cause the number of active timeseries to explode.  It's up to you to manage this risk.  node_exporter is a safer bet in my opinion, but that's only because I haven't used telegraf with prometheus.
Reply all
Reply to author
Forward
0 new messages