Prometheus HA Setup.

137 views
Skip to first unread message

yagyans...@gmail.com

unread,
Oct 10, 2020, 3:06:25 AM10/10/20
to Prometheus Users
Hi. I have a vanilla Prometheus setup with 50 days retention and data size of around 4.6TB for this much retention. I want to move to HA set up to avoid a single point of failure.
I'm a little confused on how to approach the below points:

a) With a HA pair, does the Prometheus data necessarily be local to both the Prometheus instances? Because it would require me to provision 2 5TB disks, one for each instance.
Is it a good idea to have these 2 Prometheus instances write to an NFS disk?

b) With both the HA pairs scraping the same targets, how do I build a global view of these local Prometheus instances? Is federation preferable or is there any other better way to approach this?

c) Since, 2 instances will be scraping the same targets, does it add any overhead to the target side?

Thanks in advance.

yagyans...@gmail.com

unread,
Oct 10, 2020, 3:43:32 AM10/10/20
to Prometheus Users
d) If we do use 2 separate disks for the 2 instances, how will we manage the config files? I mean is there any way to make the changes on any one instance and those get replicated to other instance automatically or will we have to do that manually?

Stuart Clark

unread,
Oct 10, 2020, 3:54:50 AM10/10/20
to yagyans...@gmail.com, Prometheus Users

a) Yes you would have two disks. NFS is not recommended for a number of reasons including performance. Additionally it would create a single point of failure which could break both machines at once. Additionally with NFS it is easy to accidentally try to have both instances pointing to the same storage location which would cause data corruption.

b) There are a number of solutions, but one option with be to run promxy in front of Prometheus (so things like Grafana would query promxy) which will query both servers to create a single view, removing any gaps.

c) Exporters and client libraries are designed to be low impact when used correctly, so additional scraping should have minimal impact.

d) Control of you config files is down to whatever container/configuration management solution you use. Generally the only difference between the servers might be external label settings.

Ben Kochie

unread,
Oct 10, 2020, 5:02:14 AM10/10/20
to yagyans...@gmail.com, Prometheus Users
4.6TB for 50 days seems like a lot. How many metrics and how many samples per second are you collecting? Just estimating based on the data, it sounds like you might have more than 10 million series and 600-700 samples per second. This might be the time to start thinking about sharding.

You can check for sure with these queries:

prometheus_tsdb_head_series

rate(prometheus_tsdb_head_samples_appended_total[1h])

For handling HA clustering and sharding, I recommend looking into Thanos. It can be added to your existing Prometheus and rolled out incrementally.

> d) d) If we do use 2 separate disks for the 2 instances, how will we manage the config files?

If you don't have any configuration management, I recommend using https://github.com/cloudalchemy/ansible-prometheus. It's very easy to get going.



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ad374ee9-c5fa-41a5-bb02-24d062366ffan%40googlegroups.com.

yagyans...@gmail.com

unread,
Oct 11, 2020, 4:36:46 AM10/11/20
to Prometheus Users
Thanks a lot, Stuart.

What is your opinion on using VictoriaMetrics over HA Prometheus setup for long-term storage? This would allow me to reduce the retention period on the individual Prometheus instance and put the majority of the data on VictoriaMetrics' head. This way I'll have to manage only a single huge persistent storage disk.

yagyans...@gmail.com

unread,
Oct 11, 2020, 4:44:54 AM10/11/20
to Prometheus Users
On Saturday, October 10, 2020 at 2:32:14 PM UTC+5:30 sup...@gmail.com wrote:
4.6TB for 50 days seems like a lot. How many metrics and how many samples per second are you collecting? Just estimating based on the data, it sounds like you might have more than 10 million series and 600-700 samples per second. This might be the time to start thinking about sharding.
You can check for sure with these queries:
prometheus_tsdb_head_series
rate(prometheus_tsdb_head_samples_appended_total[1h]
>>>>
Hi Ben, my time series collection hasn't touched 10 million yet, its around 5.5 million as of now, but my sampling rate is quite steep, sitting at approximately 643522. Since my time series are quite manageable by a single Prometheus instance I am avoiding sharding as of now because it would complicate the entire setup. What is your thought on this? 

For handling HA clustering and sharding, I recommend looking into Thanos. It can be added to your existing Prometheus and rolled out incrementally.
>>>>
Yes, I looked at Thanos but my only problem is that Thanos will use Object Storage for long time retention which will have latency while extracting old data. That is why I am inclined towards VictoriaMetrics. What's your view on going with VictoriaMetrics?

> d) d) If we do use 2 separate disks for the 2 instances, how will we manage the config files?
If you don't have any configuration management, I recommend using https://github.com/cloudalchemy/ansible-prometheus. It's very easy to get going.
>>>>
Thanks. I'll check it out.

Ben Kochie

unread,
Oct 11, 2020, 9:16:56 AM10/11/20
to yagyans...@gmail.com, Prometheus Users
On Sun, Oct 11, 2020 at 10:45 AM yagyans...@gmail.com <yagyans...@gmail.com> wrote:
On Saturday, October 10, 2020 at 2:32:14 PM UTC+5:30 sup...@gmail.com wrote:
4.6TB for 50 days seems like a lot. How many metrics and how many samples per second are you collecting? Just estimating based on the data, it sounds like you might have more than 10 million series and 600-700 samples per second. This might be the time to start thinking about sharding.
You can check for sure with these queries:
prometheus_tsdb_head_series
rate(prometheus_tsdb_head_samples_appended_total[1h]
>>>>
Hi Ben, my time series collection hasn't touched 10 million yet, its around 5.5 million as of now, but my sampling rate is quite steep, sitting at approximately 643522. Since my time series are quite manageable by a single Prometheus instance I am avoiding sharding as of now because it would complicate the entire setup. What is your thought on this? 

I usually recommend thinking about a sharding plan when you hit this level. You don't need to yet, but it's worth thinking about how you would.
 

For handling HA clustering and sharding, I recommend looking into Thanos. It can be added to your existing Prometheus and rolled out incrementally.
>>>>
Yes, I looked at Thanos but my only problem is that Thanos will use Object Storage for long time retention which will have latency while extracting old data. That is why I am inclined towards VictoriaMetrics. What's your view on going with VictoriaMetrics?

You don't need to use Thanos for long-term storage. It works just fine as a query-proxy only setup.  This is how we got into using Thanos. We had an existing sharded fleet of Prometheus HA instances. We had been using multiple Grafana data sources and simple nginx reverse proxy for HA querying. We added Thanos Query/Sidecar just to provide a single query interface. It wasn't until some time later that we started to use object storage.

Thanos object storage is optional, it can use Prometheus TSDB as the backend.

That said, Thanos object storage latency isn't a huge problem. It does depend a bit on what object storage provider/software you use. But it works just fine.

I don't recommend VictoriaMetrics. I would go with Thanos or Cortex, as these are maintained by core Prometheus community contributors.
 

> d) d) If we do use 2 separate disks for the 2 instances, how will we manage the config files?
If you don't have any configuration management, I recommend using https://github.com/cloudalchemy/ansible-prometheus. It's very easy to get going.
>>>>
Thanks. I'll check it out.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

yagyans...@gmail.com

unread,
Oct 11, 2020, 12:29:52 PM10/11/20
to Prometheus Users
Thanks, Ben.

One thing. I don't want to maintain 2 5TB disks for Prometheus HA, i.e 1 5TB disk on each instance, that is why I want to put a single huge disk in my VictoriaMetrics instance and maintain a single persistent disk rather than 2. Can Thanos also store the data in a persistent disk rather than Object disk? Because from the docs I have seen till now, I haven't found this feature in Thanos yet. This is the sole reason I am inclined towards VictoriaMetrics.

Ben Kochie

unread,
Oct 11, 2020, 3:27:17 PM10/11/20
to yagyans...@gmail.com, Prometheus Users
I don't think this is something you can or should be optimizing for. You are on the edge of needing to shard, which means you will need to manage many individual instance disks.

But if you really want to have a single big disk for storage, you can use Minio[0] for simple object storage if you aren't already on a platform that provides object storage.


yagyans...@gmail.com

unread,
Oct 13, 2020, 1:45:02 PM10/13/20
to Prometheus Users
Thanks, Ben.

What happens to Alerting in case of HA Prometheus while using Thanos/VictoriaMetrics/Cortex on top of 2 Prometheus instances?

Reply all
Reply to author
Forward
0 new messages