> You would need additional tools like corosync or keepalived to achieve this.
Thanks for pointing me to those tools. I did a quick search and didn't see much info on setting up Prometheus HA with corosync/keepalived. Except that it looks like AWS managed Prometheus uses that approach [1]:
When you set up deduplication, Amazon Managed Service for Prometheus makes one Prometheus instance a leader replica and ingests data samples only from that replica. If the leader replica stops sending data samples to Amazon Managed Service for Prometheus for 30 seconds, Amazon Managed Service for Prometheus automatically makes another Prometheus instance a leader replica and ingests data from the new leader.
I'm a bit surprised that this approach isn't used more.
Re using multiple replicas, my concern is the cost in bigger clusters. The cost will add up to having multiple replicas per shard knowing the data would be deduplicated downstream.