I would like to seek some advice on the design approach we should take.
With the given problem below, in terms of cost, how can we set up Prometheus with a large cluster.
Variables:
Installation: Kube-stack-prometheus helm chart.
Autoscale: yes
No. of Nodes: 1000 up to 1300
Mesh: Istio
Memory Usage: 50GB (Still gets OOM)
Installed: 1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger
Issue:
1. We cannot expand a larger node for Prometheus as 60GB memory is already expensive. (cost not approved by management)
2. Removing unnecessary metrics is not yet advised because we do not know which metrics of istio, jaeger and kiali are needed.
Tried solution:
We have federated the single instance of prometheus with Thanos Receivers, however, the issue is still there because kiali queries its data directly from prometheus which eventually gets OOM.
Question:
We are thinking of firing up multiple prometheus for each namespace and adding thanos-sidecar with the same scrape config since thanos will deduplicate all duplicated metrics. This approach would solve the issue in Grafana queries but not in Kiali.
How can we set up a multiple prometheus (low cost) but single instance prometheus for kiali (whole cluster)?
Appreciate any help. Thank you.