I have been using Thanos-Prometheus stack and running into high cardinality issues where CPU goes till ~80% and then goes down and this happens when firing high cardinality queries which results in "Http superfluous" exception and then promethus instance goes down.
We are trying following things as listed below -
1) We are running only with 2 instances of Prometheus on top of Thanos querier and need guidance where can we increase more to handle huge queries
2) Any front end cache like cortex cache can help here for high cardinality queries ?
3) Looking for any optimal linux parameters like hugepages which would suffice high cardinality issues
RCA So far, I have observed was CPU was clocking till ~80% and prometheus server was doing down and I also see lot of memory residing at cache memory
.
Even with above options we are not sure whether we are looking things at right direction hence need Your pair of eyes and pointers would be greatly appreciated here Brain.
Thanks and Regards
Dinesh