Need a PromQL or Metric to see the HTTP request count or API calls against Prometheus endpoint

117 views
Skip to first unread message

Moksha Reddy G

unread,
Sep 12, 2024, 9:38:00 PM9/12/24
to Prometheus Users
Hi Everyone,

We are facing some strange and serious issue with our Prometheus pods running on Azure instances, frequent pod restarts occurring when the load balancer or DNS points to Prometheus. I need to know the metric or some sort of PromQL where we could see the incoming requests count and API calls against Prometheus endpoint at the time of pod restarts. I would really appreciate if you could assist me with this matter, thank you!Little background on what we did so far!

  • I tried using scrape_samples_scraped metric and it is showing spike that occurs once in a day BUT pods are getting restarted more than 20 times in a day.
  • Tried with http_requests_total metric as mentioned in https://prometheus.io/docs/prometheus/latest/querying/examples/ BUT it did not show any spike in the requests at all.
  • I got this prometheus_http_requests_total metric and there also I don't see any spike at all.

To remediate the pod restarts problem, we have performed below actions to understand the cause of this frequent pod restarts but no luck. Do you have any recommendation or solution to stop these restarts?

  1. We spawn up new pods without pointing to any LB or DNS, this helped NO pod restarts. But this is just for testing purpose as we cannot go live without LB or DNS!
  2. We checked the access logs to see if any HTTP requests from applications are causing the issue. We are seeing many readiness probe failures for Prometheus. As a workaround, we have increased the readiness and liveliness checks timeout but this didn't help.
  3. We tried deleting /wal directory to clear the broken files to avoid Prometheus pod restarts but still the issues is same after few hours. NO immediate restarts at least!
  4. We have scaled up the Azure instance type to make pods having enough resources to handle the load(which is invisible in Prometheus) and Azure monitoring does not showing any spike or much usage still Prometheus pods are getting restarted.
  5. We had a call with App teams to cross check whether our Prometheus is getting hit by any applications/services or some load test. We still see the restarts even after we suspended some apps.
Best,
Moksh

Julius Volz

unread,
Sep 17, 2024, 5:35:47 AM9/17/24
to Moksha Reddy G, Prometheus Users
Hi Moksha,

If you don't see query rates spiking up at the problematic times, here's a few ideas:

* Have you confirmed that the Prometheus pods die with an OOM (out-of-memory failure) and not for some other reason (e.g. do the logs of the killed pods show any crash errors)?

* What do the various memory metrics ("process_resident_memory_bytes" / "process_virtual_memory_bytes" / "container_memory_rss") look like for the Prometheus processes before they die? Do they increase suddenly before an OOM, or do they just gradually creep up until the server dies?

* It could still be that a single large query takes out your Prometheus server, although the general rate of queries doesn't increase. You can use Prometheus' active query log feature to figure out what query was running while your Prometheus server crashed / got killed. See https://training.promlabs.com/training/monitoring-and-debugging-prometheus/logs/active-queries-log/

* You can also generally log all PromQL queries that a Prometheus server receives to a file, see: https://training.promlabs.com/training/monitoring-and-debugging-prometheus/logs/query-log/ - the limitation of this approach is that it will only log completed queries, so if your server dies while processing a query, it will not show up in that log (for that you will need to use the active query log approach above).

Cheers,
Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f2e63b64-227a-4498-b02c-701ee0bd4e52n%40googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com
Reply all
Reply to author
Forward
0 new messages