I've opened a feature/issue that I plan to implement for my org and submit upstream to the Prometheus client for python.
The link is here:
https://github.com/prometheus/client_python/issues/953The proposal is to equip the prometheus client with the ability to concurrently respond with an HTTP 200 /healthz page that can be *easily* integrated into control planes (ex: AWS ECS & K8s both use z-pages) ..
Control planes use z-pages (pioneered by Google, but widely adopted by most load balancers) to determine if an application is alive/functioning properly based on the HTTP response code. If an application fails to return an HTTP 200 after a configured amount of intervals the container is terminated and a new container is spun up. The python client for prometheus has it's own webserver internally, so I'm proposing to implement z-pages capability in the client.
To be clear: I'm NOT proposing adding functionality to Prometheus. As far as Prometheus core is concerned these are nothing special "Counters".
Github user: roidelapluie requested I submit my proposal to the broader Prometheus community for discussion & feedback, which is welcome!
My proposed design:
The current implementation for Prometheus Counter is the super class, and the proposed "HealthzCounter" will fully inherit all those capabilities & behaviors.
Application develops implementing east/west telemetry in their applications can then place HealthzCounters at one or more critical points in the codepath to track if an application is working (i.e. is the main loop running or blocked).
For applications using python asyncio then it would be appropriate to implement one HealthzCounter per critical event loop.
The behavior *if* for example, an MQ or DB client disconnects and doesn't/can't reconnect, it's either running an error path or a success path, and either can be attributed to a plurality of potential underlying reasons .. the application can be easily terminated in a standard way by the container orchestrator.
The intention is to place these counters into the critical codepath, the HealthzCounter requires a heartbeat, thus after an interval it trips a deadman switch) .. this would then cause the /healthz url to return a non HTTP-200 informing the clustering orchestration software to perform a Roy from IT crowd solution ("hello IT dept., have you tried turning it off an on again!?")
The business case:
My organization is in the process of implementing all our applications with east/west metrics and I have gotten tentative approval to develop this feature and upstream the work.
While it would be better to fix the bugs in the app, the reset itself is often the first "routine" step in troubleshooting. The control plane will keep a log of the resets, etc. because when an application is restarted it's counter is also reset to zero (so this makes tracking the entire maneuver quite easy and obvious in a tool like grafana using promql)
I will be faster to respond on the github issue, if you don't mind responding there with feedback or ideas, but will try to keep an eye on this group as well for the next few days.
Also since this is my first time posting to the prometheus devs list -- need to say: thank you for what you do & what you have done!!
Cheers,
-Brian Horakh
Software Engineer