python client support for z-pages

47 views
Skip to first unread message

Brian Horakh

unread,
Sep 4, 2023, 11:14:24 AM9/4/23
to Prometheus Developers
I've opened a feature/issue that I plan to implement for my org and submit upstream to the Prometheus client for python.

The link is here:
https://github.com/prometheus/client_python/issues/953

The proposal is to equip the prometheus client with the ability to concurrently respond with an HTTP 200 /healthz page that can be *easily* integrated into control planes (ex: AWS ECS & K8s both use z-pages) ..

Control planes use z-pages (pioneered by Google, but widely adopted by most load balancers) to determine if an application is alive/functioning properly based on the HTTP response code.   If an application fails to return an HTTP 200 after a configured amount of intervals the container is terminated and a new container is spun up.   The python client for prometheus has it's own webserver internally, so I'm proposing to implement z-pages capability in the client. 

To be clear:  I'm NOT proposing adding functionality to Prometheus.  As far as Prometheus core is concerned these are nothing special "Counters".

Github user: roidelapluie requested I submit my proposal to the broader Prometheus community for discussion & feedback, which is welcome! 

My proposed design:
The current implementation for Prometheus Counter is the super class, and the proposed "HealthzCounter" will fully inherit all those capabilities & behaviors. 

Application develops implementing east/west telemetry in their applications can then place HealthzCounters at one or more critical points in the codepath to track if an application is working (i.e. is the main loop running or blocked). 
For applications using python asyncio then it would be appropriate to implement one HealthzCounter per critical event loop.  

The behavior *if* for example, an MQ or DB client disconnects and doesn't/can't reconnect, it's either running an error path or a success path, and either can be attributed to a plurality of potential underlying reasons .. the application can be easily terminated in a standard way by the container orchestrator.  

The intention is to place these counters into the critical codepath, the HealthzCounter requires a heartbeat, thus after an interval it trips a deadman switch) .. this would then cause the /healthz url to return a non HTTP-200 informing the clustering orchestration software to perform a Roy from IT crowd solution ("hello IT dept., have you tried turning it off an on again!?")

The business case:
My organization is in the process of implementing all our applications with east/west metrics and I have gotten tentative approval to develop this feature and upstream the work.   

While it would be better to fix the bugs in the app, the reset itself is often the first "routine" step in troubleshooting.   The control plane will keep a log of the resets, etc. because when an application is restarted it's counter is also reset to zero (so this makes tracking the entire maneuver quite easy and obvious in a tool like grafana using promql)

I will be faster to respond on the github issue, if you don't mind responding there with feedback or ideas, but will try to keep an eye on this group as well for the next few days. 

Also since this is my first time posting to the prometheus devs list -- need to say: thank you for what you do & what you have done!! 

Cheers,

-Brian Horakh
Software Engineer
Habitat.Energy Australia



Chris Marchbanks

unread,
Sep 6, 2023, 2:27:36 PM9/6/23
to Brian Horakh, Prometheus Developers
Hello Brian,

First of all, thank you for the proposal. My initial thought is that since this functionality would not be used by Prometheus or a Prometheus related component that it is beyond the scope of a Prometheus client library. I do like the idea of being able to use metrics as a signal for the result of a /healthz endpoint though, so if it is challenging to get the current value of a metric that is something I would consider improving.

Thanks again for the proposal and I am curious what others think as well,
Chris

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/b9186e94-8433-4081-a87c-26479bbedbccn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages