Liveness vs readiness probe implementations

700 views
Skip to first unread message

Emmanuel Bernard

unread,
Nov 3, 2019, 7:47:57 AM11/3/19
to Quarkus Development mailing list
Hey there,

I get custom readiness probes, the app should be ready to answer
requests.
But what would be an example of an liveness probe written by a user? I
cna't think of anything as as soona s the app answer the http port we
are "live".

Thoughts?

Emmanuel

Michal Szynkiewicz

unread,
Nov 3, 2019, 8:05:29 AM11/3/19
to Emmanuel Bernard, Quarkus Development mailing list
Maybe e.g. db connection pool exhausted?


--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/2C906B7B-2745-477C-960B-1406E695316D%40redhat.com.

Sanne Grinovero

unread,
Nov 3, 2019, 8:26:24 AM11/3/19
to Emmanuel Bernard, Quarkus Development mailing list
It's ment to verify the application is "up" - but the definition
varies on context.

Often I see people just checking that the homepage loads - that's the
basic step and might be good enough for most cases, provided i.e. this
implies implicitly testing several internal components, such as
perhaps loading something from the database, so implicitly testing the
database is reachable still.

In some cases I've seen people running more comprehensive periodic
tests; for example an airline booking company might actually place a
booking order for testing accounts.

Sanne Grinovero

unread,
Nov 3, 2019, 8:36:29 AM11/3/19
to mszy...@redhat.com, Emmanuel Bernard, Quarkus Development mailing list
On Sun, 3 Nov 2019 at 13:05, Michal Szynkiewicz <mszy...@redhat.com> wrote:
>
> Maybe e.g. db connection pool exhausted?

I'd be careful with that. Having exhausted connection pools is not a
good reason to fail the liveness checks: it will result in the node
being killed, while actually it was perhaps working very well - but at
capacity.

When the application is at capacity, it should be scaled up rather
than down - so failing the liveness probe check would get you the
opposite of what you need.

Exhausting the DB connection pool is actually a tricky case, as in
this specific case even scaling up might not be a good idea as the DB
will have a maximum of connections it can take in total from all the
applications: scaling up/down individual instances of a scalable
application might not help. It's clear though that killing one
instance which is otherwise working fine is not the best course of
action, as those physical connections it has in its active connection
pool are expensive to be replaced; this will put more load on the DB -
and the system as a whole.

In short: don't kill the node, but try scaling up - but not over the
ideal limit the DB can take.

If that's not good enough still, one will need to look at scalable
databases, and/or offload at least some of the most intensive work to
a cloud-native NoSQL solution such as an Infinispan data grid server.
> To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CAD-vzj50RLL5ZOR%3D%3DKxZeXwg%3DLE%3D%3Dp1c%2BVboZZFjKRqHt2NX%2Bg%40mail.gmail.com.

Martin Stefanko

unread,
Nov 3, 2019, 8:38:29 AM11/3/19
to Quarkus Development mailing list
For me, the difference is that if the liveness probe fails the app is restarted so it depends on the app context but the use-case is different.

However, thanks all for the examples. Very useful for us in the spec.

V. Sevel

unread,
Nov 3, 2019, 4:09:27 PM11/3/19
to Quarkus Development mailing list
one way I like to look at it is to realize that if your liveness probe fails, your pod will get killed and restarted. so if you do not think that a pod restart would help with a specific situation, do not include that situation in your liveness probe. so by default just return a 200, and fail only in cases where your app cannot recover without a restart.

Emmanuel Bernard

unread,
Nov 4, 2019, 5:09:42 AM11/4/19
to Sanne Grinovero, Quarkus Development mailing list
What you describe as tests, Iw ould put them as readiness status more
than liveness I think.

Emmanuel Bernard

unread,
Nov 4, 2019, 5:12:04 AM11/4/19
to V. Sevel, Quarkus Development mailing list

OK that makes sense. I wish we had some concrete examples even if we have to explain context.
IF the framework does not try and restart the connection tot he DB, the DB connection is a liveness thing but then that's an uncommon case compared to the framework retrying.

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.

Roland Huss

unread,
Nov 4, 2019, 5:27:55 AM11/4/19
to Emmanuel Bernard, V. Sevel, Quarkus Development mailing list
The 'liveness probe is for healing by restart' rule of thumb is really very useful. E.g. of it doesn't make sense to fail a liveness check if a dependent service (e.g. DB) is failing, as a restart of your service won't fix that. As your service can't provide its functionality without its dependent services running, its readiness probe should flip to false, and flip back to true if all those dependencies are operable again.

Another scenario for not using the same liveness and readiness probes is for services that take a long time to start up, but with a readiness probe kicking in quickly (faster than the startup phase). Having the same liveness probe would restart your pod before it even could initialize itself (resulting in a crash loop).

Of course, if your service doesn't have any dependencies (and starts up quickly enough), liveness and readiness can be the same.

regards ...
... roland

Erin Schnabel

unread,
Nov 4, 2019, 10:53:43 AM11/4/19
to Quarkus Development mailing list
I like to say that liveness == "I am not a zombie process". 

Anything more complicated can be done in the readiness check, including having the process stop itself if it realizes it can't get back to a ready state. i.e. quiesce existing work and quit, or detect a deadlock and quit, or realizing your DB Connection Pool is not recovering and quit, or... 

I don't see any reason to do those kinds of checks in two different probes that will both be called at separate (but perhaps overlapping) intervals. 
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

V. Sevel

unread,
Nov 4, 2019, 11:37:54 AM11/4/19
to Quarkus Development mailing list
>> Another scenario for not using the same liveness and readiness probes is for services that take a long time to start up

an interesting pattern I have seen around that situation is to use the same endpoint (eg /health) for both readiness and liveness, but specify a different initialDelaySeconds. For instance 20 secs for readiness (if the app is supposed to start in 10 secs), and 60 seconds for liveness.
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

Mattia Mascia

unread,
Nov 4, 2019, 11:41:30 AM11/4/19
to Quarkus Development mailing list

V. Sevel

unread,
Nov 4, 2019, 11:55:43 AM11/4/19
to Quarkus Development mailing list
here is an example:

that is the reason why I suggested https://issues.jboss.org/browse/AG-116 back when I started working on vault.

this was in 2017, and it seems they have solved this situation:

if your application fetches secrets/properties at startup only, then your only way to recover is to shutdown and restart. so you either shutdown properly like somebody was suggesting and exit by yourself, or let k8s kill you based on the liveness.
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

V. Sevel

unread,
May 11, 2020, 8:11:51 AM5/11/20
to Quarkus Development mailing list
in the "DON'T" it says "do not depend on external dependencies (like data stores) for your Readiness/Liveness checks as this might lead to cascading failures", with examples such as:
 - dependency on db (not to be confused with db migration, which is a "DO")
 - dependency on other services

it is my understanding that the quarkus behavior is to include extensions health checks (unless disabled) as readiness probes:

I think it is a good thing that extensions providing integrations with external systems do provide custom health checks to help with diagnostic, but those health checks should not be part of the readiness probe, unless we are absolutely certain that the condition affects only the pod itself, and not the entire replica set. 
and in the vast majority of cases, failing on an external system is likely to be a problem on the server, rather on a dedicated client. if a pod cannot connect to a db, most likely the other pods won't be able to either. plus it is extremely difficult to assess from a given pod that a condition only occurs to that pod.

so by default I would not add the extension health checks to the readiness probe for external systems (agroal, mongo, neo4j, vault, ...) except for flyway.
and I would add some config params to add them to the readiness probe if I need to.

Loïc MATHIEU

unread,
May 11, 2020, 8:30:18 AM5/11/20
to vvs...@gmail.com, Quarkus Development mailing list
Hello,

While I agree that liveness probe should not depend on external system, I don't agree for readiness probe.

Readiness probe should be OK only when your application can handle the request.
If a database that is necessary to fulfill a request is down, your readiness probe should be DOWN.

In the blog post your referring to, it messes liveness and readiness :
e.g. a stateful REST service with 10 pods which depends on a single Postgres database: when your probe depends on a working DB connection, all 10 pods will be "down" if the database/network has a hiccup --- this usually makes the impact worse than it should

This statement is not true, in case of a readiness probe, your pod will not restart, it will just don't receieve traffic that your POD is not able to handle.
And I don't understand why it will make the impact worse, it is a kind of "circuit breaker", if your database is not able to handle request, it will not be used by your application, so your DB will have time to recover ...

Moreover, network hiccup will not make a liveness/readiness probe taking down your POD as it takes several DOWN probe to actually restart (for liveness) or blacklist (for readiness) a POD.

When we add health check to Quarkus, we discuss this extensively and was OK that providing readiness probes enabled by default for all datastore is a good thing. 
We decided to disabled it by default for Kafka Client as an application can still receive requests, without any issue when a Kafka cluster is down as the default client is capable of buffering the messages.

I discuss this with my devops friends (some are certified Kubernetes Admin and Kubernetes trainers) and they agree with me ...
So I think Quarkus is right with enabled by default readiness probe for all datastore that it supports.
Each can be easily disabled so if some are not agree it's easy to change this ;)

Regards,

Loïc


--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/fdc50751-320b-419b-b767-2726b706de39%40googlegroups.com.

V. Sevel

unread,
May 11, 2020, 8:49:29 AM5/11/20
to Quarkus Development mailing list
the cascading failure happens if you generalize adding external systems such as db and services to your readiness probe.
if a db is unavailable, all the microservices using that db will become unavailable and any calls to those microservices will return a 503.
then if clients of that microservice have added that microservice to their readiness probe, they will become unvailable as well.
and if you repeat that pattern, that will happen in cascade to clients of those clients, ...
the extra inconvenient is that once an entire replica set has been made unavailable, you will not be able to touch them on /health, making the diagnostic even more difficult, unless you are using tools to target the pods individually, without going through the k8s service object.
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

Georgios Andrianakis

unread,
May 11, 2020, 9:51:04 AM5/11/20
to Loïc MATHIEU, vvs...@gmail.com, Quarkus Development mailing list


On Mon, May 11, 2020, 16:47 Loïc MATHIEU <loik...@gmail.com> wrote:
Hello,

While I agree that liveness probe should not depend on external system, I don't agree for readiness probe.

Readiness probe should be OK only when your application can handle the request.
If a database that is necessary to fulfill a request is down, your readiness probe should be DOWN.

In the blog post your referring to, it messes liveness and readiness :
e.g. a stateful REST service with 10 pods which depends on a single Postgres database: when your probe depends on a working DB connection, all 10 pods will be "down" if the database/network has a hiccup --- this usually makes the impact worse than it should

This statement is not true, in case of a readiness probe, your pod will not restart, it will just don't receieve traffic that your POD is not able to handle.
And I don't understand why it will make the impact worse, it is a kind of "circuit breaker", if your database is not able to handle request, it will not be used by your application, so your DB will have time to recover ...

Moreover, network hiccup will not make a liveness/readiness probe taking down your POD as it takes several DOWN probe to actually restart (for liveness) or blacklist (for readiness) a POD.

When we add health check to Quarkus, we discuss this extensively and was OK that providing readiness probes enabled by default for all datastore is a good thing. 
We decided to disabled it by default for Kafka Client as an application can still receive requests, without any issue when a Kafka cluster is down as the default client is capable of buffering the messages.

I discuss this with my devops friends (some are certified Kubernetes Admin and Kubernetes trainers) and they agree with me ...
So I think Quarkus is right with enabled by default readiness probe for all datastore that it supports.
Each can be easily disabled so if some are not agree it's easy to change this ;)

Very well articulated, +1

Regards,

Loïc


Le lun. 11 mai 2020 à 14:11, V. Sevel <vvs...@gmail.com> a écrit :
in the "DON'T" it says "do not depend on external dependencies (like data stores) for your Readiness/Liveness checks as this might lead to cascading failures", with examples such as:
 - dependency on db (not to be confused with db migration, which is a "DO")
 - dependency on other services

it is my understanding that the quarkus behavior is to include extensions health checks (unless disabled) as readiness probes:

I think it is a good thing that extensions providing integrations with external systems do provide custom health checks to help with diagnostic, but those health checks should not be part of the readiness probe, unless we are absolutely certain that the condition affects only the pod itself, and not the entire replica set. 
and in the vast majority of cases, failing on an external system is likely to be a problem on the server, rather on a dedicated client. if a pod cannot connect to a db, most likely the other pods won't be able to either. plus it is extremely difficult to assess from a given pod that a condition only occurs to that pod.

so by default I would not add the extension health checks to the readiness probe for external systems (agroal, mongo, neo4j, vault, ...) except for flyway.
and I would add some config params to add them to the readiness probe if I need to.




Le lundi 4 novembre 2019 17:41:30 UTC+1, Mattia Mascia a écrit :
This discussion can be sum up with this blog:
https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/fdc50751-320b-419b-b767-2726b706de39%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.

Loïc MATHIEU

unread,
May 11, 2020, 10:02:34 AM5/11/20
to Georgios Andrianakis, vvs...@gmail.com, Quarkus Development mailing list
> f clients of that microservice have added that microservice to their readiness probe

This is always a design choice to do this ... and we don't have rest client health check. 
For accessing an other webservice I recommend using fault tolerance capabilities to avoid such cascading failure.

But it's always a question of "what is the minimal dependency for my service to be ready", if it's a DB, so it must be on the readiness check, if it's an external service (Vault for example, we have readiness check for it I think), it must be on the readiness check. If you don't add them to the readiness check you will still have your clients having exception (500) but have tons of failing request on your service (that can lead to DDoS) and on your dependency (so this can lead to situation where your failing external datastore cannot go back to a steady state due to too many dangling connections or too many open sessions, etc ...)

Ioannis Canellos

unread,
May 12, 2020, 2:51:16 AM5/12/20
to loik...@gmail.com, vvs...@gmail.com, Quarkus Development mailing list

V. Sevel

unread,
May 12, 2020, 1:03:50 PM5/12/20
to Quarkus Development mailing list
I have gone through some research and I have found a few articles and blogs [1] in addition to mattia's, all usually converging around the idea of cascading failures and the fact that an external shared dependency not there for one pod, is probably not there for all pods (and I have found also some [4] with the opposite proposition, including one from google).

it is interesting also to note that SB is working actively in that area for 2.3, and they decided to not include by default any of the healthchecks in the readiness probe as stated here [2] and also explained here in the PR [3] (with some some reasoning about auto-scaling). SB in itself is not an absolute proof, but they have the exact same issue to solve, and so it is always interesting to challenge quarkus, when they take a different approach.

by experience, it is a little bit of pain to investigate a 503 in k8s, because you have several layers to check: load balancer, ingress router, ingress, network policy. adding external dependencies to readiness will just return a 503 with no context, as opposed to a 500 with some context.
it is true that clients with retry strategies may decide to hammer the service, but if they have retries, I am hoping that they have circuit breakers too. And if I want to protect myself from a shaky external dependency, may be I should have a circuit breaker too.

I agree it is a tough question, and some applications will want to have this. but understand that if you loose an external dependency, then your entire application is gone, and you will be down to periodSeconds and failureThreshold tuning. it is not a right or wrong answer, but rather what is the less harmful default.






Regards,

Loïc


To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quark...@googlegroups.com.

Loïc MATHIEU

unread,
May 12, 2020, 1:23:29 PM5/12/20
to vvs...@gmail.com, Quarkus Development mailing list
Almost all Spring Boot applications I saw on production uses the activator endpoint for readiness.
Maybe this changes since 2.3 but almost all that I saw have health check with database check.

But as you said, there is no consensus on this ...

To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/243d4a3c-592c-4c74-8987-1ebae17048f9%40googlegroups.com.

Ladislav Thon

unread,
May 13, 2020, 3:02:58 AM5/13/20
to vvs...@gmail.com, Quarkus Development mailing list
I tend to think that there should be a readiness check for each external service that is absolutely vital to the functioning of the application. If 99% of the requests to the application hit a database, then database check should be part of readiness. But if the app reaches out to some external service for 10% of requests, then probably not.

LT

út 12. 5. 2020 v 19:03 odesílatel V. Sevel <vvs...@gmail.com> napsal:
To unsubscribe from this group and stop receiving emails from it, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/243d4a3c-592c-4c74-8987-1ebae17048f9%40googlegroups.com.

Erin Schnabel

unread,
May 13, 2020, 10:11:18 AM5/13/20
to lad...@gmail.com, vvs...@gmail.com, Quarkus Development mailing list
Exactly. If the service can keep functioning w/o something, it shouldn't be part of the readiness check.

For spring boot, I always treat the actuator health endpoint as the readiness check, and make my own stupid/brain-dead/I-am-not-a-zombie endpoint for liveness (that's all it is really for, to catch zombie containers where the container is alive, but the process that does stuff is dead).

You received this message because you are subscribed to a topic in the Google Groups "Quarkus Development mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/quarkus-dev/pj6L7kzbIno/unsubscribe.
To unsubscribe from this group and all its topics, send an email to quarkus-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/quarkus-dev/CALbocOkwVSjV%3D3fOsEriNBPAx3uXTo3y_OPHw7cWgt%3DcNETDKg%40mail.gmail.com.


--
An eye for eye only ends up making the whole world blind.  ~Mahatma Gandhi

clement escoffier

unread,
May 13, 2020, 11:49:02 AM5/13/20
to schn...@us.ibm.com, lad...@gmail.com, vvs...@gmail.com, Quarkus Development mailing list
For readiness, it's debatable. But as you are not in control of the number of instances the service you are calling have (and where they are), it's challenging to configure the probe. What would be the timeout? What would be the number of failures? All these are unknown to you. Imagine that 3 of 5 instances of the called service are dead, 3 failures in a row won't make you ready but imagine you hit a working service during the probe and a failing one on the first request. Do you need to wait until all the instances of the called service are alive? That's not really possible. Instances may not be colocalized, so timeout may vary from instance to instance. Basically, it's almost not configurable correctly.

For liveness, it's a no go... The failing service is responsible for that, not you as the caller. Saying "Down" will reboot you, which does not have an impact on the remote service you are calling.

Clement



Erin Schnabel

unread,
May 13, 2020, 2:08:48 PM5/13/20
to clement....@gmail.com, lad...@gmail.com, quark...@googlegroups.com, vvs...@gmail.com
Which I think is what I said. ;)
 
Thanks,
Erin
 
------
Erin Schnabel <schn...@us.ibm.com>
Senior Technical Staff Member (STSM)
Emerging Application Runtimes and Frameworks
845.435.5648 / 8.295.5648
Reply all
Reply to author
Forward
0 new messages