Question about interaction between readiness probe and liveness probe

608 views
Skip to first unread message

Brendan Hatton

unread,
Oct 31, 2018, 7:49:10 PM10/31/18
to Kubernetes developer/contributor discussion
Hi,

I am using a kubernetes 1.10 cluster with a number of basic spring boot based microservices. We have a couple of separate clusters which have quite different profiles with respect to computing power etc. As a result of this (and possibly a result of using helm to release multiple new pods simultaneously) we experience a wide range of start up times for the same services in different environments - ranging from 10 seconds to upto 90 seconds. While this slow startup time represents issues probably outside the scope of kubernetes, my question is whether I am misunderstanding how to use liveness and readiness probes.

The problem I encounter is that sometimes the liveness probe initial delay expires before the application is 'ready', and then the liveness probe reboots the service - and this can result in a loop, because the cluster is busy booting all the services they all remain slow and keep getting terminated.

I would have expected that the liveness probe would wait until the readiness probe succeeds before it starts running - but my experiments show this not to be the case. My attempt to handle a wide range of startup times was to have readiness probe with a low initial delay but a high failure threshold, so it would start checking the service after ten seconds, but would continue trying until two minutes, ensuring even the slow startup scenarios can succeed. However combining this with a liveness probe is proving difficult - it seems i have to set the initial delay of the liveness probe to the longest possible startup time of the service, because it will try to reboot the service even before it has been marked as ready.

The easiest solution is simply to add a long initial delay to both, but i am not satisfied which this because it means in the case a container does actually crash and need to be restarted, the pod will be be considered ready until this long initial delay has expired - and it could potentially be ready after a couple of seconds.
What I have settled on for now looks like this:

        livenessProbe:
          httpGet:
            path: {{ .Values.probeEndpoint }}
            port: {{ .Values.service.internalPort }}
          initialDelaySeconds: 120
          periodSeconds: 3
        readinessProbe:
          httpGet:
            path: {{ .Values.probeEndpoint }}
            port: {{ .Values.service.internalPort }}
          initialDelaySeconds: 10
          periodSeconds: 3
          failureThreshold: 40


But I don't like that if an app starts up in ten seconds, the liveness probe will not be checking it for the next 110 seconds.

I would be surprised if I am the first to encounter this situation, but I also haven't had much luck finding similar stories / scenarios. Am I misunderstanding the use / intention of these probes? Or is the underlying cause (deploying to multiple disparate clusters) quite unusual such that others haven't encountered similar issues?

Cheers,
Brendan

Brendan Burns

unread,
Oct 31, 2018, 11:52:13 PM10/31/18
to Brendan Hatton, Kubernetes developer/contributor discussion
You are correct that as far as I know readiness and liveness are entirely separate probes.

It seems to me that you are using the same endpoint for both /liveness and /readiness?

I think the general practice is to use two different endpoints.

That way you go /live quite quickly, and /ready at a slower pace when you are actually ready to serve.

Basically, you need to be able to serve your /liveness probe, even if you're not ready to serve your /readiness probe...

If your /liveness probe depends on your server being /ready than you're going to run into the problems that you describe.

So it looks something like this:

Pod starts
<short-initial-delay>
/liveness-probe becomes active
<more-delay>
/readiness-probe becomes active
<everything's happy>

I would hope that the /liveness probe can be coded in such a way that it is not variable time to get /live regardless of environment.

Hope that helps.
--brendan


From: kuberne...@googlegroups.com <kuberne...@googlegroups.com> on behalf of Brendan Hatton <brendan...@gmail.com>
Sent: Wednesday, October 31, 2018 4:33 PM
To: Kubernetes developer/contributor discussion
Subject: Question about interaction between readiness probe and liveness probe
 
--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/c1370b6c-9d3f-40de-9fe2-24a3c2e287d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Han Kang

unread,
Nov 1, 2018, 12:39:49 PM11/1/18
to Kubernetes developer/contributor discussion
Hi Brendan Hatton,

To clarify, are you using the same health endpoint for liveness and readiness? To answer your first question, yes, my understanding is that there is no internal relationship between liveness and readiness probes from the perspective of the kubelet. A liveness probe can both be configured independent of a readiness probe (and vice versa), and also together. If you want a certain type of interaction between the two probes, it would have to be configured that way. At least, this is how I understand it, but Brendan Burns likely knows this (and everything about kubernetes) better than me. 

One thing in your configuration does stand out to me; specifically, it is strange that you have set your readiness failureThreshold to 40. The readiness probe is used to dictate whether you want traffic routed to that container, so generally it is desirable for the readiness probe to fail faster. In terms of the liveness probe, yes it is unfortunate that you are encountering variable start up times. Two ways of addressing this come to mind. You can set up a long initial delay, which has the downside you mentioned (if the container starts up fast, the liveness probe will not be checked until the initial delay has passed). Alternatively, you can increase your failure threshold on your liveness probe, which means that your app would become live quicker but will not get restarted unless a greater number of consecutive health failures are encountered (this is less desirable in the steady state). 

I actually think your current setup is mostly fine but I do think you should probably decrease the failureThreshold on your readiness probe. Having a long initial delay for the liveness probe is okay, liveness probes are specifically intended to be used to restart containers. If you have a boot sequence which can be variable in the time it takes, yes it seems okay to me to set a longer initial delay. I say this because your liveness probe will not dictate whether you receive traffic, but rather when the kubelet decides that a restart might fix a broken container. Deciding whether traffic should be routed to your container is the function of the readiness probe. So if you have your readiness endpoint properly configured (with a short initial delay), then you should be able to route traffic to your container before your first liveness probe is executed, if your container comes up quickly.

Daniel Smith

unread,
Nov 1, 2018, 1:01:02 PM11/1/18
to Han Kang, kuberne...@googlegroups.com
If you don't mind building some logic into your application, you can build in a better version of the initial delay and just not use that feature.
* make your liveness probe return true while the app is initializing.
* false if initialization is taking too long
* false if initialization failed
* if initialization completed, do the actual liveness check.

Of course if there's an error in any of that logic it has pretty severe implications on your service, so test it very carefully. I'd hesitate to call it a best practice for this reason, but it's an option.

Just to repeat what others said, liveness and readiness should probably not be the same check, as they tell the system very different things.
* Liveness=false: please restart me ASAP!
* Readiness=true: please send me more traffic!


--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.

William Denniss

unread,
Nov 1, 2018, 2:05:45 PM11/1/18
to Daniel Smith, Han Kang, kuberne...@googlegroups.com
Liveness to probe the internal state of the container (is the process running normally, etc?)
Readiness to probe if the container is ready to receive traffic (are all external dependencies connected?)

The reason they are different, is you may not wish to restart the container immediately just because the DB connection is down (as the app is likely trying to re-establish that connection anyway, and a restart won't help).

I cover this topic a bit in this talk, if you're interested: https://youtu.be/2ZP4M6UdH8s?t=767

On Thu, Nov 1, 2018 at 10:00 AM, 'Daniel Smith' via Kubernetes developer/contributor discussion <kuberne...@googlegroups.com> wrote:
If you don't mind building some logic into your application, you can build in a better version of the initial delay and just not use that feature.
* make your liveness probe return true while the app is initializing.
* false if initialization is taking too long
* false if initialization failed
* if initialization completed, do the actual liveness check.

Of course if there's an error in any of that logic it has pretty severe implications on your service, so test it very carefully. I'd hesitate to call it a best practice for this reason, but it's an option.

Just to repeat what others said, liveness and readiness should probably not be the same check, as they tell the system very different things.
* Liveness=false: please restart me ASAP!
* Readiness=true: please send me more traffic!

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CAB_J3bauiNA7HMxrr7zHi1%3D%3DUTaGT2LTWuvDZMCwukcA479ahw%40mail.gmail.com.

King'ori Maina

unread,
Nov 2, 2018, 7:32:31 AM11/2/18
to Kubernetes developer/contributor discussion, Daniel Smith, William Denniss, Han Kang
Related. There’s an issue (still open) with some discussion on this exact issue:

Might be interesting to some.

King.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CAAP42hBKiv5tQrGU7Cmh79yVw7duJ1UABnu%2ByTKUDmCW3rNXjA%40mail.gmail.com.

Brendan Hatton

unread,
Nov 3, 2018, 2:22:28 AM11/3/18
to Kubernetes developer/contributor discussion
Hi Brendan,

Thanks for your feedback. I see that I didn't quite understand the liveness probe completely - it makes more sense now!

Cheers,
Brendan
To post to this group, send email to kubern...@googlegroups.com.

Brendan Hatton

unread,
Nov 3, 2018, 2:25:03 AM11/3/18
to Kubernetes developer/contributor discussion
Hi Han,

Thanks for your comment. The reason I set such a high failure threshold on the readiness probe is to handle the variable start time - so kubernetes will start checking the app after 10 seconds, but it will keep checking all the way up to two minutes without taking any action, because this could still be a normal (but slow) startup.

Cheers
Brendan

Brendan Hatton

unread,
Nov 3, 2018, 2:30:08 AM11/3/18
to Kubernetes developer/contributor discussion
Hi Daniel,
Thats an interesting idea - but as you point out it would be a relatively brave move (especially if the switch to the actual live check malfunctioned!). Thats a nice simplification of the differences between the probes - even though I did understand that, I got a little muddled in the details. Thanks for your response. 

Cheers
Brendan

Brendan Hatton

unread,
Nov 3, 2018, 2:33:53 AM11/3/18
to Kubernetes developer/contributor discussion
Hi William,

The database link is a good example - rebooting the consuming service is unlikely to help the database recover :) Thanks for the link - I haven't used the google kubernetes tooling much yet so this looks like a handy intro to it.

Cheers
Brendan
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.

Brendan Hatton

unread,
Nov 3, 2018, 2:38:02 AM11/3/18
to Kubernetes developer/contributor discussion
Thanks King - I can only wonder how I missed this thread when I was looking for others with similar experiences! At least it shows I wasn't the only one to get confused / make this mistake with regard to the probes. Shame its over two years old and no signs of progress. 

Cheers
Brendan
King.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.

EJ Campbell

unread,
Nov 3, 2018, 11:40:40 AM11/3/18
to Brendan Hatton, Kubernetes developer/contributor discussion
This blog post does an excellent job showing the lifecycle of a pod:
For a lot of applications, it seems like the post start hook can be used to handle the the initial initialization delay (e.g. waiting for the app startup), so the liveness and readiness don’t run until the app is “up”.

EJ

Zach Hanna

unread,
Nov 6, 2018, 11:07:10 AM11/6/18
to ej...@oath.com, brendan...@gmail.com, kuberne...@googlegroups.com
From my perspective, having been using this for a while now, the current behavior is the expected and desired behavior. 
The two checks should never be the same thing. One should necessarily be different from the other. 
Think of it in the context of old school Nagios - the difference between a TCP port check or ICMP ping, and a service check. One meant the node is actually down or seriously unhealthy, the other showed the service on that node needs attention or has not started up yet. If the node is not physically up or running at the OS level (pod level in this case, right?) then of course the service couldn't possibly be up or be expected to be up. But we shouldn't kill or restart the pod based on the service not being ready. 
I think it's a very well thought out design and dependency the way it is. 


EJ Campbell

unread,
Nov 6, 2018, 6:19:26 PM11/6/18
to Zach Hanna, brendan...@gmail.com, kuberne...@googlegroups.com
One thing to be very cautious of with your readiness or liveness probes is to make sure you are conservative as to what you consider to be “unhealthy”. We have seen cases where applications put a fancy health check in that depends on a backend service cause liveness to fail, which results in all pods restarting at once. Readiness too can create a global outage if you aren’t careful.

For more invasive health checks (e.g. ensuring all backends can be reached), we are going to use K8s to disrupt unhealthy pods, so that the pod disruption budget will prevent us from disrupting all pods within a service at once.

EJ

Han Kang

unread,
Nov 7, 2018, 4:08:30 PM11/7/18
to Kubernetes developer/contributor discussion
What do you mean by 'taking any action'? Failures from the readiness probe do not cause restarts, when the threshold is reached, then traffic stops being routed to the container. High failureThresholds will generally be problematic because traffic will be routed to the container for a long duration of time (failureThreshold * periodSeconds) before the container is marked as unready. 

Matthias Bertschy

unread,
Nov 14, 2018, 2:11:52 AM11/14/18
to Kubernetes developer/contributor discussion
Not much advance here... we have discussed some potential solutions here but it seems nobody wants to take real decision.

For me (and probably many others) this is a real problem for Kubernetes adoption and I'm willing to help...

Tim Hockin

unread,
Nov 14, 2018, 2:29:50 AM11/14/18
to Matthias Bertschy, Kubernetes developer/contributor discussion
I responded on bug.  We don't need to discuss it in 2 places.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages