Hi,
I am using a kubernetes 1.10 cluster with a number of basic spring boot based microservices. We have a couple of separate clusters which have quite different profiles with respect to computing power etc. As a result of this (and possibly a result of using helm to release multiple new pods simultaneously) we experience a wide range of start up times for the same services in different environments - ranging from 10 seconds to upto 90 seconds. While this slow startup time represents issues probably outside the scope of kubernetes, my question is whether I am misunderstanding how to use liveness and readiness probes.
The problem I encounter is that sometimes the liveness probe initial delay expires before the application is 'ready', and then the liveness probe reboots the service - and this can result in a loop, because the cluster is busy booting all the services they all remain slow and keep getting terminated.
I would have expected that the liveness probe would wait until the readiness probe succeeds before it starts running - but my experiments show this not to be the case. My attempt to handle a wide range of startup times was to have readiness probe with a low initial delay but a high failure threshold, so it would start checking the service after ten seconds, but would continue trying until two minutes, ensuring even the slow startup scenarios can succeed. However combining this with a liveness probe is proving difficult - it seems i have to set the initial delay of the liveness probe to the longest possible startup time of the service, because it will try to reboot the service even before it has been marked as ready.
The easiest solution is simply to add a long initial delay to both, but i am not satisfied which this because it means in the case a container does actually crash and need to be restarted, the pod will be be considered ready until this long initial delay has expired - and it could potentially be ready after a couple of seconds.
What I have settled on for now looks like this:
livenessProbe:
httpGet:
path: {{ .Values.probeEndpoint }}
port: {{ .Values.service.internalPort }}
initialDelaySeconds: 120
periodSeconds: 3
readinessProbe:
httpGet:
path: {{ .Values.probeEndpoint }}
port: {{ .Values.service.internalPort }}
initialDelaySeconds: 10
periodSeconds: 3
failureThreshold: 40
But I don't like that if an app starts up in ten seconds, the liveness probe will not be checking it for the next 110 seconds.
I would be surprised if I am the first to encounter this situation, but I also haven't had much luck finding similar stories / scenarios. Am I misunderstanding the use / intention of these probes? Or is the underlying cause (deploying to multiple disparate clusters) quite unusual such that others haven't encountered similar issues?
Cheers,
Brendan