I'm trying to make sure that as I'm deploying new services on our
cluster, that failures/restarts get handled in a way that's most optimal
for resiliency/uptime.
I'm simplifying things a bit, but if a piece of code running inside a
container crashes, there's more or less 2 possibilities: 1) bug in the
code (and/or it's trying to process data that causes an error), or 2)
problems with the hardware/network (full disk, bad disk, network outage,
etc.) If the issue is #1, then it doesn't matter whether you restart
the container or the pod. But if the issue is #2, then restarting the
pod (i.e., on another host) would fix the problem, while restarting the
container probably wouldn't.
So I guess this is sort of alluding to a bigger question, then: does
k8s have any ability to detect if a host is having hardware problems
and, if so, avoid scheduling new pods on it, move pods off of it if
their containers are crashing, etc.
I've done a lot of work with big data systems previously and, IIRC,
Hadoop (for example) used to employ procedures to detect if a disk was
bad, if many tasks on a particular node kept crashing, etc., and it
would start to blacklist those. My thinking was that k8s worked
similarly - i.e., if all containers in a pod terminated unsuccessfully,
then terminate the pod; if a particular node is having many pods
terminated unsuccessfully, then stop launching new pods on there, etc.
Perhaps I'm misunderstanding / assuming incorrectly though.
Thanks,
DR
On 2017-10-27 4:35 pm, 'Tim Hockin' via Kubernetes user discussion and