The PR[1] is ready for review and is proven to work on openshift 4.9.ci version.It detects the upgrade and prevents from remediating node/nodes and oncethe upgrade is not progressing anymore than it continues.The logic is checking the ClusterVersion condition type 'progressing' set to 'true'.Some thoughts/doubts while working on it:- if we skip remediation do we still want to mark a node as unhealthy in the status?
- if we err during an attempt to check if the cluster is upgrading I lean towards continue with remediation. I left a comment suggesting that in time, finer error handling will make this decision better. my thoughts process is "the cluster is unstable, perhaps some pressure, failing hardware, better try remediation than prevent it" - more thoughts here?- I think I need to add a status condition to say 'cluster is upgrading, remediation is disabled'
On Tue, 5 Oct 2021 at 00:27, Roy Golan <rgo...@redhat.com> wrote:The PR[1] is ready for review and is proven to work on openshift 4.9.ci version.It detects the upgrade and prevents from remediating node/nodes and oncethe upgrade is not progressing anymore than it continues.The logic is checking the ClusterVersion condition type 'progressing' set to 'true'.Some thoughts/doubts while working on it:- if we skip remediation do we still want to mark a node as unhealthy in the status?@Andrew Beekhof replied on slack that yes. I too think like that, and my current PR represents that.
- if we err during an attempt to check if the cluster is upgrading I lean towards continue with remediation. I left a comment suggesting that in time, finer error handling will make this decision better. my thoughts process is "the cluster is unstable, perhaps some pressure, failing hardware, better try remediation than prevent it" - more thoughts here?
- I think I need to add a status condition to say 'cluster is upgrading, remediation is disabled'Conditions, phases, and all that jazz is not trivial. There is no single guideline, and I want us to make a good decision.For a start, users and tools don't have a short way to understand what's the status of remediation.Today only after I see the status has a report of number of nodes observed I know the operator is active on that resource.The overall status of a resource can now be many things:- all nodes are ok - ready (or active?)- some nodes are unhealthy, no active remediation- some nodes are unhealthy, some/none have in flight remediation- some nodes are unhealthy, remediation is disabled - reason is cluster upgrading - might have inflight remediatios as wellCan we cram down all of this to a top level status.phase = ready|active|degraded ? probably yes.
Can we include a condition with a reason per each one? its more complicated but doable.Are we expecting tools or top level operators to consult remediation status? or add conditions to it? I'm not aware of it but if you have thoughts on this, please share
Good read on subject -> https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-statusFinally, I think we can move on with this and not wait for 164 as contingency.
Worst case scenarios can be that a) upgrade is progressing slowly and hence remediationis skipped for a long time. b) like in the bug
, remedietion is occureing with upgrade trying to restartnodes. This means I misreably failed to unstand how upgrade is signaled and I need to fix it asap. (but it should be fixable)
--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/CAC_JqcnmgJF%2BE6-eCpGx_RX56a5tOnbaM_Kn1iPruvHCR8BjFg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
On Tue, Oct 5, 2021 at 7:50 AM Roy Golan <rgo...@redhat.com> wrote:On Tue, 5 Oct 2021 at 00:27, Roy Golan <rgo...@redhat.com> wrote:The PR[1] is ready for review and is proven to work on openshift 4.9.ci version.It detects the upgrade and prevents from remediating node/nodes and oncethe upgrade is not progressing anymore than it continues.The logic is checking the ClusterVersion condition type 'progressing' set to 'true'.Some thoughts/doubts while working on it:- if we skip remediation do we still want to mark a node as unhealthy in the status?@Andrew Beekhof replied on slack that yes. I too think like that, and my current PR represents that.+1- if we err during an attempt to check if the cluster is upgrading I lean towards continue with remediation. I left a comment suggesting that in time, finer error handling will make this decision better. my thoughts process is "the cluster is unstable, perhaps some pressure, failing hardware, better try remediation than prevent it" - more thoughts here?In general I agree, but maybe retry getting the upgrade status before going on with remediation? A failure might be caused by an actual upgrade (e.g. the GET request reaches the apiserver node which is just getting updated)
- I think I need to add a status condition to say 'cluster is upgrading, remediation is disabled'Conditions, phases, and all that jazz is not trivial. There is no single guideline, and I want us to make a good decision.For a start, users and tools don't have a short way to understand what's the status of remediation.Today only after I see the status has a report of number of nodes observed I know the operator is active on that resource.The overall status of a resource can now be many things:- all nodes are ok - ready (or active?)- some nodes are unhealthy, no active remediation- some nodes are unhealthy, some/none have in flight remediation- some nodes are unhealthy, remediation is disabled - reason is cluster upgrading - might have inflight remediatios as wellCan we cram down all of this to a top level status.phase = ready|active|degraded ? probably yes.No "phase" please, see https://github.com/kubernetes/kubernetes/issues/7856 :)Conditions are the way to go.
Can we include a condition with a reason per each one? its more complicated but doable.Are we expecting tools or top level operators to consult remediation status? or add conditions to it? I'm not aware of it but if you have thoughts on this, please shareI also don't have a good idea at the moment why other operators would check or even update conditions. I think however that a good status is useful for users / administrators.Good read on subject -> https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-statusFinally, I think we can move on with this and not wait for 164 as contingency.164?
Worst case scenarios can be that a) upgrade is progressing slowly and hence remediationis skipped for a long time. b) like in the bugwhich bug?