Cluster upgrade-awareness remediation

11 views
Skip to first unread message

Roy Golan

unread,
Oct 4, 2021, 5:28:16 PM10/4/21
to med...@googlegroups.com
The PR[1] is ready for review and is proven to work on openshift 4.9.ci version.
It detects the upgrade and prevents from remediating node/nodes and once
the upgrade is not progressing anymore than it continues. 

The logic is checking the ClusterVersion condition type 'progressing' set to 'true'.

Some thoughts/doubts while working on it:
- if we skip remediation do we still want to mark a node as unhealthy in the status?

- if we err during an attempt to check if the cluster is upgrading I lean towards continue with remediation. I left a comment suggesting that in time, finer error handling will make this decision better. my thoughts process is "the cluster is unstable, perhaps some pressure, failing hardware, better try remediation than prevent it" - more thoughts here?

- I think I need to add a status condition to say 'cluster is upgrading, remediation is disabled'

Finally, I think we can move on with this and not wait for 164 as contingency.
Worst case scenarios can be that a) upgrade is progressing slowly and hence remediation
is skipped for a long time. b) like in the bug, remedietion is occureing with upgrade trying to restart
nodes. This means I misreably failed to unstand how upgrade is signaled and I need to fix it asap. (but it should be fixable)




Roy Golan

unread,
Oct 5, 2021, 1:50:29 AM10/5/21
to med...@googlegroups.com, Andrew Beekhof
On Tue, 5 Oct 2021 at 00:27, Roy Golan <rgo...@redhat.com> wrote:
The PR[1] is ready for review and is proven to work on openshift 4.9.ci version.
It detects the upgrade and prevents from remediating node/nodes and once
the upgrade is not progressing anymore than it continues. 

The logic is checking the ClusterVersion condition type 'progressing' set to 'true'.

Some thoughts/doubts while working on it:
- if we skip remediation do we still want to mark a node as unhealthy in the status?

@Andrew Beekhof replied on slack that yes. I too think like that, and my current PR represents that.

- if we err during an attempt to check if the cluster is upgrading I lean towards continue with remediation. I left a comment suggesting that in time, finer error handling will make this decision better. my thoughts process is "the cluster is unstable, perhaps some pressure, failing hardware, better try remediation than prevent it" - more thoughts here?

- I think I need to add a status condition to say 'cluster is upgrading, remediation is disabled'

Conditions, phases, and all that jazz is not trivial. There is no single guideline, and I want us to make a good decision.
For a start, users and tools don't have a short way to understand what's the status of remediation.
Today only after I see the status has a report of number of nodes observed I know the operator is active on that resource.
The overall status of a resource can now be many things:
- all nodes are ok -  ready (or active?)
- some nodes are unhealthy, no active remediation
- some nodes are unhealthy, some/none have in flight remediation
- some nodes are unhealthy, remediation is disabled - reason is cluster upgrading - might have inflight remediatios as well
Can we cram down all of this to a top level status.phase = ready|active|degraded ? probably yes. Can we include a condition with a reason per each one? its more complicated but doable.
Are we expecting tools or top level operators to consult remediation status? or add conditions to it? I'm not aware of it but if you have thoughts on this, please share

Marc Sluiter

unread,
Oct 5, 2021, 6:18:12 AM10/5/21
to Roy Golan, medik8s, Andrew Beekhof
On Tue, Oct 5, 2021 at 7:50 AM Roy Golan <rgo...@redhat.com> wrote:


On Tue, 5 Oct 2021 at 00:27, Roy Golan <rgo...@redhat.com> wrote:
The PR[1] is ready for review and is proven to work on openshift 4.9.ci version.
It detects the upgrade and prevents from remediating node/nodes and once
the upgrade is not progressing anymore than it continues. 

The logic is checking the ClusterVersion condition type 'progressing' set to 'true'.

Some thoughts/doubts while working on it:
- if we skip remediation do we still want to mark a node as unhealthy in the status?

@Andrew Beekhof replied on slack that yes. I too think like that, and my current PR represents that.

+1
 

- if we err during an attempt to check if the cluster is upgrading I lean towards continue with remediation. I left a comment suggesting that in time, finer error handling will make this decision better. my thoughts process is "the cluster is unstable, perhaps some pressure, failing hardware, better try remediation than prevent it" - more thoughts here?

In general I agree, but maybe retry getting the upgrade status before going on with remediation? A failure might be caused by an actual upgrade (e.g. the GET request reaches the apiserver node which is just getting updated)
 

- I think I need to add a status condition to say 'cluster is upgrading, remediation is disabled'

Conditions, phases, and all that jazz is not trivial. There is no single guideline, and I want us to make a good decision.
For a start, users and tools don't have a short way to understand what's the status of remediation.
Today only after I see the status has a report of number of nodes observed I know the operator is active on that resource.
The overall status of a resource can now be many things:
- all nodes are ok -  ready (or active?)
- some nodes are unhealthy, no active remediation
- some nodes are unhealthy, some/none have in flight remediation
- some nodes are unhealthy, remediation is disabled - reason is cluster upgrading - might have inflight remediatios as well
Can we cram down all of this to a top level status.phase = ready|active|degraded ? probably yes.

Conditions are the way to go.
 
Can we include a condition with a reason per each one? its more complicated but doable.
Are we expecting tools or top level operators to consult remediation status? or add conditions to it? I'm not aware of it but if you have thoughts on this, please share

I also don't have a good idea at the moment why other operators would check or even update conditions. I think however that a good status is useful for users / administrators. 
 




Finally, I think we can move on with this and not wait for 164 as contingency.

164?
 

Worst case scenarios can be that a) upgrade is progressing slowly and hence remediation
is skipped for a long time. b) like in the bug

which bug?
 
, remedietion is occureing with upgrade trying to restart
nodes. This means I misreably failed to unstand how upgrade is signaled and I need to fix it asap. (but it should be fixable)




--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/CAC_JqcnmgJF%2BE6-eCpGx_RX56a5tOnbaM_Kn1iPruvHCR8BjFg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Roy Golan

unread,
Oct 5, 2021, 8:04:28 AM10/5/21
to Marc Sluiter, medik8s, Andrew Beekhof
On Tue, 5 Oct 2021 at 13:18, Marc Sluiter <mslu...@redhat.com> wrote:


On Tue, Oct 5, 2021 at 7:50 AM Roy Golan <rgo...@redhat.com> wrote:


On Tue, 5 Oct 2021 at 00:27, Roy Golan <rgo...@redhat.com> wrote:
The PR[1] is ready for review and is proven to work on openshift 4.9.ci version.
It detects the upgrade and prevents from remediating node/nodes and once
the upgrade is not progressing anymore than it continues. 

The logic is checking the ClusterVersion condition type 'progressing' set to 'true'.

Some thoughts/doubts while working on it:
- if we skip remediation do we still want to mark a node as unhealthy in the status?

@Andrew Beekhof replied on slack that yes. I too think like that, and my current PR represents that.

+1
 

- if we err during an attempt to check if the cluster is upgrading I lean towards continue with remediation. I left a comment suggesting that in time, finer error handling will make this decision better. my thoughts process is "the cluster is unstable, perhaps some pressure, failing hardware, better try remediation than prevent it" - more thoughts here?

In general I agree, but maybe retry getting the upgrade status before going on with remediation? A failure might be caused by an actual upgrade (e.g. the GET request reaches the apiserver node which is just getting updated)

maybe a retry - though I don't want to stall the reconcilliation loop too much. How much time should we wait?

 

- I think I need to add a status condition to say 'cluster is upgrading, remediation is disabled'

Conditions, phases, and all that jazz is not trivial. There is no single guideline, and I want us to make a good decision.
For a start, users and tools don't have a short way to understand what's the status of remediation.
Today only after I see the status has a report of number of nodes observed I know the operator is active on that resource.
The overall status of a resource can now be many things:
- all nodes are ok -  ready (or active?)
- some nodes are unhealthy, no active remediation
- some nodes are unhealthy, some/none have in flight remediation
- some nodes are unhealthy, remediation is disabled - reason is cluster upgrading - might have inflight remediatios as well
Can we cram down all of this to a top level status.phase = ready|active|degraded ? probably yes.

Conditions are the way to go.
 
Openshift uses phases as well. I don't think they recommend today not to use phases (I'm not saying we should/n't).
Luckily our API is v1alpha1 we can experiment as we like.

Can we include a condition with a reason per each one? its more complicated but doable.
Are we expecting tools or top level operators to consult remediation status? or add conditions to it? I'm not aware of it but if you have thoughts on this, please share

I also don't have a good idea at the moment why other operators would check or even update conditions. I think however that a good status is useful for users / administrators. 
 




Finally, I think we can move on with this and not wait for 164 as contingency.

164?
 

Worst case scenarios can be that a) upgrade is progressing slowly and hence remediation
is skipped for a long time. b) like in the bug

which bug?
 
searching for it...
Reply all
Reply to author
Forward
0 new messages