Volumeattachment handling

26 views
Skip to first unread message

Dan Yasny

unread,
Sep 8, 2021, 9:53:20 AM9/8/21
to medik8s
Hi all, 

First of all, thank you for this project, it is very useful for anything stateful on k8s. I'm also glad to run into folks from good old oVirt again :)

I wanted to ask, since I didn't see an obvious answer in the docs, whether there is anything done about volumeattachments beyond actually deleting and recreating the nodes. IF the volumeattachment is removed early, that would allow for avoiding the dreaded multi-attach error which prevents workloads that consume RWO volumes from being rescheduled. 

Things work as they are, but the rescheduling takes much less time if a VA is removed early on in the fencing flow. 

Cheers, 
Dan

Andrew Beekhof

unread,
Sep 8, 2021, 11:19:47 PM9/8/21
to medik8s
On Wednesday, 8 September 2021 at 11:53:20 pm UTC+10 dya...@gmail.com wrote:
Hi all, 

First of all, thank you for this project, it is very useful for anything stateful on k8s. I'm also glad to run into folks from good old oVirt again :)

Glad you're finding it useful :)
 

I wanted to ask, since I didn't see an obvious answer in the docs, whether there is anything done about volumeattachments beyond actually deleting and recreating the nodes. IF the volumeattachment is removed early, that would allow for avoiding the dreaded multi-attach error which prevents workloads that consume RWO volumes from being rescheduled. 

Things work as they are, but the rescheduling takes much less time if a VA is removed early on in the fencing flow. 

Correct me if I'm wrong, but my understanding is that once the Node CR is removed, then k8s ignores any existing volume attachments in the same way that it understands that any Pods associated with the Node are also not running.
The Node CR is removed as soon as the node is put into a safe state, so I'd not expect there to be an earlier opportunity.
 

Cheers, 
Dan

Marc Sluiter

unread,
Sep 9, 2021, 3:28:46 AM9/9/21
to Andrew Beekhof, medik8s
Hi. It's a bit more complicated: pods stay in status "Terminating" for some time after the node was deleted, and during that time volumes are not released yet.
We released a new Poison Pill operator version just recently, which at least waits long enough before re-creating the node, so that all pods terminate and volumes are released.

Cheers,

Marc

  
 

Cheers, 
Dan

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/ae819061-6e14-4f13-b412-5394de509728n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Yasny

unread,
Sep 9, 2021, 8:27:29 PM9/9/21
to Marc Sluiter, Andrew Beekhof, medik8s
On Thu, Sep 9, 2021 at 3:28 AM Marc Sluiter <mslu...@redhat.com> wrote:


On Thu, Sep 9, 2021 at 5:19 AM Andrew Beekhof <abee...@redhat.com> wrote:


On Wednesday, 8 September 2021 at 11:53:20 pm UTC+10 dya...@gmail.com wrote:
Hi all, 

First of all, thank you for this project, it is very useful for anything stateful on k8s. I'm also glad to run into folks from good old oVirt again :)

Glad you're finding it useful :)
 

I wanted to ask, since I didn't see an obvious answer in the docs, whether there is anything done about volumeattachments beyond actually deleting and recreating the nodes. IF the volumeattachment is removed early, that would allow for avoiding the dreaded multi-attach error which prevents workloads that consume RWO volumes from being rescheduled. 

Things work as they are, but the rescheduling takes much less time if a VA is removed early on in the fencing flow. 

Correct me if I'm wrong, but my understanding is that once the Node CR is removed, then k8s ignores any existing volume attachments in the same way that it understands that any Pods associated with the Node are also not running.
The Node CR is removed as soon as the node is put into a safe state, so I'd not expect there to be an earlier opportunity.

Hi. It's a bit more complicated: pods stay in status "Terminating" for some time after the node was deleted, and during that time volumes are not released yet.
We released a new Poison Pill operator version just recently, which at least waits long enough before re-creating the node, so that all pods terminate and volumes are released.

This looks very good, but my point is about something else here. When I use a network attached RWO volume, the rescheduling of a deployment happens very quickly, when a node is detected as NotReady. This is where the new workload gets stuck unable to start because of the dreaded Multi-Attach error, which persists until the VA is removed. This takes a pretty long time. From my days at oVirt I remember us making sure a typical VM downtime wouldn't exceed 2 minutes, and even that was considered borderline excessive for a proper HA system. Here we are talking about triple that time and easily more. 

I've tried a very simple test - a script that monitors the nodes status, and when a node is NotReady for over 1 minute, it deletes the volumeattachments on it. With this in place, my workloads get rescheduled nicely in under 90 seconds, keeping my A reasonably H. 

So I'm arguing here that it shouldn't be enough to just delete and recreate the node, locking mechanisms such as volumeattachment need to be explicitly removed as soon as possible, in order to unblock stateful workload rescheduling quickly. When a node is deleted, the VA eventually gets removed, but it takes a very long time. 

Hope it all makes sense to you guys?
 

Cheers,

Marc

  
 

Cheers, 
Dan

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/ae819061-6e14-4f13-b412-5394de509728n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "medik8s" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/medik8s/_esi9rj9mTU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/CAH5Q-kW0y0qBzKQUjM%2Bi3_FGSPFPkntN%2B911NSjcX2%3DdsCrkXA%40mail.gmail.com.

Marc Sluiter

unread,
Sep 13, 2021, 10:25:33 AM9/13/21
to Dan Yasny, Andrew Beekhof, medik8s
On Fri, Sep 10, 2021 at 2:27 AM Dan Yasny <dya...@gmail.com> wrote:


On Thu, Sep 9, 2021 at 3:28 AM Marc Sluiter <mslu...@redhat.com> wrote:


On Thu, Sep 9, 2021 at 5:19 AM Andrew Beekhof <abee...@redhat.com> wrote:


On Wednesday, 8 September 2021 at 11:53:20 pm UTC+10 dya...@gmail.com wrote:
Hi all, 

First of all, thank you for this project, it is very useful for anything stateful on k8s. I'm also glad to run into folks from good old oVirt again :)

Glad you're finding it useful :)
 

I wanted to ask, since I didn't see an obvious answer in the docs, whether there is anything done about volumeattachments beyond actually deleting and recreating the nodes. IF the volumeattachment is removed early, that would allow for avoiding the dreaded multi-attach error which prevents workloads that consume RWO volumes from being rescheduled. 

Things work as they are, but the rescheduling takes much less time if a VA is removed early on in the fencing flow. 

Correct me if I'm wrong, but my understanding is that once the Node CR is removed, then k8s ignores any existing volume attachments in the same way that it understands that any Pods associated with the Node are also not running.
The Node CR is removed as soon as the node is put into a safe state, so I'd not expect there to be an earlier opportunity.

Hi. It's a bit more complicated: pods stay in status "Terminating" for some time after the node was deleted, and during that time volumes are not released yet.
We released a new Poison Pill operator version just recently, which at least waits long enough before re-creating the node, so that all pods terminate and volumes are released.

This looks very good, but my point is about something else here. When I use a network attached RWO volume, the rescheduling of a deployment happens very quickly, when a node is detected as NotReady. This is where the new workload gets stuck unable to start because of the dreaded Multi-Attach error, which persists until the VA is removed. This takes a pretty long time. From my days at oVirt I remember us making sure a typical VM downtime wouldn't exceed 2 minutes, and even that was considered borderline excessive for a proper HA system. Here we are talking about triple that time and easily more. 

I've tried a very simple test - a script that monitors the nodes status, and when a node is NotReady for over 1 minute, it deletes the volumeattachments on it. With this in place, my workloads get rescheduled nicely in under 90 seconds, keeping my A reasonably H. 

So I'm arguing here that it shouldn't be enough to just delete and recreate the node, locking mechanisms such as volumeattachment need to be explicitly removed as soon as possible, in order to unblock stateful workload rescheduling quickly. When a node is deleted, the VA eventually gets removed, but it takes a very long time. 

Hope it all makes sense to you guys?

Yes, in general it makes sense IMHO.
However, we might want to consider such changes carefully. We need to keep the balance between speed, adding manual actions (with potential unwanted side effects), and just waiting for kubernetes to do its thing automatically, even when it's slower.
Do you mind creating an issue, so we can keep track and evaluate?  https://github.com/medik8s/poison-pill/issues/new

Thanks!

Marc  
 
 

Cheers,

Marc

  
 

Cheers, 
Dan

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/ae819061-6e14-4f13-b412-5394de509728n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "medik8s" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/medik8s/_esi9rj9mTU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/CAH5Q-kW0y0qBzKQUjM%2Bi3_FGSPFPkntN%2B911NSjcX2%3DdsCrkXA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.

Nir Yehia

unread,
Sep 13, 2021, 2:49:59 PM9/13/21
to Marc Sluiter, Dan Yasny, Andrew Beekhof, medik8s
Interesting discussion.

Indeed we only delete the node currently, and we use it as the way to signal the cluster the node has been fenced and it's safe to reschedule the workload.
In pods, for example, it would take up to 60 seconds from the time that the node was deleted until the cluster will release the pod.

Deleting the pod(s) directly could possibly speed up recovery time. I guess that similar technique could be used with volumes.

Long term, I think we should aim for a real API from Kubernetes to signal that a node is dead and all resources should be released immediately.
We already had some issues around node deletion that are not optimal and could be avoided if we would just have a simple api call to do that.

Filing an issue, as Marc suggestd, sounds like a good idea at this point.




Dan Yasny

unread,
Sep 24, 2021, 9:31:52 AM9/24/21
to medik8s
RFE sumbitted: https://github.com/medik8s/poison-pill/issues/103

Apologies for the delay, I just got back from a PTO
Reply all
Reply to author
Forward
0 new messages