I am not convinced this fix is safe for the cases when the kubelet is down yet the node is still up, and so are the pods that are using the volume.
Per VMWare manuals, multi-writers option is disabled by default.
@kubernetes/sig-storage-pr-reviews
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
@BaluDontu
As we discussed in the team, another area of concern is the vSphere HA with VM Monitoring enabled. Upon VM crash or guest tool is not responding, VM will be reset. Lets test out that scenario.
@BaluDontu: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|---|---|---|
| pull-kubernetes-e2e-kops-aws | cfdff1a | link | /test pull-kubernetes-e2e-kops-aws |
| pull-kubernetes-e2e-gce-etcd3 | cfdff1a | link | /test pull-kubernetes-e2e-gce-etcd3 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
@BaluDontu: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-gce-etcd3 | cfdff1a | link | /test pull-kubernetes-e2e-gce-etcd3 |
| pull-kubernetes-e2e-kops-aws | cfdff1a | link | /test pull-kubernetes-e2e-kops-aws |
/retest
Alternative solution is to return false for vsphere in isMultiAttachForbidden() function.
And make a change in DisksAreAttached in the vsphere.go - https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/vsphere/vsphere.go#L567
In DisksAreAttached function we can check node VM's power state, if node is powered off, detach disks and then set false for the detached disks in the returned map.
This way we will not require to wait for 6 minutes for pods to recreate on another node.
@divyenpatel , we cannot do a Detach call in DisksAreAttached() function. DisksAreAttached by itself is to only check whether disks are attached or not. We cannot do any detach here.
@rootfs : #50239 is a different issue altogether. It just addresses the scenario when node is not found in vCenter. But in the present case, node is still in the vCenter which is causing this problem.
When node is deleted from vCenter, volumes are detached automatically but when node is powered off in vCenter, volumes still remain attached to the node.
@BaluDontu: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-kops-aws | cfdff1a | link | /test pull-kubernetes-e2e-kops-aws |
| pull-kubernetes-e2e-gce-etcd3 | cfdff1a | link | /test pull-kubernetes-e2e-gce-etcd3 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
—
@BaluDontu: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-gce-etcd3 | cfdff1a | link | /test pull-kubernetes-e2e-gce-etcd3 |
| pull-kubernetes-e2e-kops-aws | cfdff1a | link | /test pull-kubernetes-e2e-kops-aws |
@BaluDontu you can check the VM power state there
/lgtm cancel
let's assess alternatives and impacts
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: BaluDontu, saad-ali
Associated issue: 50944
The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment
@rootfs : Yes if we check the power state, do you want to detach the volume at that point when we are checking for DisksAreAttached()? The problem, DisksAreAttached() should report whether the volume is attached or not irrespective of power state. It should not be dependent on the power state. If suppose, if you return that a volume is not attached to the powered off node, then kubernetes will never attempt to remove the volume from the node.
If you think there is any other way that we can handle this, please let me know?
@BaluDontu I like the way @divyenpatel proposed. In addition, n DisksAreAttached, check power state of the VM. If the VM is powered off, detach the volume there, so in case the VM is restarted, it won't grab the volume.
@rootfs : The question is would you want to perform an action different from what the function is intended to do? In this calling Detach() in DisksAreAttached() function? I think its quite misleading.
I am not convinced this fix is safe for the cases when the kubelet is down yet the node is still up, and so are the pods that are using the volume.
Per VMWare manuals, multi-writers option is disabled by default.
As we discussed in the team, another area of concern is the vSphere HA with VM Monitoring enabled. Upon VM crash or guest tool is not responding, VM will be reset. Lets test out that scenario.
It sounds like multi-attach is not safe for vSphere volumes. Removing approval.
@BaluDontu I like the way @divyenpatel proposed. In addition, n DisksAreAttached, check power state of the VM. If the VM is powered off, detach the volume there, so in case the VM is restarted, it won't grab the volume.
Having a volume implicitly detach during a DisksAreAttached(...) operation, whose sole purpose is to verify a volume is attached, not actually mutate anything, is really ugly. But if it works, and doesn't cause any regressions, I'm fine with it. Make sure to consider the case that the node the volume is attached to is powered down, and 2 or more pods referencing the same volume are scheduled to different nodes: I think that case should be safe, since they will all fight to detach the volume from the power down node, and then the attach/detach controller should allow only one to succeed in attaching (but you should verify).
I do not agree to add Detach operation in DisksAreAttached() operation. Because Detach operation will also involve internal cache (actual state) update operations which are not in VerifyVolumesAreAttached function. If you allow detach in VerifyVolumesAreAttached, how to define the behavior of this function, it will cause quite confusion. This function is already complicated due to bulkvolume change.
Currently, power off VM instance will cause detach delay for all volume types because controller does not know whether it is safe to detach now or not (not sure whether volume is still mounted) based on current node status.
I am working on a proposal of possible using VerifyVolumesAreAttached before issuing detach to solve some other problem. In this case, if we can check since node is powered off, mounts are gone, we can safely to detach the volume.
@jingxu97 : I have one question. Once we detach the volume in DisksAreAttached() and return that this set of volumes are no longer attached to this node. Wouldn't it update the state in this case?
@BaluDontu actually I need to take my comments back. Yes, it will update the actual state if volumes are detached. But this VerifyVolumesAreAttached is called periodically by default 1 min, but user could change the frequency and even disable this function. I don't this is a reliable way to solve this problem.
Spoke with @jingxu97 offline. And I agree. Detaching in DisksAreAttached(...) may result in undesired behavior if the node is powered on very quickly, and possibly other scenarios.
You can still go down that route if you've tested it and are comfortable with it. Otherwise, this PR, as is (whitelisting vSphere for multi-attach), might be a less risky work around considering this was the behavior for vSphere in 1.6 and earlier.
We analyzed the following issues with the current PR and the current approach looks more safe than other approach where we plan to detach the disk in DisksAreAttached().
User power off VM.
If kubelet is not responsive on node.
Node VM is hung with vSphere HA with VM monitoring enabled.
Also the user might also disable VerifyVolumesAreAttached to be called periodically by default 1 min. In this case, DisksAreAttached() is not called at all and would not detach the volume on powered off node immediately.
I have discussed with @divyenpatel offline as he seems to be agreeing with the current approach.
We would want to go ahead with this approach. @saad-ali @jingxu97 can you lgtm it if you feel it is OK.
/retest
/lgtm.
One question after VM powered off, how long did the pods get recreated to other nodes?
@jingxu97 : PR still says lgtm label is not present. Can you try it again?
@jingxu97 : It took a minute or so for the kubernetes to recognize that the VM powered off.
After that, it deletes all the pods on the powered off node (node1 in this case) and immediately starts the volume attach on new node (node3 in this case) after which the pod became available. It almost took around 40-50 secs to have pod available.
{"log":"W0821 19:27:12.632675 1 vsphere.go:1446] VM node1, is not in poweredOn state\n","stream":"stderr","time":"2017-08-21T19:27:12.632845421Z"}
{"log":"I0821 19:27:33.459577 1 controller_utils.go:285] Recording status change NodeNotReady event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:33.459956784Z"}{"log":"I0821 19:27:33.459653 1 controller_utils.go:203] Update ready status of pods on node [node1]\n","stream":"stderr","time":"2017-08-21T19:27:33.459975458Z"}
{"log":"I0821 19:27:33.459791 1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'NodeNotReady' Node node1 status is now: NodeNotReady\n","stream":"stderr","time":"2017-08-21T19:27:33.459981155Z"}
{"log":"W0821 19:27:33.553165 1 vsphere.go:611] VM node1, is not in poweredOn state\n","stream":"stderr","time":"2017-08-21T19:27:33.553316921Z"}
{"log":"I0821 19:27:33.553213 1 nodecontroller.go:752] Deleting node (no longer present in cloud provider): node1\n","stream":"stderr","time":"2017-08-21T19:27:33.553369294Z"}
{"log":"I0821 19:27:33.553220 1 controller_utils.go:274] Recording Deleting Node node1 because it's not present according to cloud provider event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:33.55337406Z"}
{"log":"I0821 19:27:33.553384 1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingNode' Node node1 event: Deleting Node node1 because it's not present according to cloud provider\n","stream":"stderr","time":"2017-08-21T19:27:33.553443403Z"}
{"log":"I0821 19:27:38.553459 1 nodecontroller.go:623] NodeController observed a Node deletion: node1\n","stream":"stderr","time":"2017-08-21T19:27:38.553720811Z"}
{"log":"I0821 19:27:38.553518 1 controller_utils.go:274] Recording Removing Node node1 from NodeController event message for node node1\n","stream":"stderr","time":"2017-08-21T19:27:38.553760747Z"}
{"log":"I0821 19:27:38.553811 1 event.go:217] Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"\", Name:\"node1\", UID:\"c32a5114-86a4-11e7-bff5-005056bd2502\", APIVersion:\"\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'RemovingNode' Node node1 event: Removing Node node1 from NodeController\n","stream":"stderr","time":"2017-08-21T19:27:38.554956045Z"}
{"log":"I0821 19:27:53.531262 1 gc_controller.go:157] Found orphaned Pod redis-server-2448813087-m62zw assigned to the Node node1. Deleting.\n","stream":"stderr","time":"2017-08-21T19:27:53.531376237Z"}
{"log":"I0821 19:27:53.531267 1 gc_controller.go:62] PodGC is force deleting Pod: default:redis-server-2448813087-m62zw\n","stream":"stderr","time":"2017-08-21T19:27:53.53138159Z"}
{"log":"I0821 19:27:53.557972 1 gc_controller.go:161] Forced deletion of orphaned Pod redis-server-2448813087-m62zw succeeded\n","stream":"stderr","time":"2017-08-21T19:27:53.558039494Z"}
{"log":"I0821 19:27:56.771960 1 reconciler.go:208] Started AttachVolume for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\" to node \"node3\"\n","stream":"stderr","time":"2017-08-21T19:27:56.772160877Z"}
{"log":"I0821 19:27:57.804622 1 operation_generator.go:290] AttachVolume.Attach succeeded for volume \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\" (spec.Name: \"pvc-208563c0-86a5-11e7-bff5-005056bd2502\") from node \"node3\".\n","stream":"stderr","time":"2017-08-21T19:27:57.804762946Z"}
{"log":"I0821 19:27:57.880295 1 node_status_updater.go:143] Updating status for node \"node3\" succeeded. patchBytes: \"{\\\"status\\\":{\\\"volumesAttached\\\":[{\\\"devicePath\\\":\\\"/dev/disk/by-id/wwn-0x6000c291fb5e79d81ea1958e888c9d37\\\",\\\"name\\\":\\\"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk\\\"}]}}\" VolumesAttached: [{kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-208563c0-86a5-11e7-bff5-005056bd2502.vmdk /dev/disk/by-id/wwn-0x6000c291fb5e79d81ea1958e888c9d37}]\n","stream":"stderr","time":"2017-08-21T19:27:57.880397598Z"}
/test all [submit-queue is verifying that this PR is safe to merge]
Merged #51066.
Automatic merge from submit-queue
Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.
When this change was applied/discussed, what vsphere integration expected to do in the "normal" case? when pod is restarted on another node, this changes makes volume on a new node to be attached first, before detach on a previous node is executed.
We see kernel panics where volume is attached before detaching it on another node, so I am asking to find out if there is a bug lurking elsewhere even though this particular change was reverted. Looking at the discussion, it seems that normal case would be handled by vsphere cloud provider, but what exactly is expected behaviour here? Should it report error when asked to attach volume without attempting to attach if volume is attached to a healthy node? Should it detach as part of attach?
To clarify this for future users who may stumble on this: The change made here got reverted in #67825. The Kubernetes versions that do not include the change introduced here anymore are:
- `v1.10.8` (and later)
 - `v1.11.4` (and later)
 - `v1.12+