Re: [kubernetes/kubernetes] PVC status: "Lost" (AWS volumes) (#60298)

399 views
Skip to first unread message

k8s-ci-robot

unread,
Feb 23, 2018, 7:20:27 AM2/23/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

@mthoretton: Reiterating the mentions to trigger a notification:
@kubernetes/sig-storage-bugs

In response to this:

@kubernetes/sig-storage-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Jan Šafránek

unread,
Feb 23, 2018, 7:35:53 AM2/23/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

"Lost" means that something deleted PVs that were bound to the PVCs. Check your audit logs what and when did that.

Mathieu Thoretton

unread,
Feb 23, 2018, 8:41:11 AM2/23/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Do you mean AWS audit logs? (cloudtrail) if yes we don't have this service enabled :/ On k8s side I've checked different logs, I've found that line from kube-apiserver logs which doesn't say anything but could give the time when it happened:

Mathieu Thoretton

unread,
Feb 23, 2018, 10:44:04 AM2/23/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

For now our best guess about what happened is: AWS API was not working properly this night and it returned no/incomplete data. Then k8s decided to remove some PV and never tried anymore later in time to check on AWS if the volumes does exist or not.
Another interesting fact, it only affected volumes created 3d ago, and not the ones created 2d ago.

Michelle Au

unread,
Feb 23, 2018, 11:33:27 AM2/23/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Can you search master apiserver and controller manager logs for the PV name (pvc-750...)?

Mathieu Thoretton

unread,
Feb 23, 2018, 6:00:03 PM2/23/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

apiserver logs: no occurence of "pvc-"
controller manager logs: (i've filtered lines containing data-kafka-N PVCs): https://gist.github.com/mthoretton/42f28ed2b8e7c04a098220deb686f197

Tell me if you would need the full logs (unfiltered) or data from other sources. Thanks for your help.

Michelle Au

unread,
Feb 23, 2018, 6:42:30 PM2/23/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Claims become lost if one of the following occurs:

  • PVC.Spec.VolumeName gets set to ""
  • The PV got deleted
  • The PV.Spec.ClaimRef.UID doesn't match the PVC's UID

Can you check which condition it was?

Mathieu Thoretton

unread,
Feb 26, 2018, 5:15:37 AM2/26/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

In our case, PV got deleted. We managed to recover by recreating deleted PVs applying manifests where we set data like pvc name or awsElasticBlockStore.volumeID manually.

Michelle Au

unread,
Feb 26, 2018, 11:19:26 AM2/26/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Hm if PV got deleted then there must be some logs in apiserver that showed who deleted it. How long ago did it happen? Is it possible it was too long ago and apiserver logs got rotated out?

Mathieu Thoretton

unread,
Feb 28, 2018, 8:22:23 AM2/28/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

It most likely happened on February 23th. Logs have been rotated but I've kept a dump from a few days ago that cover until the kube-apiserver pods creation. Unfortunately I don't find anything interesting in those logs. Maybe we should change the log_level :/
This is the only "relevant" part from kube-apiserver I was able to find:

I0223 03:03:23.149630       1 trace.go:76] Trace[663513196]: "Get /api/v1/namespaces/gygops/persistentvolumeclaims/data-zk-0" (started: 2018-02-23 03:03:22.555886371 +0000 UTC) (total time: 593.176637ms):
Trace[663513196]: [208.542665ms] [208.535432ms] About to write a response
Trace[663513196]: [593.176637ms] [384.633972ms] END
I0223 03:03:23.760834       1 trace.go:76] Trace[1403341013]: "Get /api/v1/namespaces/gygops/persistentvolumeclaims/data-kafka-1" (started: 2018-02-23 03:03:23.101517193 +0000 UTC) (total time: 659.28669ms):
Trace[1403341013]: [659.229155ms] [659.221499ms] About to write a response
I0223 03:03:24.624111       1 trace.go:76] Trace[1452965996]: "GuaranteedUpdate etcd3: *api.Node" (started: 2018-02-23 03:03:22.520753728 +0000 UTC) (total time: 2.103326426s):
Trace[1452965996]: [377.655554ms] [377.585304ms] Transaction prepared
Trace[1452965996]: [2.103199717s] [1.725544163s] Transaction committed
I0223 03:03:24.624539       1 trace.go:76] Trace[750969696]: "List /api/v1/nodes" (started: 2018-02-23 03:03:23.061341761 +0000 UTC) (total time: 1.563181811s):
Trace[750969696]: [855.113714ms] [855.105893ms] Listing from storage done
Trace[750969696]: [1.563180137s] [708.040799ms] Writing http response done (12 items)
I0223 03:03:24.677393       1 trace.go:76] Trace[2000486475]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.521314968 +0000 UTC) (total time: 2.156055333s):
Trace[2000486475]: [539.212233ms] [539.205709ms] Listing from storage done
Trace[2000486475]: [2.156052418s] [1.616813167s] Writing http response done (12 items)
I0223 03:03:24.751598       1 trace.go:76] Trace[1384904510]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.83348263 +0000 UTC) (total time: 1.91808473s):
Trace[1384904510]: [468.834112ms] [468.819ms] Listing from storage done
Trace[1384904510]: [1.918080777s] [1.449222275s] Writing http response done (12 items)
I0223 03:03:24.896638       1 trace.go:76] Trace[894557933]: "Get /api/v1/nodes/ip-10-21-3-38.eu-central-1.compute.internal" (started: 2018-02-23 03:03:22.792555404 +0000 UTC) (total time: 2.104044913s):
Trace[894557933]: [2.104044913s] [2.103962564s] END
I0223 03:03:24.980804       1 trace.go:76] Trace[699219671]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.519967391 +0000 UTC) (total time: 2.460813182s):
Trace[699219671]: [1.182727915s] [1.182720262s] Listing from storage done
Trace[699219671]: [2.46080943s] [1.278053869s] Writing http response done (12 items)
I0223 03:03:25.075838       1 trace.go:76] Trace[920597048]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.521501531 +0000 UTC) (total time: 2.554311957s):
Trace[920597048]: [580.518267ms] [580.513175ms] Listing from storage done
Trace[920597048]: [2.554308532s] [1.973767496s] Writing http response done (12 items)
I0223 03:03:25.316771       1 trace.go:76] Trace[714806411]: "List /api/v1/nodes" (started: 2018-02-23 03:03:23.074751009 +0000 UTC) (total time: 2.241960993s):
Trace[714806411]: [1.700847836s] [1.696369814s] Listing from storage done
Trace[714806411]: [2.24195771s] [541.086708ms] Writing http response done (12 items)

fejta-bot

unread,
May 29, 2018, 9:32:28 AM5/29/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot

unread,
Jun 28, 2018, 10:18:00 AM6/28/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle rotten
/remove-lifecycle stale

k8s-ci-robot

unread,
Jul 28, 2018, 11:05:47 AM7/28/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Closed #60298.

fejta-bot

unread,
Jul 28, 2018, 11:05:51 AM7/28/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.


Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

gseshagirirao

unread,
May 29, 2020, 4:31:34 PM5/29/20
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Hi,

While trying to restore the EBS volumes from the snapshot it returns status as lost. we are using AWS KMS CMK keys with policy having kms* permission. The backup operation went fine.. the restore operation is able to restore all k8s resources expect the PVC.

k get pvc -n nginx-example
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nginx-logs Lost pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e 0 gp2-encrypt 4m22s

k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e 1Gi RWO Retain Released nginx-example/nginx-logs gp2-encrypt 33m

We noticed the UID of PV and PVC are not matching after the PVC is restored.

The service account used by velero pod has below policy as
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ec2:DeleteSnapshot",
"kms:Decrypt",
"ec2:CreateTags",
"kms:GenerateDataKeyWithoutPlaintext",
"s3:ListBucket",
"kms:GenerateDataKeyPairWithoutPlaintext",
"ec2:DescribeSnapshots",
"kms:GenerateDataKeyPair",
"kms:ReEncryptFrom",
"ec2:CreateVolume",
"s3:ListMultipartUploadParts",
"s3:PutObject",
"s3:GetObject",
"s3:AbortMultipartUpload",
"ec2:DescribeVolumes",
"ec2:CreateSnapshot",
"kms:GenerateDataKey",
"kms:ReEncryptTo",
"s3:DeleteObject"
],
"Resource": "*"
}
]
}

we are using the below yaml to define storageclass and PVC

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2-encrypt
parameters:
type: gp2
encrypted: "true"
fsType: ext4
kmsKeyId: arn:aws:kms:us-east-XXXXXX
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Retain


kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nginx-logs
namespace: nginx-example
labels:
app: nginx
spec:

Optional:

storageClassName: gp2-encrypt
accessModes:
- ReadWriteOnce
resources:
requests:
storage: [50Mi]

Below are logs from velero pods..

time="2020-05-29T19:59:04Z" level=info msg="Starting restore of backup cluster-addons/nginx-backup-5" logSource="pkg/restore/restore.go:394" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T19:59:04Z" level=info msg="Restoring cluster level resource 'persistentvolumes'" logSource="pkg/restore/restore.go:779" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T19:59:04Z" level=info msg="Getting client for /v1, Kind=PersistentVolume" logSource="pkg/restore/restore.go:821" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Restoring resource 'persistentvolumeclaims' into namespace 'nginx-example'" logSource="pkg/restore/restore.go:777" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Getting client for /v1, Kind=PersistentVolumeClaim" logSource="pkg/restore/restore.go:821" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:1030" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing AddPVFromPVCAction" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:44" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Adding PV pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:66" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Skipping persistentvolumes/pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e because it's already been restored." logSource="pkg/restore/restore.go:910" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:1030" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:63" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Attempting to restore PersistentVolumeClaim: nginx-logs" logSource="pkg/restore/restore.go:1136" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Done executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:74" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858

The cloudtrail does not have much information. Would you please let us know any additional. settings needed here?


You are receiving this because you are on a team that was mentioned.

Reply to this email directly, view it on GitHub, or unsubscribe.

Reply all
Reply to author
Forward
0 new messages