@mthoretton: Reiterating the mentions to trigger a notification:
@kubernetes/sig-storage-bugs
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
"Lost" means that something deleted PVs that were bound to the PVCs. Check your audit logs what and when did that.
Do you mean AWS audit logs? (cloudtrail) if yes we don't have this service enabled :/ On k8s side I've checked different logs, I've found that line from kube-apiserver logs which doesn't say anything but could give the time when it happened:
For now our best guess about what happened is: AWS API was not working properly this night and it returned no/incomplete data. Then k8s decided to remove some PV and never tried anymore later in time to check on AWS if the volumes does exist or not.
Another interesting fact, it only affected volumes created 3d ago, and not the ones created 2d ago.
Can you search master apiserver and controller manager logs for the PV name (pvc-750...)?
apiserver logs: no occurence of "pvc-"
controller manager logs: (i've filtered lines containing data-kafka-N PVCs): https://gist.github.com/mthoretton/42f28ed2b8e7c04a098220deb686f197
Tell me if you would need the full logs (unfiltered) or data from other sources. Thanks for your help.
Claims become lost if one of the following occurs:
Can you check which condition it was?
In our case, PV got deleted. We managed to recover by recreating deleted PVs applying manifests where we set data like pvc name or awsElasticBlockStore.volumeID manually.
Hm if PV got deleted then there must be some logs in apiserver that showed who deleted it. How long ago did it happen? Is it possible it was too long ago and apiserver logs got rotated out?
It most likely happened on February 23th. Logs have been rotated but I've kept a dump from a few days ago that cover until the kube-apiserver
pods creation. Unfortunately I don't find anything interesting in those logs. Maybe we should change the log_level :/
This is the only "relevant" part from kube-apiserver I was able to find:
I0223 03:03:23.149630 1 trace.go:76] Trace[663513196]: "Get /api/v1/namespaces/gygops/persistentvolumeclaims/data-zk-0" (started: 2018-02-23 03:03:22.555886371 +0000 UTC) (total time: 593.176637ms):
Trace[663513196]: [208.542665ms] [208.535432ms] About to write a response
Trace[663513196]: [593.176637ms] [384.633972ms] END
I0223 03:03:23.760834 1 trace.go:76] Trace[1403341013]: "Get /api/v1/namespaces/gygops/persistentvolumeclaims/data-kafka-1" (started: 2018-02-23 03:03:23.101517193 +0000 UTC) (total time: 659.28669ms):
Trace[1403341013]: [659.229155ms] [659.221499ms] About to write a response
I0223 03:03:24.624111 1 trace.go:76] Trace[1452965996]: "GuaranteedUpdate etcd3: *api.Node" (started: 2018-02-23 03:03:22.520753728 +0000 UTC) (total time: 2.103326426s):
Trace[1452965996]: [377.655554ms] [377.585304ms] Transaction prepared
Trace[1452965996]: [2.103199717s] [1.725544163s] Transaction committed
I0223 03:03:24.624539 1 trace.go:76] Trace[750969696]: "List /api/v1/nodes" (started: 2018-02-23 03:03:23.061341761 +0000 UTC) (total time: 1.563181811s):
Trace[750969696]: [855.113714ms] [855.105893ms] Listing from storage done
Trace[750969696]: [1.563180137s] [708.040799ms] Writing http response done (12 items)
I0223 03:03:24.677393 1 trace.go:76] Trace[2000486475]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.521314968 +0000 UTC) (total time: 2.156055333s):
Trace[2000486475]: [539.212233ms] [539.205709ms] Listing from storage done
Trace[2000486475]: [2.156052418s] [1.616813167s] Writing http response done (12 items)
I0223 03:03:24.751598 1 trace.go:76] Trace[1384904510]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.83348263 +0000 UTC) (total time: 1.91808473s):
Trace[1384904510]: [468.834112ms] [468.819ms] Listing from storage done
Trace[1384904510]: [1.918080777s] [1.449222275s] Writing http response done (12 items)
I0223 03:03:24.896638 1 trace.go:76] Trace[894557933]: "Get /api/v1/nodes/ip-10-21-3-38.eu-central-1.compute.internal" (started: 2018-02-23 03:03:22.792555404 +0000 UTC) (total time: 2.104044913s):
Trace[894557933]: [2.104044913s] [2.103962564s] END
I0223 03:03:24.980804 1 trace.go:76] Trace[699219671]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.519967391 +0000 UTC) (total time: 2.460813182s):
Trace[699219671]: [1.182727915s] [1.182720262s] Listing from storage done
Trace[699219671]: [2.46080943s] [1.278053869s] Writing http response done (12 items)
I0223 03:03:25.075838 1 trace.go:76] Trace[920597048]: "List /api/v1/nodes" (started: 2018-02-23 03:03:22.521501531 +0000 UTC) (total time: 2.554311957s):
Trace[920597048]: [580.518267ms] [580.513175ms] Listing from storage done
Trace[920597048]: [2.554308532s] [1.973767496s] Writing http response done (12 items)
I0223 03:03:25.316771 1 trace.go:76] Trace[714806411]: "List /api/v1/nodes" (started: 2018-02-23 03:03:23.074751009 +0000 UTC) (total time: 2.241960993s):
Trace[714806411]: [1.700847836s] [1.696369814s] Listing from storage done
Trace[714806411]: [2.24195771s] [541.086708ms] Writing http response done (12 items)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
Closed #60298.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Hi,
While trying to restore the EBS volumes from the snapshot it returns status as lost. we are using AWS KMS CMK keys with policy having kms* permission. The backup operation went fine.. the restore operation is able to restore all k8s resources expect the PVC.
k get pvc -n nginx-example
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nginx-logs Lost pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e 0 gp2-encrypt 4m22s
k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e 1Gi RWO Retain Released nginx-example/nginx-logs gp2-encrypt 33m
We noticed the UID of PV and PVC are not matching after the PVC is restored.
The service account used by velero pod has below policy as
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ec2:DeleteSnapshot",
"kms:Decrypt",
"ec2:CreateTags",
"kms:GenerateDataKeyWithoutPlaintext",
"s3:ListBucket",
"kms:GenerateDataKeyPairWithoutPlaintext",
"ec2:DescribeSnapshots",
"kms:GenerateDataKeyPair",
"kms:ReEncryptFrom",
"ec2:CreateVolume",
"s3:ListMultipartUploadParts",
"s3:PutObject",
"s3:GetObject",
"s3:AbortMultipartUpload",
"ec2:DescribeVolumes",
"ec2:CreateSnapshot",
"kms:GenerateDataKey",
"kms:ReEncryptTo",
"s3:DeleteObject"
],
"Resource": "*"
}
]
}
we are using the below yaml to define storageclass and PVC
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2-encrypt
parameters:
type: gp2
encrypted: "true"
fsType: ext4
kmsKeyId: arn:aws:kms:us-east-XXXXXX
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Retain
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nginx-logs
namespace: nginx-example
labels:
app: nginx
spec:
storageClassName: gp2-encrypt
accessModes:
- ReadWriteOnce
resources:
requests:
storage: [50Mi]
Below are logs from velero pods..
time="2020-05-29T19:59:04Z" level=info msg="Starting restore of backup cluster-addons/nginx-backup-5" logSource="pkg/restore/restore.go:394" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T19:59:04Z" level=info msg="Restoring cluster level resource 'persistentvolumes'" logSource="pkg/restore/restore.go:779" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T19:59:04Z" level=info msg="Getting client for /v1, Kind=PersistentVolume" logSource="pkg/restore/restore.go:821" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Restoring resource 'persistentvolumeclaims' into namespace 'nginx-example'" logSource="pkg/restore/restore.go:777" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Getting client for /v1, Kind=PersistentVolumeClaim" logSource="pkg/restore/restore.go:821" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:1030" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing AddPVFromPVCAction" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:44" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Adding PV pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pv_from_pvc_action.go:66" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Skipping persistentvolumes/pvc-bda55207-a1e5-11ea-b7e6-02b82f6b7f4e because it's already been restored." logSource="pkg/restore/restore.go:910" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing item action for persistentvolumeclaims" logSource="pkg/restore/restore.go:1030" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:63" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Attempting to restore PersistentVolumeClaim: nginx-logs" logSource="pkg/restore/restore.go:1136" restore=cluster-addons/nginx-backup-5-20200529155858
time="2020-05-29T20:09:04Z" level=info msg="Done executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:74" pluginName=velero restore=cluster-addons/nginx-backup-5-20200529155858
The cloudtrail does not have much information. Would you please let us know any additional. settings needed here?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.