CC @kubernetes/sig-storage-bugs
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Mitigation: I can verify that rebooting the machine ("sudo shutdown -r") cleans up the transient mounts.
I ran some tests locally and systemd considers these commands as mount units and hence they show up in systemctl list-units
but once secret has been unmounted the unit disappears from the list. @saad-ali can you verify if unit remains in list even after secret has been unmounted?
@saad-ali was this cluster started via ./cluster/local-up-cluster.sh
? if yes - this environment has default of KeepTerminatedVolumes
set to true which will cause mount point to persist event after pod using it has been terminated...
I have opened a PR to change the default for local clusters - #57355
is kubelet started by systemd too? what's the resource limit in kubelet unit file?
@gnufied No, I was able to repro against a standard GKE/GCE cluster. Once the secret is unmounted the unit appears to stick around as loaded inactive dead
in systemctl list-units --all
. The key for repro'ing was setting up a cluster with 1 node and setting up a simple Kuberentes cron job (which causes a new container to be created every minute).
@saad-ali do you still have that cluster around? can you check if those secrets are still mounted or they just show up in systemctl list-units
but mount points are gone?
@saad-ali do you still have that cluster around? can you check if those secrets are still mounted or they just show up in systemctl list-units but mount points are gone?
They don't appear to be mounted, they just show up in the systemctl list-units
.
Example entry in:
var-lib-kubelet-pods-4cb56507\x2de42f\x2d11e7\x2da1b6\x2d42010a80022d-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2d5zqkr.mount loaded inactive dead /var/lib/kubelet/pods/4cb56507-e42f-11e7-a1b6-42010a80022d/volumes/kubernetes.io~secret/default-token-5zqkr
And no associated entry in mount table:
$ mount | grep -i "4cb56507-e42f-11e7-a1b6-42010a80022d"
The number of entries in the mount table remains static:
$ mount | grep -i kube | wc -l
35
Also, I'm seeing the systemd transient units growing uncontrolled in Kubernetes 1.6 as well. Over the course of an hour:
$ systemctl list-units --all | wc -l
369
$ systemctl list-units --all | wc -l
549
So my hypothesis is that this has been happening for a while, but the reason it's becoming an issue now is that in k8s 1.8+ (with PR #49640), all k8s mount operations are executed as scoped transient units and once the max units is hit, all subsequent kubernetes triggered mounts fail.
It might be a platform issue - @mtanino tested it with 16K transit mount on Fedora 25 without hitting this issue.
mkdir -p /mnt/test; for i in $(seq 1 16384); do echo $i; mkdir /mnt/test/$i; systemd-run --scope -- mount -t tmpfs tmpfs /mnt/test/$i; done
It would be better to test on different platforms with the full Kubernetes stack, ie the cron job example. It could be something in Kubernetes or docker that is causing the leak.
I think this is indeed GKE/GCE problem.. /home/kubernetes/containerized_mounter/
directory appears to recursively bind mount rootfs
including /var/lib/kubelet
. All mounts inside /var/lib/kubelet
also propagate into /home/kubernetes/containarized_mounter/xxx
directory..(because containarized_mounter uses shared mount I guess).
I can confirm that, recursively bind mounting a directory with shared option in another place and then mounting tmpfs inside the directory causes tmpfs to propagate to bind mount directory as well (and you will have 2 systemd-unit for per mount). But on umount, both systemd units are removed from systemd-unit listing. So the bug isn't entirely because of rootfs being mounted in multiple places. But it does exaceberate the problems somewhat because for each mount you have 2 systemd units being created.
GCE image also appears to be using overlay2 and has a weird bind mount of /var/lib/docker on itself.. Things to investigate next:
In isolation check if somehow this is related to overlay2. The mount error that @saad-ali posted above is because of the way layers are mounted in overlay2. overlay2 uses symlinks to reduce number of arguments supplied to mount command.
@msau42 I tested full stack locally on Ubuntu and then full stack on EBS and I can't reproduce the problem.
The limit based on what @wonderfly dug up is 131072 transient units (https://github.com/systemd/systemd/blob/v232/src/core/unit.c#L229). So you won't hit the issue with 16k units.
That said, it does look like the containerized_mounter
is causing the leaks:
$ sudo mkdir -p /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha; sudo systemd-run --scope -- /home/kubernetes/containerized_mounter/mounter mount -t tmpfs tmpfs /var/lib/kubelet/testmnt/alpha Running scope as unit: run-r5bde6edc9a5d4529bae2a560d81c8025.scope $ systemctl list-units --all | grep -i "testmnt" home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmnt-alpha.mount loaded inactive dead /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha var-lib-kubelet-testmnt-alpha.mount loaded inactive dead /var/lib/kubelet/testmnt/alpha $ mount | grep -i "testmnt" tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime) tmpfs on /var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime) tmpfs on /var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime) tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime) $ sudo umount /var/lib/kubelet/testmnt/alpha/ $ mount | grep -i "testmnt" $ systemctl list-units --all | grep -i "testmnt" home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmnt-alpha.mount loaded inactive dead /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha var-lib-kubelet-testmnt-alpha.mount loaded inactive dead /var/lib/kubelet/testmnt/alpha
Mounts created directly with the host mount utility do no appear to have the same issue.
That said, it does look like the containerized_mounter is causing the leaks:
@gnufied pointed out offline that this is a bit misleading:
[12:45] saad, I think problem isn't that containarized_mounter is causing the leak. the problem is anything that gets mounted in /home/containzerid_mounter/ is creating additional inactive/dead systemd-unit
[12:45] even if you use regular host mount
[12:46] or specifically - if you mount anything in /var/lib/kubelet it propgataes and creates another unit for /home/kubernetes/containized_mounter
[12:46] just using regular mount command, I am not even using systemd-run
So to be clear the problem is with the way the containerized_mounter
is setup (specifically with mount propagation) not containerized_mounter
triggering the mounting.
So to be clear the problem is with the way the containerized_mounter is setup (specifically with mount propagation) not containerized_mounter triggering the mounting.
To expand on this, the mount does not need to created containerized_mounter
for the inactive dead
systemd transient mount unit to be created. Any mounts created in the /var/lib/kubelet/
dir will do this:
$ sudo mount -t tmpfs tmpfs /var/lib/kubelet/testmntsaad1/
$ mount | grep -i "testmntsaad"
tmpfs on /var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
saadali@gke-cluster-1-default-pool-63faa0d6-k4f6 ~ $ systemctl list-units --all | grep -i "testmntsaad"
home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmntsaad1.mount
loaded inactive dead /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1
var-lib-kubelet-testmntsaad1.mount
loaded inactive dead /var/lib/kubelet/testmntsaad1
$ sudo umount /var/lib/kubelet/testmntsaad1/
$ mount | grep -i "testmntsaad"
$ systemctl list-units --all | grep -i "testmntsaad"
home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmntsaad1.mount
loaded inactive dead /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1
var-lib-kubelet-testmntsaad1.mount
loaded inactive dead /var/lib/kubelet/testmntsaad1
Any mounts created in /var/lib/docker
will also do this:
$ sudo mkdir /var/lib/docker/saaddockertest1
$ mount | grep -i "saaddockertest1"
$ systemctl list-units --all | grep -i "saaddockertest1"
$ sudo mount -t tmpfs tmpfs /var/lib/docker/saaddockertest1
$ mount | grep -i "saaddockertest1"
tmpfs on /var/lib/docker/saaddockertest1 type tmpfs (rw,relatime)
tmpfs on /var/lib/docker/saaddockertest1 type tmpfs (rw,relatime)
$ systemctl list-units --all | grep -i "saaddockertest1"
var-lib-docker-saaddockertest1.mount loaded inactive dead /var/lib/docker/saaddockertest1
$ sudo umount /var/lib/docker/saaddockertest1
$ mount | grep -i "saaddockertest1"
$ systemctl list-units --all | grep -i "saaddockertest1"
var-lib-docker-saaddockertest1.mount loaded inactive dead /var/lib/docker/saaddockertest1
Mounts created outside those directories do not appear to have this issue:
$ sudo mkdir /tmp/mnttestsaad1tmp
$ sudo mount -t tmpfs tmpfs /tmp/mnttestsaad1tmp/
$ mount | grep -i "mnttestsaad1tmp"
tmpfs on /tmp/mnttestsaad1tmp type tmpfs (rw,relatime)
$ systemctl list-units --all | grep -i "mnttestsaad1tmp"
tmp-mnttestsaad1tmp.mount loaded active mounted /tmp/mnttestsaad1tmp
$ sudo umount /tmp/mnttestsaad1tmp/
$ mount | grep -i "mnttestsaad1tmp"
$ systemctl list-units --all | grep -i "mnttestsaad1tmp"
These directories are both set up with shared mount propagation:
/dev/sda1[/var/lib/docker] │ ├─/var/lib/docker shared
/dev/sda1[/var/lib/kubelet] │ ├─/var/lib/kubelet shared
/dev/sda1[/var/lib/kubelet] │ ├─/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet shared
Will follow up with COS team.
BTW an easier mitigation might be systemctl daemon-reload
which will remove those dead/inactive units.
Ya, that's the mitigation we are using at the moment.
Is your expected command line like this?
for i in $(seq 1 16384); do echo $i; mkdir -p /mnt/test/$i; mkdir -p /mnt/test-bind/$i; systemd-run --scope -- mount --make-shared -t tmpfs tmpfs /mnt/test/$i; systemd-run --scope -- mount --bind /mnt/test/$i /mnt/test-bind/$i; done
for i in $(seq 1 16384); do sudo umount /mnt/test/$i; sudo umount /mnt/test-bind/$i; done
@mtanino actually, just the /mnt/test directory should be bind mounted shared. The underlying mounts are normal mounts.
Hi, good catch, I'm seeing this in my cluster too:
kube-3 ~ # systemctl list-units --all | wc -l
131079
The problem is that this is starving systemd resources, causing errors like
kube-3 ~ # systemctl daemon-reload
Failed to reload daemon: No buffer space available
It seems that waiting a bit allows the daemon-reload to work though:
kube-3 ~ # systemctl list-units --all | wc -l
1095
Another consequence is socket-activated services, like sshd in CoreOS Container Linux, can't start anymore.
Sorry for my late response.
I tried shared bind mount using following command as @msau42 mentioned but I didn't see any "loaded inactive dead" on my 'systemctl list-units'.
for i in $(seq 1 32767); do echo $i; mkdir -p /var/lib/kubelet/test/$i; mkdir -p /var/lib/kubelet/test-sharedbind/$i; mount -t tmpfs tmpfs /var/lib/kubelet/test/$i; systemd-run --scope -- mount --make-shared --bind /var/lib/kubelet/test/$i /var/lib/kubelet/test-sharedbind/$i; done
Here is my test results.
Tried same steps with @saad-ali but I didn't see loaded inactive dead.
root# uname -a
Linux bl-k8sbuild 4.13.12-100.fc25.x86_64 #1 SMP Wed Nov 8 18:13:25 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root# mkdir /var/lib/kubelet/testmntsaad1/
root# sudo mount -t tmpfs tmpfs /var/lib/kubelet/testmntsaad1/
root# mount | grep -i "testmntsaad"
tmpfs on /var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
root# systemctl list-units --all | grep -i "testmntsaad"
var-lib-kubelet-testmntsaad1.mount loaded active mounted /var/lib/kubelet/testmntsaad1
root# sudo umount /var/lib/kubelet/testmntsaad1/
root# mount | grep -i "testmntsaad"
root# systemctl list-units --all | grep -i "testmntsaad"
root#
Also I tried shared bind mount for 32767 mount points but also didn't see the problem.
root# uname -a
Linux bl-k8sbuild 4.13.12-100.fc25.x86_64 #1 SMP Wed Nov 8 18:13:25 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root# cat transient_mount_v2.sh
for i in $(seq 1 32767); do echo $i; mkdir -p /var/lib/kubelet/test/$i; mkdir -p /var/lib/kubelet/test-sharedbind/$i; mount -t tmpfs tmpfs /var/lib/kubelet/test/$i; systemd-run --scope -- mount --make-shared --bind /var/lib/kubelet/test/$i /var/lib/kubelet/test-sharedbind/$i; done
root# ./transient_mount_v2.sh
1
Running scope as unit: run-r8dd78c69477f4a5d99fa327575769464.scope
2
Running scope as unit: run-r3f44a91e916f4b869484232af8c017aa.scope
3
....
root# systemctl list-units --all | grep -i var-lib-kubelet-test | less
root# systemctl list-units --all | grep -i var-lib-kubelet-test | wc
65534 327670 11479005
root# systemctl list-units --all | grep -i var-lib-kubelet-test | grep inactive
root# systemctl list-units --all | grep -i var-lib-kubelet-test | head -n 10
var-lib-kubelet-test-1.mount loaded active mounted /var/lib/kubelet/test/1
var-lib-kubelet-test-10.mount loaded active mounted /var/lib/kubelet/test/10
var-lib-kubelet-test-100.mount loaded active mounted /var/lib/kubelet/test/100
var-lib-kubelet-test-1000.mount loaded active mounted /var/lib/kubelet/test/1000
var-lib-kubelet-test-10000.mount loaded active mounted /var/lib/kubelet/test/10000
var-lib-kubelet-test-10001.mount loaded active mounted /var/lib/kubelet/test/10001
var-lib-kubelet-test-10002.mount loaded active mounted /var/lib/kubelet/test/10002
var-lib-kubelet-test-10003.mount loaded active mounted /var/lib/kubelet/test/10003
var-lib-kubelet-test-10004.mount loaded active mounted /var/lib/kubelet/test/10004
var-lib-kubelet-test-10005.mount loaded active mounted /var/lib/kubelet/test/10005
root#
Just to confirm this is still happening on GKE 1.8.5 nodes.
At this point, we believe this to be a systemd issue tracked by systemd/systemd#7798:
In systemd if a directory is bind mounted to itself (it doesn't matter how this is done) and then something is mounted at some subdirectory of that directory (by something other than systemd-mount) systemd will create a bad mount unit for that mount point (the mount point at the subdirectory). This mount unit appears to only be cleaned up when either the parent bind mount is unmounted or when systemctl daemon-reload is run.
The suggested mitigation is to run systemctl daemon-reload
to periodically clear the bad transient mounts.
Once it is fixed in systemd, the changes must be picked up in the OS you are using.
Closed #57345.
For the record, the systemd issue has been addressed with systemd/systemd#7811 and included in the tag v237.
Does anyone know when systemd would be updated on GKE? We have that issue every second month :(
If I'm not mistaken, update has been done in GKE, at least since version 1.8.8-gke.0
. @artemyarulin
Thank you @honnix, but we have v1.9.2-gke.1 and just got an issue today :(
Interesting, then I'm not sure. Regression? Or maybe ported to later 1.9.x than 1.9.2?
Yeah, would try to migrate to the latest one today and see how it behaves, thank you!
@honnix this is fixed in gke 1.9.3+
oops and @artemyarulin ^
@msau42 This is awesome, thank you! Doing migration now :)
I had the issue on Azure with kubernetes 1.9.9. i restarted the node and issue disappeared. It might appear again after a few days.
@martinnaughton You need to upgrade your systemd version to v237+ or patch current systemd with change from systemd PR 7811.
I have tested command "systemctl daemon-reload",it don't work.
my question is frequent creation and stop of PODS by JOB mode causes the time to create and stop PODS to gradually increase
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.