Description
-----------
When several runsc sandboxes are started concurrently on the same node
(e.g. multiple Pods scheduled together by Kubernetes during a scale-up),
a subset of the sandboxes never reach a running state. Instead they
appear to be busy-spinning inside the sandbox boot path: exactly one
thread per sandbox stays in state R consuming ~50% CPU, indefinitely.
The remaining threads are parked in futex_wait_queue or
hrtimer_nanosleep. The sandbox never becomes ready, kubelet's readiness
probes fail, and the Pod stays in PodInitializing / ContainerCreating
for tens of minutes.
The same node, started concurrently, can also have other sandboxes
boot perfectly (~3-4% CPU steady state), so the issue is not
workload-, image-, or distro-specific.
Reproduces in different environments
------------------------------------------
- EKS cluster/ kops-managed cluster, AL2023 worker nodes
- Ubuntu 22.04 / AL2023 nodes
- Cilium CNI / AWS VPC CNI
Same symptom, same fingerprint in both. Independent of CNI and AMI
family.
Reproduction (heuristic)
------------------------
1. Schedule N Pods (N >= 4) using a runsc RuntimeClass on a single
fresh node within a few seconds of each other (autoscaler bringing
up a new node from zero is the easiest trigger).
2. Observe runsc-sandbox processes on the node:
ps -eLo pid,tid,stat,pcpu,wchan,comm | grep runsc
3. A subset of sandboxes will show a single thread:
<tid> Rsl ~50.0 - exe
and never progress beyond that, while sibling sandboxes started in
the same batch finish booting normally.
The probability of getting a stuck sandbox scales with how many
sandboxes are started in the same few seconds on one node.
Expected behavior
-----------------
All sandboxes either (a) boot successfully or (b) fail with a clear
error returned to the runtime so the container manager can recover.
Actual behavior
---------------
- Stuck sandboxes never report failure; runsc-sandbox process keeps
spinning at ~50% CPU on a single thread, indefinitely.
- Cleanup paths do not terminate the spinning process. containerd logs
show:
"runsc" did not terminate successfully: sandbox is not running
Forcibly stopping sandbox "<id>"
StopPodSandbox ... returns successfully
but the runsc-sandbox PID for <id> remains alive and continues to
consume CPU.
- When the higher-level orchestrator retries (deletes the Pod and
recreates it), a new runsc-sandbox is started while the previous
stuck one is still running. Repeated retries accumulate orphaned
sandboxes; we have observed 10+ orphan runsc-sandbox processes on a
single 8-core node, collectively consuming ~60% of total node CPU
even after their parent Pods/sandbox IDs were removed from the
runtime's view.
Per-thread snapshot of one stuck sandbox (representative):
TID STATE CPU% WCHAN
X Rsl 50.2 - <-- runaway
X+1 Ssl 1.5 hrtimer_nanosleep
X+2 Ssl 0.0 futex_wait_queue
X+3 Ssl 0.0 futex_wait_queue
... (remaining threads all futex_wait_queue)
/proc/<pid>/status of the stuck sandbox shows State: S (sleeping) at
the process level, low VmRSS (~25 MB), and a healthy
voluntary_ctxt_switches count -- consistent with a single thread
spinning while the rest of the sandbox idles.
Environment
-----------
- runsc release: release-20260420.0 (also reproduced on 20260316.0)
- containerd-shim-runsc-v1: same release as runsc
- containerd: 1.7.28 / 2.0.x
- Kubernetes: 1.33
- Node sizing approx.: 8 vCPU / 16 GiB RAM
- Host kernel: 6.8.0-1021-aws (Ubuntu 22.04.5 LTS) on kops side;
AL2023 default kernel on EKS side
- CNI: aws-vpc-cni; Cilium (kops). Same symptom on both.
- Platform: runsc default (systrap); KVM is not used.
- Networking: netstack (default).
Configurations and Logs:
-----------
Please find attached gvisor-configs-and-logs.txt
Please let us know if there is some known issue which we are running into, we can provide any more relevant debug logs on request.
Thanks