runsc sandbox boot hangs (busy-spin) when many sandboxes start concurrently on the same node; orphans persist after cleanup

25 views
Skip to first unread message

Sanket Nadkarni

unread,
Apr 28, 2026, 3:23:36 AMApr 28
to gVisor Users [Public]
Description
-----------
When several runsc sandboxes are started concurrently on the same node
(e.g. multiple Pods scheduled together by Kubernetes during a scale-up),
a subset of the sandboxes never reach a running state. Instead they
appear to be busy-spinning inside the sandbox boot path: exactly one
thread per sandbox stays in state R consuming ~50% CPU, indefinitely.
The remaining threads are parked in futex_wait_queue or
hrtimer_nanosleep. The sandbox never becomes ready, kubelet's readiness
probes fail, and the Pod stays in PodInitializing / ContainerCreating
for tens of minutes.

The same node, started concurrently, can also have other sandboxes
boot perfectly (~3-4% CPU steady state), so the issue is not
workload-, image-, or distro-specific.

Reproduces in different environments
------------------------------------------
- EKS cluster/ kops-managed cluster, AL2023 worker nodes
-  Ubuntu 22.04 / AL2023  nodes
-  Cilium CNI / AWS VPC CNI

Same symptom, same fingerprint in both. Independent of CNI and AMI
family.

Reproduction (heuristic)
------------------------
1. Schedule N Pods (N >= 4) using a runsc RuntimeClass on a single
   fresh node within a few seconds of each other (autoscaler bringing
   up a new node from zero is the easiest trigger).
2. Observe runsc-sandbox processes on the node:
     ps -eLo pid,tid,stat,pcpu,wchan,comm | grep runsc
3. A subset of sandboxes will show a single thread:
     <tid>  Rsl  ~50.0  -    exe
   and never progress beyond that, while sibling sandboxes started in
   the same batch finish booting normally.

The probability of getting a stuck sandbox scales with how many
sandboxes are started in the same few seconds on one node.

Expected behavior
-----------------
All sandboxes either (a) boot successfully or (b) fail with a clear
error returned to the runtime so the container manager can recover.

Actual behavior
---------------
- Stuck sandboxes never report failure; runsc-sandbox process keeps
  spinning at ~50% CPU on a single thread, indefinitely.
- Cleanup paths do not terminate the spinning process. containerd logs
  show:
    "runsc" did not terminate successfully: sandbox is not running
    Forcibly stopping sandbox "<id>"
    StopPodSandbox ... returns successfully
  but the runsc-sandbox PID for <id> remains alive and continues to
  consume CPU.
- When the higher-level orchestrator retries (deletes the Pod and
  recreates it), a new runsc-sandbox is started while the previous
  stuck one is still running. Repeated retries accumulate orphaned
  sandboxes; we have observed 10+ orphan runsc-sandbox processes on a
  single 8-core node, collectively consuming ~60% of total node CPU
  even after their parent Pods/sandbox IDs were removed from the
  runtime's view.

Per-thread snapshot of one stuck sandbox (representative):
  TID    STATE  CPU%  WCHAN
  X      Rsl    50.2  -                       <-- runaway
  X+1    Ssl    1.5   hrtimer_nanosleep
  X+2    Ssl    0.0   futex_wait_queue
  X+3    Ssl    0.0   futex_wait_queue
  ... (remaining threads all futex_wait_queue)

/proc/<pid>/status of the stuck sandbox shows State: S (sleeping) at
the process level, low VmRSS (~25 MB), and a healthy
voluntary_ctxt_switches count -- consistent with a single thread
spinning while the rest of the sandbox idles.

Environment
-----------
- runsc release: release-20260420.0 (also reproduced on 20260316.0)
- containerd-shim-runsc-v1: same release as runsc
- containerd: 1.7.28 / 2.0.x
- Kubernetes: 1.33
- Node sizing approx.:  8 vCPU / 16 GiB RAM
- Host kernel: 6.8.0-1021-aws (Ubuntu 22.04.5 LTS) on kops side;
  AL2023 default kernel on EKS side
- CNI: aws-vpc-cni; Cilium (kops). Same symptom on both.
- Platform: runsc default (systrap); KVM is not used.
- Networking: netstack (default).

Configurations and Logs:
-----------
Please find attached gvisor-configs-and-logs.txt

Please let us know if there is some known issue which we are running into, we can provide any more relevant debug logs on request.

Thanks
gvisor-configs-and-logs.txt
Reply all
Reply to author
Forward
0 new messages