Preferred Format: 25+5 mins
While systemd is not a kernel subsystem, it is the first userspace process (PID 1) on virtually every modern Linux distribution — the bridge between the kernel and the rest of the system. A crash in systemd is, for all practical purposes, as catastrophic as a kernel panic. This talk presents a real-world debugging journey of exactly such a crash, emphasizing the distributive nature of Linux: how independently developed components — the kernel, systemd, service daemons, and application scripts — must compose correctly across version boundaries, and how subtle breakages emerge when they don't.
When we rolled out Azure Linux 3.0 across 1,600+ cloud VMs, 29 machines froze hard — PID 1 crashed with `assert_not_reached()` in `service_sigchld_event()` at `service.c:3863`. No OOM, no kernel panic — systemd killed itself. We traced the crash from vague "Transport endpoint is not connected" errors to a race condition in systemd's daemon-reload serialization path, introduced in systemd 254: a symlink alias (`syslog.service` → `rsyslog.service`) causes the unit's deserialized state to be silently overwritten to DEAD while its `main_pid` remains set, leading to an assertion failure when SIGCHLD arrives.
The talk covers: journal log forensics and initial triage; source-level root cause analysis of systemd's serialization and deserialization paths (`service.c`, `manager.c`); constructing a deterministic reproducer that defeats hashmap ordering non-determinism; the mitigation we shipped and upstream engagement (systemd/systemd#14141, #38817, PR #39703).
Attendees will walk away with: practical techniques for debugging PID 1 crashes using journal logs and systemd source code; an understanding of how daemon-reload serialization works internally; awareness of how unit symlink aliases can corrupt service state; and broader lessons on the risks of version upgrades in a distributive ecosystem where no single component owns the integration contract.
Presenter:I am one of the maintainers of Azure Linux, Microsoft's Linux distribution. With 15 years of experience in Linux systems engineering, I have worked across the networking subsystem, scheduler, KVM, BSP, and device drivers at organizations including Flipkart, Hewlett Packard Enterprise, and Sony. I previously presented a talk on the EEVDF Scheduler at the Linux Kernel Meetup.