Hoi folks,
Halloumi and Gouda saw a 24min partial outage on Tuesday Jan 27th,
as our frontends got hammered and we ran out of filedescriptors on
two out of four frontends.
Start: 2026-01-27 17:10 UTC
Alerted: 2026-01-27 17:12 UTC
Resolved: 2026-01-27 17:34 UTC
Impact across all static log shards:
- gouda served 6.8k 500s on the write path, and 4.8k 499s (an nginx
feature showing incomplete request) on the read path
- halloumi served 7.4k 500s on the write path, and 3.4k 499s on the
read path.
Root Cause: nginx logs these to /var/log/nginx/error.log as follows:
2026/01/27 17:34:40 [alert] 476981#476981: *
9002191315 socket()
failed
(24: Too many open files) while connecting to
upstream, client: 2a03:4000:29:38:d4c8:c3ff:fe18:4132, server:
gouda2026h1.log.ct.ipng.ch, request: "POST /ct/v1/add-chain
HTTP/2.0", upstream:
"http://[2001:678:d78:504::a]:6420/ct/v1/add-chain", host:
"
gouda2026h1.log.ct.ipng.ch"
2026/01/27 17:34:40 [alert] 476981#476981: *
9002198155 socket()
failed
(24: Too many open files) while connecting to
upstream, client: 2a03:4000:29:38:d4c8:c3ff:fe18:4132, server:
halloumi2026h1.log.ct.ipng.ch, request: "POST /ct/v1/add-chain
HTTP/2.0", upstream:
"http://[2001:678:d78:510::e]:6901/ct/v1/add-chain", host:
"
halloumi2026h1.log.ct.ipng.ch"
Remediation: Raise FD limits from default 1k to 64k on each worker,
and ulimit to infinity on nginx itself.
We have a pending action item to reconfigure nginx to reuse FDs on
the backends, currently these are not using HTTP keepalive; but I'll
have to test that a little bit on Rennet and Lipase before making
the changes to production logs.
May our certificates continue to log and our pagers be silent!
groet,
Pim obo IPng CT Ops.
--
Pim van Pelt <p...@ipng.ch>
PBVP1-RIPE https://ipng.ch/