Minor incident on Halloumi and Gouda

57 views
Skip to first unread message

Pim van Pelt

unread,
Jan 28, 2026, 5:55:55 AMJan 28
to ct-p...@chromium.org
Hoi folks,

Halloumi and Gouda saw a 24min partial outage on Tuesday Jan 27th, as our frontends got hammered and we ran out of filedescriptors on two out of four frontends.

Start: 2026-01-27 17:10 UTC 
Alerted: 2026-01-27 17:12 UTC
Resolved: 2026-01-27 17:34 UTC

Impact across all static log shards:
- gouda served 6.8k 500s on the write path, and 4.8k 499s (an nginx feature showing incomplete request) on the read path
- halloumi served 7.4k 500s on the write path, and 3.4k 499s on the read path.

Root Cause: nginx logs these to /var/log/nginx/error.log as follows:
2026/01/27 17:34:40 [alert] 476981#476981: *9002191315 socket() failed (24: Too many open files) while connecting to upstream, client: 2a03:4000:29:38:d4c8:c3ff:fe18:4132, server: gouda2026h1.log.ct.ipng.ch, request: "POST /ct/v1/add-chain HTTP/2.0", upstream: "http://[2001:678:d78:504::a]:6420/ct/v1/add-chain", host: "gouda2026h1.log.ct.ipng.ch"
2026/01/27 17:34:40 [alert] 476981#476981: *9002198155 socket() failed (24: Too many open files) while connecting to upstream, client: 2a03:4000:29:38:d4c8:c3ff:fe18:4132, server: halloumi2026h1.log.ct.ipng.ch, request: "POST /ct/v1/add-chain HTTP/2.0", upstream: "http://[2001:678:d78:510::e]:6901/ct/v1/add-chain", host: "halloumi2026h1.log.ct.ipng.ch"

Remediation: Raise FD limits from default 1k to 64k on each worker, and ulimit to infinity on nginx itself.

We have a pending action item to reconfigure nginx to reuse FDs on the backends, currently these are not using HTTP keepalive; but I'll have to test that a little bit on Rennet and Lipase before making the changes to production logs. 

May our certificates continue to log and our pagers be silent!

groet,
Pim obo IPng CT Ops.
-- 
Pim van Pelt <p...@ipng.ch>
PBVP1-RIPE https://ipng.ch/
Reply all
Reply to author
Forward
0 new messages