halloumi2026h1 issue

65 views
Skip to first unread message

Pim van Pelt

unread,
Mar 22, 2026, 8:31:14 AM (2 days ago) Mar 22
to Ct Policy, Ct Ops
Hoi folks,

On 2026-03-22 at approximately midnight, Halloumi 2026h1 started issuing quite a few 500s (orange in the graph below)


I noticed that its antispam (operated by BadgerDB) had stopped updating. its 'mem' and 'vlog' files were being written to, but very slowly although there is no disk shortage, while the tesseract-posix process was emitting:

Mar 22 09:53:58 ctlog1 tesseract-posix[793346]: W0322 09:53:58.927757 793346 handlers.go:240] halloumi2026h1.log.ct.ipng.ch: AddChain handler error: couldn't store the leaf: error waiting for Tessera index future and its integration: context canceled 
Mar 22 09:53:58 ctlog1 tesseract-posix[793346]: W0322 09:53:58.929338 793346 handlers.go:240] halloumi2026h1.log.ct.ipng.ch: AddChain handler error: couldn't store the leaf: error waiting for Tessera index future and its integration: context canceled 
Mar 22 09:53:58 ctlog1 tesseract-posix[793346]: W0322 09:53:58.936277 793346 handlers.go:240] halloumi2026h1.log.ct.ipng.ch: AddChain handler error: couldn't store the leaf: error waiting for Tessera index future and its integration: context canceled 
Mar 22 09:53:58 ctlog1 tesseract-posix[793346]: W0322 09:53:58.938424 793346 handlers.go:240] halloumi2026h1.log.ct.ipng.ch: AddChain handler error: couldn't store the leaf: error waiting for Tessera index future and its integration: context canceled

At 2026-03-22 09:55 UTC I restarted TesseraCT for this logshard, BadgerDB did not recover, 
. I decided to move the antispam database out of the way, and restart TesseraCT one more time. It's now rebuilding its antispam database from the tree - it's about 10% done in the last ~50min or so (green line from 1.41B now 139M). One working theory is that we are memory bound on the VM, we will look into this shortly.




By the way, our cross-poster friends are not very well behaved - after restarting TesseraCT, it started serving 429s and somehow they decided to issue ~2.5Kqps of writes, so I've ratelimited them - at least temporarily - on Halloumi as well (blue line with the marker is the cross-poster, rate limit applied at 11:54 UTC):



Overnight, Halloumi served (website~=halloumi.*log.ct.ipng.ch AND uri~=/ct):
#      LABEL     COUNT 
1      200     21 255 442
2      500     21 210 825
3      429     17 852 987
4      499     2 520 935
5      403     1 907 523
6      400     588 011
7      502     8 939
8      410     6 128
9      408     2 020
10     404     493


A typical 24h period rarely has 500s for TesseraCT logs, so this is a notable event. We will continue to monitor Halloumi as it recovers - and may reach out to the Google Trustfabric folks (thanks already to Al for taking a look with me), if when recovering the antispam database to 1.41B entries, it acts up again. I did save the old badgerdb directory in case anyone would like to do forensics.

groet,
Pim
-- 
Pim van Pelt <p...@ipng.ch>
PBVP1-RIPE https://ipng.ch/

Pim van Pelt

unread,
Mar 23, 2026, 12:41:52 AM (yesterday) Mar 23
to ct-p...@chromium.org
Hoi folks,

An update from IPng on the Halloumi logshards:

- At 2026-03-22 09:55 UTC I removed the antispam database, suspecting
corruption, as writes to SST tables were no longer occuring.
- We suspected perhaps RAM pressure, as the halloumi2026h1 shard had
grown substantially and the VM was provisioned with 16GB of memory.
- At 12:48 UTC Jeroen rebooted the VM but unfortunately it did not come
up because of a libvirt/qemu + hypervisor issue. He rebooted the
hypervisor, and the VM came up with 64GB of memory, at 13:25 UTC it was
serving again.
- Joe pointed out that at 13:23:28, some of the halloumi and lipase
shards served a stale checkpoint. This can be explained by the read-path
being brought up (using nginx) before the write-path, which requires a
crypto key to unlock the filesystem with the private keys.
- As an aside, the cross poster folks briefly sent ~2Kqps of writes,
which I throttled with a rate limiter.
- TesseraCT will rebuild the antispam database from the tree when it
goes missing, this is by design. It doesn't go very quickly however, it
took until 2026-03-23 03:01 UTC to rebuild, just about 17.5hrs in total.

Unfortunately as soon as it rebuilt, it stopped serving 429s but started
serving an equal amount of 200s and 500s, with the exact same symptoms
and journal entries:
AddChain handler error: couldn't store the leaf: error waiting for
Tessera index future and its integration: context canceled

I spent another hour looking at BadgerDB and how Tessera uses it and
trying to understand why it does write to its mem files, but does not
write new SSTs nor its VLOG. The current state, as I write this at 04:35
UTC, is:
-rw-r--r-- 1 ctlog ctlog   37597532 Mar 23 03:32 210547.sst
-rw-r--r-- 1 ctlog ctlog          6 Mar 23 03:37 LOCK
-rw------- 1 ctlog ctlog     144438 Mar 23 03:37 MANIFEST
-rw-r--r-- 1 ctlog ctlog     183467 Mar 23 03:37 210548.sst
-rw-r--r-- 1 ctlog ctlog 2147483646 Mar 23 03:37 000008.vlog
-rw-r--r-- 1 ctlog ctlog         20 Mar 23 03:37 000007.vlog
drwx------ 2 ctlog ctlog       3097 Mar 23 03:37 .
-rw-r--r-- 1 ctlog ctlog  134217728 Mar 23 04:35 01690.mem

I took some stack traces from tesseract-posix (sending it SIGQUIT) and
tried to analyze why the log writer is blocked, but I'm out of my depth
on this, unfortunately. I'm going to seek help from the Google team
tomorrow. Until then, Halloumi2026h1 will continue to deteriorate by
serving 500s. The other shards - with exception of the 37min downtime
while the VM + hypervisor rebooted - continue to serve fine.

I'll update this thread as soon as I have pertinent information.

Pim van Pelt

unread,
Mar 23, 2026, 2:34:02 PM (13 hours ago) Mar 23
to ct-p...@chromium.org
Hoi folks,

An update from IPng on the Halloumi logshards:
- At 2026-03-23 03:37 UTC I restarted TesseraCT on Halloumi2026h1, and it wrote one SST shard
- As a side note, the blue line of 500s vanishes, because after restarting there is no /metrics timeseries with 429s => they are completely gone.
- Nothing operational changed overnight, but at 13:08:35 UTC BadgerDB recovered by itself, and all 500s are gone. Halloumi is fully recovered at that time.
- Al, Philippe and I spent an hour today on a call discussing what maybe could have happened. Philippe has a theory about the 200s/500s overnight, but none of us put our finger on why it recovered by itself at 13:08. 
- We looked at /debug/vars endpoint on TesseraCT, which contains a bunch of metrics from BadgerDB, but a brief comparison between Halloumi2026h1 (affected) and Halloumi2026h2a (not affected) did not reveal anything obvious.
- Philippe has a few introspection / observability improvements for tesseract which he'll release today in a new version.
 => I have just updated Lipase (our staging logs) with tesseract-posix@0a4c590 and plan to update Halloumi2026h1 with this new version later today (2026-02-23 at around 23:00 UTC)
- Al is downloading the Halloumi2026h1 tree with 1.41B entries, and I've also given him access to the original badgerdb that jammed when this first started
 => Al will attempt some analysis offline to see if he can find out why the database froze the first time.

Here's a succinct graph of HTTP response codes that shows the last 16hrs or so (blue lines were 429s served while TesseraCT rebuilt its antispam DB; green+orange lines were equal amounts of 200s and 500s; a recovery at 12:15 or so, two brief spikes of 500s; and a recovery at 13:08 with only 200s served - throughout a low level of HTTP 400s due to invalid submissions, missing CA, invalid cert, etc):



What's somewhat troubling is that we do not yet have a root cause. We do have a tonne of data though, and we're discussing it (the data, the issue, and possible root causes) in #cheese on Slack. Many many thanks to Al and Philippe and Jeroen for all your help getting to the bottom of this.
Screenshot 2026-03-23 at 08.56.10.png
Reply all
Reply to author
Forward
0 new messages