Hoi folks,
An update from IPng on the Halloumi logshards:
- At 2026-03-22 09:55 UTC I removed the antispam database, suspecting
corruption, as writes to SST tables were no longer occuring.
- We suspected perhaps RAM pressure, as the halloumi2026h1 shard had
grown substantially and the VM was provisioned with 16GB of memory.
- At 12:48 UTC Jeroen rebooted the VM but unfortunately it did not come
up because of a libvirt/qemu + hypervisor issue. He rebooted the
hypervisor, and the VM came up with 64GB of memory, at 13:25 UTC it was
serving again.
- Joe pointed out that at 13:23:28, some of the halloumi and lipase
shards served a stale checkpoint. This can be explained by the read-path
being brought up (using nginx) before the write-path, which requires a
crypto key to unlock the filesystem with the private keys.
- As an aside, the cross poster folks briefly sent ~2Kqps of writes,
which I throttled with a rate limiter.
- TesseraCT will rebuild the antispam database from the tree when it
goes missing, this is by design. It doesn't go very quickly however, it
took until 2026-03-23 03:01 UTC to rebuild, just about 17.5hrs in total.
Unfortunately as soon as it rebuilt, it stopped serving 429s but started
serving an equal amount of 200s and 500s, with the exact same symptoms
and journal entries:
AddChain handler error: couldn't store the leaf: error waiting for
Tessera index future and its integration: context canceled
I spent another hour looking at BadgerDB and how Tessera uses it and
trying to understand why it does write to its mem files, but does not
write new SSTs nor its VLOG. The current state, as I write this at 04:35
UTC, is:
-rw-r--r-- 1 ctlog ctlog 37597532 Mar 23 03:32 210547.sst
-rw-r--r-- 1 ctlog ctlog 6 Mar 23 03:37 LOCK
-rw------- 1 ctlog ctlog 144438 Mar 23 03:37 MANIFEST
-rw-r--r-- 1 ctlog ctlog 183467 Mar 23 03:37 210548.sst
-rw-r--r-- 1 ctlog ctlog
2147483646 Mar 23 03:37 000008.vlog
-rw-r--r-- 1 ctlog ctlog 20 Mar 23 03:37 000007.vlog
drwx------ 2 ctlog ctlog 3097 Mar 23 03:37 .
-rw-r--r-- 1 ctlog ctlog 134217728 Mar 23 04:35 01690.mem
I took some stack traces from tesseract-posix (sending it SIGQUIT) and
tried to analyze why the log writer is blocked, but I'm out of my depth
on this, unfortunately. I'm going to seek help from the Google team
tomorrow. Until then, Halloumi2026h1 will continue to deteriorate by
serving 500s. The other shards - with exception of the 37min downtime
while the VM + hypervisor rebooted - continue to serve fine.
I'll update this thread as soon as I have pertinent information.