Elephant2026h1 index corruption and repair

217 views
Skip to first unread message

Rob Stradling

unread,
Apr 2, 2026, 3:22:52 PMApr 2
to Certificate Transparency Policy
At 2026-04-01 07:15 UTC I received a report from a monitor operator that https://elephant2026h1.ct.sectigo.com/ct/v1/get-entries?start=227980800&end=227981055, as well as get-entries calls for just the first or last of those entries, were consistently producing HTTP 500 errors.  It was also observed that get-entries calls for various other entry ranges worked without any problems.

Errors of the following form were spotted in the CTFE logs:
GetEntries handler error: backend GetLeavesByRange request failed: rpc error: code = Unknown desc = ERROR: could not open file "base/16386/16484.166" (target block 519901635): previous segment is only 88541 blocks (SQLSTATE XX000)

Executing the query "SELECT relname, relkind FROM pg_class WHERE relfilenode = 16484;" showed us that the affected object was "leafdata_pkey", the primary key index on the "leafdata" table.  Since this object is an index rather than a table, we concluded that rebuilding the index should resolve the problem.

A "REINDEX CONCURRENTLY" operation began at 2026-04-01 07:58 UTC and eventually completed at 2026-04-02 06:37 UTC.  Since then, the previously problematic get-entries calls have consistently worked correctly.

The first and last CTFE errors indicating the index corruption occurred at 2026-03-30 04:11 and 2026-04-02 06:36 respectively.

https://www.gstatic.com/ct/compliance/endpoint_uptime_24h.csv shows poor availability for Elephant2026h1 over the past 24hrs.  We're speculating that this was due to the performance impact of the reindexing operation, and so we're optimistic that those numbers will look healthy again within 24hrs from now.

We think it's likely that our recent Proxmox outage was the root cause of the index corruption.  That incident finished around 6 days before the first evidence of index corruption was discovered, but our access logs show that no requests for https://elephant2026h1.ct.sectigo.com/ct/v1/get-entries?start=227980800&end=227981055 were received during that 6-day window, and we have not found any evidence of other entry ranges being impacted.

We will continue to monitor this log for any reoccurrence of this issue. 

Rob Stradling

unread,
Apr 13, 2026, 10:07:13 AMApr 13
to Certificate Transparency Policy, Rob Stradling
> We will continue to monitor this log for any reoccurrence of this issue. 

We've been made aware that the recreated index is now showing signs of corruption.  The error message we're seeing this time in the CTFE logs is as follows:
GetEntries handler error: backend GetLeavesByRange request failed: rpc error: code = Unknown desc = ERROR: invalid page in block 10421895 of relation base/16386/4155290161 (SQLSTATE XX001)

Since it's now happened twice, we're beginning to suspect that something in the underlying system (RAM? filesystem? disk controller?) may be at fault.  We plan to migrate Elephant2026h1 to a different hypervisor at 14:00 UTC tomorrow (Tuesday April 14th), after which we will initiate another index rebuild.

Rob Stradling

unread,
Apr 14, 2026, 11:15:43 AMApr 14
to Certificate Transparency Policy, Rob Stradling
After successfully completing the hypervisor migration, we've seen no further "invalid page in block" errors logged.  This better-than-expected outcome implies that another index rebuild is not necessary.  :-)
Reply all
Reply to author
Forward
0 new messages