Gouda Availability

133 views
Skip to first unread message

Pim van Pelt

unread,
Mar 12, 2026, 11:04:26 AM (12 days ago) Mar 12
to Ct Policy
Hoi folks,

Last week the Chrome folks sent us a headsup that IPng has dipped just under 99% availability the write path of Gouda2026h1. We have been investigating proximate and symptomatic cause, but we have not yet found a root cause. Our ZFS diskpool (consisting of three Samsung MZILT3T8HBLS (SAS-3) drives in raidz-1) has slowed down considerably, while at the same time the load on Gouda is considerably higher than other logs, from some cross posters and ToR (write) traffic. 

I had previously loadtested [1] Sunlight to at least an order of magnitude more writes, and it is as of yet unclear what the root cause is. The pattern is intermittent 503s (the blue spikes below) -


Due to sequencing pool overruns -


Gouda is loaded a fair bit higher than other Sunlight logs, Filippo showed a comparison (Gouda in green) -


We have identified three next steps to take:
1) Block writes from the cross poster ASNs, at least temporarily.
2) Roll out Sunlight with https://github.com/FiloSottile/sunlight/pull/56 to evict low priority entries from pool under load
3) Switch to different (bare metal) hardware, to rule out hypervisor and VM configuration issues.

I will try to provide an update after the weekend. Bear with us as we figure this out.

groet,
Pim
[1] https://ipng.ch/s/articles/2025/08/10/certificate-transparency-part-2-sunlight/

-- 
Pim van Pelt <p...@ipng.ch>
PBVP1-RIPE https://ipng.ch/

Joe DeBlasio - Google

unread,
Mar 12, 2026, 1:54:00 PM (12 days ago) Mar 12
to Certificate Transparency Policy, Pim van Pelt
Thanks very much for the update and transparency, Pim!

Joe

Pim van Pelt

unread,
Mar 17, 2026, 9:49:27 AM (7 days ago) Mar 17
to Pim van Pelt, Ct Policy
Hoi,

A quick update and a few questions from IPng's Gouda / Halloumi logs.
- On 2026-03-12 at 16:35 UTC I ratelimited two cross posters (or one
cross poster from two different networks):
- We observed the submissions/sec go from ~400 to ~45/sec immediately
- Of all traffic to Halloumi + Gouda (93M total), 39M were from the
cross poster(s):

# LABEL COUNT
1 2a01:4f9:4b::/48 36 031 259
2 2a03:4000:29::/48 13 279 765
3 40.75.145.0/24 3 822 027
4 48.204.59.0/24 3 672 889
5 240d:c000:f05f::/48 3 427 313

- Focusing on the write path, of the 56M submissions in the last 24hrs,
27M were from the cross poster(s) (HTTP 403 or HTTP 429)

# LABEL COUNT
1 403 24 723 966
2 429 2 695 745
3 404 113 159
4 410 10 226
5 499 6 135
6 500 1 559

- It means from the 39M requests, 27M were rejected and 12M made it
through.

However, once I placed the limiter on the write-path:
- We saw latency go down to normal, and 500s mostly vanish on regular
traffic (~450 or so 500s served out of 93M queries served)
- Shortly after applying the ratelimit to the cross posters, Matthew
showed latency as seen from Let's Encrypt submissions markedly improve
to Gouda.
- The Sunlight pool size (which regularly clipped at 750 before, causing
submissions to be rejected with HTTP 503), is now consistently under 40.
- After having notified the hosting providers, they both relayed the
message to their customer (could be one customer, or two, I do not
know), and they have not stopped sending ~400qps of submissions to Gouda
and Halloumi.

I've left Halloumi (the TesseraCT log) unrouched. It is serving as well
~400 submissions/sec from these cross posters.

For Gouda, the current ratio of HTTP 200s to HTTP 400s is quite high;
about 11% is an HTTP 400, and about 88.9% is HTTP 200. The 400s on the
write path are fairly distributed. For example they query
'website~=gouda.* AND uri~=/ct AND status=400' contains this top 5:
# LABEL COUNT
1 172.71.164.0/24 37 761
2 162.158.202.0/24 37 170
3 2a0a:4cc0:c0::/48 35 301
4 172.70.242.0/24 31 959
5 172.71.172.0/24 17 893

On Halloumi, they are heavily skewed to the cross poster (#1+#2, not
rate limited there), with the query 'website~=halloumi.* AND uri~=/ct
AND status=400':
# LABEL COUNT % BAR
1 2a01:4f9:4b::/48 422 588
2 2a0a:4cc0:c0::/48 44 192
3 172.71.164.0/24 37 762
4 162.158.202.0/24 37 172
5 172.70.243.0/24 31 436

I've e-mailed the two hosting providers. One of them responded pretty
quickly, noting that their customer was rather uncooperative. I've
followed up with both hosting providers' abuse teams to see if I can
arrange a dialog. Until now, it's not looking great though.

Question:
- Does Chrome offer anything shorter than 90d rolling average for log
availability? I see the performance still as ~98.89% despite us having
served almost any 500s in the last ~4 days on
https://www.gstatic.com/ct/compliance/endpoint_uptime.csv From the CT
Log Policy:
```Log availability is measured on a per-endpoint basis over a 90-day
rolling average from all requests made to the log by the Chrome team’s
compliance monitoring infrastructure. The log’s overall availability is
represented by the minimum of all per-endpoint availabilities.```

- Cloudflare shows average uptime at 100% in
https://radar.cloudflare.com/certificate-transparency/log/gouda2026h1?dateRange=24w

I'd like to get a signal from CAs (and possibly monitors, although the
read path was performant throughout for both Gouda and Halloumi), if
submissions to Gouda have improved or not. I believe we are in the clear

groet,
Pim

On 2026-03-12 16:04, 'Pim van Pelt' via Certificate Transparency Policy
wrote:
> Hoi folks,
>
> Last week the Chrome folks sent us a headsup that IPng has dipped just
> under 99% availability the write path of Gouda2026h1. We have been
> investigating proximate and symptomatic cause, but we have not yet
> found a root cause. Our ZFS diskpool (consisting of three Samsung
> MZILT3T8HBLS (SAS-3) drives in raidz-1) has slowed down considerably,
> while at the same time the load on Gouda is considerably higher than
> other logs, from some cross posters and ToR (write) traffic.
>
> I had previously loadtested [1] Sunlight to at least an order of
> magnitude more writes, and it is as of yet unclear what the root cause
> is. The pattern is intermittent 503s (the blue spikes below) -
>
> Due to sequencing pool overruns -
>
> Gouda is loaded a fair bit higher than other Sunlight logs, Filippo
> showed a comparison (Gouda in green) -
>
> We have identified three next steps to take:
> 1) Block writes from the cross poster ASNs, at least temporarily.
> 2) Roll out Sunlight with
> https://github.com/FiloSottile/sunlight/pull/56 to evict low priority
> entries from pool under load
> 3) Switch to different (bare metal) hardware, to rule out hypervisor
> and VM configuration issues.
>
> I will try to provide an update after the weekend. Bear with us as we
> figure this out.
>
> groet,
> Pim
> [1]
> https://ipng.ch/s/articles/2025/08/10/certificate-transparency-part-2-sunlight/
>
> --
> Pim van Pelt <p...@ipng.ch>
> PBVP1-RIPE https://ipng.ch/
>
> --
> You received this message because you are subscribed to the Google
> Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to ct-policy+...@chromium.org.
> To view this discussion visit
> https://groups.google.com/a/chromium.org/d/msgid/ct-policy/637a9d15-deb8-484d-96c2-1bc592a60ea0%40ipng.ch
> [1].
>
>
> Links:
> ------
> [1]
> https://groups.google.com/a/chromium.org/d/msgid/ct-policy/637a9d15-deb8-484d-96c2-1bc592a60ea0%40ipng.ch?utm_medium=email&utm_source=footer

Joe DeBlasio

unread,
Mar 17, 2026, 3:55:43 PM (7 days ago) Mar 17
to Pim van Pelt, Ct Policy
Thanks for the update, Pim.

Does Chrome offer anything shorter than 90d rolling average for log
availability? I see the performance still as ~98.89% despite us having
served almost any 500s in the last ~4 days

We don't presently (though never say never).

When issues are resolved and a log is back to perfect availability, we expect the 90-day rolling numbers to remain flat, exactly as you're seeing, until the period of poor availability ages out, so you're seeing evidence of recovery from our vantage point.

In particular, we've seen perfect availability from our perspective from 2026-03-06 through now. Though we encourage you to keep investigating ways to avoid this situation in the future (by addressing the cross-posting, by helping Sunlight get more robust, or something else), we consider the proximate availability incident resolved. Do let us know if there's anything we can do to help, and thanks for the work!

Best,
Joe, on behalf of the Chrome CT team

Reply all
Reply to author
Forward
0 new messages