Altering the Cloudfront WAF rate limiting error from 403 to 429

73 views
Skip to first unread message

Onni Hakala

unread,
Nov 26, 2025, 10:29:53 AM (12 days ago) Nov 26
to Common Crawl
Hey!

First of all thanks for the awesome people maintaining commoncrawl 😊

Greg and Sebastian have been very helpful in the Discord when I have been exploring the different services you provide.

I'm using DuckDB with the columnar index.

I checked from the https://status.commoncrawl.org/ page and learned that:
CloudFront: 4xx errors are mostly Greg's explicit rate limit rules.

I noticed that DuckDB fails miserably when CommonCrawl WAF starts throwing the 403 errors.

To test this out I deployed today my own separate Cloudfront in front of the s3://commoncrawl/ bucket and added WAF with rate-limiting with default block rules and noticed that by default the rate limiting WAF rules indeed did result in the same 403 errors:

$ duckdb -f test.sql
HTTP Error:
HTTP GET error on 'https://XXXXXXX.cloudfront.net/cc-index/table/cc-main/warc/crawl=CC-MAIN-2014-41/subset=warc/part-00103-e1115d40-d2cc-4445-873c-2b206f427726.c000.gz.parquet' (HTTP 403) But then I tried to alter the WAF based rate limiting with a custom response in Screenshot 2025-11-26 at 14.30.27 1.png

I added the 429 code with Retry-After header. This is also the recommendation in RFC 6585:
Screenshot 2025-11-26 at 14.40.06.png After this change the rate limiting still works but DuckDB doesn't fail. It just retries again as you can see from the MITM proxy logs:
Screenshot 2025-11-26 at 14.58.07.png

I'm wondering if you would be kind and would want to change the default WAF rules for your cloudfront distribution?

This way many tools would be able to automatically continue and it would also be easier to read the error logs when they don't have the 403 errors.

Thanks in advance,
Onni Hakala

Greg Lindahl

unread,
Nov 26, 2025, 11:15:42 AM (12 days ago) Nov 26
to common...@googlegroups.com
Please see the solution I gave you on Discord. I realize that you disagree with it. This is not the place for this discussion.

Thank you.

-- greg


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/656b7359-f584-44f8-8bd9-507d048e1ebfn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages