Hey!
First of all thanks for the awesome people maintaining commoncrawl 😊
Greg and Sebastian have been very helpful in the Discord when I have been exploring the different services you provide.
I'm using DuckDB with the columnar index.
I checked from the
https://status.commoncrawl.org/ page and learned that:
CloudFront: 4xx errors are mostly Greg's explicit rate limit rules.I noticed that DuckDB fails miserably when CommonCrawl WAF starts throwing the 403 errors.
To test this out I deployed today my own separate Cloudfront in front of the
s3://commoncrawl/ bucket and added WAF with rate-limiting with default block rules and noticed that by default the rate limiting WAF rules indeed did result in the same 403 errors: