I'm seeing a lot of traffic to my public gerrit server. Digging in, the top offenders are:
4213 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +
https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
7967 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.7151.103 Mobile Safari/537.36 (compatible; GoogleOther)"
12090 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +
https://openai.com/gptbot)"
19437 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +
clau...@anthropic.com)"
My metrics say I used ~180 GB of egress data to do that, and a fair amount of CPU load. I'm running this on GCP, so that ends up actually costing some money too.
Anyone got any best practices here? In theory, I like the models learning from the code, but I think they should be able to get that from github too where it is mirrored. 90% of the traffic is scraping gitiles too.