Reduce AI indexing load on gerrit

Austin Schuh

unread,

Jun 27, 2025, 1:08:09 AMJun 27

to repo-d...@googlegroups.com

Hi all,

I'm seeing a lot of traffic to my public gerrit server. Digging in, the top offenders are:

4213 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
7967 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.7151.103 Mobile Safari/537.36 (compatible; GoogleOther)"
12090 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"
19437 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +clau...@anthropic.com)"

My metrics say I used ~180 GB of egress data to do that, and a fair amount of CPU load. I'm running this on GCP, so that ends up actually costing some money too.

Anyone got any best practices here? In theory, I like the models learning from the code, but I think they should be able to get that from github too where it is mirrored. 90% of the traffic is scraping gitiles too.

Thanks,

Austin

luca.mi...@gmail.com

unread,

Jun 27, 2025, 1:14:28 AMJun 27

to Austin Schuh, repo-d...@googlegroups.com

Sent from my iPhone

On 27 Jun 2025, at 06:08, Austin Schuh <austin...@gmail.com> wrote:

I have configured GitHub as code viewer in GerritHub.io and disabled Gitiles altogether !

Nothing wrong with Gitiles, but the abuse from AI bots was intolerable !

Luca

Thanks,
Austin

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/repo-discuss/CABsbf%3DHRu4qhEMVwwysPUkzTvK_r%3DZ9TPpDoFZHhFeof4e3heA%40mail.gmail.com.

Austin Schuh

unread,

Jun 27, 2025, 2:05:20 PMJun 27

to luca.mi...@gmail.com, repo-d...@googlegroups.com

On Thu, Jun 26, 2025, 10:14 PM <luca.mi...@gmail.com> wrote:

Sent from my iPhone

On 27 Jun 2025, at 06:08, Austin Schuh <austin...@gmail.com> wrote:

Hi all,

I'm seeing a lot of traffic to my public gerrit server. Digging in, the top offenders are:

4213 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
7967 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.7151.103 Mobile Safari/537.36 (compatible; GoogleOther)"
12090 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"
19437 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +clau...@anthropic.com)"

My metrics say I used ~180 GB of egress data to do that, and a fair amount of CPU load. I'm running this on GCP, so that ends up actually costing some money too.

Anyone got any best practices here? In theory, I like the models learning from the code, but I think they should be able to get that from github too where it is mirrored. 90% of the traffic is scraping gitiles too.

I have configured GitHub as code viewer in GerritHub.io and disabled Gitiles altogether !

Nothing wrong with Gitiles, but the abuse from AI bots was intolerable !

Amazing, thanks. That makes a lot of sense.

Are you able to share the configuration to do that? I started configuring it and it has quickly become clear that there are a lot of options that need to be right to work. I'm also assuming you setup replication of changes to make the links to the code in the review tool work?

Austin

Reply all

Reply to author

Forward