Combating bot scraping

196 views
Skip to first unread message

Nic Stanton-Roark

unread,
May 13, 2025, 4:37:16 PM5/13/25
to Archivesspace_Users_Group
Hi all, 

For those of you who use a PUI, how are you combating LLM training bots scraping your database? We self host, and even after blocking a wide range of IP addresses we're experiencing significant slowdown when the PUI is enabled. 

Blake Carver

unread,
May 20, 2025, 7:42:05 AM5/20/25
to Archivesspace_Users_Group
I bet bots are a HUGE problem for all ArchivesSpace sites. There's a group here that is working on it in a general way:
Not ArchivesSpace focused, but we're all fighting the same thing.



From: archivesspac...@lyrasislists.org <archivesspac...@lyrasislists.org> on behalf of Nic Stanton-Roark <ndr...@anderson.edu>
Sent: Tuesday, May 13, 2025 4:36 PM
To: Archivesspace_Users_Group <Archivesspac...@lyrasislists.org>
Subject: [ArchivesSpace Users Group] Combating bot scraping
 
Hi all, 

For those of you who use a PUI, how are you combating LLM training bots scraping your database? We self host, and even after blocking a wide range of IP addresses we're experiencing significant slowdown when the PUI is enabled. 
--
You received this message because you are subscribed to the Google Groups "Archivesspace_Users_Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to Archivesspace_User...@lyrasislists.org.
To view this discussion visit https://groups.google.com/a/lyrasislists.org/d/msgid/Archivesspace_Users_Group/CAKtCAhhHi-WZN4qQnDHRoM1tPHzLwfOo3xnEwWcfgxVGFjivWg%40mail.gmail.com.

Blake Carver

unread,
May 20, 2025, 10:22:17 AM5/20/25
to Archivesspace_Users_Group
Speaking of bots...

I'm curious if anyone else is seeing this botnet/scraper thing behavior that is hiding behind a huge range of what appear to be residential IPs. The traffic all looks something like this:

"GET /filter_fields%5B%5D=subjects&filter_values%5B%5D=Montgomery+%28Ala.%29&filter_values%5B%5D=Religion&page=2"

"GET /agents/corporate_entities/255?filter_fields%5B%5D=subjects"

It's mostly hits to something with "/filter_fields/" and usually agents and/or subjects. I assume it's an AI training thing scraping for metadata stuff? They're doing their best to avoid detection. 

I see just 2 hits from each IP in the past few hours, and only to URLs like the above. The ranges are just all over the place, but seem to all be what look like "normal" home IPs & user agents. Right now most of the IPs seem to be out of Canada (e.g. 99.233.165.128). They seem to move around.



From: 'Blake Carver' via Archivesspace_Users_Group <Archivesspac...@lyrasislists.org>
Sent: Tuesday, May 20, 2025 7:41 AM
To: Archivesspace_Users_Group <archivesspac...@lyrasislists.org>
Subject: Re: [ArchivesSpace Users Group] Combating bot scraping
 

James Truitt

unread,
May 21, 2025, 10:35:11 AM5/21/25
to Archivesspace_Users_Group, Blake Carver
There's been a lot of discussion of anti-bot techniques over on the #bots channel of the Code4Lib slack. Some folks from that group have also written up a wiki page: https://wiki.code4lib.org/Blocking_Bots
Reply all
Reply to author
Forward
0 new messages