Aggressive crawler bots

265 views
Skip to first unread message

Yuyun W

unread,
Apr 15, 2025, 9:15:43 PMApr 15
to Dataverse Users Community
Hi everyone, 

Our installation was heavily crawled by aggressive bots on 15 Apr, to the point that we restarted and paused the system for 2 hours. The efforts did not help as the crawler started again and overloaded the system once we started it. 

Within 10 mins, we recorded 1535 HTTP requests, coming from unique IP addresses. Our current solution, which is to blacklist specific IP, won't work in this scenario. The bots might be AI crawlers that can spoof IP addresses (https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/). 

Anyone encounter similar issue? Any application setting that we can do to mitigate? 

I hope this issue is discussed in Dataverse Community Meeting as well. 

Best,
Yuyun 

Bethany Seeger

unread,
Apr 16, 2025, 9:42:40 AMApr 16
to Dataverse Users Community
Hi Yuyun, 

We've been experiencing this in the last few weeks as well. 

We did adjust our Robots.txt file to match more of what is detailed here:  https://guides.dataverse.org/en/latest/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines  But that only helps if they are nice bots and actually pay attention to it. 
 
When it first started happening it took our system down.  We gave the system more resources and ended up blocking one IP that was particularly aggressive.  Things have settled down a bit, though we are currently figuring out what to do in general.  We will probably try to apply some rate limiting. Does anyone have recommendations on what the limits might be set to? 

I hope this helps some. I look forward to learning what others are doing as well. 

Best,
Bethany


Night Owl

unread,
Apr 16, 2025, 11:03:41 AMApr 16
to dataverse...@googlegroups.com

We’re in the same boat and have been battling the bots for over a year now. On our institutional repository it has been a massive issue and we have done a great deal of investigation and configuration to try to limit the chaos. We have AWS WAF enabled to help with specific bots, and we also have CloudFlare rules in place to set rate limits. We block specific JA3/4 fingerprints when we can tell exactly where they are coming from. For a while we blocked the entire country of China. I would be all in for having a discussion about this at DCM2025. 


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/41e9e114-7f51-43e7-aa7f-483be6d0e2dan%40googlegroups.com.

Yuyun W

unread,
Apr 17, 2025, 12:42:55 AMApr 17
to Dataverse Users Community
We tried adjusting robot.txt, changed crawl-delay to 60, but no use. Disregard by impolite bots. 
Will try to block another IP which has been flagged as "Bad IP: HTTP spammer". 

I can't attend DCM, but yes, please discuss (and share)! 
Confederation of OA Repositories raises "Dealing with AI Bots in Repositories" as 1st agenda in their upcoming meeting in May (https://coar-repositories.org/news-updates/coar-annual-conference-2025/). 

Don Sizemore

unread,
Apr 17, 2025, 9:40:42 AMApr 17
to dataverse...@googlegroups.com
I'll ask about wedging such a discussion into the current conference agenda, though glancing over the schedule I don't see a good place to include it.
This is a technical problem requiring a technical solution, and might well take over the hackathon on Friday?

Don

Yuyun W

unread,
Apr 21, 2025, 1:28:05 AMApr 21
to Dataverse Users Community
How about using Captcha?

We have WAF, but don't have Captcha at WAF level. Perhaps at application level?

Don Sizemore

unread,
Apr 21, 2025, 8:32:22 AMApr 21
to dataverse...@googlegroups.com
Jim implemented a Captcha at QDR, which should stop the crawlers entirely.

If you're on v6.2 or above you can implement tiered rate limiting (I'm looking into this today). Tier 0 applies to anonymous users, Tier 1 to logged-in users, and superusers are exempt from rate-limiting.
Our recent swarms of crawlers have focused on our collection page, but I don't see the collection analog of GetLatestPublishedDatasetVersionCommand just yet.

Don

Yuyun Wirawati

unread,
Apr 22, 2025, 11:12:04 PMApr 22
to dataverse...@googlegroups.com
Most useful, thanks for the info, Don.

You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/_sefDLswDbQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/CAPfMOay2HQGFo3M1gXW6%2B74kWAynKoPyA%2BHc-cFEwYq-aTKMYQ%40mail.gmail.com.

Sherry Lake

unread,
Apr 25, 2025, 11:03:07 AMApr 25
to Dataverse Users Community
This might be the bots and reason for the aggressive bots earlier this month:

On Tuesday, April 15, 2025 at 9:15:43 PM UTC-4 ywir...@gmail.com wrote:

o.be...@fz-juelich.de

unread,
Apr 29, 2025, 12:42:58 PMApr 29
to Dataverse Users Community
I am currently implementing counter measures in our installation and am happy to include learnings in one of my talks.

o.be...@fz-juelich.de

unread,
Apr 29, 2025, 12:43:29 PMApr 29
to Dataverse Users Community
Also, feel free to reach out on chat.dataverse.org and open a topic about this.
Reply all
Reply to author
Forward
0 new messages