--
All messages to this mailing list should adhere to the Code of Conduct: https://lyrasis.org/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dspace-tech/fba039e0-21fc-40d5-a13d-5a6af09d5d1cn%40googlegroups.com.
Dear all,
AI bots are a mess. We calculate, fight, discuss, firewall them out … daily. Now I found an interesting article to rate limit aggressive access easily with the NGINX in front of our DSpace. E.g. this should help:
limit_req_zone $binary_remote_addr zone=req_per_ip:20m rate=5r/s;
limit_req zone=req_per_ip burst=20;
limit_req_status 429;
see https://joshtronic.com/2025/11/09/rate-limiting-nginx/
I prefer to block them at the firewall—before they reach the application. But I should probably give it a try!?!
Does anyone here have any experience with this?
THX
Jens
--
Jens Witzel
Universität Zürich
Zentrale Informatik
Pfingstweidstrasse 60B
CH-8005 Zürich
mail: jens....@uzh.ch
phone: +41 44 63 56777
http://www.zi.uzh.ch
Dear Jens,
by chance, I recently implemented such measures in our site because lately bots are becoming more and more aggressive. For example, since a few weeks, there's a complete subnet identified as FB/Meta bots trying to scrap our site.
Up to 60 different IPs are sending requests at the same time, and they're not giving up. We started by manually blocking them for a few days, but after unblocking them, they start trying again.
Thus, we have implemented rate limitng and automatic blocking of IPs.
I was planning to write a blog post about this, but I haven't had the time, so here's a summary.
Keep in mind that our policy is very aggressive, and may affect users and indexing.
First, remember that most of the traffic generated by bots comes from SSR, thus, the traffic comes from the IP of your frontend, and not the bot IPs.
A single request to the frontend by a bot will trigger dozens of requests on the backend by the SSR (which is the actual problem). Also, consider that returning a 429 status code to the SSR engine may result in returning to the bot a broken page with 200 status code rather than a 429 (but I haven't investigated this further, this is a guess based on https://github.com/angular/angular/issues/50829).
In any case, you should consider using the "Real IP" module (https://nginx.org/en/docs/http/ngx_http_realip_module.html), e.g.:
set_real_ip_from <FRONTEND_IP>;
real_ip_header X-Forwarded-For;
real_ip_recursive on;
This way, $binary_remote_addr will contain the bot's IP rather than our frontend IP.
Notice that your proxy must set the appropiate headers (see proxy_set_header below).
In our case, we are also counting requests per /24 subnet rather than per IP. To do so, we map $binary_remote_addr to another variable.
map $binary_remote_addr $binary_remote_addr_slash24 {
"~^(?P<net>...).$" "$net";
}
limit_req_zone $binary_remote_addr_slash24 zone=abusers:10m rate=5r/s;
Notice that this may block legitimate users if ther'e typicially from the same organzization, but we are a low-traffic site, and typically only bots send lots of requests from the same subnet at the same time. You may need to whitelist your organization subnet to avoid false positives.
We have also implemented different limits for the frontend and the backend:
location /dspace.server {
limit_req zone=abusers burst=30 delay=10;
limit_req_status 429;
add_header Retry-After $retry_after always;
proxy_pass ...;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $remote_addr;
...
}
location / {
limit_req zone=abusers burst=20 delay=10;
limit_req_status 429;
add_header Retry-After $retry_after always;
proxy_pass ...;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $remote_addr;
...
}
To set the Retry-After header we use the following map:
# Request a delay of 30 seconds before the next request
# if a 429 status code is returned
map $status $retry_after {
default '';
429 '30';
}
Also, notice that a single "normal" page request will trigger tens of requests (I have measured around 20 requests to the frontend, and 30 to the backend), so be careful when setting the "burst" parameter.
Finally, we have implemented automatic IP blocking by using Fail2Ban (because, as I said, we are being "attacked" by complete subnets, and even reaching a point when a 429 status code is return is costly in terms of CPU).
This is (a summary of) our "/etc/fail2ban/jail.d/nginx-limit-req.conf" (asuming a default Debian installation of Fail2Ban):
[nginx-limit-req]
enabled = true
# "bantime.increment" allows to use database for searching of previously banned ip's to increase a
# default ban time using special formula, default it is banTime * 1, 2, 4, 8, 16, 32...
bantime.increment = true
# "bantime" is the amount of time that a host is banned, integer in seconds or
# time abbreviation format (m - minutes, h - hours, d - days, w - weeks, mo - months, y - years).
# This is to consider as an initial time if bantime.increment gets enabled.
bantime = 2m
# A host is banned if it has generated "maxretry" during the last "findtime"
# seconds.
findtime = 30s
# "maxretry" is the number of failures before a host get banned.
maxretry = 10
And, in summary, that's all. Also, as I mentioned, we're being VERY conservative and this may block legitimate users and may affect indexing.
Use this advise at your own risk, and adjust the configuration parameters to your needs and hardware configuration.
Hope this helps, and if I have time to write it in a more structured way, I'll post it here.
Best regards,
Abel
--
All messages to this mailing list should adhere to the Code of Conduct: https://lyrasis.org/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dspace-tech/ZRAP278MB01761781A7CD2247989649C99D232%40ZRAP278MB0176.CHEP278.PROD.OUTLOOK.COM.
-- Abel Gómez Llana, PhD ab...@gomez.llana.me https://abel.gomez.llana.me
--
All messages to this mailing list should adhere to the Code of Conduct: https://lyrasis.org/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.