Managing repository server performance and stability issues, and bot-traffic

102 views
Skip to first unread message

Marijka Azzopardi

unread,
Sep 23, 2024, 8:06:20 AM9/23/24
to DSpace Community

Hi all,

I'd like to ask about your experience in managing repository server performance and stability issues, particularly as a result of high-bot traffic to your DSpace repository. 

To provide some background, at UNSW Sydney, our DSpace7 repository has been experiencing an increase in performance and stability issues due to the heavy load being placed on our repository from several contributing factors, including increased bot and crawler traffic.

We plan to upgrade from DSpace v7.0 to v7.6.2 to optimise our server performance, and access vital bug fixes and functionality that addresses this, such as Caching of server-side rendered pages. I am aware though that performance and scalability issues are still being reported by DSpace v7.6.2 and v8 repository owners and, as a result, solutions are being pushed for prioritisation in a future DSpace 9 release (tentative Apr 2025).

In the interim, although this may have reduced impact, we’re looking to update our robots.txt “disallow” rules to prevent bot crawling of unnecessary repository pages and reduce the number of requests made to our server by directing ‘compliant’ search engine crawlers directly to repository metadata and files.

I would be very interested to know how your institution may be managing server performance and stability issues, if you have updated your robots.txt to direct crawler traffic and block bots, or have implemented any other solutions e.g. integration with third-party software such as Redis (used by Jagiellonian University) to cache server-side rendered pages in DSpace, or Cloudflare (Content Delivery Network), etc.

Looking forward to hearing from you and a thanks in advance for your time!

 

Kind regards,

Marijka Azzopardi

Repository Librarian

Scholarly Communications & Repositories, UNSW Library
UNSW SYDNEY 2052

pierre...@bibl.ulaval.ca

unread,
Sep 24, 2024, 10:28:56 AM9/24/24
to DSpace Community
Hi Marijika, 

We experienced the same problem with bot traffic. It was so intense that our server was not able to handle it. We decided to use a commercial service of our university IT office called  Big-IP : https://www.f5.com/products/big-ip-services/advanced-firewall-manager

Basicaly it handles all requests targeted to our DSpace server before it is sent to it. It required a few weeks of fine tuning (during the first days even the API calls from the angular UI to the backend were blocked). 

It solves the bot traffic problems, but as you pointed out, there are still performance issues in DSpace that will need to be tackled. 

Thanks!
Best, 
PIerre

Mehmet Demirel

unread,
Oct 3, 2024, 2:55:31 PM10/3/24
to DSpace Community
As a result of our research, we found that the Bytespider bot was causing the excessive CPU usage. You will see this when you examine the traffic and logs on the Firewall side.

This Bytespider; spider-feedback(at)bytedance.com bot is very aggressive.
It seems to be constantly making requests from China and Singapore.
Moreover, they never use robots.txt and seem to attack the DSpace server every second of every day.
Bytedance seems to be using a different IP every second. We need to block all of their IPs through the firewall.
Also, examine the Apache/Nginx logs. You can start by trying to block IP blocks starting with 47.128.

SelenSoft Consulting
Turkey

Reply all
Reply to author
Forward
0 new messages