Skip to first unread message

BrzI Channel

unread,
Apr 29, 2024, 2:29:48 PMApr 29
to AtoM Users
Hi folks,

I noticed my AtoM 2.64 instance has started using a lot of CPU time..please see below
The last time records were loaded was on April 11th..
The site works fine but I would like to get to the bottom of this..
I looked at the SQL threads and they all looked good - not a single job hanging..
I would appreciate any suggestions..

Thanks

CPU.pngHTOP.png

Dan Gillean

unread,
Apr 30, 2024, 7:45:08 AMApr 30
to ica-ato...@googlegroups.com
Hi there, 

I don't know for sure if it's related, but as a start I would recommend that you try reviewing the Nginx access logs. At Artefactual we have recently seen a large increase in badly-behaved bots and web crawlers repeatedly hitting pages and causing slow downs. This thread has a query our team has used to check the access logs for such behavior: 
Please let us know what you find!

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/340e951b-7021-4741-a2f5-6b395376eb1an%40googlegroups.com.

BrzI Channel

unread,
Apr 30, 2024, 2:20:04 PMApr 30
to AtoM Users
Here is the output of that command.. a couple of funny email addresses here...

 221450 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +clau...@anthropic.com)
 116694 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
  44625 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  38971 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
  36336 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
   9950 Mozilla/5.0 (compatible; Bytespider; spider-...@bytedance.com) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36
   3276 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
   2764 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36
   2680 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36
   2652 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36
   2638 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36
   2531 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47
   1366 Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-...@bytedance.com)
   1340 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
   1081 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
    823 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
    605 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/124.0.6367.60 Safari/537.36
    586 Python/3.8 aiohttp/3.9.5
    450 Go-http-client/1.1
    411 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.94 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

 11:15:53 up 4 days, 23:17,  1 user,  load average: 7.88, 4.44, 3.45
php-fpm processes 26

Thanks

BrzI Channel

unread,
Apr 30, 2024, 3:17:28 PMApr 30
to AtoM Users
I see you have a robots.yml file in the link you provided. 
The one about claude...bot is probably the biggest contributor to my CPU cycles...
How do I enable that in my nginx site-enabled conf file ?
Read about quite a few different ways of doing it but noen quite match my setup (that I could see).
Thanks

Jim Adamson

unread,
May 1, 2024, 7:24:25 AMMay 1
to ica-ato...@googlegroups.com
Hi,

robots.txt goes in the root of your site, so typically /usr/share/nginx/atom

You'll probably need a bigger hammer though. What I've done is create the file /etc/nginx/conf.d/user-agent-rules.conf and populate it with:

map_hash_bucket_size 500;
map $http_user_agent $badagent {
        default         0;
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"      1;
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"      1;
        ~*bingbot      1;
}


The lines ending with 1; are the user agent strings you want to block. These can be literal strings or regular expressions.

Then in your server { } blocks — typically in the file /etc/nginx/sites-available/atom — you can add:

if ($badagent) {
    return 444;
}

Reload your nginx config with systemctl reload nginx for your changes to take effect.

I suggest you invent some user agent strings for testing:
  • add them to user-agent-rules.conf
  • systemctl reload nginx
  • then use curl -v -L -A "your-invented-user-agent-string" https://atom... to test that the server is behaving as expected i.e. issuing a 444 status code and closing the connection
You might find that you need to adjust the map_hash_bucket_size, as I did. You will know if after reloading the nginx config, you see:

Job for nginx.service failed.
See "systemctl status nginx.service" and "journalctl -xe" for details.


I hope that helps.

Thanks, Jim



--
Jim Adamson
Systems Administrator/Developer
Facilities Management Systems
IT Services
LFA/023 | Harry Fairhurst building | University of York | Heslington | York | YO10 5DD

Dan Gillean

unread,
May 1, 2024, 8:00:03 AMMay 1
to ica-ato...@googlegroups.com
Hi Jim, 

Thanks so much for jumping in and sharing your configuration examples!

The GitHub repository with the bad-robots sample block I shared is part of an Ansible playbook to automate the deployment and configuration of Nginx, that our Support team uses for client AtoM deployments. It is very similar to the base example included in the documentation, but with a few more things configured - and instead of being created manually, Ansible handles the deployment. There are further details on its use, including how to configure protection against bots, in the README: 
You will also find a number of external Nginx configuration examples and articles linked in that README. Here is one more that our Support team told us they have been referencing and exploring a lot recently: 
For those using Apache, the same person has an Apache configuration example here for reference: 
Hopefully, between those links and Jim's helpful tips, you can get this configured. Let us know how it goes!

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him

BrzI Channel

unread,
May 1, 2024, 3:12:01 PMMay 1
to AtoM Users
Until I have fully read the comments above I have actually firewall-blocked the IPs of the worst offenders:

claudebot
bingbot
facebook

This has brought down CPU usage from an everage of 90+% to singled-digit to low-teens %
I am aware that IPs will change so yes - a proper bot protection scheme will need to be put in place.

Thanks for the help everyone. You can consider this resolved.
Reply all
Reply to author
Forward
0 new messages