Crawlers keep crashing our Elasticsearch - how are others dealing with this?

160 views
Skip to first unread message

Carolyn Sullivan

unread,
Jan 4, 2024, 3:27:52 PMJan 4
to AtoM Users
Hello,

We've been having issues with crawlers overloading our site and crashing Elasticsearch, resulting in errors in browser such as 

Elasticsearch error: Elastica\Exception\Connection\HttpException

Our excellent IT team configured refuse connection for crawlers from ByteDance, but we're still getting 9-10 crawlers a second from Googleother and Googlebot.  We recently added Google Analytics before the holidays, and my IT folks are wondering if this isn't at least partly due to that (though admittedly, I need to read more documentation on whether those bots are necessary for the function of Google Analtycis).  I was wondering if anyone else had issues with bots crashing Elasticsearch, or information on whether there was a correlation with use of Google Analytics.

Thank you so much for your time and expertise, and a very happy new year to those who celebrate,

Cheers,
Carolyn.

Jim Adamson

unread,
Jan 5, 2024, 8:01:05 AMJan 5
to ica-ato...@googlegroups.com
Hi Carolyn,

Yes, we see this Elasticsearch error from time to time, and typically need to restart elasticsearch for search to recover. Have you configured a robots.txt file in the root of your site? This should at least make honorable bots go away, but I'm pretty sure ByteDance won't.

Our robots.txt is as follows, which "allows" those listed and "disallows" any others:

User-agent: Googlebot
Disallow:

User-agent: DuckDuckBot
Disallow:

User-agent: BingBot
Disallow:

User-agent: Msnbot
Disallow:

User-agent: Yahoo-slurp
Disallow:

User-agent: *
Disallow: /


Just to reiterate that bots don't have to honour these rules; it's very much advisory only with no guarantees. But I think it should help, so definitely worth doing.

In terms of aggressively disallowing bots in Nginx, there is the Nginx Ultimate Bad blocker, though I haven't tried it. As it looks like a comprehensive solution, I guess this might be likely to generate false positives, and potentially block some real users, or have other unforeseen effects on AtoM. Another (perhaps safer) option would be to create some simple rules to catch persistent offenders to your site using something like this guide.

I really don't think Google Analytics would cause any problems like you're experiencing. GA javascript code typically runs on legitimate clients and wouldn't cause much if any extra load on the AtoM web server. Most (all?) bots won't be running javascript at all, but the sheer volume of requests can consume resources to the point of causing outages.

I hope that helps a bit.

Thanks, Jim

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/b767cda3-38d3-4030-b9a2-932d2d6ade3en%40googlegroups.com.


--
Jim Adamson
Systems Administrator/Developer
Facilities Management Systems
IT Services
LFA/023 | Harry Fairhurst building | University of York | Heslington | York | YO10 5DD

Reply all
Reply to author
Forward
0 new messages