Hi all:
Well things have *mostly* been going smoothly since the upgrade.
However, last night around 8:20 Eastern time, our database server ran out of memory and killed off all existing database connections. This is not particularly new behaviour; a quasi-denial of service attack (poorly behaved robots that are crawling our site extremely rapidly, either trying to index it, or looking for vulnerabilities) increases the number of requests to the database until it runs out of memory. At that point, it kills off all database connections.
In the past, this state would have been almost immediately obvious as common actions such as running a search query would have resulted in a server error.
Now, however, running a search query in this state results in a seemingly successful "0 results" message, rather than reporting a server error. The scripts that currently monitor the Evergreen services don't see a server error and therefore don't report a problem to me.
Determining the state of the system is even more confusing because a subset of the services (those coded in C) automatically reconnect to the database and continue working properly--so you can successfully log into the staff client, for example. But those services that have been coded in Perl do not successfully reconnect.
In the short term, I plan to improve our system-monitoring scripts so that it will properly detect this problem and report it to me, so that I can manually restart the services if needed. In the slightly longer run, I hope to make the scripts themselves restart the services if they detect that the condition has continued for, say, five minutes. Ultimately, I'm hopeful that Evergreen's Perl services can be taught to try reconnecting to the database instead of simply hanging onto the no-longer-functioning connection, as that would be a much more robust way of handling the underlying error.
As for the poorly behaved robots, in the past we have tried blocking their IP addresses, specific search queries, or user agents. But it's a game of whack-a-mole: there are always more that pop up. And ideally Evergreen should just handle the pressure of many simultaneous requests more gracefully.
Thanks to all those who reported problems this morning, and my apologies for taking until 9:30 to determine what the problem was and take action.
Best,
Dan
--
Dan Scott
Associate Librarian / Bibliothécaire agrégé
Laurentian University / Université Laurentienne