504 Gateway timeouts and concurrent connections with long running EAD exports

744 views
Skip to first unread message

zonedef...@gmail.com

unread,
Jun 17, 2016, 2:23:32 PM6/17/16
to ICA-AtoM Users
Good morning Atom User group;

We've noticed that our monitoring software is telling us that AtoM is not responding fairly often. It always starts again very soon however, 5-10 minutes intervals.

This seems to happen sometimes when the following URL is accessed on our site:
<site-ip-redacted>/index.php/photographic-material;ead?sf_format=xml

This is an EAD XML export of a large number of records I believe (please correct me if I'm wrong).

The problem is that this particular export is taking so long that NGINX was timing out (504 Gateway timeout). I managed to at least get it to stop timing out by increasing the "fastcgi_read_timeout" parameter, (thank you Artifactual for posting that solution), but the big problem with this export is that it seems to BLOCK other requests coming to the web server while that request is being dealt with (up to 3-4 minutes). I think that is why the monitoring software is picking up outages, as even simple queries are not returned until after the long-running export is finished.

I can reproduce this problem at will on my DEV server.

Can anyone perhaps provide some insight on what is happening here, and how to fix it possibly? Is there a way to configure NGINX to allow another AtoM thread to be spawned?

It doesn't seem like slow MYSQL queries are to blame, as I have turned on logging for it and only rarely are any queries taking more than 0.5 second.

We are using AtoM version 2.2, Nginx 1.10.1, php5.5.9 and mysql 5.5.49 on Ubuntu 14.04. The host is running on VMWare with 16GB of RAM, 4 vCPUs. I've tried moving the mysql database to an ext3 disk, which did not help. I've also tried a few other things to get NGINX to spawn new threads for new requests, but it didn't seem to fix the problem.

Thank you for your consideration!

Brad, Library Systems Administrator

David at Artefactual

unread,
Jun 20, 2016, 5:31:30 PM6/20/16
to ICA-AtoM Users, zonedef...@gmail.com
Hi Brad,

In general we recommend against increasing the PHP timeout limit for public AtoM sites because of the problem your are experiencing, allowing longer running requests creates more load on the server.  In many cases requests to large EAD files (e.g. photographic-material;ead?sf_format=xml) are made by search crawlers, which exacerbates the problem.  In a future version of AtoM we would like to move the expensive work of generating large EAD documents to a background process instead of trying to generate the document each time it is requested, but this enhancement will require community sponsorship to make it a reality.

In most cases when we've noticed significant slow downs in AtoM performance the bottleneck is that MySQL is maxing out the CPU load. You can check if this is the bottleneck in your case by running "uptime" on your MySQL server and checking your server load - for 4 CPU cores a load of 4 or higher means that processes are waiting for CPU time.  You can confirm that MySQL is the process using up the majority of the CPU by using "top" or "htop" to check the resources being using by the mysqld process.  While it's true that AtoM doesn't usually show very long query times, it does make a lot of queries, especially for long operations like generating large EAD finding aids.

I think it's unlikely that the number of avaialble PHP processes are your bottleneck. Assuming you are using our installation instructions, the php-fpm "pm" settings (e.g. pm.max_children, pm.start_servers) listed should allow serving up to 30 PHP processes at the same time. However, you could try increasing the number for "pm.max_children" to see if it helps (make sure to restart php4-fpm after changes).  You can count the number of php-fpm processes running with "ps fauxwwww | grep -c php5-fp[m]".

You may also want to analyze your web logs - in many cases where we've seen significant slowdowns in AtoM it's due to search engine web crawlers making a lot of requests in a short amount of time - often tens of thousands of requests a day.  We've had good results with adding a robots.txt Crawl-delay to slow down requests (30 is a good initial value to try) and blocking particularly demanding or unwanted web crawlers.

I hope that helps!


Best regards,
David

--

David Juhasz
Director, AtoM Technical Services Artefactual Systems Inc. www.artefactual.com

zonedef...@gmail.com

unread,
Jun 20, 2016, 7:11:25 PM6/20/16
to ICA-AtoM Users, zonedef...@gmail.com
Thanks for your reply David;

The load on the server is generally not too much for it- right now the load in HTOP is listed as "0.02" - which is not high. The performance of MYSQL doesn't seem to usually be a problem for regular traffic. When running HTOP while executing that long-running query, I see mysql and PHP both using a lot of processor time, which makes sense if AtoM is making many small queries.

I have noticed in the access logs many web crawlers coming to our site, and I've updated my robots.txt file as you recommend. I also have been reading about the ELK stack which would help with log file analysis - do you find that to be a helpful approach to analyzing AtoM's logs?

The PHP thread settings are set as the installation document listed above recommends.

I have set Nginx's php execution time down to 60s - hopefully that provides a balance between allowing the exports, and keeping the site responsive to visitors.

Thanks David!

Brad

David at Artefactual

unread,
Jun 20, 2016, 7:28:52 PM6/20/16
to ICA-AtoM Users, zonedef...@gmail.com
Hi Brad,

I haven't used ELK myself, but it looks like a very powerful and flexible stack for web analytics and visualization.  I've used https://piwik.org/ for web analytics, and it worked well for breaking out robot traffic vs. real users.  ELK sounds more flexible than Piwik, but it also seems more complicated to set up and configure.  I'd like to test out ELK in the future to really compare though.

I'm surprised you are having speed issues with a CPU load of 0.02! It certainly suggests that CPU is not your bottleneck.  Have you checked the server resources when running generating a large EAD finding aid?  HTOP is great for real time monitoring of CPU and RAM utilization, as well as identifying which processes are using the most resources.


Cheers,
David
Reply all
Reply to author
Forward
0 new messages