Nginx error (110: Connection timed out)

TJ Keemon

unread,

Jul 30, 2015, 3:50:06 PM7/30/15

to Open edX operations

Hey everybody-

I'm running into a bit of trouble with a customer. Studio is crashing every few minutes during course authoring. The logs are showing the following upstream timeout error:

2015/07/30 19:13:14 [error] 30216#0: *5 upstream timed out (110: Connection timed out) while reading response heade

r from upstream, client: 12.34.59.108, server: ~^((stage|prod)-)?studio.*, request: "GET /xblock/outline/block-v1:

org+dwnc01+2015+type@chapter+block@ea1a130c5e3e4fb89b836da10fe8eded HTTP/1.1", upstream: "http:/

/127.0.0.1:8010/xblock/outline/block-v1:org+dwnc01+2015+type@chapter+block@ea1a130c5e3e4fb89b836

da10fe8eded", host: "studio.sitename.com", referrer: "http://studio.sitename.com/course/course-v1
:org+dwnc01+2015"

and the corresponding error from /edx/var/log/supervisor/cmstderr.log:

2015-07-30 15:11:42 [1684] [CRITICAL] WORKER TIMEOUT (pid:29641)2015-07-30 15:11:42,114 CRITICAL 1684 [gunicorn.error] glogging.py:204 - WORKER TIMEOUT (pid:29641)2015-07-30 15:11:42 [1684] [CRITICAL] WORKER TIMEOUT (pid:29641)2015-07-30 15:11:42,142 CRITICAL 1684 [gunicorn.error] glogging.py:204 - WORKER TIMEOUT (pid:29641)2015-07-30 15:11:42 [31118] [INFO] Booting worker with pid: 311182015-07-30 15:11:42,148 INFO 31118 [gunicorn.error] glogging.py:213 - Booting worker with pid: 311182015-07-30 15:14:39 [1684] [CRITICAL] WORKER TIMEOUT (pid:30245)2015-07-30 15:14:39,320 CRITICAL 1684 [gunicorn.error] glogging.py:204 - WORKER TIMEOUT (pid:30245)2015-07-30 15:14:39 [1684] [CRITICAL] WORKER TIMEOUT (pid:30245)2015-07-30 15:14:39,336 CRITICAL 1684 [gunicorn.error] glogging.py:204 - WORKER TIMEOUT (pid:30245)2015-07-30 15:14:39 [31977] [INFO] Booting worker with pid: 31977

I've tried tweaking the Nginx settings:

uwsgi_read_timeout
proxy_connect_timeout
proxy_read_timeout

in /edx/app/nginx/sites-available/cms

based on what I came across here:

http://stackoverflow.com/questions/16141610/nginx-timeouts-when-uwsgi-takes-long-to-process-request
http://stackoverflow.com/questions/6816215/gunicorn-nginx-timeout-problem

Has anybody else encountered this? I'm still trying to track down the exact actions that cause Nginx to stop responding, but I can't find any pattern based on the error logs.

I'm also seeing this error pop up. Not sure if it's related or not:

Jul 30 19:11:51 body-mind-server [service_variant=lms][xblock.plugin][env:sandbox] WARNING [body-mind-server 3208] [plugin.py:138] - Unable to load XBlockAside 'thumbs_aside'Traceback (most recent call last): File "/edx/app/edxapp/venvs/edxapp/src/xblock/xblock/plugin.py", line 136, in load_classes yield (class_.name, cls._load_class_entry_point(class_)) File "/edx/app/edxapp/venvs/edxapp/src/xblock/xblock/plugin.py", line 73, in _load_class_entry_point class_ = entry_point.load() File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pkg_resources.py", line 2092, in load raise ImportError("%r has no %r attribute" % (entry,attr))ImportError: <module 'sample_xblocks.thumbs' from '/edx/app/edxapp/venvs/edxapp/src/xblock-sdk/sample_xblocks/thumbs/__init__.py'> has no 'ThumbsAside' attribute

Thanks.

-TJ

TJ Keemon

unread,

Jul 31, 2015, 1:06:29 PM7/31/15

to Open edX operations, kee...@gmail.com

Just to add a bit more info as it's trickling in. It seems that Studio is crashing when the sections of a course are being renamed and/or reordered.

David Baumgold

unread,

Jul 31, 2015, 4:01:26 PM7/31/15

to Open edX operations, kee...@gmail.com

I'm not sure if this is helpful, but did you recently try to upgrade your installation? I know that Feanil discovered a problem with Celery workers blocking indefinitely, and suggested that upgrading could fix it, but that you need to kill all your Celery workers pre-emptively for them to pick up the fix. See this page: https://openedx.atlassian.net/wiki/display/OpenOPS/Potential+Problems+Migrating+from+Birch+to+Cypress

DB

TJ Keemon

unread,

Aug 5, 2015, 10:54:16 AM8/5/15

to Open edX operations, kee...@gmail.com

I'm still on Birch.1, so I don't think that's the issue.

I have a video of our customer demonstrating how he's causing the error. I'm going to spend some time trying to reproduce it locally, then I'll post any new info or solutions I come up with.