Open edX re-run hangs?

1,866 views
Skip to first unread message

Ronny Cook

unread,
Mar 2, 2015, 10:00:34 PM3/2/15
to opene...@googlegroups.com
Short version: When re-running an imported course (or the Demo course) on a new Open edX install, attempting a course re-run returns a server error, then the re-run shows up on the console but runs indefinitely. Judging from the logs, this seems to be some form of internal communication error (see cms/edx.log extract below).

Can anyone suggest where I should look to possibly fix this?

Now the long version:

I've been trying to get Open edX running, with mixed results. Initially I used an EC2 instance with a preconfigured AMI; this required half a dozen or so fixes and workaround during the build process. I then tried a new install from scratch on an Ubuntu 12.04 instance. The install for this went much more smoothly, but in the end I wound up with exactly the same symptoms. I have made some changes to the CMS and LMS JSON files (mainly adding host names and fixing email addresses) and have done a manual theme install using the standard set of instructions.

The basic problem is: I can import a course and view it; I can also view the demo course included with the base install. However, when I try to re-run a course, the system throws a server error (as shown on the re-run page) and the new re-run course is then shown on the main console with "Configuring re-run..." Rather than terminating after 5-10 minutes, this continues indefinitely (72+ hours at this point).

This seems to suggest I've screwed up something in my config, but I haven't been able to find anything

My interpretation of the following is that it's trying to communicate with something (via AMPQ?) to get the rerun processed but that something isn't communicating. All daemons seem to be running, although "certs" seems to restart on a regular basis.

Snippets of various log files follow...

$ sudo /edx/bin/supervisorctl -c /edx/etc/supervisord.conf status
certs                            RUNNING   pid 13571, uptime 0:00:11
edxapp:cms                       RUNNING   pid 1311, uptime 3 days, 20:43:16
edxapp:lms                       RUNNING   pid 1310, uptime 3 days, 20:43:16
edxapp_worker:cms_default_4      RUNNING   pid 4536, uptime 3 days, 20:23:54
edxapp_worker:cms_high_1         RUNNING   pid 4505, uptime 3 days, 20:23:56
edxapp_worker:cms_low_3          RUNNING   pid 4482, uptime 3 days, 20:23:57
edxapp_worker:lms_default_3      RUNNING   pid 4504, uptime 3 days, 20:23:56
edxapp_worker:lms_high_4         RUNNING   pid 4547, uptime 3 days, 20:23:54
edxapp_worker:lms_high_mem_2     RUNNING   pid 4483, uptime 3 days, 20:23:57
edxapp_worker:lms_low_1          RUNNING   pid 4527, uptime 3 days, 20:23:55
forum                            RUNNING   pid 1305, uptime 3 days, 20:43:16
xqueue                           RUNNING   pid 1308, uptime 3 days, 20:43:16
xqueue_consumer                  RUNNING   pid 1304, uptime 3 days, 20:43:16

supervisor/certstdout.log:
2015-02-27 09:08:40,987 INFO 23570 [requests.packages.urllib3.connectionpool] connectionpool.py:180 - Starting new HTTP connection (1): localhost
2015-02-27 09:08:58,073 CRITICAL 23570 [openedx_certificates.queue_xqueue] queue_xqueue.py:62 - Unable to get queue length: No JSON object could be decoded

supervisor/certstderr.log:
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1503]
Traceback (most recent call last):
  File "/edx/app/certs/certificates/certificate_agent.py", line 197, in <module>
    main()
[...]
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1503]

nginx/error.log:
2015/02/27 05:47:43 [error] 11246#0: *4 upstream prematurely closed connection while reading response header from upstream, client: 127.0.0.1, server: , request: "GET /xqueue/get_queuelen/?queue_name=certificates HTTP/1.1", upstream: "http://127.0.0.1:8040/xqueue/get_queuelen/?queue_name=certificates", host: "localhost:18040"
2015/02/27 05:47:53 [error] 11246#0: *13 connect() failed (111: Connection refused) while connecting to upstream, client: 127.0.0.1, server: , request: "POST /xqueue/login/ HTTP/1.1", upstream: "http://127.0.0.1:8040/xqueue/login/", host: "localhost:18040"

From cms/edx.log:
Feb 27 05:47:26 localhost [service_variant=cms][gunicorn.error][env:sandbox] INFO [online  11341] [glogging.py:213] - Listening at: http://127.0.0.1:8010 (11341)
Feb 27 05:47:26 localhost [service_variant=cms][gunicorn.error][env:sandbox] INFO [online  11341] [glogging.py:213] - Using worker: sync
Feb 27 05:47:26 localhost [service_variant=cms][gunicorn.error][env:sandbox] INFO [online  11594] [glogging.py:213] - Booting worker with pid: 11594
[ several more worker process boots]
Feb 27 05:57:53 localhost [service_variant=cms][edxmako.shortcuts][env:sandbox] WARNING [online  11597] [shortcuts.py:56] - Cannot find corresponding link for name: ROOT
Feb 27 05:57:53 localhost [service_variant=cms][edxmako.shortcuts][env:sandbox] WARNING [online  11597] [shortcuts.py:56] - Cannot find corresponding link for name: TOS
Feb 27 05:57:53 localhost [service_variant=cms][edxmako.shortcuts][env:sandbox] WARNING [online  11597] [shortcuts.py:56] - Cannot find corresponding link for name: PRIVACY
[...]
Feb 27 05:59:45 localhost [service_variant=cms][root][env:sandbox] ERROR [online  11594] [exceptions.py:9] - Uncaught exception from <class 'django.core.handlers.wsgi.WSGIHandler'>
Traceback (most recent call last):
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 109, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/contrib/auth/decorators.py", line 20, in _wrapped_view
    return view_func(request, *args, **kwargs)
  File "/edx/app/edxapp/edx-platform/cms/djangoapps/contentstore/views/course.py", line 256, in course_handler
    return _create_or_rerun_course(request)
  File "/edx/app/edxapp/edx-platform/common/djangoapps/util/json_request.py", line 27, in parse_json_into_request
    return view_function(request, *args, **kwargs)
[...]
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 115, in establish_connection
    conn = self.Connection(**opts)
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/connection.py", line 151, in __init__
    (10, 30),  # tune
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/abstract_channel.py", line 69, in wait
    self.channel_id, allowed_methods)
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/connection.py", line 204, in _wait_method
    self.method_reader.read_method()
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/method_framing.py", line 192, in read_method
    raise m
error: [Errno 104] Connection reset by peer
Feb 27 05:59:45 localhost [service_variant=cms][django.request][env:sandbox] ERROR [online  11594] [base.py:213] - Internal Server Error: /course/

[...]

$ netstat -an | grep tcp.*LISTEN
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:15672           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:55672           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:18040           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:25              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:18010           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:8000          0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:18080           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:8099            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:5672          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:8040          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:4040          0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:43209           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:27017           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:9001          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:8010          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:9002          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:11211         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:18060           0.0.0.0:*               LISTEN
tcp6       0      0 :::9200                 :::*                    LISTEN
tcp6       0      0 :::9300                 :::*                    LISTEN
tcp6       0      0 :::22                   :::*                    LISTEN
tcp6       0      0 :::25                   :::*                    LISTEN

Eugene Medvedev

unread,
Mar 3, 2015, 5:39:01 AM3/3/15
to opene...@googlegroups.com
On Tuesday, 3 March 2015 06:00:34 UTC+3, Ronny Cook wrote:
My interpretation of the following is that it's trying to communicate with something (via AMPQ?) to get the rerun processed but that something isn't communicating. All daemons seem to be running, although "certs" seems to restart on a regular basis.

...wait, I think I know this one. A shot in the dark, sure, but it'll be easy to confirm if that is the reason. This is from my list of edX gotchas:

rabbitmq goes insane if /etc/hosts contains a line like 127.0.0.1 localhost real-host-name. Symptoms are odd installation/update failures due to failing to start up services. To get it to work, /etc/hosts must look like this:

127.0.0.1 localhost 
127.0.0.1 real-host-name 

Ronny Cook

unread,
Mar 3, 2015, 9:00:14 PM3/3/15
to opene...@googlegroups.com

This was indeed how I had my /etc/hosts configured, but changing it doesn’t seem to have helped… the certs server is restarting every minute or so as well, which I suspect should not be happening. I haven’t configured SSL at all, if that matters.

cms/edx.log:

[… same leadup as last time …]

  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/connection.py", line 204, in _wait_method

    self.method_reader.read_method()

  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/method_framing.py", line 192, in read_method

    raise m


supervisor/certstderr.log has:

Traceback (most recent call last):

  File "/edx/app/certs/certificates/certificate_agent.py", line 197, in <module>

    main()

  File "/edx/app/certs/certificates/certificate_agent.py", line 64, in main

    if manager.get_length() == 0:

  File "/edx/app/certs/certificates/openedx_certificates/queue_xqueue.py", line 57, in get_length

    response = json.loads(request.text)

  File "/usr/lib/python2.7/json/__init__.py", line 326, in loads

    return _default_decoder.decode(s)

  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode

    obj, end = self.raw_decode(s, idx=_w(s, 0).end())

  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode

    raise ValueError("No JSON object could be decoded")

ValueError: No JSON object could be decoded

 

It looks like the certs server is crashing and restarting…

supervisor/supervisord.log:

2015-03-04 01:01:30,355 INFO spawned: 'certs' with pid 3982

2015-03-04 01:01:32,030 INFO success: certs entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2015-03-04 01:01:48,140 INFO exited: certs (exit status 1; not expected)

 

xqueue/edx.log:

Mar  4 01:18:09 online [service_variant=xqueue][pika.adapters.base_connection][env:sandbox] ERROR [online  1461] [base_connection.py:260] - Socket Error on fd 13: 104

Mar  4 01:18:13 online [service_variant=xqueue][queue.management.commands.run_consumer][env:sandbox] INFO [online  1302] [run_consumer.py:118] -  [182] Worker failed

Mar  4 01:18:13 online [service_variant=xqueue][queue.management.commands.run_consumer][env:sandbox] INFO [online  1302] [run_consumer.py:124] -  [183] Starting worker

Mar  4 01:18:13 online [service_variant=xqueue][queue.consumer][env:sandbox] INFO [online  6130] [consumer.py:270] -  [183] Starting consumer for queue edX-Open_DemoX

Mar  4 01:18:13 online [service_variant=xqueue][pika.adapters.base_connection][env:sandbox] ERROR [online  1461] [base_connection.py:260] - Socket Error on fd 13: 104

 

supervisor/xqueue_consumertderr.log:

Process Worker-233:

Traceback (most recent call last):

  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap

    self.run()

  File "/edx/app/xqueue/xqueue/queue/consumer.py", line 275, in run

    self.connection.ioloop.start()

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/select_connection.py", line 101, in start

    self.poller.start()

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/select_connection.py", line 385, in start

    self.poll()

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/select_connection.py", line 440, in poll

    self._handler(fileno, event, write_only=write_only)

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/base_connection.py", line 283, in _handle_events

    self._handle_read()

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/base_connection.py", line 299, in _handle_read

    return self._handle_error(error)

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/base_connection.py", line 263, in _handle_error

    self._handle_disconnect()

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/base_connection.py", line 212, in _handle_disconnect

    self._adapter_disconnect()

  File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/base_connection.py", line 126, in _adapter_disconnect

    self._check_state_on_disconnect()

 File "/edx/app/xqueue/venvs/xqueue/src/pika/pika/adapters/base_connection.py", line 141, in _check_state_on_disconnect

    raise exceptions.ProbableAuthenticationError

ProbableAuthenticationError

 

nginx/access.log seems to show an error when querying the queue:

- - 127.0.0.1 - edx [04/Mar/2015:01:21:47 +0000]  "POST /xqueue/login/ HTTP/1.1" 200 72 0.063 "-" "python-requests/2.3.0 CPython/2.7.3 Linux/3.2.0-69-virtual"

- - 127.0.0.1 - edx [04/Mar/2015:01:22:04 +0000]  "GET /xqueue/get_queuelen/?queue_name=certificates HTTP/1.1" 500 28 17.018 "-" "python-requests/2.3.0 CPython/2.7.3 Linux/3.2.0-69-virtual"

 

For completeness, the first two lines of /etc/hosts:

127.0.0.1 localhost

127.0.0.1 online online.it.mydomain

--
You received this message because you are subscribed to a topic in the Google Groups "Open edX operations" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openedx-ops/EHYvndwmHZI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openedx-ops...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eugene Medvedev

unread,
Mar 4, 2015, 1:27:17 AM3/4/15
to opene...@googlegroups.com
On Wednesday, 4 March 2015 05:00:14 UTC+3, Ronny Cook wrote:

Mar  4 01:18:13 online [service_variant=xqueue][pika.adapters.base_connection][env:sandbox] ERROR [online  1461] [base_connection.py:260] - Socket Error on fd 13: 104


This one pretty clearly indicates that xqueue can't connect to rabbitmq, and the most obvious cause would be that rabbitmq isn't listening... What happens if you run rabbitmqctl cluster_status or rabbitmqctl list_queues -- or in general try to directly communicate with it?


Ronny Cook

unread,
Mar 4, 2015, 2:19:24 AM3/4/15
to opene...@googlegroups.com

Rabbitmqctl does return results, but to my untrained eye they don’t look like good ones. The rabbitmq logs are showing “edx” and “celery” have invalid credentials.

 

How and where are those set ? I couldn’t find anything in the ansible playbook setting any usernames or passwords.

 

ubuntu@online:~$ sudo rabbitmqctl cluster_status

Cluster status of node rabbit@online ...

[{nodes,[{disc,[rabbit@online]}]},

{running_nodes,[rabbit@online]},

{partitions,[]}]

...done.

ubuntu@online:~$ sudo rabbitmqctl list_queues

Listing queues ...

...done.

 

The latter should return at least one queue, correct?

 

Extracts from /var/log/rabbitmq/rab...@online.log:

 

_____________________________________________________________________________________

=ERROR REPORT==== 4-Mar-2015::06:41:42 ===

closing AMQP connection <0.9173.1> (127.0.0.1:33251 -> 127.0.0.1:5672):

{handshake_error,starting,0,

                 {amqp_error,access_refused,

                             "PLAIN login refused: user 'edx' - invalid credentials",

                             'connection.start_ok'}}

 

=INFO REPORT==== 4-Mar-2015::06:41:43 ===

accepting AMQP connection <0.9177.1> (127.0.0.1:33252 -> 127.0.0.1:5672)

=ERROR REPORT==== 4-Mar-2015::06:42:01 ===

closing AMQP connection <0.9200.1> (127.0.0.1:33262 -> 127.0.0.1:5672):

{handshake_error,starting,0,

                 {amqp_error,access_refused,

                             "AMQPLAIN login refused: user 'celery' - invalid credentials",

                             'connection.start_ok'}}

 

=ERROR REPORT==== 4-Mar-2015::06:42:01 ===

closing AMQP connection <0.9203.1> (127.0.0.1:33264 -> 127.0.0.1:5672):

{handshake_error,starting,0,

                 {amqp_error,access_refused,

                             "AMQPLAIN login refused: user 'celery' - invalid credentials",

                             'connection.start_ok'}}

 

=INFO REPORT==== 4-Mar-2015::06:42:01 ===

accepting AMQP connection <0.9224.1> (127.0.0.1:33270 -> 127.0.0.1:5672)

_____________________________________________________________________________________

 

So it looks like access is being refused due to faulty credentials – but where are those credentials set?

 

                                …Ronny

--

Eugene Medvedev

unread,
Mar 4, 2015, 2:40:44 AM3/4/15
to opene...@googlegroups.com
On Wednesday, 4 March 2015 10:19:24 UTC+3, Ronny Cook wrote:

Rabbitmqctl does return results, but to my untrained eye they don’t look like good ones. The rabbitmq logs are showing “edx” and “celery” have invalid credentials.

Yep, these aren't good. For what it's worth, here's what good ones look like on a live, working system:

root@new-edx:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@new-edx' ...
[{nodes,[{disc,['rabbit@new-edx']}]},
 {running_nodes,['rabbit@new-edx']},
 {partitions,[]}]
...done.
root@new-edx:~# rabbitmqctl list_queues
Listing queues ...
certificates    0
edX-Open_DemoX  0
edx.cms.core.default    0
edx.cms.core.default.%h.celery.pidbox   0
edx.cms.core.high       0
edx.cms.core.high.%h.celery.pidbox      0
edx.cms.core.low        0
edx.cms.core.low.%h.celery.pidbox       0
edx.lms.core.default    0
edx.lms.core.default.%h.celery.pidbox   0
edx.lms.core.high       0
edx.lms.core.high.%h.celery.pidbox      0
edx.lms.core.high_mem   0
edx.lms.core.high_mem.%h.celery.pidbox  0
edx.lms.core.low        0
edx.lms.core.low.%h.celery.pidbox       0
open-ended      0
open-ended-message      0
...done.
root@new-edx:~#

As far as I know, the queues aren't actually created during installation, they are created on demand by edX once it starts up. I.e. edx/celery user accounts have never had access to that rabbitmq node to create queues. (Or at least, never had it since rabbitmq was last restarted...) Which implies the accounts probably don't exist, or at least, don't exist on this particular host name.

How and where are those set ? I couldn’t find anything in the ansible playbook setting any usernames or passwords.

One would expect to override the variables defined in https://github.com/edx/configuration/blob/master/playbooks/roles/rabbitmq/defaults/main.yml through server_vars.yml during installation, (they do include some passwords) but it's a safe bet you didn't do anything of the sort and left the values at their defaults -- which should work.  But rabbitmq is a complicated beast (kind of like ejabberd, which I'm personally much more familiar with and which is also written in Erlang...) which has, among other things, the notion of virtual hosts. Those can complicate things considerably.

root@new-edx:~# rabbitmqctl list_permissions
Listing permissions in vhost "/" ...
admin   .*      .*      .*
celery  .*      .*      .*
edx     .*      .*      .*
...done.

Do the user accounts actually exist? For that matter, is the vhost that is active by default the "/", i.e. the default vhost, and not something else that rabbtimq decided during installation? If the answer to either is negative, you might try to reinstall rabbitmq role on it's own.

zwcum...@gmail.com

unread,
Apr 4, 2016, 2:53:27 PM4/4/16
to Open edX operations
FWIW I was having the same problem. I added the 'edx' user to rabbitmq's list of users. 

sudo rabbitmqctl add_user edx edx
sudo rabbitmqctl set_permissions ".*" ".*" ".*"
sudo service rabbitmq-server restart
sudo reboot

This stopped the re-run from hanging, although the queue seems to have a hidden orphan process in it somewhere because while the course has reappeared in the course list, the course also appears at the top of the page and shows the 'configure as re-run' spinner.

It's puzzling, for sure and I am eager to find a solid solution.

-Zach 

Iaroslav Vassiliev

unread,
Jun 29, 2018, 9:55:12 AM6/29/18
to Open edX operations
I've met the similar problem recently after cloning a virtual server with installed Open edX instance. RabbitMQ config got corrupted, no users were listed in response to
sudo rabbitmqctl list_permissions
command.

In order to solve this issue I've found passwords for admin, edx and celery users in my-passwords.yml edX file:
RABBIT_USERS:
  - name: 'admin'
    password: "{{ RABBIT_ADMIN_PASSWORD }}"
  - name: 'edx'
    password: "{{ XQUEUE_RABBITMQ_PASS }}"
  - name: 'celery'
    password: "{{ EDXAPP_CELERY_PASSWORD }}"

And then:
sudo rabbitmqctl add_user admin <password>
sudo rabbitmqctl set_permissions admin ".*" ".*" ".*"
sudo rabbitmqctl add_user edx <password>
sudo rabbitmqctl set_permissions edx ".*" ".*" ".*"
sudo rabbitmqctl add_user celery <password>
sudo rabbitmqctl set_permissions celery ".*" ".*" ".*"
sudo service rabbitmq-server restart
sudo rabbitmqctl list_permissions

Then restart edxapp.


понедельник, 4 апреля 2016 г., 21:53:27 UTC+3 пользователь zwcum...@gmail.com написал:

Business Finance

unread,
Jul 20, 2019, 11:25:25 PM7/20/19
to Open edX operations
How to solve this problem 

New Picture.png

Reply all
Reply to author
Forward
0 new messages