CMS Celery worker: memory leakage

735 views
Skip to first unread message

LIubov Fomicheva

unread,
Jun 24, 2016, 1:04:43 PM6/24/16
to Open edX operations
Hello colleagues,

  In our fullstack installation we've had the following issue for several times: default CMS celery workers tends to consume too much RAM, up to 12% per process (the name of the process: /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py cms --settings=aws celery worker --loglevel=info --queues=edx.cms.core.default --hostname=edx.cms.core.default.%h --concurrency=4). There ware 4 such processes, and they locked all the other processes from allocating the memory.
  I've checked logs for a worker (/edx/var/log/supervisor/cms_default_4-stderr.log), but no errors are registered. The logs just shows that the worker constantly performs usual tasks:
  • (constantly) updating course structure (openedx.core.djangoapps.content.course_structures.tasks.update_course_structure) and search index (contentstore.tasks.update_search_index);
  • (rarely) update_credit_course_requirements
  • (rarely) rerun_course.
  Previous time I've temporarily hotfixed this issue by limiting the number of tasks per process for cms default celery queue (see example for lms high mem queue) - the worker child should have been restarted after some reasonable number of tasks. But after the update to upstream (post-Cypress, in January) the fix was forgotten, and a couple of days ago our server ran out of memory again. I suppose for that moment the uptime for the process was about a week.

  So my question is: has anybody seen such an issue too? Maybe you may suggest what's the reason and how to fix it? And is the solution with limiting the number of tasks per child is appropriate for such a case?

I'm not an operations specialist and would be glad to hear any thoughts about possible ways of solving this problem.

Kind regards,
Liubov

Ed Zarecor

unread,
Jun 27, 2016, 9:33:39 PM6/27/16
to Open edX operations
Are you using a named release?  If so, there were some changes recently merged to master that fixed a long standing memory leak related to request scoped caches used in asynchronous workers, i.e., outside of a an HTTP request scope.

See https://github.com/edx/edx-platform/pull/12481

Ed.

Pierre Mailhot

unread,
Jun 28, 2016, 11:12:07 AM6/28/16
to Open edX operations
Ed,

I saw the "fix" in the notes for the devops hangout later today.

Do you know if this "fix" is backward compatible with Dogwood? I've encountered a similar problem to the one described by Llubov in the past and we worked around it by restarting elasticsearch, edxapp and edxapp_worker on regular basis (at least once a month). We then recover at least 30 to 40% of RAM on our production server. As we do have a workaround, I can wait until Eucalyptus.

LIubov Fomicheva

unread,
Jun 29, 2016, 10:23:03 AM6/29/16
to Open edX operations
Ed, thanks a lot for the answer!
  I'll try to merge the fix with our code. As I mentioned earlier, we currently use some post-Cypress version, not yet Dogwood, hence I'm not sure if the problem is the same as fixed in pull-request.
I don't like the idea to restart servers on schedule, so if the fix does not help, I'll stick to previous solution with worker processes limitation.

Liubov

вторник, 28 июня 2016 г., 4:33:39 UTC+3 пользователь Ed Zarecor написал:

Nikolay Yurchenko

unread,
Aug 14, 2018, 1:26:31 PM8/14/18
to Open edX operations
Hi Liubov, Ed, Pierre,

I'm currently installing the latest Open edX release (hawthorn.1) and have the same problem as Liubov mentioned earlier: several python workers consume all the system memory (around 15% each of 5-7 processes). Process names are like following:

/edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py cms --settings=aws celery worker --loglevel=info --queues=edx.cms.core.low --hostname=edx.cms.core.low.%h --concurrency=1

These processes even caused Open edX installation to crash due to lack of memory.

Operating system: Ubuntu 16.04.5 LTS (GNU/Linux 3.10.0-693.21.1.vz7.46.7 x86_64)

Could you please suggest how to fix the problem?

Thanks in advance,
Nikolay Yurchenko

среда, 29 июня 2016 г., 17:23:03 UTC+3 пользователь Liubov Fomicheva написал:

Nikolay Yurchenko

unread,
Aug 14, 2018, 2:18:28 PM8/14/18
to Open edX operations
I tried to add option "--maxtasksperchild 1" for all workers (normally "--maxtasksperchild 1" was set only for high_mem worker) but it didn't change the picture.

Typical picture of CPU and memory consumption by worker processes (from top command):

KiB Mem :  1258288 total,      168 free,  1143516 used,   114604 buff/cache
KiB Swap:   524288 total,      244 free,   524044 used.        0 avail Mem
  PID  %CPU %MEM COMMAND
 5946  23.9 18.8 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py lms --settings=aws celery worker --loglevel=info --queues=edx.lms.core.low --hostname=edx.lms.core.low.%h --concurrency=1 --maxtasksperchild 1
 7161   0.3 18.7 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py lms --settings=aws celery worker --loglevel=info --queues=edx.lms.core.low --hostname=edx.lms.core.low.%h --concurrency=1 --maxtasksperchild 1
 5947  23.9 18.3 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py cms --settings=aws celery worker --loglevel=info --queues=edx.cms.core.high --hostname=edx.cms.core.high.%h --concurrency=1 --maxtasksperchild 1
 7160   0.0 18.2 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py cms --settings=aws celery worker --loglevel=info --queues=edx.cms.core.high --hostname=edx.cms.core.high.%h --concurrency=1 --maxtasksperchild 1
 6258  55.5 15.8 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py lms --settings=aws celery worker --loglevel=info --queues=edx.lms.core.high --hostname=edx.lms.core.high.%h --concurrency=1 --maxtasksperchild 1
 6814  52.8 14.3 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py cms --settings=aws celery worker --loglevel=info --queues=edx.cms.core.default --hostname=edx.cms.core.default.%h --concurrency=1 --maxtasksperchild 1
 4787   0.0  7.4 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py cms --settings=aws celery worker --loglevel=info --queues=edx.cms.core.low  --hostname=edx.cms.core.low.%h --concurrency=1 --maxtasksperchild 1
 5167   0.0  6.8 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py cms --settings=aws celery worker --loglevel=info --queues=edx.cms.core.low  --hostname=edx.cms.core.low.%h --concurrency=1 --maxtasksperchild 1
 7157 154.8  6.0 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py lms --settings=aws celery worker --loglevel=info --queues=edx.lms.core.high_mem --hostname=edx.lms.core.high_mem.%h --concurrency=1 --maxtasksperchild 1
 7131  72.8  5.7 /edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py lms --settings=aws celery worker --loglevel=info --queues=edx.lms.core.default --hostname=edx.lms.core.default.%h --concurrency=1 --maxtasksperchild 1


вторник, 14 августа 2018 г., 20:26:31 UTC+3 пользователь Nikolay Yurchenko написал:

Pierre Mailhot

unread,
Aug 14, 2018, 2:25:50 PM8/14/18
to Open edX operations
Your mileage may vary, but the way we solved this issue is through the use of max_requests in /edx/app/edxapp/cms_gunicorn.py and /edx/app/edxapp/lms_gunicorn.py


For the cms we set max_requests = 500

For the lms we set max_requests = 750


We define max_requests before the number of workers.


You will have to restart and check how your system is reacting. 


See documentation here : http://docs.gunicorn.org/en/stable/settings.html#max-requests 

Nikolay Yurchenko

unread,
Aug 14, 2018, 3:26:49 PM8/14/18
to Open edX operations
I've added suggested lines to /edx/app/edxapp/cms_gunicorn.py and /edx/app/edxapp/lms_gunicorn.py. Now lms_gunicorn.py starts with following lines:

import multiprocessing

preload_app = False
timeout = 300
bind = "127.0.0.1:8000"
pythonpath = "/edx/app/edxapp/edx-platform"

max_requests = 750
workers = (multiprocessing.cpu_count()-1) * 4 + 4


After server restart nothing did substantionally changed - workers still consume all memory and CPU.

I've checked worker logs (at last): tail -f /edx/var/log/supervisor/cms_default_1-stderr.log

In stderr logs there're following messages:

[2018-08-14 15:13:13,833: ERROR/MainProcess] consumer: Cannot connect to amqp://celery:**@127.0.0.1:5672//: [Errno 104] Connection reset by peer.
Trying again in 2.00 seconds...

[2018-08-14 15:13:18,878: ERROR/MainProcess] consumer: Cannot connect to amqp://celery:**@127.0.0.1:5672//: [Errno 104] Connection reset by peer.
Trying again in 4.00 seconds...

[2018-08-14 15:13:25,904: ERROR/MainProcess] consumer: Cannot connect to amqp://celery:**@127.0.0.1:5672//: [Errno 104] Connection reset by peer.
Trying again in 6.00 seconds...

[2018-08-14 15:13:34,942: ERROR/MainProcess] consumer: Cannot connect to amqp://celery:**@127.0.0.1:5672//: [Errno 104] Connection reset by peer.
Trying again in 8.00 seconds...

I'm not sure what is the reason for this but remember that Open edX installation wasn't finished due to out of memory error (caused by these workers). Speaking about worker's stderr log again, after several lines of "Cannot connect" several out of memory errors appear:

OSError: [Errno 12] Cannot allocate memory
Traceback (most recent call last):
  File "/edx/app/edxapp/edx-platform/manage.py", line 118, in <module>
    startup.run()
  File "/edx/app/edxapp/edx-platform/cms/startup.py", line 19, in run
    django.setup()
  File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/django/__init__.py", line 27, in setup
    apps.populate(settings.INSTALLED_APPS)
...

Then everything repeats.

Any suggestions what to check/modify?

вторник, 14 августа 2018 г., 21:25:50 UTC+3 пользователь Pierre Mailhot написал:

Pierre Mailhot

unread,
Aug 14, 2018, 5:25:30 PM8/14/18
to Open edX operations
It will not happen automagically.

There needs to be a lot of connexion to your servers before you see some workers restart and free some memory. It may take a few hours or it may take a few minutes.

How much memory do you have? Minimum recommended is 8 GB if I remember correctly. More would be recommended. We use 16 GB on an m4.xlarge server in AWS with databases on another server.


--
You received this message because you are subscribed to a topic in the Google Groups "Open edX operations" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openedx-ops/q-9lZwy-_GM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openedx-ops...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openedx-ops/280a5dfc-e561-4332-8441-e8fbb9128310%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Salutations / Regards ,
Pierre Mailhot, M.Sc., CISSP, CEH

Nikolay Yurchenko

unread,
Aug 15, 2018, 10:28:16 PM8/15/18
to Open edX operations
I've had 1.2 GB on my VPS. I've upgraded to 2 GB (not too much spare physical memory was available). This stopped constant crashes due to "Cannot allocate memory".

Though this still leaves questions why each of 4 LMS workers (low, default, high, high_mem) must use ~250 MB and each of 3 CMS workers (low, default, high) must use ~150 MB while Open edX server as a whole isn't working and isn't providing any substantial load. Is there the way to limit memory consumption by workers?

Also I noticed that workers tend to be duplicated and use twice more memory (you may see examples of it in my previous emails) so you can see 2 cms.low and 2 lms.low workers at the same time (same for cms.high and lms.high). Sometimes I can kill one of the duplicated workers and it doesn't auto-restart (so only one worker of each kind is left) but sometimes after I kill any of the duplicated worker it restarts both.

So it may be that 2 workers of the same kind are up (and consuming memory) at the same time:
/edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py lms --settings=aws celery worker --loglevel=info --queues=edx.lms.core.low --hostname=edx.lms.core.low.%h --concurrency=1
/edx/app/edxapp/venvs/edxapp/bin/python /edx/app/edxapp/edx-platform/manage.py lms --settings=aws celery worker --loglevel=info --queues=edx.lms.core.low --hostname=edx.lms.core.low.%h --concurrency=1

Probably workers duplication happens after re-running installation script. Is there the way to prevent such worker duplication?

среда, 15 августа 2018 г., 0:25:30 UTC+3 пользователь Pierre Mailhot написал:

Nikolay Yurchenko

unread,
Aug 15, 2018, 10:37:08 PM8/15/18
to Open edX operations
To check if worker duplication was indeed caused by re-running installation script, I've rebooted linux server and see that each worker is duplicated on server startup (so there're 8 LMS workers instead of 4 and 6 CMS workers instead of 3). So there must be the problem somewhere else.

четверг, 16 августа 2018 г., 5:28:16 UTC+3 пользователь Nikolay Yurchenko написал:

Pierre Mailhot

unread,
Aug 15, 2018, 11:06:48 PM8/15/18
to Open edX operations
Nikolay,

Check this
  • Ubuntu 16.04 amd64 (oraclejdk required). It may seem like other versions of Ubuntu will be fine, but they are not.  Only 16.04 is known to work.
  • Minimum 8GB of memory
  • At least one 2.00GHz CPU or EC2 compute unit
  • Minimum 25GB of free disk, 50GB recommended for production servers

You said : I've had 1.2 GB on my VPS. I've upgraded to 2 GB.

2 GB is way less than the recommended minimum of 8 GB of memory. You will definitely encounter other problems especially when recompiling assets.

You can find the total number of workers used in /edx/app/edxapp/lms_gunicorn.py and /edx/app/edxapp/cms_gunicorn.py. It is a function based on the number of processors. 
LMS : workers = (multiprocessing.cpu_count()-1) * 4 + 4
CMS : workers = (multiprocessing.cpu_count()-1) * 2 + 2

How they are distributed is another story, but I believe there are at least 2 of each for redundancy, but I could be wrong.


For more options, visit https://groups.google.com/d/optout.

Nikolay Yurchenko

unread,
Aug 15, 2018, 11:39:20 PM8/15/18
to Open edX operations
Pierre, yes, I understand that by using less than 8 GB I'm provoking problems. But I'm installing proof-of-concept server and I cannot have more than 2 GB of RAM for now. So I'll proceed with care.

Speaking about number of workers, I've found configuration in /var/tmp/configuration/playbooks/roles/edxapp/defaults/main.yml and replaced "EDXAPP_WORKERS: !!null" with lines:
EDXAPP_WORKERS:
  lms: 1
  cms: 1

So after re-running installation script my /edx/app/edxapp/lms_gunicorn.py and cms_gunicorn.py both contain "workers = 1" inside them. But after rebooting there still 8 LMS and 6 LMS workers instead of 4 and 3 respectively. So redundancy number of workers wasn't affected by "workers" variable inside *_gunicorn.py.

четверг, 16 августа 2018 г., 6:06:48 UTC+3 пользователь Pierre Mailhot написал:

ji...@opencraft.com

unread,
Aug 16, 2018, 9:21:32 PM8/16/18
to Open edX operations
Hi Pierre, Nikolay,

OpenCraft have had issues with memory leaks in Ginkgo from the celery workers too, and so use the following default configuration to run small Ginkgo instances on 4GB VMs.
We adjust these (and the instance size) for clients with higher request counts.

EDXAPP_WORKERS:
   lms: 3
   cms: 2

EDXAPP_LMS_MAX_REQ: 20000
EDXAPP_WORKER_DEFAULT_STOPWAITSECS: 1200


This month, our (largely idle) Hawthorn test instances started crashing too, and so we're trialing this additional setting to restart the CMS celery workers too:

EDXAPP_CMS_MAX_REQ: 1000

We haven't had any issues with the EDXAPP_WORKERS setting not doing its job and limiting the number of workers run, though, so I'm not sure why that's not working for you?  But as Pierre noted that this number isn't absolute, due to the redundant celery threads and extra web worker threads.

Cheers,
--
Jill
@OpenCraft
Reply all
Reply to author
Forward
0 new messages