Cluster config and deaggregation

233 views
Skip to first unread message

fgdg...@gmail.com

unread,
Sep 18, 2017, 11:05:51 PM9/18/17
to OpenQuake Users

Hello,

We have a cluster set up with 4 compute nodes (4 cores per node) without any shared dir as data shared wasn't big.
Works fine for Hazard calculation so far but we started some deaggregation computation lately and ran into multiple errors.
As we have no shared dir, some computation were run on the head node which was not sized as the compute node memory wise and ran into memory issue, this was solved by sizing the head node as the compute nodes following the 2G memory/core rule.
But now we run into an python error which we do not really understand :

INFO:root:Using celery@oq01, celery@oq02, celery@oq03, celery@oq04, 16 cores
[2017-09-13 14:07:06,049 #131 INFO] Using engine version 2.4.0-1
[2017-09-13 14:07:06,098 #131 INFO] Using hazardlib version 0.24.0-1
[2017-09-13 14:07:06,166 #131 INFO] Read 1 hazard site(s)
[2017-09-13 14:07:06,219 #131 INFO] Instantiating the source-sites filter
[2017-09-13 14:07:06,283 #131 INFO] Parsing /home/franckg/tmp/oqtest2/KSHM_source_model-bk_Kaik2117-fl_F1C0517-mmin5.xml
[2017-09-13 14:07:20,286 #131 INFO] Processed source model 1 with 224 potential gsim path(s) and 4958 sources
[2017-09-13 14:07:20,342 #131 INFO] Parsing /home/franckg/tmp/oqtest2/KSHM_source_model-bk_Kaik2117-fl_F1C0517S-mmin5.xml
[2017-09-13 14:07:31,623 #131 INFO] Processed source model 2 with 224 potential gsim path(s) and 4960 sources
[2017-09-13 14:07:31,745 #131 INFO] Filtering composite source model
[2017-09-13 14:07:43,767 #131 INFO] Using a maxweight of 6811
[2017-09-13 14:07:43,813 #131 INFO] Sending source group #1 of 6 (Subduction Interface, 7 sources)
[2017-09-13 14:07:43,862 #131 INFO] Submitting  "classical" tasks
[2017-09-13 14:07:43,935 #131 INFO] Sending source group #2 of 6 (Volcanic, 24 sources)
[2017-09-13 14:07:44,009 #131 INFO] Sending source group #3 of 6 (Active Shallow Crust, 4677 sources)
[2017-09-13 14:07:45,395 #131 INFO] Sending source group #4 of 6 (Subduction Interface, 9 sources)
[2017-09-13 14:07:45,460 #131 INFO] Sending source group #5 of 6 (Volcanic, 24 sources)
[2017-09-13 14:07:45,535 #131 INFO] Sending source group #6 of 6 (Active Shallow Crust, 4677 sources)
[2017-09-13 14:07:47,230 #131 INFO] Sent 14.9 MB of data in 86 task(s)
[2017-09-13 14:07:47,292 #131 INFO] classical   1%
[...]
[2017-09-13 15:11:25,481 #131 INFO] classical 100%
[2017-09-13 15:11:25,574 #131 INFO] Received 1.15 MB of data, maximum per task 19.01 KB
[2017-09-13 15:11:25,871 #131 WARNING] Reducing the logic tree of KSHM_source_model-bk_Kaik2117-fl_F1C0517-mmin5.xml from 224 to 56 realizations
[2017-09-13 15:11:25,974 #131 WARNING] Reducing the logic tree of KSHM_source_model-bk_Kaik2117-fl_F1C0517S-mmin5.xml from 224 to 56 realizations
[2017-09-13 15:11:27,844 #131 WARNING] Task `build_hcurves_and_stats` will be run on the controller node only, since no `shared_dir` has been specified
[2017-09-13 15:11:27,893 #131 INFO] Submitting  "build_hcurves_and_stats" tasks
[2017-09-13 15:11:27,983 #131 INFO] Sent 23.19 KB of data in 1 task(s)
[2017-09-13 15:11:28,189 #131 INFO] build_hcurves_and_stats 100%
[2017-09-13 15:11:28,256 #131 INFO] Received 1.99 KB of data, maximum per task 1.99 KB
[2017-09-13 15:11:28,307 #131 INFO] 1 epsilon bins from -3.0 to 3.0
[2017-09-13 15:11:28,608 #131 INFO] 20 mag bins from 5.0 to 9.0
[2017-09-13 15:11:28,659 #131 INFO] 20 dist bins from 0.0 to 400.0
[2017-09-13 15:11:28,708 #131 INFO] 1 lon bins from 169.7 to 177.568805985
[2017-09-13 15:11:28,757 #131 INFO] 1 lat bins from -44.7972720057 to -38.8807830741
[2017-09-13 15:11:28,841 #131 INFO] 20 dist bins from 0.0 to 400.0
[2017-09-13 15:11:28,890 #131 INFO] 1 lon bins from 169.7 to 177.568805985
[2017-09-13 15:11:28,939 #131 INFO] 1 lat bins from -44.7972720057 to -38.8807830741
[2017-09-13 15:11:29,092 #131 INFO] 20 dist bins from 0.0 to 400.0
[2017-09-13 15:11:29,142 #131 INFO] 1 lon bins from 169.7 to 177.568805985
[2017-09-13 15:11:29,191 #131 INFO] 1 lat bins from -44.7972720057 to -38.8807830741
[2017-09-13 15:11:35,223 #131 INFO] 20 mag bins from 5.0 to 9.0
[2017-09-13 15:11:35,270 #131 INFO] 20 dist bins from 0.0 to 400.0
[2017-09-13 15:11:35,317 #131 INFO] 1 lon bins from 169.7 to 177.568805985
[2017-09-13 15:11:35,363 #131 INFO] 1 lat bins from -44.7972720057 to -38.8807830741
[2017-09-13 15:11:35,424 #131 INFO] 20 dist bins from 0.0 to 400.0
[2017-09-13 15:11:35,470 #131 INFO] 1 lon bins from 169.7 to 177.568805985
[2017-09-13 15:11:35,516 #131 INFO] 1 lat bins from -44.7972720057 to -38.8807830741
[2017-09-13 15:11:35,602 #131 INFO] 20 dist bins from 0.0 to 400.0
[2017-09-13 15:11:35,648 #131 INFO] 1 lon bins from 169.7 to 177.568805985
[2017-09-13 15:11:35,694 #131 INFO] 1 lat bins from -44.7972720057 to -38.8807830741
[2017-09-13 15:11:40,784 #131 INFO] Submitting 68 "compute_disagg" tasks
[2017-09-13 16:00:11,397 #131 CRITICAL]
Traceback (most recent call last):
  File "/opt/openquake/lib/python2.7/site-packages/openquake/calculators/base.py", line 203, in run
  File "/opt/openquake/lib/python2.7/site-packages/openquake/calculators/disaggregation.py", line 119, in post_execute
  File "/opt/openquake/lib/python2.7/site-packages/openquake/calculators/disaggregation.py", line 238, in full_disaggregation
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/parallel.py", line 592, in reduce
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/parallel.py", line 633, in submit_all
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/parallel.py", line 548, in submit
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/parallel.py", line 555, in _submit
  File "/opt/openquake/lib/python2.7/site-packages/celery/app/task.py", line 453, in delay
  File "/opt/openquake/lib/python2.7/site-packages/celery/app/task.py", line 565, in apply_async
  File "/opt/openquake/lib/python2.7/site-packages/celery/app/base.py", line 349, in send_task
  File "/opt/openquake/lib/python2.7/site-packages/celery/backends/rpc.py", line 32, in on_task_call
  File "/opt/openquake/lib/python2.7/site-packages/kombu/common.py", line 112, in maybe_declare
  File "/opt/openquake/lib/python2.7/site-packages/kombu/common.py", line 129, in _imaybe_declare
  File "/opt/openquake/lib/python2.7/site-packages/kombu/connection.py", line 457, in _ensured
  File "/opt/openquake/lib/python2.7/site-packages/kombu/connection.py", line 369, in ensure_connection
  File "/opt/openquake/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_time
  File "/opt/openquake/lib/python2.7/site-packages/kombu/connection.py", line 237, in connect
  File "/opt/openquake/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection
  File "/opt/openquake/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection
  File "/opt/openquake/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection
  File "/opt/openquake/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
  File "/opt/openquake/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
  File "/opt/openquake/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport
  File "/opt/openquake/lib/python2.7/site-packages/amqp/transport.py", line 87, in __init__
AttributeError: 'NoneType' object has no attribute 'close'
[2017-09-13 16:00:11,400 #131 CRITICAL] Traceback (most recent call last):
  File "/opt/openquake/lib/python2.7/site-packages/openquake/engine/engine.py", line 198, in run_calc
  File "/opt/openquake/lib/python2.7/site-packages/openquake/engine/engine.py", line 231, in _do_run_calc
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/performance.py", line 156, in __exit__
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/performance.py", line 118, in measure_mem
  File "/opt/openquake/lib/python2.7/site-packages/psutil/__init__.py", line 349, in __init__
  File "/opt/openquake/lib/python2.7/site-packages/psutil/__init__.py", line 375, in _init
  File "/opt/openquake/lib/python2.7/site-packages/psutil/__init__.py", line 636, in create_time
  File "/opt/openquake/lib/python2.7/site-packages/psutil/_pslinux.py", line 810, in wrapper
  File "/opt/openquake/lib/python2.7/site-packages/psutil/_pslinux.py", line 943, in create_time
  File "/opt/openquake/lib/python2.7/site-packages/psutil/_pslinux.py", line 138, in open_binary
IOError: [Errno 24] Too many open files: '/proc/3829/stat'

Traceback (most recent call last):
  File "/opt/openquake/lib/python2.7/site-packages/openquake/engine/engine.py", line 198, in run_calc
  File "/opt/openquake/lib/python2.7/site-packages/openquake/engine/engine.py", line 231, in _do_run_calc
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/performance.py", line 156, in __exit__
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/performance.py", line 118, in measure_mem
  File "/opt/openquake/lib/python2.7/site-packages/psutil/__init__.py", line 349, in __init__
  File "/opt/openquake/lib/python2.7/site-packages/psutil/__init__.py", line 375, in _init
  File "/opt/openquake/lib/python2.7/site-packages/psutil/__init__.py", line 636, in create_time
  File "/opt/openquake/lib/python2.7/site-packages/psutil/_pslinux.py", line 810, in wrapper
  File "/opt/openquake/lib/python2.7/site-packages/psutil/_pslinux.py", line 943, in create_time
  File "/opt/openquake/lib/python2.7/site-packages/psutil/_pslinux.py", line 138, in open_binary
IOError: [Errno 24] Too many open files: '/proc/3829/stat'
Traceback (most recent call last):
  File "/usr/bin/oq", line 35, in <module>
    main.oq()
  File "/opt/openquake/lib/python2.7/site-packages/openquake/commands/__main__.py", line 49, in oq
    parser.callfunc()
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/sap.py", line 186, in callfunc
    return self.func(**vars(namespace))
  File "/opt/openquake/lib/python2.7/site-packages/openquake/baselib/sap.py", line 245, in main
    return func(**kw)
  File "/opt/openquake/lib/python2.7/site-packages/openquake/commands/engine.py", line 178, in engine
    exports, hazard_calculation_id=hc_id)
  File "/opt/openquake/lib/python2.7/site-packages/openquake/commands/engine.py", line 65, in run_job
    hazard_calculation_id=hazard_calculation_id, **kw)
  File "/opt/openquake/lib/python2.7/site-packages/openquake/engine/engine.py", line 209, in run_calc
    logs.LOG.critical(tb)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1194, in critical
    self._log(CRITICAL, msg, args, **kwargs)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1268, in _log
    self.handle(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in handle
    self.callHandlers(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1318, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/opt/openquake/lib/python2.7/site-packages/openquake/commonlib/logs.py", line 131, in emit
    record.getMessage())
  File "/opt/openquake/lib/python2.7/site-packages/openquake/commonlib/logs.py", line 52, in dbcmd
    raise RuntimeError('Cannot connect on %s:%s' % config.DBS_ADDRESS)
RuntimeError: Cannot connect on localhost:1908

I'm quite confuse between the
AttributeError: 'NoneType' object has no attribute 'close'
and the
IOError: [Errno 24] Too many open files: '/proc/3829/stat'

Which one is the root of the issue ?

Best regards.
Franck

Daniele Viganò

unread,
Sep 19, 2017, 4:07:00 AM9/19/17
to openqua...@googlegroups.com

Dear Franck,


to better understand your issue we need some further information:

  • Which OS are you using? Ubuntu, CentOS?
  • If it's CentOS, is SELinux enabled?
  • Are you running OQ on bare metal or in a VM/container (like LXC)?
  • There's any other software running on the same hosts (excluding base services I mean)

I would also suggest you to try upgrading to v2.5.0 first to see if the issue is reproducible.

If the situation does not change you can send us your input files at engine....@openquake.org so we can try to reproduce your issue.


I suspect anyway that the root cause is

IOError: [Errno 24] Too many open files: '/proc/3829/stat'

Everything else is a consequence of it (failure in rabbitmq and dbserver). Can you check the default limits running 'ulimit -a` from a user shell?


Cheers,
Daniele

--
You received this message because you are subscribed to the Google Groups "OpenQuake Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openquake-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
DANIELE VIGANÒ | System Administrator | Skype dennyv85 | +39 0382 5169882
GLOBAL EARTHQUAKE MODEL | working together to assess risk

fgdg...@gmail.com

unread,
Sep 19, 2017, 5:30:36 PM9/19/17
to OpenQuake Users
Thank Daniele for your answer.
We're running CentOS v7 (CentOS Linux release 7.3.1611), SELinux is disabled and we're running on virtual machine (VMware based), this VMs cluster is for Openquake usage exclusively, no other processing is done.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31220
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I'll update to 2.5 and try the processing again see if fails again. I'll try to raise the ulimit for the user as well and see how it goes.

Best regards.
Franck

Michele Simionato

unread,
Sep 20, 2017, 12:20:19 AM9/20/17
to OpenQuake Users
It looks like the calculation you are running is too large for your infrastructure. If this is the case, the only thing you can do is to reduce it.
You have a small cluster and disaggregation calculations are very big, orders of magnitudes bigger than an equivalent classical calculation.
Improving the disaggregation has been on our wish list for years and soon or later will be done, but do not hold your breath. It is difficult and
it will be in the long term.

fgdg...@gmail.com

unread,
Sep 20, 2017, 9:31:34 PM9/20/17
to OpenQuake Users
I updated to 2.5... the RPM overwrote the /etc/openquake/openquake.cfg which is annoying. You may want to ckeck the RPM update process.

Rerunning the same job thrown the same error. I change the limit for open files for my user before submitting the job and it seems to run now. At least it didn't crash yet.
It seems that disaggregation tasks are processed by the head node and run by the user owned process. I'm monitoring the file descriptors associated with this process and noticed the behavior is like a saw profile : number of open FD grows slowly up to around 1100 than falls to 40 then grow slowly..... rinse and repeat.
The funny thing about thos file descriptor is that they are not associated with actual files but sockets and you can see every sec or 2 an entry about a connection in the rabbitmq log file.
I'm just reporting there.

Supposedly, it's going to run..
About the sizing issue raised by Michele, the good thing about VMs is that they are elastic both in size and number. But I need to know if I need more CPU power, more memory per node.... The issue that I see here is that disaggregation tasks are only run by the head node, is it due to the lack of shared_dir ?

Best regards.
Franck


Michele Simionato

unread,
Sep 20, 2017, 11:20:59 PM9/20/17
to OpenQuake Users
The behavior you observe is interesting and new to us. If you could send your input files to engine....@openquake.org it would help us a lot in understanding what is going on.
 

Michele Simionato

unread,
Sep 21, 2017, 8:20:34 AM9/21/17
to OpenQuake Users
I am running the computation on our cluster and everything works fine. There must be something wrong with your virtual machines or your configuration or your version of rabbitmq.

> It seems that disaggregation tasks are processed by the head node and run by the user owned process.

This is absurd. Are you sure that in the openquake.cfg file you have set oq_distribute = celery?

Anyway, the calculation is big but not huge, so it can be done in a single server if you wait enough (probably less than 1 day). If your goal is to run it you can go that way: set
oq_distribute=futures and you will not use celery nor rabbitmq. If your goal is to debug rabbitmq/celery you should wait next week when our sysadmin will be back ;-)

Michele Simionato

unread,
Sep 21, 2017, 9:01:43 AM9/21/17
to OpenQuake Users


On Thursday, September 21, 2017 at 2:20:34 PM UTC+2, Michele Simionato wrote:
Anyway, the calculation is big but not huge, so it can be done in a single server if you wait enough (probably less than 1 day). 

I take back what I said. I was looking at a simplified version. You logic tree has 112 realizations, so it is a very big computation that will likely take more than 1 day on a single machine.
The first step would be to reduce the logic tree to a single realization and then see how long it takes. 

fgdg...@gmail.com

unread,
Sep 21, 2017, 7:17:44 PM9/21/17
to OpenQuake Users
Thanks Michele for having a look.
The time is not really an issue, on our 500 cores HPC, we have jobs running for weeks. We'll scale out the openquake cluster once I know it's working as intended.


>> It seems that disaggregation tasks are processed by the head node and run by the user owned process.
>This is absurd. Are you sure that in the openquake.cfg file you have set oq_distribute = celery?

Yes, I was surprised as well but it is the case. During the classic calculation phase, tasks are distributed accross the nodes, you can see on the header of the job logs in the first post :

INFO:root:Using celery@oq01, celery@oq02, celery@oq03, celery@oq04, 16 cores
And the celeris-status script shows all the workers are busy, we can hop onto a compute node and see the 4 python processes running the computation... everything looks perfect.
But when you reach the disagg phase,

[2017-09-13 15:11:40,784 #131 INFO] Submitting 68 "compute_disagg" tasks
Only the head node runs something :
$ top
top - 11:04:20 up 2 days, 41 min,  4 users,  load average: 1.26, 1.23, 1.26
Tasks: 170 total,   3 running, 167 sleeping,   0 stopped,   0 zombie
%Cpu(s): 50.4 us,  2.7 sy,  0.0 ni, 45.9 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
KiB Mem :  8010840 total,  3604860 free,  1220092 used,  3185888 buff/cache
KiB Swap:  3904508 total,  3904508 free,        0 used.  6476560 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26746 franck    20   0 1068024 335768  16720 R  81.4  4.2   1221:53 oq
  831 rabbitmq  20   0 1802372 119836   2952 S  23.6  1.5 365:31.75 beam.smp
 2888 rabbitmq  20   0   39396   1072    776 S   1.3  0.0  16:25.20 inet_gethost
   10 root      rt   0       0      0      0 S   0.3  0.0   0:00.77 watchdog/0
  601 root      20   0  231352   6140   4756 S   0.3  0.1   1:57.89 vmtoolsd
 2866 rabbitmq  20   0   33028    712    496 S   0.3  0.0   3:38.24 inet_gethost
26688 root      20   0       0      0      0 S   0.3  0.0   0:00.21 kworker/1:2
30324 franck    20   0  161960   2424   1648 R   0.3  0.0   0:00.01 top
    1 root      20   0   43408   3848   2484 S   0.0  0.0   0:04.19 systemd
The compute nodes are idle :
$ celery-status
==========
Host: celery@oq04
Status: Online
Worker processes: 4
Active tasks: 0
==========
Host: celery@oq03
Status: Online
Worker processes: 4
Active tasks: 0
==========
Host: celery@oq02
Status: Online
Worker processes: 4
Active tasks: 0
==========
Host: celery@oq01
Status: Online
Worker processes: 4
Active tasks: 0
==========

Total workers:       16
Active tasks:        0
Cluster utilization: 0.00%


The current computation I ran yesterday still runs... increasing the ulimit -n value did the trick.
But yes, there is something going wrong when you run disagg in cluster (rabbitmq/celery) context so I guess we'll have to wait next week for your sysadmin to be back :)

Best regards
Franck

Michele Simionato

unread,
Sep 22, 2017, 2:24:19 AM9/22/17
to OpenQuake Users
I can confirm that it is a problem related to your infrastructure and not to application code. Yesterday I ran the full computation on our cluster with 256 cores in 6h 30m without any issue.
The memory occupation is well below the 2 GB per core, there is nothing strange with the file descriptors and the only limit is the CPU power. Maybe our sysadmin will have some idea of
what is wrong in your situation. Of course without having access to your infrastructure it is difficult.

Daniele Viganò

unread,
Sep 22, 2017, 12:06:08 PM9/22/17
to openqua...@googlegroups.com

Dear Franck,

I hadn't time yet to read the full thread, I hope to have more time next week. Anyway, I'm setting up a testbed made with CentOS 7 VMs (unfortunately our cluster runs Ubuntu 14.04 with a tweaked version of Python 3.5 and CentOS testing is done using demo calculations inside a Docker container). I'm using a 4 core master and a 16 cores worker. I'm using a single worker, but this should not change the picture since the number of connections depends on the number of workers, not nodes.

See some considerations below:

This may be correct: the computations workflow is the following
  1. Classical tasks are generated on master (no load on workers). This phase is usually relatively quick for Classical PSHA
  2. Tasks are sent to the workers and results collected, here you should see load on workers a few cycles on master too
  3. Classical phase is done, the same cycle is repeated for Disaggregation
  4. Disaggregation tasks are generated on master and sent to the workers (no load on them as soon as at least one task is sent)

It looks like that the computation fails when master tries to submit tasks to the workers: it seems an issue on the RabbitMQ side which has difficulties to flush tasks from the queue or it's causing a traffic jam. Even if you increase ulimits this is not the expected behavior and if you still don't see any load on the workers my fear is that they will never start processing tasks. We had already similar bugs with specific versions of RabbitMQ in the past (~3yrs ago).

Something you can try, that should not cost you too much effort would be try a more recent version of RabbitMQ (https://dl.bintray.com/rabbitmq/rabbitmq-server-rpm/rabbitmq-server-3.6.12-1.el7.noarch.rpm).

I'm starting running the job now, I'll let you know if I'm able to reproduce the error and (hopefully) where it's coming from.

We also fixed the issue with the configuration file: https://github.com/gem/oq-engine/pull/3040 it will be part of Engine 2.6.0, which should land next week.

Cheers,
Daniele

Daniele Viganò

unread,
Sep 22, 2017, 5:18:11 PM9/22/17
to openqua...@googlegroups.com

Dear Franck,

I've been able to reproduce your issue. Using a more recent version of RabbitMQ the 'too many open files' error is not triggered, but the computation is stuck and the behavior is more or less the same.
The issue in the communication between the OpenQuake Engine (or better AMQP) and RabbitMQ. AMQP continuously opens new connections against RabbitMQ, but no data is then sent to the workers.

We'll need to make some further investigation to trace down the bug and also why it worked on our configuration. It will require some days due the weight of the computation.

Cheers,
Daniele

Daniele Viganò

unread,
Sep 23, 2017, 3:35:21 AM9/23/17
to openqua...@googlegroups.com

Hi Franck,

some quick updates. The reason why Michele has been able to run the computation on our cluster is due the 'shared dir' feature. Having this enabled on the CentOS 7 VM I was able to start the disaggregation phase. Unfortunately I ran out of memory (16 cores,  16GB, I will try soon with 32GB), but I did not experienced the issue with RabbitMQ.

We need to figure out what happens without the shared_dir, but I'm not sure it's a bug. Having the shared dir unset we have to transfer all the data via RabbitMQ. It may just need a relaxed open files limit (see Controlling System Limits on Linux under https://www.rabbitmq.com/install-rpm.html) or it's simply incapable of managing such amount of data and then a shared dir is required (that's why we have introduced it). I will check with Michele.

Adding a shared dir is not documented yet, but here you can find some guidelines (/home/shared is an example, it can be whatever):

All nodes

Update /etc/openquake/openquake.cfg, setting /home/shared as shared_dir value

Master

$ yum install nfs-utils
$ mkdir /home/shared
$ chgrp openquake /home/shared
# Users need to create their oqdata inside
$ chmod 777 /home/shared
# Keep openquake group as owner of its content
$ chmod g+s /home/shared

to /etc/exports add
/home/shared            <ip subnet>/<mask>(ro,sync,no_all_squash)

$ systemctl enable rpcbind
$ systemctl enable nfs-server
$ systemctl enable nfs-lock
$ systemctl enable nfs-idmap
$ systemctl start rpcbind
$ systemctl start nfs-server
$ systemctl start nfs-lock
$ systemctl start nfs-idmap
$ systemctl restart openquake-dbserver

Workers

$ yum install nfs-utils
$ mkdir /home/shared
$ mount <master ip>:/home/shared /home/shared -o ro,bg,soft,intr,noauto
$ systemctl restart openquake-celery


Best regards,
Daniele

fgdg...@gmail.com

unread,
Sep 24, 2017, 6:49:55 PM9/24/17
to OpenQuake Users
Good to see that you were able to replicate the issue.
I was wondering when I did the set up how long we could live without the shared_dir feature... well, I guess I had my answer.
I switched the configuration to shared_dir based and I confirm that disagg tasks are spread and processed by the compute nodes so it seems ok.
From the numbers from Michele, I think we should complete it in 4 days.

I can switch back to pure rabbitmq configuration quite quickly if you need me to do some tests.

One question, it seems that every job leaves data in the shared_dir directory. Is it the intended that when a job completes, it leaves data in this directory?
I'm thinking about a community of scientists submitting jobs and the impact on the space over time.

Best regards.
Franck

fgdg...@gmail.com

unread,
Sep 24, 2017, 10:41:56 PM9/24/17
to OpenQuake Users
As we progressed quite quickly to 58% of the disagg task, the memory consumption exploded.
Find below a snapshot of a compute node after a few hours of computation.
As you can see it won't progress anytime soon as all the resources are used to managed (io wait) for swap.

Best regards.
Franck

| CPU Utilisation ----------------------------------------------------------------------|
|---------------------------+-------------------------------------------------+         |
|CPU  User%  Sys% Wait% Idle|0          |25         |50          |75       100|         |
|  1   0.0   8.3  87.9   3.8|ssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW W>         |
|  2   0.0  51.4  28.9  19.7|sssssssssssssssssssssssssWWWWWWWWWWWWWW          >         |
|  3   0.0  20.8  65.6  13.6|ssssssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW       >         |
|  4   0.8  21.0  66.4  11.8|ssssssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW      >         |
|---------------------------+-------------------------------------------------+         |
|Avg   0.5  24.9  62.6  12.0|ssssssssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW      >         |
|---------------------------+-------------------------------------------------+         |
| Memory Stats -------------------------------------------------------------------------|
|                RAM     High      Low     Swap    Page Size=4 KB                       |
| Total MB      7726.8     -0.0     -0.0   7629.0                                       |
| Free  MB       127.7     -0.0     -0.0   2329.9                                       |
| Free Percent     1.7%   100.0%   100.0%    30.5%                                      |
|             MB                  MB                  MB                                |
|                      Cached=     8.1     Active=  6209.5                              |
| Buffers=     0.3 Swapcached=   688.9  Inactive =  1234.6                              |
| Dirty  =     0.0 Writeback =     0.0  Mapped   =     5.2                              |
| Slab   =    43.1 Commit_AS = 12935.3 PageTables=    30.2                              |
| Kernel Stats -------------------------------------------------------------------------|
| Run-ueue              1   Load Average    CPU use since boot time                     |
| Conte|tSwitch    1187.9    1 mins  4.26    Uptime Days=  5 Hours= 4 Mins=17           |
| Forks               0.0    5 mins  4.33    Idle   Days= 19 Hours= 8 Mins=43           |
| Interrupts       7143.1   15 mins  4.38    Average CPU use=-273.88%                   |
| Disk I/O --/proc/diskstats----mostly in KB/s-----Warning:contains duplicates----------|
|DiskName Busy  Read WriteMB|0          |25         |50          |75       100|         |
|sda      100%   11.7   15.0|RRRRRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWWWWWWW>         |
|sda1       0%    0.0    0.0|>                                                |         |
|sda2       0%    0.0    0.0|        >                                        |         |
|sda3     100%   11.7   15.0|RRRRRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWWWWWWW>         |
|sda4       0%    0.0    0.0|>disk busy not available                         |         |
|sda5       0%    0.0    0.0|                        >                        |         |
|Totals Read-MB/s=23.3     Writes-MB/s=29.9     Transfers/sec=7102.8                    |
| Top Processes Procs=101 mode=4 (1=Basic, 3=Perf 4=Size 5=I/O)-------------------------|
|  PID    %CPU  Size   Res   Res   Res   Res Shared   Faults Command  Faults  Command   |
|          Used    KB   Set  Te|t  Data   Lib    KB  Min  Maj KB     Min   Maj          |
|   19838  19.4 3777368 1992028     4 3284648     0   988  126   87 python              |
|   19835  18.4 3607488 1740884     4 3114768     0   864  163   73 python              |
|   19836  18.4 3692936 1675376     4 3200216     0   988  267  138 python              |
|   19837  16.5 3683488 1536736     4 3190768     0   860   52   84 python              |
|   19820   0.5 718996 29184     4 226276     0  1240    0    0 python                  |
|   20488   0.5 17772  4492   112  7200     0   784   50    0 nmon_|86_64_rhe           |
|     369   0.0 36828  2264   252   356     0  2136    0    0 systemd-journal           |
|   20424   0.0 145700  1720   800   912     0   416    0    0 sshd                     |
|     529   0.0 231352  1472    44  1568     0  1152    0    0 vmtoolsd                 |
|       1   0.0 43360  1204  1320  1240     0   580    0    0 systemd                   |
|     775   0.0 562388  1040     4 304476     0   704    0    0 tuned                   |
|   20426   0.0 115392   792   884   500     0   372    0    0 bash                     |
|     528   0.0 212120   788   596 148740     0   348    0    0 rsyslogd                |
|     523   0.0 534260   568   108 450208     0   256    0    0 polkitd                 |
|     525   0.0 21624   484    44   492     0   328    0    0 ir-balance                |
|     542   0.0 24204   452   524   372     0   308    0    0 systemd-logind            |


Michele Simionato

unread,
Sep 25, 2017, 12:10:24 AM9/25/17
to OpenQuake Users
Il giorno lunedì 25 settembre 2017 00:49:55 UTC+2, fgdg...@gmail.com ha scritto:
Good to see that you were able to replicate the issue.
I was wondering when I did the set up how long we could live without the shared_dir feature... well, I guess I had my answer.
I switched the configuration to shared_dir based and I confirm that disagg tasks are spread and processed by the compute nodes so it seems ok.
From the numbers from Michele, I think we should complete it in 4 days.

Just to clarify, a disaggregation calculation should not require a shared_dir. The shared_dir is meant for situations with hundreds of thousand of sites,
not in a disaggregation with a single site. I have market this as a bug and will be fixed after the forthcoming release of engine 2.6. It is not clear to
me what it is happening yet, but it is certainly non-intended behavior. Unfortunately, since our cluster uses a shared_dir we never realized that there was a bug in
absence of the shared_dir.


One question, it seems that every job leaves data in the shared_dir directory. Is it the intended that when a job completes, it leaves data in this directory?
I'm thinking about a community of scientists submitting jobs and the impact on the space over time.

What kind of data do you mean? The shared_dir should be read-only for the workers, so nothing should be written there. Which kind of files do you find?
I am surprised,

                   Michele

Daniele Viganò

unread,
Sep 25, 2017, 3:12:44 AM9/25/17
to openqua...@googlegroups.com

Dear Franck,


On 25/09/17 00:49, fgdg...@gmail.com wrote:
Good to see that you were able to replicate the issue.
I was wondering when I did the set up how long we could live without the shared_dir feature... well, I guess I had my answer.
I switched the configuration to shared_dir based and I confirm that disagg tasks are spread and processed by the compute nodes so it seems ok.
From the numbers from Michele, I think we should complete it in 4 days.

I can switch back to pure rabbitmq configuration quite quickly if you need me to do some tests.

after some discussion with Michele in the weekend we concluded that your calculation isn't big enough and should no require a shared_dir: so there's a bug somewhere regarding the disaggregation done on a multi node setup without a shared dir set. We are releasing 2.6 with this as a know issue and we'll address the bug for 2.6.1 or 2.7.


One question, it seems that every job leaves data in the shared_dir directory. Is it the intended that when a job completes, it leaves data in this directory?
I'm thinking about a community of scientists submitting jobs and the impact on the space over time.

Shared dir is written only by the master an it contains the following structure:

- shared_dir
 - - <username>
   - - - oqdata

What happens when shared_dir is set is that the oqdata folder (which contains the datastore of each calculation in form of a 'calc_XYZ.hdf5' HDF5 file) is stored in <shared_dir>/<user>/oqdata instead of /home/<user>/oqdata. This makes possible to export just OQ data and not the whole /home. You can still keep oqdata in /home/<user>/oqdata setting shared_dir = /home and exporting the whole /home.

Each oqdata is owned by the users who had run the job and oqdata directories can be removed as soon as the results (in csv, xml, hdf5...) have been exported and raw data generated by the calculation isn't required anymore.

To cleanup data you can make a script starting from this https://github.com/gem/oq-engine/blob/master/utils/reset-db

Best regards,

Daniele Viganò

unread,
Sep 25, 2017, 3:25:17 AM9/25/17
to openqua...@googlegroups.com
On 25/09/17 04:41, fgdg...@gmail.com wrote:
As we progressed quite quickly to 58% of the disagg task, the memory consumption exploded.
Find below a snapshot of a compute node after a few hours of computation.
As you can see it won't progress anytime soon as all the resources are used to managed (io wait) for swap.

I can confirm that 32GB on 16 cores are not enough for this computation with the default number of tasks generated. Maybe Michele can give you some hints on this (like generating more, smaller, tasks).

About the swap: I should add it to the FAQ/documentation, but having swap active is strongly discouraged on resources fully dedicated to OQ because of the performance hit when it's used and because how python allocates memory. Sometime, if the memory I/O is stable, it can save a computation from a failure due an oom and bring it to the end, but in most cases (when memory throughput is relevant) is totally useless and it will just increase by several orders of magnitude the time to complete the job (making the job actually stuck).

Michele Simionato

unread,
Sep 25, 2017, 5:28:30 AM9/25/17
to OpenQuake Users
Il giorno lunedì 25 settembre 2017 09:25:17 UTC+2, Daniele Viganò ha scritto:
On 25/09/17 04:41, fgdg...@gmail.com wrote:
As we progressed quite quickly to 58% of the disagg task, the memory consumption exploded.
Find below a snapshot of a compute node after a few hours of computation.
As you can see it won't progress anytime soon as all the resources are used to managed (io wait) for swap.

I can confirm that 32GB on 16 cores are not enough for this computation with the default number of tasks generated. Maybe Michele can give you some hints on this (like generating more, smaller, tasks).


In the cluster I was using not much memory because more one thousand tasks were generated.  If you have a small number of cores more memory will be needed. In my workstation (10 cores) I am using around 3 GB per core.
Since you have 4 cores per machine I suggest to increase the amount of memory at least to 12 GB per machine, better 16 GB per machine.
If the files you see in the shared_dir are the calc_XYZ.hdf5 files, they are okay. They are always generated, even if you do not have a shared_dir. It is your responsibility to clean up old calculations, the engine does not do that
automatically.

fgdg...@gmail.com

unread,
Sep 25, 2017, 5:59:39 PM9/25/17
to OpenQuake Users
This is actually what happened, the jobs eventually exhausted the memory+swap on the compute nodes.
So we're around 4GB/core rather than 2 or 3.
I would target at least 16GB per 4 core nodes for this job as it is.

But if I understand correctly, there a way to make the tasks smaller by increasing their number. I'm interested in this "parameter"... Is it a simple parameter to change or redefinition of the grid/tree/step size (choose your term depending of the science context) in the disagg job ?

Best regards.
Franck

Michele Simionato

unread,
Sep 26, 2017, 3:28:27 AM9/26/17
to OpenQuake Users
Running a large calculation in the engine is an art. There are several parameters that may change completely the required time and memory. In particular in your case:

1. The rupture_mesh_spacing of 2 km is too small. Normally we use 5-10 km. A small rupture_mesh_spacing causes a computation bigger than needed to be run.
2. The maximum_distance of 400 km is too big. Normally we use 200 km. If you still want to use 400 km for the highest magnitudes you can, but then you should
use a smaller maximum_distance for the lowest magnitudes. In other words you should use a magnitude-distance filtering, a relatively new feature in engine.
For instance you can write maximum_distance = [(9,400), (8, 300), (4, 20)] meaning: filter out ruptures distant more than 400 km for magnitude 9,  distant
more then 300 km for magnitude 8, distant more than 20 km for magnitude 4. An hazard scientists could tell you which is the optimal magnitude-distance filter
depending on the tectonic region type and the GMPE, but even an heuristic magnitude-distance filter can make your calculation several times faster.

Also, you should take into account that the disaggregation calculator is somewhat less tested than the other calculators. It means that there are bugs
and that its performance is not optimal. If you are running large disaggregation calculations the best way is to stay on the cutting edge with the engine (i.e. use
the nightly builds) so that you get the fixes as soon as we implement them. For instance, I discovered today that the filtering was not applied in the disaggregation
phase, so it is slower and more memory consuming than needed and I fixed it here: https://github.com/gem/oq-engine/pull/3053

Michele Simionato

unread,
Sep 26, 2017, 7:02:32 AM9/26/17
to OpenQuake Users
I forgot. The other way to save memory is to set the parameter concurrent_tasks in the job.ini.
This parameter gives an hint to the engine about the number of tasks to generate. If it is not
set the engine will decide by itself how many tasks to generate. If it is set, the engine will generate
a number of tasks close to the hint. For instance in your case you could set

concurrent_tasks = 1000

The more tasks, the less memory, at least for the disaggregation calculator.

fgdg...@gmail.com

unread,
Sep 26, 2017, 5:40:05 PM9/26/17
to OpenQuake Users
Thank you Michele for those tips. I restarted a computation with high value for the concurrent_tasks parameters and passed the others tips to the scientists.
I'll keep you up to date about the outcome of the job.

Best regards.
Franck

fgdg...@gmail.com

unread,
Sep 28, 2017, 6:13:03 PM9/28/17
to OpenQuake Users
Gentlemen,
Just to let you know that the calculation finally completed in less than 36h.
The concurrent_task parameter made it.

Thank you for all your time and efforts.

Best regards.
Franck
Reply all
Reply to author
Forward
0 new messages