Dear Franck,
to better understand your issue we need some further information:
I would also suggest you to try upgrading to v2.5.0 first to see if the issue is reproducible.
If the situation does not change you can send us your input files at engine....@openquake.org so we can try to reproduce your issue.
I suspect anyway that the root cause is
IOError: [Errno 24] Too many open files: '/proc/3829/stat'
Everything else is a consequence of it (failure in rabbitmq and
dbserver). Can you check the default limits running 'ulimit -a`
from a user shell?
Cheers,
Daniele
--
You received this message because you are subscribed to the Google Groups "OpenQuake Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openquake-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Anyway, the calculation is big but not huge, so it can be done in a single server if you wait enough (probably less than 1 day).
Dear Franck,
I hadn't time yet to read the full thread, I hope to have more
time next week. Anyway, I'm setting up a testbed made with CentOS
7 VMs (unfortunately our cluster runs Ubuntu 14.04 with a tweaked
version of Python 3.5 and CentOS testing is done using demo
calculations inside a Docker container). I'm using a 4 core master
and a 16 cores worker. I'm using a single worker, but this should
not change the picture since the number of connections depends on
the number of workers, not nodes.
See some considerations below:
It looks like that the computation fails when master tries to submit tasks to the workers: it seems an issue on the RabbitMQ side which has difficulties to flush tasks from the queue or it's causing a traffic jam. Even if you increase ulimits this is not the expected behavior and if you still don't see any load on the workers my fear is that they will never start processing tasks. We had already similar bugs with specific versions of RabbitMQ in the past (~3yrs ago).
Something you can try, that should not cost you too much effort would be try a more recent version of RabbitMQ (https://dl.bintray.com/rabbitmq/rabbitmq-server-rpm/rabbitmq-server-3.6.12-1.el7.noarch.rpm).
I'm starting running the job now, I'll let you know if I'm able
to reproduce the error and (hopefully) where it's coming from.
Dear Franck,
I've been able to reproduce your issue. Using a more recent
version of RabbitMQ the 'too many open files' error is not
triggered, but the computation is stuck and the behavior is more
or less the same.
The issue in the communication between the OpenQuake Engine (or
better AMQP) and RabbitMQ. AMQP continuously opens new connections
against RabbitMQ, but no data is then sent to the workers.
We'll need to make some further investigation to trace down the bug and also why it worked on our configuration. It will require some days due the weight of the computation.
Cheers,
Daniele
Hi Franck,
some quick updates. The reason why Michele has been able to run the computation on our cluster is due the 'shared dir' feature. Having this enabled on the CentOS 7 VM I was able to start the disaggregation phase. Unfortunately I ran out of memory (16 cores, 16GB, I will try soon with 32GB), but I did not experienced the issue with RabbitMQ.
We need to figure out what happens without the shared_dir, but
I'm not sure it's a bug. Having the shared dir unset we have to
transfer all the data via RabbitMQ. It may just need a relaxed
open files limit (see Controlling System Limits on Linux under
https://www.rabbitmq.com/install-rpm.html) or it's simply
incapable of managing such amount of data and then a shared dir is
required (that's why we have introduced it). I will check with
Michele.
Adding a shared dir is not documented yet, but here you can find some guidelines (/home/shared is an example, it can be whatever):
All nodes
Update /etc/openquake/openquake.cfg, setting /home/shared as shared_dir value
Master
$ yum install nfs-utils
$ mkdir /home/shared
$ chgrp openquake /home/shared
# Users need to create their oqdata inside
$ chmod 777 /home/shared
# Keep openquake group as owner of its content
$ chmod g+s /home/shared
to /etc/exports add
/home/shared <ip
subnet>/<mask>(ro,sync,no_all_squash)
$ systemctl enable rpcbind
$ systemctl enable nfs-server
$ systemctl enable nfs-lock
$ systemctl enable nfs-idmap
$ systemctl start rpcbind
$ systemctl start nfs-server
$ systemctl start nfs-lock
$ systemctl start nfs-idmap
$ systemctl restart openquake-dbserver
Workers
$ yum install nfs-utils
$ mkdir /home/shared
$ mount <master ip>:/home/shared /home/shared -o
ro,bg,soft,intr,noauto
$ systemctl restart openquake-celery
Best regards,
Daniele
Good to see that you were able to replicate the issue.
I was wondering when I did the set up how long we could live without the shared_dir feature... well, I guess I had my answer.
I switched the configuration to shared_dir based and I confirm that disagg tasks are spread and processed by the compute nodes so it seems ok.
From the numbers from Michele, I think we should complete it in 4 days.
One question, it seems that every job leaves data in the shared_dir directory. Is it the intended that when a job completes, it leaves data in this directory?
I'm thinking about a community of scientists submitting jobs and the impact on the space over time.
Dear Franck,
Good to see that you were able to replicate the issue.
I was wondering when I did the set up how long we could live without the shared_dir feature... well, I guess I had my answer.
I switched the configuration to shared_dir based and I confirm that disagg tasks are spread and processed by the compute nodes so it seems ok.
From the numbers from Michele, I think we should complete it in 4 days.
I can switch back to pure rabbitmq configuration quite quickly if you need me to do some tests.
One question, it seems that every job leaves data in the shared_dir directory. Is it the intended that when a job completes, it leaves data in this directory?
I'm thinking about a community of scientists submitting jobs and the impact on the space over time.
As we progressed quite quickly to 58% of the disagg task, the memory consumption exploded.
Find below a snapshot of a compute node after a few hours of computation.
As you can see it won't progress anytime soon as all the resources are used to managed (io wait) for swap.
On 25/09/17 04:41, fgdg...@gmail.com wrote:
As we progressed quite quickly to 58% of the disagg task, the memory consumption exploded.
Find below a snapshot of a compute node after a few hours of computation.
As you can see it won't progress anytime soon as all the resources are used to managed (io wait) for swap.
I can confirm that 32GB on 16 cores are not enough for this computation with the default number of tasks generated. Maybe Michele can give you some hints on this (like generating more, smaller, tasks).