celery 4.1 issue in Openquake 3.1: 'struct.error: pack_into requires a buffer of at least *** bytes'

38 views
Skip to first unread message

Rui Yang

unread,
Jul 9, 2018, 1:32:45 AM7/9/18
to OpenQuake Users
Dear developers,

I recently installed Openquake 3.0 & 3.1 from source at our HPC cluster. Openquake 3.0 (and all previous versions) works fine but Openquake 3.1 has an issue given by celery&amqp modules. 

For example, I have no problems on running the demo example 'LogicTreeCase2ClassicalPSHA' with 2 nodes (32 cores each) by using Openquake 3.0, but Openquake 3.1 gives the error message as below. It seems Openquake 3.1 uses celery 4.1.0 while Openquake 3.0 stays with celery 3.1.20 so I suppose this is caused by using new version of celery. Openquake 3.1 doesn't seems to work with celery 3.1.20 anymore as I tested. Similar issues happen for some other test cases.

Many thanks for any help to solve this issue.

Regards,
Rui

Error message from Openquake 3.1:
-bash-4.1$ oq engine --run job.ini 
Using celery@r861, celery@r862, 64 cores
INFO:root:zipping the input files
[2018-07-09 03:47:05,960 #4 INFO] Running ./LogicTreeCase2ClassicalPSHA/job.ini
[2018-07-09 03:47:06,054 #4 INFO] Using engine version 3.1.0-gite54577f
[2018-07-09 03:47:06,116 #4 INFO] Reading the risk model if present
[2018-07-09 03:47:06,190 #4 INFO] Read 1 hazard sites
[2018-07-09 03:47:06,287 #4 INFO] Reading /short/zs6/rxy900/openquake_job/3.1/LogicTreeCase2ClassicalPSHA/source_model.xml
[2018-07-09 03:47:06,412 #4 INFO] Processed source model 1 with 4 potential gsim path(s) 
...
[2018-07-09 03:47:22,057 #4 INFO] /short/zs6/rxy900/openquake_job/3.1/LogicTreeCase2ClassicalPSHA/source_model.xml has been considered 81 times
[2018-07-09 03:47:22,156 #4 INFO] Splitting sources
[2018-07-09 03:47:23,894 #4 INFO] Using receiver tcp://10.9.12.69:1919
[2018-07-09 03:47:24,281 #4 INFO] Submitting 96 "RtreeFilter" tasks
[2018-07-09 03:47:24,452 #4 INFO] Sent 1.18 MB of data in 96 task(s)
[2018-07-09 03:47:24,544 #4 INFO] RtreeFilter   1%
...
[2018-07-09 03:47:34,674 #4 INFO] RtreeFilter  98%
[2018-07-09 03:47:34,736 #4 INFO] RtreeFilter 100%
[2018-07-09 03:47:34,811 #4 INFO] Received 1.29 MB of data, maximum per task 15.01 KB
[2018-07-09 03:47:34,923 #4 INFO] Using maxweight=441
[2018-07-09 03:47:35,291 #4 INFO] Submitting  "classical" tasks
[2018-07-09 03:47:41,278 #4 INFO] Sent 5238 sources in 208 tasks
[2018-07-09 03:47:41,333 #4 INFO] Sent 1.82 MB of data in 208 task(s)
[2018-07-09 03:47:41,537 #4 INFO] classical   1%
...
[2018-07-09 03:47:51,271 #4 INFO] classical  99%
[2018-07-09 03:47:51,365 #4 INFO] classical 100%
[2018-07-09 03:47:51,408 #4 INFO] Received 244.3 KB of data, maximum per task 2.03 KB
[2018-07-09 03:47:51,475 #4 INFO] Effective sites per task: 1
[2018-07-09 03:47:52,095 #4 INFO] Using receiver tcp://10.9.12.69:1912
[2018-07-09 03:47:52,192 #4 INFO] Reading PoEs on 1 sites
[2018-07-09 03:47:52,410 #4 INFO] Submitting  "build_hcurves_and_stats" tasks
[2018-07-09 03:47:52,493 #4 CRITICAL] 
Traceback (most recent call last):
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/calculators/base.py", line 189, in run
    self.result = self.execute()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/calculators/classical.py", line 353, in execute
    self.core_task.__func__, self.gen_args()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/parallel.py", line 615, in submit_all
    num_tasks = next(it)
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/parallel.py", line 652, in _iter_celery
    res = safetask.delay(self.task_func, piks)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/task.py", line 413, in delay
    return self.apply_async(args, kwargs)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/task.py", line 536, in apply_async
    **options
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/base.py", line 737, in send_task
    amqp.send_task_message(P, name, message, **options)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/amqp.py", line 554, in send_task_message
    **properties
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/connection.py", line 494, in _ensured
    return fun(*args, **kwargs)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/channel.py", line 1734, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/method_framing.py", line 163, in write_frame
    3, channel, framelen, str_to_bytes(body), 0xce)
struct.error: pack_into requires a buffer of at least 296911 bytes
[2018-07-09 03:47:52,570 #4 CRITICAL] Traceback (most recent call last):
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/engine/engine.py", line 330, in run_calc
    close=False, **kw)  # don't close the datastore too soon
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/calculators/base.py", line 189, in run
    self.result = self.execute()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/calculators/classical.py", line 353, in execute
    self.core_task.__func__, self.gen_args()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/parallel.py", line 615, in submit_all
    num_tasks = next(it)
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/parallel.py", line 652, in _iter_celery
    res = safetask.delay(self.task_func, piks)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/task.py", line 413, in delay
    return self.apply_async(args, kwargs)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/task.py", line 536, in apply_async
    **options
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/base.py", line 737, in send_task
    amqp.send_task_message(P, name, message, **options)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/amqp.py", line 554, in send_task_message
    **properties
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/connection.py", line 494, in _ensured
    return fun(*args, **kwargs)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/channel.py", line 1734, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/method_framing.py", line 163, in write_frame
    3, channel, framelen, str_to_bytes(body), 0xce)
struct.error: pack_into requires a buffer of at least 296911 bytes

Traceback (most recent call last):
  File "/apps/hpc-opt/openquake/3.1/oqenv/bin/oq_real", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/bin/oq", line 23, in <module>
    main.oq()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/commands/__main__.py", line 46, in oq
    parser.callfunc()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/sap.py", line 186, in callfunc
    return self.func(**vars(namespace))
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/sap.py", line 245, in main
    return func(**kw)
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/commands/engine.py", line 170, in engine
    exports, hazard_calculation_id=hc_id)
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/commands/engine.py", line 66, in run_job
    hazard_calculation_id=hazard_calculation_id, **kw)
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/engine/engine.py", line 330, in run_calc
    close=False, **kw)  # don't close the datastore too soon
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/calculators/base.py", line 189, in run
    self.result = self.execute()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/calculators/classical.py", line 353, in execute
    self.core_task.__func__, self.gen_args()
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/parallel.py", line 615, in submit_all
    num_tasks = next(it)
  File "/apps/hpc-opt/openquake/3.1/src/oq-engine/openquake/baselib/parallel.py", line 652, in _iter_celery
    res = safetask.delay(self.task_func, piks)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/task.py", line 413, in delay
    return self.apply_async(args, kwargs)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/task.py", line 536, in apply_async
    **options
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/base.py", line 737, in send_task
    amqp.send_task_message(P, name, message, **options)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/celery/app/amqp.py", line 554, in send_task_message
    **properties
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/connection.py", line 494, in _ensured
    return fun(*args, **kwargs)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/channel.py", line 1734, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/apps/hpc-opt/openquake/3.1/oqenv/lib/python3.5/site-packages/amqp/method_framing.py", line 163, in write_frame
    3, channel, framelen, str_to_bytes(body), 0xce)
struct.error: pack_into requires a buffer of at least 296911 bytes



Normal output from Openquake 3.0:
-bash-4.1$ oq engine --run job.ini 
INFO:root:Using celery@r3010, celery@r3022, 64 cores
[2018-07-09 04:13:06,406 #2 INFO] Running ./LogicTreeCase2ClassicalPSHA/job.ini
[2018-07-09 04:13:06,484 #2 INFO] Using engine version 3.0.1-git4a00a9d
[2018-07-09 04:13:06,568 #2 INFO] There are 1 hazard site(s)
[2018-07-09 04:13:06,634 #2 INFO] Reading the risk model if present
[2018-07-09 04:13:06,708 #2 INFO] Reading /short/zs6/rxy900/openquake_job/3.0/LogicTreeCase2ClassicalPSHA/source_model.xml
[2018-07-09 04:13:06,815 #2 INFO] Processed source model 1 with 4 potential gsim path(s) and 2 sources
...
[2018-07-09 04:13:38,462 #2 INFO] classical  99%
[2018-07-09 04:13:38,560 #2 INFO] classical 100%
[2018-07-09 04:13:38,658 #2 INFO] Received 244.3 KB of data, maximum per task 2.03 KB
[2018-07-09 04:13:38,717 #2 INFO] Effective sites per task: 1
[2018-07-09 04:13:39,260 #2 INFO] Using receiver tcp://10.9.52.58:1919
[2018-07-09 04:13:39,363 #2 INFO] Reading PoEs on 1 sites
[2018-07-09 04:13:39,549 #2 INFO] Submitting  "build_hcurves_and_stats" tasks
[2018-07-09 04:13:39,625 #2 INFO] Sent 210.1 KB of data in 1 task(s)
[2018-07-09 04:13:39,715 #2 INFO] build_hcurves_and_stats 100%
[2018-07-09 04:13:39,794 #2 INFO] Received 1.21 KB of data, maximum per task 1.21 KB
[2018-07-09 04:13:42,221 #2 INFO] Calculation 2 finished correctly in 33 seconds
  id | name
   7 | Full Report
   8 | Hazard Curves
   9 | Hazard Maps
  10 | Realizations


   

Daniele Viganò

unread,
Jul 9, 2018, 3:54:41 AM7/9/18
to openqua...@googlegroups.com

Dear Rui,

could you please:

  1. provide a copy of the output of 'pip freeze' from the environment where the Engine is installed
  2. make sure that all the nodes, if code isn't exported via NFS, are aligned with the same versions of the libraries (you can run 'pip freeze' also from here, again, if NFS or any other filesystem sharing technique isn't in place)
  3. tell us which OS (including the version) and with release of Python are you using on your cluster

If would like to send this information in a private form you can send them to engine....@openquake.org.

Cheers,
Daniele
--
You received this message because you are subscribed to the Google Groups "OpenQuake Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openquake-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
DANIELE VIGANÒ | System Administrator | Skype dennyv85 | +39 0382 5169882
GLOBAL EARTHQUAKE MODEL | working together to assess risk

Daniele Viganò

unread,
Jul 9, 2018, 4:01:09 AM7/9/18
to openqua...@googlegroups.com

Dear Rui,

waiting for you feedback I had a look to the source code of amqp (which is a third-party dependency used by Celery) and it looks like a memory issue: are you running the job with any memory constrain? Or the user running the job has any limitation on the memory allocation on the cluster?

It looks like that amqp is trying to allocate memory that isn't released by the OS (or via a scheduler). An higher memory usage in 3.1 is something you may expect. Unfortunately, this looks like an issue very specific to your deployment, thus not easy for us to debug.

Cheers,
Daniele


On 07/09/2018 07:32 AM, Rui Yang wrote:
--
You received this message because you are subscribed to the Google Groups "OpenQuake Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openquake-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rui Yang

unread,
Jul 9, 2018, 6:02:10 AM7/9/18
to OpenQuake Users
Dear Daniele,

Many thanks for your prompt response. I did the following steps to install the dependences and Openquake:

#module load python3/3.5.2 (this is on-site python environment)
#python3.5 -m venv oqenv
#source oqenv/bin/activate
#pip install -U pip setuptools
#mkdir src && cd src
#git clone -b engine-3.1 https://github.com/gem/oq-engine.git
#pip install -r oq-engine/requirements-py35-linux64.txt
#pip install -e oq-engine/

OS: Linux 3.10.0 el6.x86_64 GNU/Linux
Python: Python 3.5.2

The 'pip freeze' command gives the following output:
(oqenv) bash-4.1$ pip freeze
amqp==2.2.2
basemap==1.1.0
billiard==3.5.0.4
celery==4.1.0
certifi==2018.1.18
chardet==3.0.4
cycler==0.10.0
decorator==4.2.1
Django==2.0.4
docutils==0.14
h5py==2.8.0rc1
idna==2.6
kombu==4.1.0
matplotlib==2.1.2
mock==2.0.0
nose==1.3.7
numpy==1.14.2
pbr==4.0.0
psutil==5.4.3
pyparsing==2.2.0
pyproj==1.9.5.1
pyshp==1.2.3
python-dateutil==2.7.2
python-prctl==1.6.1
pytz==2018.3
PyYAML==3.12
pyzmq==17.0.0
requests==2.18.4
Rtree==0.8.3
scipy==1.0.1
setproctitle==1.1.10
Shapely==1.6.4.post1
six==1.11.0
urllib3==1.22
vine==1.1.4

The OQ installation directory is NFS mounted and it is visible to all worker nodes. There is memory constrain for each process but I believe it is large enough to run the demo examples. The total memory used in running the job is less than 500MB. I requested 2GB and 4GB memory per process but both still fail.

I am a bit confusion that there is no problem to complete the major calculations but it fails on the trivial "build_hcurves_and_stats" task. From OQ 3.0 it seems the job size to be sent is about 210.1KB:

[2018-07-09 04:13:39,549 #2 INFO] Submitting  "build_hcurves_and_stats" tasks
[2018-07-09 04:13:39,625 #2 INFO] Sent 210.1 KB of data in 1 task(s)
[2018-07-09 04:13:39,715 #2 INFO] build_hcurves_and_stats 100%

However, OQ 3.1 couldn't allocate such size of buffer at the amqp:
struct.error: pack_into requires a buffer of at least 296911 bytes

I googled the same issue but not sure whether the following one is relevant or not:

Is there a way for OQ 3.1 to invoke APIs of old celery v3.1 to work around this issue temporarily?

Regards,
Rui
 







Daniele Viganò

unread,
Jul 9, 2018, 6:59:10 AM7/9/18
to openqua...@googlegroups.com

Dear Rui,

thanks for the feedback. See below for my replies


On 07/09/2018 12:02 PM, Rui Yang wrote:


I googled the same issue but not sure whether the following one is relevant or not:

I saw the same bug report, but the fix it's tagged for v2.1.1 and we are using v2.2.2, so it should be already included



Is there a way for OQ 3.1 to invoke APIs of old celery v3.1 to work around this issue temporarily?

First of that I would suggest you to try with the latest amqp available (should be 2.3.2) and Celery (4.2.0)

$ pip install -U celery amqp

Celery 4.2.0 API is backward compatible with Celery 4.1.0.

If you want to try to revert to Celery 3.1 most of the changes are here https://github.com/gem/oq-engine/commit/6980408f0e64b6bfa64d91f968c8b2a714b943e8#diff-7b9dcacb392385214b68dc758b7984a9 and here https://github.com/gem/oq-engine/commit/3a275532603a542106673bd5f0febd10046eb358#diff-7b9dcacb392385214b68dc758b7984a9 but many other changes happened (not directly related to Celery) so we cannot guarantee that everything will still work.

Just to have a complete picture of the environment: which task scheduler are you using? (PBS, Torque, Slurm, SGE or whatever) And how do you start a job?

Cheers,
Daniele

Rui Yang

unread,
Jul 9, 2018, 9:14:03 AM7/9/18
to OpenQuake Users

Hi Daniele,

Many thanks for your suggestions and I will follow them to address the issue. 

The job scheduler is PBSPro, but it is just used for requesting worker nodes. After that, rabittmq server is started at one node and celery workers start at all worker nodes. Thus the job scheduler doesn't actually control the job running except for the time and resource limits.

Regards,
Rui

Daniele Viganò

unread,
Jul 9, 2018, 10:06:07 AM7/9/18
to openqua...@googlegroups.com

Hi Rui,

thanks. I did a quick trial with a test bench similar to your configuration (apart from the scheduler): CentOS 6, Kernel 3.10.0, Python 3.5.2, Engine 3.1.0 (+ ref. libraries), one master and one node connected via Celery and I was unable to reproduce the issue. There are still many many architecture changes or customization that I'm not aware of (kernel, networking, ...) which are specific to every each HPC setup and quite hard to debug without an access, even as a user, to them.

This is the output I got:


INFO:root:zipping the input files
[2018-07-09 15:43:28,341 #7 INFO] Running /home/daniele/oq-engine/demos/hazard/LogicTreeCase2ClassicalPSHA/job.ini
[2018-07-09 15:43:28,350 #7 INFO] Using engine version 3.1.0
[2018-07-09 15:43:28,366 #7 INFO] Reading the risk model if present
[2018-07-09 15:43:28,381 #7 INFO] Read 1 hazard sites
[2018-07-09 15:43:28,396 #7 INFO] Reading /home/daniele/oq-engine/demos/hazard/LogicTreeCase2ClassicalPSHA/source_model.xml
[cut]
[2018-07-09 15:43:32,127 #7 INFO] /home/daniele/oq-engine/demos/hazard/LogicTreeCase2ClassicalPSHA/source_model.xml has been considered 81 times
[2018-07-09 15:43:32,144 #7 INFO] Splitting sources
[2018-07-09 15:43:33,632 #7 INFO] Using receiver tcp://192.168.1.1:1914
Using cel...@worker.el6, 8 cores
[2018-07-09 15:43:33,707 #7 INFO] Submitting 12 "RtreeFilter" tasks
[2018-07-09 15:43:33,795 #7 INFO] Sent 1.01 MB of data in 12 task(s)
[2018-07-09 15:43:34,137 #7 INFO] RtreeFilter   8%
[2018-07-09 15:43:34,208 #7 INFO] RtreeFilter  16%
[2018-07-09 15:43:34,264 #7 INFO] RtreeFilter  25%
[2018-07-09 15:43:34,589 #7 INFO] RtreeFilter  33%
[2018-07-09 15:43:34,712 #7 INFO] RtreeFilter  41%
[2018-07-09 15:43:34,775 #7 INFO] RtreeFilter  50%
[2018-07-09 15:43:34,850 #7 INFO] RtreeFilter  58%
[2018-07-09 15:43:34,932 #7 INFO] RtreeFilter  66%
[2018-07-09 15:43:35,055 #7 INFO] RtreeFilter  75%
[2018-07-09 15:43:35,080 #7 INFO] RtreeFilter  83%
[2018-07-09 15:43:35,139 #7 INFO] RtreeFilter  91%
[2018-07-09 15:43:35,168 #7 INFO] RtreeFilter 100%
[2018-07-09 15:43:35,235 #7 INFO] Received 1.21 MB of data, maximum per task 104.48 KB
[2018-07-09 15:43:35,336 #7 INFO] Using maxweight=3526
[2018-07-09 15:43:35,347 #7 INFO] Submitting  "classical" tasks
[2018-07-09 15:43:38,968 #7 INFO] Sent 5238 sources in 25 tasks
[2018-07-09 15:43:39,042 #7 INFO] Sent 1.27 MB of data in 25 task(s)
[2018-07-09 15:44:09,500 #7 INFO] classical   4%
[2018-07-09 15:44:11,111 #7 INFO] classical   8%
[2018-07-09 15:44:11,177 #7 INFO] classical  12%
[2018-07-09 15:44:11,792 #7 INFO] classical  16%
[2018-07-09 15:44:12,024 #7 INFO] classical  20%
[2018-07-09 15:44:12,206 #7 INFO] classical  24%
[2018-07-09 15:44:12,298 #7 INFO] classical  28%
[2018-07-09 15:44:12,449 #7 INFO] classical  32%
[2018-07-09 15:44:44,266 #7 INFO] classical  36%
[2018-07-09 15:44:45,431 #7 INFO] classical  40%
[2018-07-09 15:44:45,800 #7 INFO] classical  44%
[2018-07-09 15:44:46,450 #7 INFO] classical  48%
[2018-07-09 15:44:46,848 #7 INFO] classical  52%
[2018-07-09 15:44:46,983 #7 INFO] classical  56%
[2018-07-09 15:44:47,150 #7 INFO] classical  60%
[2018-07-09 15:44:47,514 #7 INFO] classical  64%
[2018-07-09 15:45:15,709 #7 INFO] classical  68%
[2018-07-09 15:45:27,567 #7 INFO] classical  72%
[2018-07-09 15:45:28,511 #7 INFO] classical  76%
[2018-07-09 15:46:11,365 #7 INFO] classical  80%
[2018-07-09 15:46:31,533 #7 INFO] classical  84%
[2018-07-09 15:46:31,616 #7 INFO] classical  88%
[2018-07-09 15:46:32,160 #7 INFO] classical  92%
[2018-07-09 15:46:32,599 #7 INFO] classical  96%
[2018-07-09 15:46:32,683 #7 INFO] classical 100%
[2018-07-09 15:46:32,699 #7 INFO] Received 83.14 KB of data, maximum per task 6.61 KB
[2018-07-09 15:46:32,707 #7 INFO] Effective sites per task: 1
[2018-07-09 15:46:33,275 #7 INFO] Using receiver tcp://192.168.1.1:1919
[2018-07-09 15:46:33,285 #7 INFO] Reading PoEs on 1 sites
[2018-07-09 15:46:33,446 #7 INFO] Submitting  "build_hcurves_and_stats" tasks
[2018-07-09 15:46:33,484 #7 INFO] Sent 212.83 KB of data in 1 task(s)
[2018-07-09 15:46:33,902 #7 INFO] build_hcurves_and_stats 100%
[2018-07-09 15:46:33,953 #7 INFO] Received 1.21 KB of data, maximum per task 1.21 KB
[2018-07-09 15:46:34,242 #7 INFO] Calculation 7 finished correctly in 185 seconds
  id | name
  13 | Full Report
  14 | Hazard Curves
  15 | Hazard Maps
  16 | Input Files
  17 | Realizations
  18 | Seismic Source Groups

$ uname -a
Linux master.el6 3.10.102-1.el6.x86_64 #1 SMP Tue Jun 14 11:40:50 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

$ python3 --version
Python 3.5.2


Cheers,
Daniele
--
You received this message because you are subscribed to the Google Groups "OpenQuake Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openquake-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rui Yang

unread,
Jul 10, 2018, 10:10:13 AM7/10/18
to OpenQuake Users
Hi Daniele,

Many thanks for your efforts and time on testing it. Your conclusion reminders me to check my customised settings on RabbitMQ which is the same in all versions of Openquake. It proves that the problem is cause by setting a value on 'frame_max'. After removing that Openquake 3.1 works on this and all other demos. 
And then, I meet another issue when running a user job with the high celery concurrency. Basically it is "socket closed" issue during the calculation which is probably caused by the disconnection between the Rabbitmq server and the client with the Rabbitmq error message:"missed heartbeats from client, timeout: 120s". Again Openquake 3.0 works fine on this job although it has exactly the same Rabbitmq configurations with Openquake 3.1. The RabbitMQ server version in use is 3.6.5. Just wondering do you have any suggestions on fixing this issue? Thanks again for you help.

Regards,
Rui
 




Daniele Viganò

unread,
Jul 10, 2018, 10:28:46 AM7/10/18
to openqua...@googlegroups.com

Hi Rui,


On 07/10/2018 04:10 PM, Rui Yang wrote:

Many thanks for your efforts and time on testing it. Your conclusion reminders me to check my customised settings on RabbitMQ which is the same in all versions of Openquake. It proves that the problem is cause by setting a value on 'frame_max'. After removing that Openquake 3.1 works on this and all other demos.
that's a good news.


And then, I meet another issue when running a user job with the high celery concurrency. Basically it is "socket closed" issue during the calculation which is probably caused by the disconnection between the Rabbitmq server and the client with the Rabbitmq error message:"missed heartbeats from client, timeout: 120s". Again Openquake 3.0 works fine on this job although it has exactly the same Rabbitmq configurations with Openquake 3.1. The RabbitMQ server version in use is 3.6.5. Just wondering do you have any suggestions on fixing this issue? Thanks again for you help.
Could you please:
  1. Explain "high celery concurrency", how much 'high' is for you?
  2. Could you share (even privately) your RabbitMQ config?
  3. At which point of the calculation it happens? Could you share the log?

Cheers,
Daniele

Rui Yang

unread,
Jul 11, 2018, 9:02:26 AM7/11/18
to OpenQuake Users
Hi Daniele,

All issues have gone after fixing an inconsistent module problem and now Openquake 3.1 works fine. 
Thank you very much for your help on making it happen.

Regards,
Rui

Daniele Viganò

unread,
Jul 11, 2018, 9:19:53 AM7/11/18
to openqua...@googlegroups.com

Hi Rui,

thanks, good news! What "an inconsistent module" was? I mean, something related to our code/libraries or something related to your custom deployment?

In the first case this is formation may be useful for other uses in our community in the future.

Thanks,
Daniele
--
You received this message because you are subscribed to the Google Groups "OpenQuake Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openquake-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rui Yang

unread,
Jul 11, 2018, 11:43:23 PM7/11/18
to OpenQuake Users
Hi Daniele,

Sorry for the confusion. There is nothing wrong with OQ itself but I need to use our own python modules. They are built upon optimised libraries to get the best performance at our cluster so it is not applicable for other users. Intel-python might be an alternative solution although I didn't test it.

Regards,
Rui


Reply all
Reply to author
Forward
0 new messages