Thank you Michele, I was testing 3.19 during the weekend. One calculation failed after a day by clear memory error so the reason is obvious:
[2024-06-09 12:45:01 #5 INFO] gen_event_based 39% [4351 submitted, 0 queued]
[2024-06-09 15:57:59 #5 INFO] Received 7916 * 7.34 MB {'gmfdata': '56.73 GB', 'sig_eps': '7.02 MB', 'times': '2.73 MB'} in 94732 seconds from gen_event_based
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\OpenQuake\python3\Scripts\oq.exe\__main__.py", line 7, in <module>
File "C:\OpenQuake\python3\Lib\site-packages\openquake\commands\__main__.py", line 48, in oq
sap.run(commands, prog='oq')
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\sap.py", line 212, in run
return _run(parser(funcdict, **parserkw), argv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\sap.py", line 203, in _run
return func(**dic)
^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\commands\engine.py", line 181, in main
run_jobs(jobs)
File "C:\OpenQuake\python3\Lib\site-packages\openquake\engine\engine.py", line 418, in run_jobs
run_calc(jobctx)
File "C:\OpenQuake\python3\Lib\site-packages\openquake\engine\engine.py", line 281, in run_calc
calc.run(shutdown=True)
File "C:\OpenQuake\python3\Lib\site-packages\openquake\calculators\base.py", line 255, in run
raise exc from None
File "C:\OpenQuake\python3\Lib\site-packages\openquake\calculators\base.py", line 244, in run
self.result = self.execute()
^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\calculators\event_based.py", line 681, in execute
acc = smap.reduce(self.agg_dicts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 896, in reduce
return self.submit_all().reduce(agg, acc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 634, in reduce
for result in self:
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 620, in __iter__
yield from self._iter()
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 605, in _iter
msg = check_mem_usage()
^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 482, in check_mem_usage
raise MemoryError('Using more memory than allowed by configuration '
MemoryError: Using more memory than allowed by configuration (Used: 99% / Allowed: 99%)! Shutting down.
For another calculation it again ends with the 'invalid load key':
[2024-06-10 06:44:33 #6 INFO] gen_event_based 30% [2499 submitted, 0 queued]
[2024-06-10 07:16:30 #6 INFO] Received 3225 * 6.36 MB {'gmfdata': '20.02 GB', 'sig_eps': '4.51 MB', 'times': '1.48 MB'} in 39561 seconds from gen_event_based
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\OpenQuake\python3\Scripts\oq.exe\__main__.py", line 7, in <module>
File "C:\OpenQuake\python3\Lib\site-packages\openquake\commands\__main__.py", line 48, in oq
sap.run(commands, prog='oq')
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\sap.py", line 212, in run
return _run(parser(funcdict, **parserkw), argv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\sap.py", line 203, in _run
return func(**dic)
^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\commands\engine.py", line 181, in main
run_jobs(jobs)
File "C:\OpenQuake\python3\Lib\site-packages\openquake\engine\engine.py", line 418, in run_jobs
run_calc(jobctx)
File "C:\OpenQuake\python3\Lib\site-packages\openquake\engine\engine.py", line 281, in run_calc
calc.run(shutdown=True)
File "C:\OpenQuake\python3\Lib\site-packages\openquake\calculators\base.py", line 255, in run
raise exc from None
File "C:\OpenQuake\python3\Lib\site-packages\openquake\calculators\base.py", line 244, in run
self.result = self.execute()
^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\calculators\event_based.py", line 681, in execute
acc = smap.reduce(self.agg_dicts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 896, in reduce
return self.submit_all().reduce(agg, acc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 634, in reduce
for result in self:
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 620, in __iter__
yield from self._iter()
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 604, in _iter
for result in self.iresults:
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\parallel.py", line 933, in _loop
res = next(isocket)
^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\openquake\baselib\zeromq.py", line 156, in __iter__
yield self.zsocket.recv_pyobj()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\zmq\sugar\socket.py", line 977, in recv_pyobj
return self._deserialize(msg, pickle.loads)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OpenQuake\python3\Lib\site-packages\zmq\sugar\socket.py", line 835, in _deserialize
return load(recvd)
^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, '\x00'.
The machine has 44 physical cores, 88 logical, by default OQ Engine uses 44 workers, but I decreased to 30 by setting
num_cores = 30
in openquake.cfg to have more memory available per core - the server has 350 GB of memory,
around 300 available when starting the calculation (10GB per worker?), it is a Windows Server 2019 standard OS.
We do have other Windows machines and on some of them the failures are less frequent (now I noticed the others have 512GB RAM so will try on those too),
the biggest challenge is it is very unpredictable what will run and what not :(
Thank you
Peter