I am encountering an issue with the RELION 5.0-tomo pipeline during the refinement of bacterial ribosomes. While the process runs smoothly initially, it consistently fails at the M-step of the final converged iteration.
Dataset & Environment:
- Particles: 14,771
- Box Size: 403 pixels (at 10 Å/px)
- Upstream Tool: WarpTools
- RELION Version: 5.0.0-17-gadfec821
- Hardware: NVIDIA A10 GPUs (24 GB VRAM per card)
The Issue:
I suspect a potential memory bottleneck, though my initial estimates suggest the particle data should only require approximately 3.8 GB (403 x 4 x bytes x14,771 partciles). I have tested the run on both 4-GPU and 8-GPU configurations, but the error persists at the same stage.
Below are the MPI and thread configurations for my recent runs:
4-GPU Setup
=== RELION MPI setup ===
+ Number of MPI processes = 3
+ Number of threads per MPI process = 4
+ Total number of threads therefore = 12
+ Leader (0) runs on host = node13.bbsrc
+ Follower 1 runs on host = node14.bbsrc
+ Follower 2 runs on host = node15.bbsrc
==========================
uniqueHost node14.bbsrc has 1 ranks.
uniqueHost node15.bbsrc has 1 ranks.
Using explicit indexing on follower 1 to assign devices 0 1 2 3
Thread 0 on follower 1 mapped to device 0
Thread 1 on follower 1 mapped to device 1
Thread 2 on follower 1 mapped to device 2
Thread 3 on follower 1 mapped to device 3
Using explicit indexing on follower 2 to assign devices 0 1 2 3
Thread 0 on follower 2 mapped to device 0
Thread 1 on follower 2 mapped to device 1
Thread 2 on follower 2 mapped to device 2
Thread 3 on follower 2 mapped to device 3
Running CPU instructions in double precision.
8-GPU Setup
=== RELION MPI setup ===
+ Number of MPI processes = 3
+ Number of threads per MPI process = 4
+ Total number of threads therefore = 12
+ Leader (0) runs on host = node11.bbsrc
+ Follower 1 runs on host = node12.bbsrc
+ Follower 2 runs on host = node13.bbsrc
==========================
uniqueHost node12.bbsrc has 1 ranks.
uniqueHost node13.bbsrc has 1 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 1 mapped to device 0
Thread 1 on follower 1 mapped to device 1
Thread 2 on follower 1 mapped to device 2
Thread 3 on follower 1 mapped to device 3
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 2 mapped to device 0
Thread 1 on follower 2 mapped to device 1
Thread 2 on follower 2 mapped to device 2
Thread 3 on follower 2 mapped to device 3
Running CPU instructions in double precision.
I would appreciate any insights into whether this might be a specific memory allocation spike in the M-step, or if there is an optimization in the MPI/thread setup I should consider for this specific version of RELION-5.0.
corrupted size vs. prev_size
[node18:05369] *** Process received signal ***
[node18:05369] Signal: Aborted (6)
[node18:05369] Signal code: (-6)
[node18:05369] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f35dce54db0]
[node18:05369] [ 1] /lib64/libc.so.6(+0xa154c)[0x7f35dcea154c]
[node18:05369] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f35dce54d06]
[node18:05369] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f35dce287f3]
[node18:05369] [ 4] /lib64/libc.so.6(+0x29130)[0x7f35dce29130]
[node18:05369] [ 5] /lib64/libc.so.6(+0xab617)[0x7f35dceab617]
[node18:05369] [ 6] /lib64/libc.so.6(+0xac186)[0x7f35dceac186]
[node18:05369] [ 7] /lib64/libc.so.6(+0xac310)[0x7f35dceac310]
[node18:05369] [ 8] /lib64/libc.so.6(+0xadf18)[0x7f35dceadf18]
[node18:05369] [ 9] /lib64/libc.so.6(+0xaf04f)[0x7f35dceaf04f]
[node18:05369] [10] /lib64/libc.so.6(+0xaf6fa)[0x7f35dceaf6fa]
[node18:05369] [11] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_malloc_plain+0x15)[0x7f35e3d55155]
[node18:05369] [12] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(+0x1d88f)[0x7f35e3d5688f]
[node18:05369] [13] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_kdft_register+0x24)[0x7f35e3d62554]
[node18:05369] [14] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_solvtab_exec+0x28)[0x7f35e3d59508]
[node18:05369] [15] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_dft_conf_standard+0x22)[0x7f35e3d5e422]
[node18:05369] [16] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_configure_planner+0x9)[0x7f35e3e4b559]
[node18:05369] [17] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_the_planner+0x28)[0x7f35e3e55eb8]
[node18:05369] [18] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_mkapiplan+0x2b)[0x7f35e3e4b0fb]
[node18:05369] [19] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_plan_many_dft_r2c+0x142)[0x7f35e3e55952]
[node18:05369] [20] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_plan_dft_r2c+0x25)[0x7f35e3e54ee5]
[node18:05369] [21] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN18FourierTransformer7setRealER13MultidimArrayIdEb+0xc4)[0x606624]
[node18:05369] [22] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN13BackProjector11reconstructER13MultidimArrayIdEibRKS1_ddibP5ImageIdE+0x13b)[0x5cf98b]
[node18:05369] [23] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN14MlOptimiserMpi46readTemporaryDataAndWeightArraysAndReconstructEii+0xf86)[0x52cc06]
[node18:05369] [24] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x646)[0x52d936]
[node18:05369] [25] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x4c8)[0x52fdb8]
[node18:05369] [26] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(main+0x55)[0x4e05e5]
[node18:05369] [27] /lib64/libc.so.6(+0x3feb0)[0x7f35dce3feb0]
[node18:05369] [28] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f35dce3ff60]
[node18:05369] [29] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_start+0x25)[0x4e4015]
[node18:05369] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 5369 on node node18 exited on
signal 6 (Aborted).
Thanks!