Problem in relion refinement steps pre M

63 views

Skip to first unread message

Dylan Noone

unread,

Mar 30, 2026, 3:48:01 AMMar 30

to Warp

Hey guys,

I also posted this in CCPEM, but just in case anyone here had a similar issue:

I am encountering an issue with the RELION 5.0-tomo pipeline during the refinement of bacterial ribosomes. While the process runs smoothly initially, it consistently fails at the M-step of the final converged iteration.

Dataset & Environment:

Particles: 14,771
Box Size: 403 pixels (at 10 Å/px)
Upstream Tool: WarpTools
RELION Version: 5.0.0-17-gadfec821
Hardware: NVIDIA A10 GPUs (24 GB VRAM per card)

The Issue:

I suspect a potential memory bottleneck, though my initial estimates suggest the particle data should only require approximately 3.8 GB (403 x 4 x bytes x14,771 partciles). I have tested the run on both 4-GPU and 8-GPU configurations, but the error persists at the same stage.

Below are the MPI and thread configurations for my recent runs:

4-GPU Setup

=== RELION MPI setup ===

+ Number of MPI processes = 3

+ Number of threads per MPI process = 4

+ Total number of threads therefore = 12

+ Leader (0) runs on host = node13.bbsrc

+ Follower 1 runs on host = node14.bbsrc

+ Follower 2 runs on host = node15.bbsrc

==========================

uniqueHost node14.bbsrc has 1 ranks.

uniqueHost node15.bbsrc has 1 ranks.

Using explicit indexing on follower 1 to assign devices 0 1 2 3

Thread 0 on follower 1 mapped to device 0

Thread 1 on follower 1 mapped to device 1

Thread 2 on follower 1 mapped to device 2

Thread 3 on follower 1 mapped to device 3

Using explicit indexing on follower 2 to assign devices 0 1 2 3

Thread 0 on follower 2 mapped to device 0

Thread 1 on follower 2 mapped to device 1

Thread 2 on follower 2 mapped to device 2

Thread 3 on follower 2 mapped to device 3

Running CPU instructions in double precision.

8-GPU Setup

=== RELION MPI setup ===

+ Number of MPI processes = 3

+ Number of threads per MPI process = 4

+ Total number of threads therefore = 12

+ Leader (0) runs on host = node11.bbsrc

+ Follower 1 runs on host = node12.bbsrc

+ Follower 2 runs on host = node13.bbsrc

==========================

uniqueHost node12.bbsrc has 1 ranks.

uniqueHost node13.bbsrc has 1 ranks.

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

Thread 0 on follower 1 mapped to device 0

Thread 1 on follower 1 mapped to device 1

Thread 2 on follower 1 mapped to device 2

Thread 3 on follower 1 mapped to device 3

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

Thread 0 on follower 2 mapped to device 0

Thread 1 on follower 2 mapped to device 1

Thread 2 on follower 2 mapped to device 2

Thread 3 on follower 2 mapped to device 3

Running CPU instructions in double precision.

I would appreciate any insights into whether this might be a specific memory allocation spike in the M-step, or if there is an optimization in the MPI/thread setup I should consider for this specific version of RELION-5.0.

corrupted size vs. prev_size

[node18:05369] *** Process received signal ***

[node18:05369] Signal: Aborted (6)

[node18:05369] Signal code: (-6)

[node18:05369] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f35dce54db0]

[node18:05369] [ 1] /lib64/libc.so.6(+0xa154c)[0x7f35dcea154c]

[node18:05369] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f35dce54d06]

[node18:05369] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f35dce287f3]

[node18:05369] [ 4] /lib64/libc.so.6(+0x29130)[0x7f35dce29130]

[node18:05369] [ 5] /lib64/libc.so.6(+0xab617)[0x7f35dceab617]

[node18:05369] [ 6] /lib64/libc.so.6(+0xac186)[0x7f35dceac186]

[node18:05369] [ 7] /lib64/libc.so.6(+0xac310)[0x7f35dceac310]

[node18:05369] [ 8] /lib64/libc.so.6(+0xadf18)[0x7f35dceadf18]

[node18:05369] [ 9] /lib64/libc.so.6(+0xaf04f)[0x7f35dceaf04f]

[node18:05369] [10] /lib64/libc.so.6(+0xaf6fa)[0x7f35dceaf6fa]

[node18:05369] [11] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_malloc_plain+0x15)[0x7f35e3d55155]

[node18:05369] [12] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(+0x1d88f)[0x7f35e3d5688f]

[node18:05369] [13] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_kdft_register+0x24)[0x7f35e3d62554]

[node18:05369] [14] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_solvtab_exec+0x28)[0x7f35e3d59508]

[node18:05369] [15] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_dft_conf_standard+0x22)[0x7f35e3d5e422]

[node18:05369] [16] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_configure_planner+0x9)[0x7f35e3e4b559]

[node18:05369] [17] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_the_planner+0x28)[0x7f35e3e55eb8]

[node18:05369] [18] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_mkapiplan+0x2b)[0x7f35e3e4b0fb]

[node18:05369] [19] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_plan_many_dft_r2c+0x142)[0x7f35e3e55952]

[node18:05369] [20] /bbsrc/soft/relion/5.0.0-17-gadfec821/external/fftw/lib/libfftw3.so.3(fftw_plan_dft_r2c+0x25)[0x7f35e3e54ee5]

[node18:05369] [21] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN18FourierTransformer7setRealER13MultidimArrayIdEb+0xc4)[0x606624]

[node18:05369] [22] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN13BackProjector11reconstructER13MultidimArrayIdEibRKS1_ddibP5ImageIdE+0x13b)[0x5cf98b]

[node18:05369] [23] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN14MlOptimiserMpi46readTemporaryDataAndWeightArraysAndReconstructEii+0xf86)[0x52cc06]

[node18:05369] [24] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x646)[0x52d936]

[node18:05369] [25] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x4c8)[0x52fdb8]

[node18:05369] [26] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(main+0x55)[0x4e05e5]

[node18:05369] [27] /lib64/libc.so.6(+0x3feb0)[0x7f35dce3feb0]

[node18:05369] [28] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f35dce3ff60]

[node18:05369] [29] /bbsrc/soft/relion/5.0.0-17-gadfec821/bin/relion_refine_mpi(_start+0x25)[0x4e4015]

[node18:05369] *** End of error message ***

--------------------------------------------------------------------------

prterun noticed that process rank 2 with PID 5369 on node node18 exited on

signal 6 (Aborted).

Thanks!

Warp Bot

unread,

Mar 30, 2026, 12:14:09 PMMar 30

to Warp, Dylan Noone

Hi Dylan,

This is a known issue that has been reported by several users running RELION refinement with large box sizes. The "corrupted size vs. prev_size" error is a heap corruption in glibc that surfaces during FFTW's memory allocation inside RELION's BackProjector::reconstruct at the maximization step. It is a RELION-side problem, not something caused by WarpTools.

One thing that stands out is your box size of 403 pixels. FFTW performs best with dimensions that factor into small primes (2, 3, 5, 7). The number 403 = 13 x 31, which consists entirely of larger primes, and this can cause FFTW's planner to allocate significantly more memory and exercise less-tested code paths. This is a likely contributing factor.

I would suggest:

1. Change your box size to an FFTW-friendly number such as 400 (= 2^4 x 5^2) or 384 (= 2^7 x 3). Re-export your particles from WarpTools with the new box size and re-run refinement.

2. If that does not help, try reducing to a single MPI follower (2 MPI processes total) to see if inter-process memory pressure is involved.

3. You may also want to try a newer RELION commit if one is available -- this crash pattern has been reported across multiple RELION 5 versions and may be addressed in more recent builds.

Since this is ultimately a RELION issue, the CCPEM list (where you already posted) is the best place for RELION-specific debugging. The RELION developers would be able to confirm whether this is a known FFTW/memory bug on their side.

— Warp Bot
THIS IS AN AUTOMATED MESSAGE GENERATED BY AN LLM. IT MAY OR MAY NOT SOLVE YOUR PROBLEM. IF YOU'D LIKE TO SPEAK TO A HUMAN, SAY SO IN YOUR MESSAGE.

--
You received this message because you are subscribed to the Google Groups "Warp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warp-em+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/warp-em/7c33b9d5-586b-4dba-a726-93c343e50f86n%40googlegroups.com.

Dylan Noone

unread,

Mar 30, 2026, 12:17:27 PMMar 30

to Warp Bot, Warp

Thanks for the tips.

The box size was 400, I made a typo!

What seemed to work in the end was changing the setting combine iterations through disk to yes in the compute tab.