Issue with Refinement Job Resource Utilization

11 views
Skip to first unread message

Milad Reyhani

unread,
May 4, 2026, 7:04:21 AM (19 hours ago) May 4
to EMAN2

Dear team,

I am experiencing an issue with refinement jobs not fully utilizing the requested computational resources. I have also noticed that some of my colleagues are encountering similar behavior.

When I submit a job to the cluster and request, for example, 128 cores, the initial iterations run quickly. However, after the first few iterations, the process slows down significantly. I have attached a screenshot from a previous job in which I gathered metadata and ran the p2t2ptrd refinement. In that run, the first two positional refinements were completed within a day with full CPU usage, but resource utilization then dropped considerably, and subsequent iterations took much longer to complete. I have also attached the parameters used.

I would appreciate any advice on whether there is a way to improve or stabilize resource utilization during these later stages.

I look forward to your thoughts.

Best regards,
Milad

Screenshot from 2026-05-04 14-12-54.png
0_spt_params.json

Steve Ludtke

unread,
May 4, 2026, 7:27:14 AM (19 hours ago) May 4
to em...@googlegroups.com
Hi Milad,
The quick answer is that the biggest factor is probably storage bandwidth. For more details:

Debugging cluster performance issues can be very tricky. Not only does it depend on the characteristics of the data, but also the configuration of the cluster. Some EMAN2 tasks are limited to threaded (single node) parallelism, whereas others can make use of multiple nodes. CryoEM/ET image processing is extremely data-intensive, and in many cases, it isn't constrained by the CPU, but rather by data I/O. Some clusters are configured with high performance integrated storage systems which can reliably provide >1 GB/s of data read capability on each node simultaneously, but on many clusters there is a total data bandwidth available on the entire cluster and this is shared across nodes, so if you start a job running on a few nodes on a large cluster when the cluster is idle, it may perform very well, but then someone starts another big job on other nodes, and suddenly your data bandwidth drops.  This is just one potential situation.

In my lab, we've largely shifted away from using clusters to using individual workstations with (typ) 64 cores, very high performance M.2 storage, 10 Gb to a shared RAID with good performance, and 1-2 good GPUs.  When possible, we shift the entire project to the fast local M.2 storage, then run the refinements there using all of the cores on the workstation (and GPU for appropriate tasks).

So, we may be able to say more, but would need a lot more information:

- cluster configuration
- CPUs on a single node (specific model and count)
- RAM on a single node
- scratch availability on a single node (type of drive/speed)
- type of storage and speed where the project folder lives (the folder where you run eman2 commands from)
- project info
- box size of extracted subtilt stacks (often 2x the final box size)
- number of 3-D particles in the project
- number of tilts in a tomogram (typical if it varies)
- imposed symmetry (if any)

I'm not sure where in your plot the final "rd" stages came in, but r in particular can be very slow and rarely provides much of an improvement. Personally I don't usually use it.

--
--
----------------------------------------------------------------------------------------------
You received this message because you are subscribed to the Google
Groups "EMAN2" group.
To post to this group, send email to em...@googlegroups.com
To unsubscribe from this group, send email to eman2+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/eman2

---
You received this message because you are subscribed to the Google Groups "EMAN2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eman2+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/eman2/50eec42c-decc-4a9e-ae26-ef833df049f4n%40googlegroups.com.
<Screenshot from 2026-05-04 14-12-54.png><0_spt_params.json>

Muyuan Chen

unread,
May 4, 2026, 12:07:38 PM (14 hours ago) May 4
to em...@googlegroups.com
For cluster usage, also make sure you write scratch file to a location with enough storage. Specify --parallel thread:128:/path/to/scratch/ for this.

Also if you are on a new enough version, there is now a 'g' iteration that can replace 't'. It should be much faster and sometimes gets better results especially when there are many particles per tomogram. Maybe alternating p and g is better but I am still testing. 

Muyuan

Reply all
Reply to author
Forward
0 new messages