Issue with Refinement Job Resource Utilization

39 views
Skip to first unread message

Milad Reyhani

unread,
May 4, 2026, 7:04:21 AMMay 4
to EMAN2

Dear team,

I am experiencing an issue with refinement jobs not fully utilizing the requested computational resources. I have also noticed that some of my colleagues are encountering similar behavior.

When I submit a job to the cluster and request, for example, 128 cores, the initial iterations run quickly. However, after the first few iterations, the process slows down significantly. I have attached a screenshot from a previous job in which I gathered metadata and ran the p2t2ptrd refinement. In that run, the first two positional refinements were completed within a day with full CPU usage, but resource utilization then dropped considerably, and subsequent iterations took much longer to complete. I have also attached the parameters used.

I would appreciate any advice on whether there is a way to improve or stabilize resource utilization during these later stages.

I look forward to your thoughts.

Best regards,
Milad

Screenshot from 2026-05-04 14-12-54.png
0_spt_params.json

Steve Ludtke

unread,
May 4, 2026, 7:27:14 AMMay 4
to em...@googlegroups.com
Hi Milad,
The quick answer is that the biggest factor is probably storage bandwidth. For more details:

Debugging cluster performance issues can be very tricky. Not only does it depend on the characteristics of the data, but also the configuration of the cluster. Some EMAN2 tasks are limited to threaded (single node) parallelism, whereas others can make use of multiple nodes. CryoEM/ET image processing is extremely data-intensive, and in many cases, it isn't constrained by the CPU, but rather by data I/O. Some clusters are configured with high performance integrated storage systems which can reliably provide >1 GB/s of data read capability on each node simultaneously, but on many clusters there is a total data bandwidth available on the entire cluster and this is shared across nodes, so if you start a job running on a few nodes on a large cluster when the cluster is idle, it may perform very well, but then someone starts another big job on other nodes, and suddenly your data bandwidth drops.  This is just one potential situation.

In my lab, we've largely shifted away from using clusters to using individual workstations with (typ) 64 cores, very high performance M.2 storage, 10 Gb to a shared RAID with good performance, and 1-2 good GPUs.  When possible, we shift the entire project to the fast local M.2 storage, then run the refinements there using all of the cores on the workstation (and GPU for appropriate tasks).

So, we may be able to say more, but would need a lot more information:

- cluster configuration
- CPUs on a single node (specific model and count)
- RAM on a single node
- scratch availability on a single node (type of drive/speed)
- type of storage and speed where the project folder lives (the folder where you run eman2 commands from)
- project info
- box size of extracted subtilt stacks (often 2x the final box size)
- number of 3-D particles in the project
- number of tilts in a tomogram (typical if it varies)
- imposed symmetry (if any)

I'm not sure where in your plot the final "rd" stages came in, but r in particular can be very slow and rarely provides much of an improvement. Personally I don't usually use it.

--
--
----------------------------------------------------------------------------------------------
You received this message because you are subscribed to the Google
Groups "EMAN2" group.
To post to this group, send email to em...@googlegroups.com
To unsubscribe from this group, send email to eman2+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/eman2

---
You received this message because you are subscribed to the Google Groups "EMAN2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eman2+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/eman2/50eec42c-decc-4a9e-ae26-ef833df049f4n%40googlegroups.com.
<Screenshot from 2026-05-04 14-12-54.png><0_spt_params.json>

Muyuan Chen

unread,
May 4, 2026, 12:07:38 PMMay 4
to em...@googlegroups.com
For cluster usage, also make sure you write scratch file to a location with enough storage. Specify --parallel thread:128:/path/to/scratch/ for this.

Also if you are on a new enough version, there is now a 'g' iteration that can replace 't'. It should be much faster and sometimes gets better results especially when there are many particles per tomogram. Maybe alternating p and g is better but I am still testing. 

Muyuan

Milad Reyhani

unread,
May 5, 2026, 10:12:03 PMMay 5
to EMAN2

Dear Steve and Muyuan,

Thank you very much for the detailed explanations.

I assume the next step would be to request allocation in the scratch directory from the cluster staff, as building a local machine capable of replacing the cluster is not feasible for us at this stage.

To give you a clearer picture of the setup and my workload, I run refinement jobs on CPU-only partitions (no GPUs) with the following specifications:

Nodes: 78 and 26
Cores per node: 128
Memory per node: ~977 GB and ~2 TB
Processor: Intel Xeon Gold 6448H
Max wall time: 30 days

Regarding the project, I am refining different regions of an icosahedral virus. The largest box size I currently use is 768 pixels (pixel size 1.603 Å/px), while for smaller regions it goes down to 96 pixels. I initially applied icosahedral symmetry for the whole virus and C5 symmetry for pentons, but individual pentons and other regions are now refined in C1.

Each tilt series should contain ~35 tilts, although I am not certain whether some are excluded during alignment and reconstruction. I have approximately 1,300 virus particles, which corresponds to ~15,600 pentons (box size 768), and ~78,000 monomers (box size 96). A cylindrical mask is applied around regions of interest.

The graph I shared corresponds to a job that did not complete the “d” iteration due to the time limit (20 days). The final peaks likely represent reconstruction steps from the “r” iteration or the beginning of the “d” iteration.

Thank you for pointing out the “g” iteration, we were not aware of it, and I will definitely test it.

Regarding scratch space, I know it is an NVMe-based filesystem. I will contact the cluster team for more details and to request allocation, but I would first need to provide an estimate of the required storage. I would appreciate your guidance on this.

Thank you again for all your help.

Best regards,
Milad

Steve Ludtke

unread,
May 5, 2026, 11:10:41 PMMay 5
to em...@googlegroups.com
Hi Milad, 
is the 768 box size the size with the 2x padding on the subtilt extraction or the unpadded size?  You're using a box size of 768 for the pentons, not the whole capsid?

That processor could be configured as 2 processors/node (64 cores, 128 threads) or 4 processors/node (128 cores, 256 threads). However, "threads" is deceptive, as it is not true additional processing power, though Intel tries to market it that way. When specifying a number of threads and/or MPI cores in EMAN2 jobs, limit yourself to the number of cores, not the number of hyperthreads.

Normally you shouldn't need to "request" scratch space, as the entire point of scratch space is exactly that; space which is used while a job is running and can then be discarded. Normally if you have a node allocated to you, you can use the entire local scratch space on that node while it's allocated. The question is whether you have true per-node local scratch available or the scratch is via some sort of shared scratch filesystem. If it is a shared scratch filesystem, then performance will depend on a lot of things (total storage bandwidth, shared filesystem used, interconnect speed, how heavily it's being used at a given moment in time).  Again, the key issue if you're trying to optimize processing is the actual (not theoretical max) storage bandwidth.

I can't give you much guidance on how much scratch will be used. I don't have enough details about your jobs. You can check yourself, though, if you have a job running. The default folder for scratch files if you didn't specify one should be /tmp. Just check the disk usage for your job in /tmp at different points of time during the job.



Milad Reyhani

unread,
May 5, 2026, 11:21:09 PMMay 5
to EMAN2
Hi Steve,

768 box size is with the 2x padding.

For using scratch space which I know is shared across multiple nodes, they request: "If you wish to use /data/scratch, please submit a request for a scratch filesystem directory."
That said, there is another space in the cluster, called the Local temp space. On each of the physical nodes there is a fast NVMe PCI-E card which will provide the fastest filesystem for the jobs. The normal capacity per node is 1.8TB, which is shared between all jobs running on that node. I can use it by writing to /tmp on each node. /tmp is local to each job and each worker node. But it is automatically cleaned once the job has finished. Is that fine? For this, I don't need to request for access.

Steve Ludtke

unread,
May 6, 2026, 7:49:17 AMMay 6
to em...@googlegroups.com
Yes, a local NVMe drive is ideal for scratch space, and /tmp is the default location where such files will be stored if you don't specify. So, for the --parallel option you should be fine with no changes.

Performance-wise, the type of storage where your project folder lives when you run your jobs is critical. It sounds like they probably provide a generic long-term storage filesystem to everyone, but likely that filesystem has limited performance. The "scratch" filesystem probably has higher bandwidth, but I'm just guessing based on typical cluster configurations. So, you have 3 "tiers" of storage:
1) main storage, probably a very large RAID array or somesuch, but with limited bandwidth to the cluster nodes
2) high performance shared scratch, a much smaller RAID or distributed storage array with high bandwidth to the compute nodes, for temporary use while jobs are running
3) local scratch (separate on each node). Highest possible bandwidth, but very limited capacity and lifetime

On such a cluster, the optimal way to run an EMAN2 job:

If you are running on multiple nodes with --parallel=mpi and --thread:
- at the beginning of the job, clone your entire project folder to the shared scratch drive (or at minimum info/, particles/, particles3d/, sets/, .eman2log.txt, .eman3log.txt )
- cd to the project folder on the shared scratch drive
- run your e2spt_refine... command, point the scratch drive for --parallel=mpi to the node-local scratch space (/tmp in your case)
- at the end of the job, rsync the project back to its permanent location
- if you aren't immediately going to run another job, remove the project folder from shared scratch

If you are running on a single node:
- at the beginning of the job, if you have enough space, clone the entire project to the node-local scratch space (if there isn't enough space, you can use the shared scratch instead)
- cd to the project folder on the node-local scratch space 
- run your e2spt_refine... command, point the scratch drive for --parallel=mpi to the node-local scratch space (/tmp in your case)
- at the end of the job, rsync the project back to its permanent location (critical, since /tmp may get automatically cleaned up at the end of the job)
- to be polite, in case the auto-cleanup features aren't working well, removing the project folder from the local scratch is a good idea

While it _can_ work to only copy the necessary files as mentioned above (  info/, particles/, particles3d/, sets/, .eman2log.txt, .eman3log.txt ), note that this means your refinement folder name will be spt_00 not a higher number, and if you then rsync back to permanent storage it will overwrite your previous spt_00 results, so you, instead would rsync the spt_00 file to the correct (next sequential number) name on permanent storage. In general I recommend avoiding this approach and rsyncing the entire project folder in both directions, as the chances of making a critical mistake are greatly reduced.

If you're using MPI on a cluster like this, your job would run faster on 4 nodes than on 2 nodes, but it will NOT be faster by a factor of 2. There are various things which can limit the total performance of a job on a large number of processes (storage and inter-process communication being big ones). When you start seeing a job consistently failing to use all of the CPU time available, this usually means that you have saturated some other cluster resource, and you may actually get better total performance using fewer nodes in such situations. Indeed, there are significant speedups you can gain by running on a single node. If your nodes are actually configured with 128 physical cores, you may actually get faster overall results running on a single node with --parallel=thread:128:/tmp and --threads=128 than you would running on, say, 2 nodes with MPI, but it's possible that running on 2 will still get your results faster (but not 2x faster). You can read about Amdahl's law if you want to understand some of the issues, but it isn't the only factor. 


Milad Reyhani

unread,
May 11, 2026, 12:24:49 AMMay 11
to EMAN2

Hi Steve,

Thank you for the excellent explanation. It really helped me understand the situation much better.

I will try moving the entire project folder to the scratch drive and run the project from there.

Reply all
Reply to author
Forward
0 new messages