Dear team,
I am experiencing an issue with refinement jobs not fully utilizing the requested computational resources. I have also noticed that some of my colleagues are encountering similar behavior.
When I submit a job to the cluster and request, for example, 128 cores, the initial iterations run quickly. However, after the first few iterations, the process slows down significantly. I have attached a screenshot from a previous job in which I gathered metadata and ran the p2t2ptrd refinement. In that run, the first two positional refinements were completed within a day with full CPU usage, but resource utilization then dropped considerably, and subsequent iterations took much longer to complete. I have also attached the parameters used.
I would appreciate any advice on whether there is a way to improve or stabilize resource utilization during these later stages.
I look forward to your thoughts.
Best regards,
Milad
--
--
----------------------------------------------------------------------------------------------
You received this message because you are subscribed to the Google
Groups "EMAN2" group.
To post to this group, send email to em...@googlegroups.com
To unsubscribe from this group, send email to eman2+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/eman2
---
You received this message because you are subscribed to the Google Groups "EMAN2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eman2+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/eman2/50eec42c-decc-4a9e-ae26-ef833df049f4n%40googlegroups.com.
<Screenshot from 2026-05-04 14-12-54.png><0_spt_params.json>
To view this discussion visit https://groups.google.com/d/msgid/eman2/23DC6B44-F6C7-4DAF-93E7-BE8F54562B49%40gmail.com.
Dear Steve and Muyuan,
Thank you very much for the detailed explanations.
I assume the next step would be to request allocation in the scratch directory from the cluster staff, as building a local machine capable of replacing the cluster is not feasible for us at this stage.
To give you a clearer picture of the setup and my workload, I run refinement jobs on CPU-only partitions (no GPUs) with the following specifications:
Nodes: 78 and 26
Cores per node: 128
Memory per node: ~977 GB and ~2 TB
Processor: Intel Xeon Gold 6448H
Max wall time: 30 days
Regarding the project, I am refining different regions of an icosahedral virus. The largest box size I currently use is 768 pixels (pixel size 1.603 Å/px), while for smaller regions it goes down to 96 pixels. I initially applied icosahedral symmetry for the whole virus and C5 symmetry for pentons, but individual pentons and other regions are now refined in C1.
Each tilt series should contain ~35 tilts, although I am not certain whether some are excluded during alignment and reconstruction. I have approximately 1,300 virus particles, which corresponds to ~15,600 pentons (box size 768), and ~78,000 monomers (box size 96). A cylindrical mask is applied around regions of interest.
The graph I shared corresponds to a job that did not complete the “d” iteration due to the time limit (20 days). The final peaks likely represent reconstruction steps from the “r” iteration or the beginning of the “d” iteration.
Thank you for pointing out the “g” iteration, we were not aware of it, and I will definitely test it.
Regarding scratch space, I know it is an NVMe-based filesystem. I will contact the cluster team for more details and to request allocation, but I would first need to provide an estimate of the required storage. I would appreciate your guidance on this.
Thank you again for all your help.
Best regards,
Milad
To view this discussion visit https://groups.google.com/d/msgid/eman2/f0a2d7e2-c8a7-4e27-972d-591a0d25e3c2n%40googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/eman2/0352742e-5b89-43f4-a8f3-fa5fcc406782n%40googlegroups.com.
Hi Steve,
Thank you for the excellent explanation. It really helped me understand the situation much better.
I will try moving the entire project folder to the scratch drive and run the project from there.