MCore error on slurm cluster gpu nodes

328 views
Skip to first unread message

Bin Tsai

unread,
Jun 18, 2025, 5:10:24 PMJun 18
to war...@googlegroups.com
Hi all,

I have been trying to run WarpTools (the latest dev33 version installed using conda) on slurm cluster and getting some random failures when using MCore.

The preprocessing with Warp-Relion worked perfectly on GPU nodes of the cluster (I am using "srun" command to allocate the entire GPU node).

While using MCore, I am able to run the program until " MCore --population --iter 0 " to get an initial reconstruction.

The following MCore refinement will report the attached error saying "WorkerDied" after finishing refining all series in data source.

The weird part is that sometime this error is persistent while some times the MCore refinement works without error, the behaviour seems random and I cannot get a conclusion.

As this issue was raised in github issue #382  (https://github.com/warpem/warp/issues/382) which suggests to use dev33 version, I find it still not working for my case.

I would really appreciate it if anyone could give suggestions regarding this error.

Best regards,
Cai



MCore --population TZone_pool_20250314_ver1/TZone_20250314_ver1.population --refine_imagewarp 6x4 --refine_particles --devicelist 0 1 2 3 4 5 6 7
Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
1/1                                                                                                                   
Performing refinement
Preparing population for data source TZone_20250314_dataset_ver1...Done
Loading gain reference for TZone_20250314_dataset_ver1... Done
Refining all series in data source...
507/507                                                                                                               
Commiting changes in TZone_20250314_dataset_ver1...Unhandled exception. Unhandled exception. System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1727282134776/work/MCore/MCore.cs:line 597
   at Warp.WorkerWrapper.ReportDeath() in /usr/share/miniconda/envs/package-build/conda-bld/warp_1727282134776/work/WarpLib/WorkerWrapper.cs:line 255
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /usr/share/miniconda/envs/package-build/conda-bld/warp_1727282134776/work/WarpLib/WorkerWrapper.cs:line 189System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1727282134776/work/MCore/MCore.cs:line 597
   at Warp.WorkerWrapper.ReportDeath() in /usr/share/miniconda/envs/package-build/conda-bld/warp_1727282134776/work/WarpLib/WorkerWrapper.cs:line 255
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /usr/share/miniconda/envs/package-build/conda-bld/warp_1727282134776/work/WarpLib/WorkerWrapper.cs:line 189

Aborted

Pranav Shah

unread,
Jun 20, 2025, 12:08:18 AMJun 20
to Bin Tsai, war...@googlegroups.com
Hi Bin,
Yes, moving to dev33 does not fix the issue... the only way I have
been able to overcome this is by deleting the M project, resetting the
tiltseries metadata files and then re-initialising the M project from
scratch... Now this works, if you are in the early stages of the
refinement process, but can be soul-crushing if you are a few
iterations in... I unfortunately dont have a solution for this problem
as it is nearly impossible to trigger the behaviour reproducibly... I
suspect file-system lag is triggering this error - but this is just
pure speculation...
Best,
Pranav
--
Pranav Shah
Postdoctoral Research Fellow.

Division of Structural Biology,
Wellcome Trust Centre for Human Genetics,
University of Oxford,
Roosevelt Drive, Oxford OX3 7BN,
UK
> --
> You received this message because you are subscribed to the Google Groups "Warp" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to warp-em+u...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/warp-em/CAOdZDRUfPXkzYr0KuTOFjbyKpjRRHbANJPuxvu4ZwE5NRP9LXQ%40mail.gmail.com.

Pranav Shah

unread,
Jun 20, 2025, 12:08:22 AMJun 20
to Bin Tsai, war...@googlegroups.com
Before you delete the M project, could you also post the
postprocess.log file in the refinement_temp folder.
Thanks!
Best,
Pranav
--
Pranav Shah
Postdoctoral Research Fellow.

Division of Structural Biology,
Wellcome Trust Centre for Human Genetics,
University of Oxford,
Roosevelt Drive, Oxford OX3 7BN,
UK

Bin Tsai

unread,
Jun 20, 2025, 4:24:10 AMJun 20
to Warp
Hi Pranav,

Thank you very much for the suggestions.

I tried to re-run M processing from scratch several times, and this random failure problem always happens. And when the above error happens, I only have preprocess.log and logs for individual tomograms in my refinement_temp/log folder, no postprocess.log.

I also suspect that the failure might be caused by the slow communication speed of slurm cluster file system, and the heartbeat monitoring process of Warp/M would kill the MCore refinement job by mistake. Maybe be it would be helpful to modify the heartbeat monitoring parameters in WorkerWrapper.cs .

Best regards,
Cai 
Message has been deleted
Message has been deleted

Bin Tsai

unread,
Jun 29, 2025, 6:11:20 AMJun 29
to Warp
Hi again,

Not sure if this is the correct way to reply, my previous replies disappeared.

I seem to found a way to work around this issue.

The error seem to be caused by DataSource.cs trying to parallel all during Hash computation, which results in high CPU usage that would lead to worker crash.

By limiting the parallele number of computation solved this issue in my case. See github issue #394, hope this is correct.

Although github issue #382 by Pranav seems a bit different from my issue, so I am not sure if this fixation works in both cases.

Best regards,
Cai

Bin Tsai <imud...@gmail.com> 于2025年6月20日周五 10:24写道:
Message has been deleted
Message has been deleted
Message has been deleted

Mart Last

unread,
Jul 7, 2025, 11:59:42 PMJul 7
to Warp
Hi all,

I'm running into a similar problem, maybe the same one so I thought I would respond in this thread. I read the above discussion and the related issue on GitHub and updated WarpTools to dev34 yesterday, which if I understood correctly contains the parallelization fix (am not 100% sure about that though - 'pip list' lists warp 1.0.4, during the install I did see 2.0.0dev34 zoom by). Some example errors pasted below. It seems to happen a bit randomly. Sometimes I get lucky and it doesn't occur, but other times with the same command and on the same GPU node I do see it.

Example 1

MCore --population m/tric.population --refine_particles

Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
1/1
Performing refinement
Preparing population for data source tric...Done
Loading gain reference for tric... Done

Refining all series in data source...
612/612
Commiting changes in tric...Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception.Unhandled exception.  System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190

System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190

System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190

System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190




Aborted (core dumped)

Example 2

(warp) mlast@fmg58 slurm:5830122 [HeLa_MPA_merged]: MCore --population m/tric.population --iter 0 --perdevice_refine 8

Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
1/1
Performing refinement
Preparing population for data source tric...Done
Loading gain reference for tric... Done

Refining all series in data source...
612/612
Commiting changes in tric...Unhandled exception.Unhandled exception.Unhandled exception.   Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190
Real-time signal 0

Example 3 (the command in this example did once complete succesfully in a previous run, but most of the time it raises this error)

(warp) mlast@fmg58 slurm:5830122 [HeLa_MPA_merged]: MCore --population m/tric.population --iter 0 --perdevice_refine 1

Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
1/1
Performing refinement
Preparing population for data source tric...Done
Loading gain reference for tric... Done

Refining all series in data source...
612/612
Commiting changes in tric...Unhandled exception.Unhandled exception.  Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190

System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190


System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190


Aborted (core dumped)

Thanks in advance for the help! 

Best wishes,
Mart
Op zondag 29 juni 2025 om 11:11:20 UTC+1 schreef Bin Cai:

Alister Burt

unread,
Jul 8, 2025, 8:58:02 AMJul 8
to Mart Last, Warp
Hi Mart,

conda env list to check conda deps, there must be some python package called warp in your env too

Thanks for the report - it does look like it’s breaking in the same place… can you test with a few tilt series rather than the whole set to see if it still blows up?

Can you keep an eye on memory usage whilst it runs and report on that too?

Cheers,

Alister

Sent from mobile - apologies for brevity

On Jul 7, 2025, at 20:59, Mart Last <mgf...@gmail.com> wrote:

Hi all,

Mart Last

unread,
Jul 9, 2025, 11:30:36 AMJul 9
to Warp
Hi Alister,

Thanks for the reply. I deleted the random warp package and real warp popped up with 2.0.0dev34.

Is there a simple way to run a refinement for a subset of the tilt series only? Most of the particles I'm trying to average are picked in just a small subset of 10 - 20 tomograms, so in general it would be nice not to loop over all 612 tomos. But I'm not sure how to.

While monitoring memory usage, the first couple of runs actually ended up not crashing. Only when I ran one refinement that got stuck:

 MCore --population m/R.population --refine_imagewarp 4x4 --refine_particles --ctf_defocus --ctf_defocusexhaustive --perdevice_refine 4

Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
1/1
Performing refinement
Preparing population for data source R...Done
Loading gain reference for R... Done

Refining all series in data source...
598/612

When I backgrounded (ctrl Z) that process in order to start a new one using --port 14301:

 MCore --population m/R.population --refine_imagewarp 4x4 --refine_particles --ctf_defocus --ctf_defocusexhaustive --perdevice_refine 4 --port 14301

Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
1/1
Performing refinement
Preparing population for data source R...Done
Loading gain reference for R... Done

Refining all series in data source...
612/612
Commiting changes in R...Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. Unhandled exception. System.NotImplementedException: The method or operation is not implemented.

   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190
System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190
System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e) in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/MCore/MCore.cs:line 599
   at Warp.WorkerWrapper.ReportDeath() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 264
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0() in /home/runner/micromamba/envs/package-build/conda-bld/warp_1750964440471/work/WarpLib/WorkerWrapper.cs:line 190Aborted (core dumped)

I did get the crash again. Memory never seemed to be running out by the way. Maybe I'm just making a mess running a new refinement when the original one gets stuck; this would explain at least some of the earlier instances of when I ran in to this issue - but not all so I'll keep an eye on it. For now I suppose maybe the main problem is the combination of refinements getting stuck & me being impatient... 

Cheers,
Mart
Op dinsdag 8 juli 2025 om 13:58:02 UTC+1 schreef alist...@gmail.com:

Alister Burt

unread,
Jul 9, 2025, 1:11:00 PMJul 9
to Mart Last, Warp
Hey Mart,

I don't know about "easy" but a not-very-ergonomic way to work on subsets would be to hide the tilt series xml files in another folder when you set up your M data source

Interesting if multiple M processes running at the same time is part of the problem - will have to think about that!

Cheers,

Alister

Bin Tsai

unread,
Jul 12, 2025, 9:17:25 PMJul 12
to Alister Burt, Mart Last, Warp
Hi both,

This issue seems to be similar to mine.

I was able to fix this problem by changing the code of DataSource.cs of dev33 according the github discussion and recompiled Warp/M. Maybe you could check whether your installation has similar changes.

About the high CPU usage when MCore is committing changes, I found that RAM quiet free, but MCore would take almost 2600% CPU usage. After limiting the parrallel number of this step, the CPU usage dropped to 300%.

Also, I notice the file system also matters when using HPC cluster. I keep xml file in SSD scratch disk, and frames in NAS/flash style storage now. Using lustre file system might also cause problems.

Hope this helps.

Best regards,
Cai
To unsubscribe from this group and stop receiving emails from it, send an email to warp-em+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Warp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warp-em+unsubscribe@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/warp-em/CAOt3kHoq7KSAymKrS%3Dx2gtKQUgoBZDF%3DOO7Edu8LF98no0pB2Q%40mail.gmail.com.

Juha Huiskonen

unread,
Aug 10, 2025, 2:21:52 AMAug 10
to Warp
Hi Cai and colleagues,

I am running MCore (dev34) on an HPC cluster using SLURM job scheduler and having the same issue - The MCore job fails at the "Commiting changes" step. I have checked the job with seff command and there is nothing to indicate that the job would be running out of memory. Still, the same refinement works when the box size is 128 pixels.  When I try with box size of 256 pixels, it crashes. 

Next, I will try to copy the files using rsync from the Lustre disk to a fast SSD disk (only available during run time). Has anyone checked which folders are needed by MCore during refinement? And which folders contain the results? Copying the entire project over is not practical for me as the SSD space is limited and takes too long. 

Cheers,
Juha

--
You received this message because you are subscribed to the Google Groups "Warp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warp-em+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/warp-em/CAOt3kHoq7KSAymKrS%3Dx2gtKQUgoBZDF%3DOO7Edu8LF98no0pB2Q%40mail.gmail.com.

Juha Huiskonen

unread,
Aug 14, 2025, 2:57:28 AMAug 14
to Warp
I got this working finally with my current dataset and conda installation of Warp dev34.

First, I stage the input files to an SDD disk:

rsync -av --include="*.source" --include='*.xml' --exclude='*' warp_tiltseries/ "$LOCAL_SCRATCH"/warp_tiltseries/
rsync -av tomostar m "$LOCAL_SCRATCH"
ln -s "$WORKDIR"/warp_frameseries "$LOCAL_SCRATCH"/warp_frameseries

Here are the parameters I set in my SLURM job script:

#!/bin/bash
#SBATCH -J warptools
#SBATCH -o batch/m_virion_run1.out
#SBATCH -e batch/m_virion_run1.err
#SBATCH --partition=gpumedium
#SBATCH --time=60:00
#SBATCH --hint=nomultithread
#SBATCH --ntasks=4
#SBATCH --mem 0
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:a100:4,nvme:3400

source /projappl/project_2009057/apps/miniconda3/etc/profile.d/conda.sh
conda activate warp

set -euo pipefail
trap 'echo "ERR at: $BASH_COMMAND" >&2' ERR

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export DOTNET_PROCESSOR_COUNT=10
export DOTNET_ThreadPool_MinThreads=256
export MALLOC_ARENA_MAX=4

I am running MCore from the script with the following command 

srun -n 1 -c "$SLURM_CPUS_PER_TASK" -u --cpu-bind=cores --gpu-bind=none \
    MCore \
    --perdevice_preprocess 1 \
    --perdevice_refine 4 \
    --perdevice_postprocess 1 \
    --population m/virion.population \

    --refine_imagewarp 4x4 \
    --refine_particles \
    --ctf_defocus \
    --ctf_defocusexhaustive

(Some parts might be overly complicated / not needed. I haven't thoroughly tested this.)

Especially, the value of DOTNET_PROCESSOR_COUNT is critical. Too small or large values cause the heartbeat to fail and crash the job in the "Preparing refinement requisites" or "Committing changes" steps. 

I hope this helps others!

Cheers,
Juha

Juha Huiskonen

unread,
Aug 17, 2025, 3:21:37 AMAug 17
to Warp
Update: spoke too soon... The same job failed now at the filtering step.  I have observed this issue with other jobs: they may randomly fail at any step where MCore appears to be reading numerous small files (and calculating hashes). I suspect this is because the LUSTRE file system in an HPC environment may be under more stress than usual. The refinement step itself is OK, and GPU memory is not a limiting factor. So as of today, I haven't found a way to make MCore run stably on an HPC system using LUSTRE. Any suggestions? 

Alister Burt

unread,
Aug 17, 2025, 12:20:27 PMAug 17
to Juha Huiskonen, Warp
Hi Juha,

This is not an issue we see in our infrastructure (GPFS), if you can work with your HPC admins to come up with a concrete recommendation we will certainly consider it.

Cheers,

Alister

Sent from mobile - apologies for brevity

On Aug 17, 2025, at 00:21, Juha Huiskonen <juha.hu...@gmail.com> wrote:

Update: spoke too soon... The same job failed now at the filtering step.  I have observed this issue with other jobs: they may randomly fail at any step where MCore appears to be reading numerous small files (and calculating hashes). I suspect this is because the LUSTRE file system in an HPC environment may be under more stress than usual. The refinement step itself is OK, and GPU memory is not a limiting factor. So as of today, I haven't found a way to make MCore run stably on an HPC system using LUSTRE. Any suggestions? 
Message has been deleted

Kain van

unread,
Sep 19, 2025, 11:19:43 AMSep 19
to Warp

Hi all,

I encountered the same issue on a cluster setup where the m-refinement would consistently fail at the "committing changes" step—even when running a single iteration with no refinement.

However, switching to warptools/2.0.0dev35+1 resolved the issue completely. I've since been able to run the full range of m-refinements without any problems.

Hope this helps anyone else running into the same problem!

Cheers,

Kain

Juha Huiskonen

unread,
Sep 24, 2025, 1:12:08 PMSep 24
to Warp
Hi Kain,

Thanks for sharing this!  In the meantime, I have modified the source code to make it run on our cluster.  Before sharing this, I will test warptools/2.0.0dev35+1 if it solves our issues as well. 

Cheers,
Juha 

Reply all
Reply to author
Forward
0 new messages