Increasing tasks per node causing instability

40 views
Skip to first unread message

Madison Mirelez

unread,
Apr 8, 2026, 8:45:29 PMApr 8
to MBX-users
Hello, 

I am attempting to do scalability testing on a NVE simulation of 7579 water molecules using the MBX potential in LAMMPS.The system consists of a 61.5 x 61.5 x 61.5 Angstrom box generated with packmol and converted to a LAMMPS data file using a python script. I have included these files under the names generate_water.inp, convert.py, and waterbox.data. I also included my LAMMPS input file under in.water and my job submission file under sb.water

When I submit the job with 1 node and 1 task per node the simulation runs smoothly and remains stable. However, when I increase the tasks per node, such as using 2 tasks per node or higher, I begin to encounter problems. These problems being a steady rise in the total energy, pressure, and temperature which sometimes lead to a lost atoms error. I have noticed that as the tasks per node increase so does the starting total energy and the likelihood of the simulation getting a lost atoms error. I have also noticed that the length of the minimization phase decreases. 

I am new to using MBX and generating my own LAMMPS data file so Im having trouble deciphering where the issue is stemming from. I would greatly appreciate any advice or guidance regarding this issue.

Thank You,
Madison

sb.water
generate_water.inp
convert.py
waterbox.data
in.water

Henry Agnew

unread,
Apr 10, 2026, 3:15:51 AMApr 10
to MBX-users
Dear Madison,
         Welcome to the forum! I have an initial clarification question about what you mentioned about "--ntasks-per-node". Could you please provide more information on what you are attempting to accomplish by modifying "--ntasks-per-node"? Depending on what specifically you are trying to do, we may have better options than "--ntasks-per-node".

Also, could you please attach the log.lammps files for both the ntasks=1 and ntasks=2? Those files may contain useful information to help us narrow down what is occurring.

Thanks, and we look forward to hearing from you,
- The MBX Team

Madison Mirelez

unread,
Apr 10, 2026, 11:43:35 PMApr 10
to MBX-users
Dear MBX team, 

Thank you so much for the response and for taking the time to help me with this issue.

Regarding "--ntasks-per-node" my intention was to was to perform scalability testing. I went about this by varying the number of tasks per node from [1, 2, 4, 8, 16, 32, 64, and 96], recording the total wall time, and analyzing the results to identify the most efficient configuration and determine where diminishing returns begin to occur.

I have attached the log.lammps files for both the ntasks=1 and ntasks=2 runs as requested. I also included the log.lammps file for ntasks=96, as this run produced a lost atoms error that could be helpful in diagnosing the issue. I attached these files with the naming convention my_inp_<ntasks>. 

Additionally, I attempted to run the 2048_h2o simulation provided in the GitHub repository using multiple tasks per node. In this case, I observed slightly different behavior, as the number of tasks per node increased, the initial total energy decreased across runs, then steadily grew before sometimes producing the following error:

"gammq: x = -nan, a = 0.75
lmp: potential/electrostatics/gammq.cpp:241: double elec::gammq(double, double): Assertion `x >= 0.0 && a > 0.0' failed."

I also attached the corresponding log.lammps files from these simulations with ntasks=1,2 and 96. I used the naming convention 2048_h2o_<n_tasks>.

Please let me know if there is any additional information that I can provide. Thank you again for the help and I look forward to your response. 

-Madison Mirelez
my_inp_1.lammps
my_inp_96.lammps
my_inp_2.lammps
2048_h2o_2.lammps
2048_h2o_96.lammps
2048_h2o_1.lammps

Henry Agnew

unread,
Apr 13, 2026, 6:12:59 PMApr 13
to MBX-users
Dear Madison,
         We are still investigating, but we have a short-term fix that will likely solve your problems. MBX is primarily OMP parallelized instead of MPI, so we would recommend using OMP instead of MPI for your scalability testing. Here is an example for 8 cores:

export OMP_NUM_THREADS=8
../../../../build/lmp -in in.mbx_h2o



As for the investigation, I am not yet able to replicate the issue you are encountering on our development computers. As you can see in my attachments, my 2048_h2o_1 is very similar to what you encountered. However, for 2048_h2o_2 my results look okay while yours diverges significantly. I will next test on a SLURM supercomputer to see if the issue is indeed caused specifically by SLURM's "ntasks-per-node" or if the issue is something else. I will send a follow-up email once I have those results.


We also have just released a new video tutorial today on the MBX+LAMMPS examples in case you are interested: https://www.youtube.com/watch?v=CaggOelutEc

Best regards,
- The MBX Team
henry_2048_h2o_1.lammps
henry_2048_h2o_2.lammps

Henry Agnew

unread,
Apr 13, 2026, 9:04:09 PMApr 13
to MBX-users
Dear Madison,
        I ran some additional tests on a SLURM supercomputer and was still unable to replicate the specific issue that you are experiencing. I am not sure what is going on.

For your scalability tests, I would still recommend using OMP_NUM_THREADS instead of MPI. This should work better for you, but please let us know if you encounter any issues.


If you are interested in still trying to figure out what is weird with your MPI, could you please do a fresh install using the LAMMPS release branch? I see you cloned from the "develop" branch, so it is possible that there might be something wrong with that specific beta version. If you have time, could you please do a new installation and see if you still see the same issue?
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
git clone -b release https://github.com/lammps/lammps.git
cd lammps

cmake -S cmake -B build -C ./cmake/presets/basic.cmake -D PKG_MBX=yes -D PKG_EXTRA-PAIR=yes
cmake --build build --parallel 4
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Best regards,
- The MBX Team

supercomputer_2048_h2o_2.lammps

Madison Mirelez

unread,
Apr 14, 2026, 12:54:01 AMApr 14
to MBX-users
Dear MBX Team,

Thank you so much for all the help.

I tested the 2048_h2o simulation using OMP_NUM_THREADS with 1,2,4 and 8 cores, and the output looks good. I am no longer observing the issues I was previously encountering. I will also be sure to check out the video tutorial you provided on the MBX + LAMMPS examples. I am currently testing my own input script using OMP_NUM_THREADS with varying numbers of cores, and so far the results appear to be error-free and more consistent than my previous results.

I plan to recompile LAMMPS using the commands that you have provided. I will follow up once I have completed that to confirm whether it resolves my earlier issues when utilizing MPI and "ntasks-per-node". 

Thank you again for you time and guidance.

-Madison Mirelez 

Madison Mirelez

unread,
Apr 17, 2026, 11:33:41 PM (12 days ago) Apr 17
to MBX-users
Dear MBX Team, 

I wanted to provide an update regarding the numerical errors I have been encountering with MPI and "ntasks-per-node".

I did a fresh install of LAMMPS using the release branch and tested it with the 2048_h2o example while varying the number of "ntasks-per-node". However, I am still seeing the same numerical errors. I also ran the tests on a different SLURM cluster to check whether the issue might be system-specific, but I continued to get the same unusual output there as well.

I have been using OMP_NUM_THREADS  threads instead, and I have not run into any unusual behavior or issues. I will continue investigating to see if there is anything else I can do to resolve the problem. 

Thank you all for your help and suggestions.

- Madison Mirelez

madison_2048_h2o_2.lammps
Reply all
Reply to author
Forward
0 new messages