Resource Usage

20 views
Skip to first unread message

Barthle, Jonathan

unread,
May 29, 2025, 12:48:21 PMMay 29
to inl-rav...@googlegroups.com
Good afternoon,

I have been working with HPC support to figure out an issue with the use of resources on Bitterroot with RAVEN. It appears that I am requesting the resources; however, it does not seem those resources are actually being used. This is causing the RELAP5-3D simulations to take an extremely long time (reaching the walltime) during the MultiRun. When I run RELAP outside of RAVEN, the run finishes well within the walltime. We have tried several different things to try and find the solution. 

The HPC support team reached out to the RAVEN team for some additional recommendations. One of the recommendations was adding <MPIParam>--bind-to none</MPIParam> to the submission. I was wondering how this node is supposed to be added, I have been getting errors when I tried to implement it. 

I was wondering what might be causing this issue. I have attached my runinfo block for reference. Thank you for your assistance.

Thank you
Jonathan
Screen Shot 2025-05-29 at 12.42.04 PM.png

Joshua J. Cogliati

unread,
May 29, 2025, 10:37:42 PMMay 29
to Barthle, Jonathan, inl-rav...@googlegroups.com

Add the:

      <MPIParam>--bind-to none</MPIParam>

Inside of <Simulation><RunInfo><mode>
(or right after <runSbatch/>)

Joshua Cogliati

--
You received this message because you are subscribed to the Google Groups "INL RAVEN Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inl-raven-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/inl-raven-users/CAMsK15JPqBEpCKjG%2BBFYBAtu8-svcHNwKyKFZdKf0HULuyMBMg%40mail.gmail.com.
OpenPGP_signature.asc

Joshua J. Cogliati

unread,
Aug 14, 2025, 9:58:41 PMAug 14
to Barthle, Jonathan, inl-rav...@googlegroups.com

Hm, that is a new error to me.

Joshua Cogliati

On 8/14/25 11:03 AM, Barthle, Jonathan wrote:
Good afternoon,

I had a follow up question for requesting resources. Since I was able to get the "bind-to none" to work, I have been attempting to request additional nodes to increase the amount of cases running at a given time. However, when I request more than 3 full nodes, I begin to run into a significant number of errors. I was wondering if there might be a potential solution to this problem or if there is just a hard limit on the number of nodes I can request within RAVEN. 

Attached is an example of the slurm.out file that I get. After a certain number of cases, they begin to immediately fail. I am getting numerous errors including:

"<jemalloc>: background thread creation failed (11)"

"pmix_progress_thread_start failed
  --> Returned value -1 instead of PMIX_SUCCESS"

"[br370:432108] PRTE ERROR: The system limit on number of children a process can have was reached in file plm_slurm_module.c at line 436"

Thank you for your time
Jonathan
OpenPGP_signature.asc
Reply all
Reply to author
Forward
0 new messages