Running with large number of threads/processes kills WESTPA job

83 views
Skip to first unread message

David LeBard

unread,
Feb 13, 2020, 11:13:56 AM2/13/20
to westpa-users
Hi WESTPA folks,

I am trying to run a WESTPA simulation on a single AWS p3.16xlarge instance that is packed with 64 hyperthreaded cores, 8 V100 GPUs, and uses Amazon's flavor of linux. Unfortunately, if I run the simulation as I normally would by setting the CUDA_VISIBLE_DEVICES to be a modulus of the WM_PROCESS_ID, I can run for a few iterations then I reproducibly hit this strange error and the simulation stops: 

+ python /home/ec2-user/membrane/common_files/membrane_prod.py
ERROR; return code from pthread_create() is 11
        Error detail: Resource temporarily unavailable

It seems I can mitigate the problem by setting the WM_N_WORKERS variable to be 2x the number of GPUs (i.e. 16 workers), but this seems like it should be unnecessary and might not be the actual fix. I have tried using both the and threads processes work managers and both have this problem.

Has anyone else run into similar issues as this? Do you think it could be due to the hyperthreading of the cores, and if so, should I turn that "feature" off? Or, are there better fixes out there that others know about?

I should also mention that I have run this simulation successfully across 4x K80s on a local GPU node, and on single GTX1080 on my local workstation without any issues.

Thanks in advance,
David 

Aud J. Pratt

unread,
Feb 13, 2020, 11:34:50 AM2/13/20
to westpa...@googlegroups.com
Hi David,

That error looks like it typically arises due to the system not having the available resources to create another thread, or you're about to hit some sort of system imposed max.  I don't recall the default for WM_N_WORKERS; it's possible it might be 64 (or 128) if it's pulling that info from the OS.  I wonder if AWS has some limitations on the number of threads that can be created?

Judging from a Google search, it seems that the amount of virtual memory is involved in the calculation for the thread limit for a particular process.  The following command on a linux machine should tell you how many threads you can have: cat /proc/sys/kernel/threads-max

On my home machine, this yields 514054.  Given the stats for the p3.16xlarge instance, I'd be surprised if this number is very low, but it's possible we're somehow trying to create way more threads than we anticipated.  Do you have any log files from your run that include an output of the env command?  In addition, it might not hurt to provide the result of the threads-max.

Since it's going for a few iterations and then failing out, perhaps threads are lingering or not being re-used in the way we think they would.  Is membrane_prod.py code you've written?  Are you attempting to fork or use multiprocessing within that script?

Best,
Audrey

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/789c5316-6d8f-4833-91d3-c62dc6f5a2a2%40googlegroups.com.


--
Audrey Pratt
Graduate Student in Chemistry
Chong Lab, Room 338, Eberly Hall
University of Pittsburgh
Pittsburgh, PA 15260

David LeBard

unread,
Feb 13, 2020, 12:00:43 PM2/13/20
to westpa-users
Hi Audrey,

Thanks for the quick reply!

I also saw that AWS has a max threads set, so I cranked it up to unlimited and it still had this issue, that's when I started playing with the WM_N_WORKERS variable. Also, the output from cat /proc/sys/kernel/threads-max is 3934081, so I'm not sure that's the issue. 

As for the env command, there's no real output but here is the contents of my env.sh where I setup the worker and other WESTPA variables: 

#!/bin/bash

# Set up environment for westpa
export WEST_PYTHON=$(which python3.6)
export WEST_SIM_ROOT="$PWD"
export SIM_NAME=$(basename $WEST_SIM_ROOT)
export WM_N_WORKERS=16

And regarding the python file that threw that error, it's really a dead simple script: it just deserializes an old system, sets up the simulation state, runs the simulation, then serializes the final state. 

And you said "Since it's going for a few iterations and then failing out, perhaps threads are lingering or not being re-used in the way we think they would." I totally agree this could be the issue. Can you suggest any way of testing or confirming this? 

Thanks again,
David

On Thursday, February 13, 2020 at 9:34:50 AM UTC-7, Auds Jadeite wrote:
Hi David,

That error looks like it typically arises due to the system not having the available resources to create another thread, or you're about to hit some sort of system imposed max.  I don't recall the default for WM_N_WORKERS; it's possible it might be 64 (or 128) if it's pulling that info from the OS.  I wonder if AWS has some limitations on the number of threads that can be created?

Judging from a Google search, it seems that the amount of virtual memory is involved in the calculation for the thread limit for a particular process.  The following command on a linux machine should tell you how many threads you can have: cat /proc/sys/kernel/threads-max

On my home machine, this yields 514054.  Given the stats for the p3.16xlarge instance, I'd be surprised if this number is very low, but it's possible we're somehow trying to create way more threads than we anticipated.  Do you have any log files from your run that include an output of the env command?  In addition, it might not hurt to provide the result of the threads-max.

Since it's going for a few iterations and then failing out, perhaps threads are lingering or not being re-used in the way we think they would.  Is membrane_prod.py code you've written?  Are you attempting to fork or use multiprocessing within that script?

Best,
Audrey

On Thu, 13 Feb 2020 at 08:13, David LeBard <david...@eyesopen.com> wrote:
Hi WESTPA folks,

I am trying to run a WESTPA simulation on a single AWS p3.16xlarge instance that is packed with 64 hyperthreaded cores, 8 V100 GPUs, and uses Amazon's flavor of linux. Unfortunately, if I run the simulation as I normally would by setting the CUDA_VISIBLE_DEVICES to be a modulus of the WM_PROCESS_ID, I can run for a few iterations then I reproducibly hit this strange error and the simulation stops: 

+ python /home/ec2-user/membrane/common_files/membrane_prod.py
ERROR; return code from pthread_create() is 11
        Error detail: Resource temporarily unavailable

It seems I can mitigate the problem by setting the WM_N_WORKERS variable to be 2x the number of GPUs (i.e. 16 workers), but this seems like it should be unnecessary and might not be the actual fix. I have tried using both the and threads processes work managers and both have this problem.

Has anyone else run into similar issues as this? Do you think it could be due to the hyperthreading of the cores, and if so, should I turn that "feature" off? Or, are there better fixes out there that others know about?

I should also mention that I have run this simulation successfully across 4x K80s on a local GPU node, and on single GTX1080 on my local workstation without any issues.

Thanks in advance,
David 

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa...@googlegroups.com.

JD Russo

unread,
Feb 13, 2020, 12:19:11 PM2/13/20
to westpa...@googlegroups.com
Could it be a memory issue when it’s spawning the new processes? A simple way to check that would be through just watching the “top” command as your code runs, and keep an eye on the memory usage column.


From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of David LeBard <david....@eyesopen.com>
Sent: Thursday, February 13, 2020 9:00:43 AM
To: westpa-users <westpa...@googlegroups.com>
Subject: Re: [westpa-users] Running with large number of threads/processes kills WESTPA job
 
Hi Audrey,

Thanks for the quick reply!

I also saw that AWS has a max threads set, so I cranked it up to unlimited and it still had this issue, that's when I started playing with the WM_N_WORKERS variable. Also, the output from cat /proc/sys/kernel/threads-max is 3934081, so I'm not sure that's the issue. 

As for the env command, there's no real output but here is the contents of my env.sh where I setup the worker and other WESTPA variables: 

#!/bin/bash

# Set up environment for westpa
export WEST_PYTHON=$(which python3.6)
export WEST_SIM_ROOT="$PWD"
export SIM_NAME=$(basename $WEST_SIM_ROOT)
export WM_N_WORKERS=16

And regarding the python file that threw that error, it's really a dead simple script: it just deserializes an old system, sets up the simulation state, runs the simulation, then serializes the final state. 

And you said "Since it's going for a few iterations and then failing out, perhaps threads are lingering or not being re-used in the way we think they would." I totally agree this could be the issue. Can you suggest any way of testing or confirming this? 

Thanks again,
David

On Thursday, February 13, 2020 at 9:34:50 AM UTC-7, Auds Jadeite wrote:
Hi David,

That error looks like it typically arises due to the system not having the available resources to create another thread,oryou're about to hit some sort of system imposed max.  I don't recall the default for WM_N_WORKERS; it's possible it might be 64 (or 128) if it's pulling that info from the OS.  I wonder if AWS has some limitations on the number of threads that can be created?
To unsubscribe from this group and stop receiving emails from it, send an email towestpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/72e5746b-be59-4444-bd52-67d67f59724d%40googlegroups.com.

David LeBard

unread,
Feb 13, 2020, 1:55:49 PM2/13/20
to westpa-users
Hi John,

Thanks for the suggestion. I had been watching processes through htop, and could see all processes die more at more or less the same time. Total memory usage probably is not an issue, since this instance has an absurd 480GB of system RAM, and the spikes in memory usage max out at about 6GB (avg. usage is closer to 2.5GB).

Also, WM_N_WORKERS seems to keep the system under control when I limit it to 2x the number of GPUs (8x2 = 16). However, when I run with either 4x (8x4.= 32) or 8x (8x8.= 64) I still have the same error even though the instance itself has *I believe* has 64 virtual cores and 32 physical cores. fwiw, htop shows 64 cores at the top of its output.

David


On Thursday, February 13, 2020 at 10:19:11 AM UTC-7, John Russo wrote:
Could it be a memory issue when it’s spawning the new processes? A simple way to check that would be through just watching the “top” command as your code runs, and keep an eye on the memory usage column.


John Russo

unread,
Feb 13, 2020, 2:38:42 PM2/13/20
to westpa-users
Just a thought RE those core counts. I do see that like you say, AWS's page describing this instance says it has 64 vCPUs which in this case should be equivalent to logical/virtual cores. However, I also see that it claims this instance has a Xeon E5-2686, which has 18 physical cores and 36 logical cores with hyperthreading. So, if it's providing 64 logical cores to you, it must be allocating them from multiple CPUs.

https://www.credera.com/blog/technology-solutions/whats-in-a-vcpu-state-of-amazon-ec2-in-2018/ describes some issues with vCPUs and burst performance. Just for fun, would it be easy to try that instance with hyperthreading disabled? If it's choking at 32 threads now, you could try those same 32 with hyperthreading disabled, so with each thread mapped to a physical core.

Gabriel Monteiro da Silva

unread,
Jul 9, 2021, 4:24:46 PM7/9/21
to westpa-users
Hi David,

We are also trying to run WESTPA in AWS instances and running into some issues, I was wondering if you were able to find a fix to this issue in particular.

Thanks!

David LeBard

unread,
Jul 16, 2021, 3:20:56 PM7/16/21
to westpa-users

Hi Gabriel,

I was able to successfully run WESTPA on a raw AWS instance, but I have since moved to using WESTPA in Orion (OpenEye’s platform running on AWS). 

Regarding your problem, did you have any luck adjusting the WM_N_WORKERS environment variable? If I recall, I ended up using an 8-GPU instance, and because of that I had to set WM_N_WORKERS to either 8 or 16. I believe I also disabled hyperthreading within the AWS console, and that helped as well. 

Please report back if this helps, or the errors you're seeing if it does not.

All the best,
David
Reply all
Reply to author
Forward
0 new messages