Resource exhaustion despite OK quota, and retry frequency settings

jam...@broadinstitute.org

unread,

Dec 3, 2020, 9:40:42 PM12/3/20

to GCP Life Sciences Discuss

I have questions about resource exhaustion and job retry patterns that I originally asked in a dsub issue but was pointed this way since they are about the life sciences API rather than the dsub software itself.

I have GPU-based jobs that I am launching with dsub that I permit to run on 3 regions: us-central1, us-east1, or us-west1. (So far, I am not specifying zones, just regions.)

One particular job has a few thousand tasks. I'm testing the job by launching two of the tasks to make sure they are well behaved. Both questions are prompted by watching these two tasks.

The first task completed in about 30 minutes. The second task was blocked by resource limits twice and finally launched on a third attempt.

Via dstat -f, I only see the initial start attempt and two retry attempts (Note: this has formatting and is easier to read at the dsub issue link):

Trying to understand the resource limit, it's not clear to me what quotas we might be hitting, since none of the relevant quotas are remotely full [quota screenshot at the dsub issue]).

So, my questions are:

Is it possible that I am hitting a regional resource limit that is unrelated to my account? (I.e., has the entirety of us-west1 run out of GPUs, so that I can't get one even though I am using 0 of the 16 permitted by the quota?) That seems unlikely. So a follow-up is, as Matt Bookman pointed out in the dsub issue thread: because I am requesting by region -- and not every zone in these regions has T4 GPU instances -- is it possible that the jobs are being assigned to zones within these regions that cannot satisfy the requirements for the jobs (because they have no T4 GPUs)? I.e., when using GPUs, do we need to specify zones and not rely on regions?
For the job that didn't successfully start for an hour, it appears that (a) it tried to launch, then (b) tried again 20 minutes later, and then (c) a third time 35 minutes after the second launch attempt. That seems surprisingly infrequent! Or, is it the case that there are additional retries within each of those 3 'start-time' events that just don't get logged in a way that I can see?

Thanks!

- James

Paul Grosu

unread,

Dec 3, 2020, 10:57:06 PM12/3/20

to GCP Life Sciences Discuss

Hi James,

Looks like you're hitting compute (VM) resource availability limit based on the (resource type:compute) error. Have you tried reserving them? Below is a link describing that:

https://cloud.google.com/compute/docs/instances/reserving-zonal-resources

The thing is that you will need to pay for them once you create the reservation. Otherwise you have to compete with the availability in the pool in the (region(zones)), which most definitely has its own retry-logic.

Hope it helps,

Paul

Tim Jennison

unread,

Dec 4, 2020, 8:48:43 AM12/4/20

to jam...@broadinstitute.org, GCP Life Sciences Discuss

Hi James,

The errors you reported indicate that the zone had no available T4 GPUs. While the API does take into account your available quota and which zones contain GPUs, it is currently unable to determine when there are no GPUs available. If particular zones consistently have no T4 GPUs available, listing the other zones individually could avoid attempts to run there.

The default backoff time for the API when unable to get resources is 15 minutes. When dealing with quota, this gives time for some existing pipelines to hopefully finish before attempting to allocate again. I agree that in this particular case that's not the best behavior and we could do a better job.

Thanks

Tim

--
You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcp-life-sciences-discuss/d4023ad3-fbde-4df0-83e4-4ee8dc723453n%40googlegroups.com.

James Pirruccello

unread,

Dec 4, 2020, 8:52:26 AM12/4/20

to Tim Jennison, GCP Life Sciences Discuss

Thank you, Tim. That is super helpful. I’ll start by specifying zones instead of regions.