I have GPU-based jobs that I am launching with dsub that I permit to run on 3 regions: us-central1, us-east1, or us-west1. (So far, I am not specifying zones, just regions.)
One particular job has a few thousand tasks. I'm testing the job by launching two of the tasks to make sure they are well behaved. Both questions are prompted by watching these two tasks.
The first task completed in about 30 minutes. The second task was blocked by resource limits twice and finally launched on a third attempt.
Via dstat -f, I only see the initial start attempt and two retry attempts (Note: this has formatting and is easier to read at the dsub issue link):
Trying to understand the resource limit, it's not clear to me what quotas we might be hitting, since none of the relevant quotas are remotely full [quota screenshot at the dsub issue]).
So, my questions are:
--
You received this message because you are subscribed to the Google Groups "GCP Life Sciences Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcp-life-sciences-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcp-life-sciences-discuss/d4023ad3-fbde-4df0-83e4-4ee8dc723453n%40googlegroups.com.