ZONE_RESOURCE_POOL_EXHAUSTED for GPU based resources in all Zones

2,055 views
Skip to first unread message

Karel Goderis

unread,
Apr 22, 2021, 9:53:00 AM4/22/21
to gce-discussion
Hi all,

Since yesterday I am unable to provision instances with P100 and T4 GPU's in any of the available zones. They all return the above message. This is still ongoing today, it seems not to be a "transitional" problem per documentation. I can not believe that GCP is suddenly that busy....

Is anyone experiencing the same?

Bruno (Google Cloud Support)

unread,
Apr 22, 2021, 10:35:52 AM4/22/21
to gce-discussion
Hello,

With the increase in popularity of IA it's possible that some GPU types are not available.
To guarantee availability and getting a committed discount at the same occasion, you can reserve GPUs in a psecific zone.
For more details, please refer to this documentation [1]

[1] https://cloud.google.com/compute/docs/gpus#reserving_gpus_with_committed_use_discounts

Zekun Ni

unread,
May 7, 2021, 10:06:06 AM5/7/21
to gce-discussion
I started to experience the same issue since two days ago. Virtually no instance can be started no matter which zone and GPU type I selected. However, when I switch to my corporate account, I can start an instance with as many as EIGHT GPUs without any headache. So I guess there is some kind of priority assignment in GCP's allocation backend, and some users and/or small GPU allocations are simply demoted. I wish to know if there is any way to prevent myself from being demoted. I just want to find ONE GPU.

Md Sadik Masoud

unread,
May 10, 2021, 12:26:54 PM5/10/21
to gce-discussion
Hi, 

I do not believe the issue is related some type of corporate account or small users.This type of behavior has been mentioned clearly in this public document [1] with possible resolution. You got that error messages simply because, the Google Cloud resources which you need, were not available when you tried to spin up the instances with GPUs. As mentioned earlier, you can consider using the reservation for the resources [2] and take benefits of using committed use discounts. Hope it helps!

Zekun Ni

unread,
May 12, 2021, 10:35:19 AM5/12/21
to gce-discussion
Hi Sadik,

Glad to see your reply but I don't think it addresses my issue. From the documentation you provided I should expect it easier to create an instance with only one GPU compared with an instance with 8 GPUs, but what I experienced is exactly the opposite. I have also tried other zones or regions, or tried starting an instance at different hours, just like what the documentation says. None of them worked despite dozens of failed attempts. Resource reservation is not feasible for my case as I don't plan to use it for sufficiently long, and in the case where I successfully started an instance with 8 GPUs with my corporate account, I didn't reserve any resource either.

Zekun Ni

unread,
May 12, 2021, 10:35:19 AM5/12/21
to gce-discussion
Hi Sadik,

Glad to see your reply but I don't think the documentations you provided addresses my issue. Actually from the documentation I should expect it easier to create an instance with only one GPU compared with an instance with eight GPUs, but in fact what I experienced is the opposite. In the case where I failed to start an instance using my personal account, I've also tried multiple zones or regions, different GPU types, and tried to start at different hours, and none of them worked despite dozens of attempts. Resource reservation is not feasible in this case because I don't plan to use it for sufficiently long, and in addition, when I successfully started an instance with eight GPUs using my corporate account, I didn't reserve any resource either.

在2021年5月10日星期一 UTC-7 上午9:26:54<sadik...@google.com> 写道:

Anthony Leo

unread,
May 12, 2021, 1:49:32 PM5/12/21
to gce-discussion
As mentioned earlier in the thread, if any users experiencing the following error messages:
"ZONE_RESOURCE_POOL_EXHAUSTED"

This message has to do with around resource availability in GCP, in that resource errors occur when users try to request new resources in a zone that cannot accommodate your request due to the current unavailability of a Compute Engine resource, such as GPUs or CPUs.

Resource errors only apply to new resource requests in the zone and do not affect existing resources. Resource errors are not related to your Compute Engine quota and only apply to the resource you specified in your request at the time you sent the request, not to all resources in the zone.

As such, although these issues occur, you may try one or all of the following methods in order to resolve these types of errors:
1. Try to create the resources in another zone in the region or in another region.
2. Because this situation is temporary and can change frequently based on fluctuating demand, try your request again later.
3. Try to change the shape of the VM you are requesting. It's easier to get smaller machine types than larger ones. A change to your request, such as reducing the number of GPUs or using a custom VM with less memory or vCPUs, might let your request proceed.
4. Use Compute Engine reservations to reserve resources within a zone to ensure that the resources you need are available when you need them.
5. If you are trying to create a preemptible instance, remember that preemptible VMs are spare capacity and so might not be obtainable at peak demand periods.
6. If you were unable to resolve the error using any of the preceding instructions, try Getting support.

Kirill Katsnelson

unread,
May 13, 2021, 3:09:24 AM5/13/21
to gce-discussion
Fellow GCEers, look like the resource du jour ¹ is the famous inferencing price/perf champ, the T4. Our training nodes on preemptive P100/V100 churn, but even non-preemptive T4s play possum. (But of course we skipped purchasing reservations and commitments, which has been The Right Thing every time. In hindsight.) We're running in the major US regions (us-*1), but no backup setup across the pond, can't tell about other regions. So I'm just sitting here and waiting the weather out. Myself, I bumped my dev machine from T4 to V100, which is a good thing to do once in a while to run into data races you haven't caught. Every cloud has a silver lining.

This is a little price to pay for having this amazing infra maintained by the crème de la crème SREs at your fingertips. They can do anything, except conjuring the T4 accelerators out of thin air². Maybe. I mean, last time I checked they couldn't, but that was whole two weeks ago.

And all this is at the time of an unseen before global catastrophic pandemic, disrupting the global hardware supply. Keep calm, thank these S.überR.E. as warmly as you can for keeping this unimaginably huge and complex rig running for ya, despite inevitable hiccups (did I mention reservations and commitments?), and, of course, don't forget to mark 21-05-25 in your calendars!

DON'T PANIC
 ____
¹ Or, rather, the complete semantic opposite of its non-literal meaning.
² Jeff Dean can, indeed. The rumor is that he has already sent for a supply of compressed air.

 -kkm
Reply all
Reply to author
Forward
0 new messages