You overbook zones??

11,451 views
Skip to first unread message

Ray Foss

unread,
Apr 18, 2017, 10:48:45 AM4/18/17
to gce-discussion
Today I've stumbled upon the unpleasant surprise that I cannot start a VM because the zone it's in is overbooked.

"Starting VM instance Error: The zone does not have enough resources available to fulfill the request"

I guess I have no choice but to migrate it, which begs the question, how do I know if a zone is over capacity 
so that I can avoid this issue in the future? I'm thinking of moving it to us-west1-b, from us-east1-b

Marilu (Cloud Platform Support)

unread,
Apr 18, 2017, 3:31:33 PM4/18/17
to gce-dis...@googlegroups.com
Hello Ray,

We are continuously expanding capacity so you can try again later, however, the immediate solution is to create your instance in a different zone.

I'll recommend reviewing this link that recommends distributing your instances across zones to increase availability. You can also review this other link, where you will find some best practices for designing robust systems on GCE.

I hope this helps,

Marilu

Paul Nash

unread,
Apr 18, 2017, 4:28:26 PM4/18/17
to Marilu Rojas, gce-discussion
Hi Ray,

Thanks for your question. Here's some more information to extend Marilu's great response and links.

First, "overbook" is a fairly emotional term at this moment, at least in other industries, but not an accurate picture of how cloud resources work. Specifically, we always have the resources that customers have paid for (vs selling an airline seat multiple times and betting that some people won't show up).

In general, it's important to recognize that the default model of buying cloud resources is on-demand and pay-as-you-go. That could be rephrased as "first come, first serve." Our goal is to make sure that there are available resources in all locations, almost all the time (where "almost" is a high probability, but never 1.0). Of course we could make sure that every single customer can get anything they could want, whenever they need it. But that would be massively inefficient, and is the reason that traditional IT has slow provisioning rates and constrained flexibility. At the end of the day, the flexibility you get means that we (or any other cloud) can't know exactly how much capacity will be needed at any moment, so we can't plan perfectly for it.

So, quotas. Quotas, which may be what you're referring to as "oversold" are merely limits (which are flexible) to help ensure that all customers have a reasonable opportunity to get what they need (and that no single customer can show up by surprise and take everything). They are not promises, they are a system to bring some level of fairness for our customers.

The trade off of cloud being on-demand, is that it requires a little more work on the developer's part. If you need a higher level guarantee of obtaining resources than what the cloud provides, then you should be using multiple zones, and possibly even regions so you can sample from different pools. Temporary unavailability of new resources is similar in nature to existing resources going down, and the cloud approach is the same - spread your usage across other locations. It is for these reasons that most major cloud providers, including us, require that you be deployed in a multi-zone highly available configuration in order to qualify for SLA guarantees of uptime and obtainability.

So, how can you avoid this? For one thing, we often work directly with customers who have a large enough need to do specific forecasting, so we know that their demand is coming. If your usage is large enough, it would be good to contact our sales team to discuss your likely needs. Another way this is avoided in some clouds is by offering "reserved" product models, but those reservations or guarantees typically come with very restrictive caveats on where and when you can get the capacity. We do not currently offer a *guaranteed* reserved capacity product.

Your question about whether we can help you "steer" is a great one. We don't currently offer something like this, for a variety of reasons, but we are considering it in the future.

Finally, please note - getting a "zone resources exhausted" message is specific to that request and that moment in time. It is possible that the size or type of resource you requested is temporarily in use, but that doesn't mean that all resources are exhausted for all requesters. As a simple example, it might be possible that all GPUs are in use, but other resources are available. That doesn't help you necessarily, if you can't change your request, but it is worth noting that this message does not mean the zone is "down" or "full" for every request.

I hope this helps address some of your questions. Yes, unfortunately it is possible to not be able to get what you need at any given moment. We work hard to make this very rare, but it does happen.

On Apr 18, 2017 12:31 PM, "'Marilu (Cloud Platform Support)' via gce-discussion" <gce-dis...@googlegroups.com> wrote:
Hello Ray,

We are continuously expanding capacity so you can try again later, however, the immediate solution is to create your instance in a different zone.

I'll recommend reviewing this link that recommends distributing your instances across zones to increase availability. You can also review this other link, where you will find some best practices for designing robust systems on GCE.

On Tuesday, April 18, 2017 at 10:48:45 AM UTC-4, Ray Foss wrote:
Today I've stumbled upon the unpleasant surprise that I cannot start a VM because the zone it's in is overbooked.

"Starting VM instance Error: The zone does not have enough resources available to fulfill the request"

I guess I have no choice but to migrate it, which begs the question, how do I know if a zone is over capacity 
so that I can avoid this issue in the future? I'm thinking of moving it to us-west1-b, from us-east1-b

--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/1125700d-3a5e-45a3-8d66-bf8ee43f67d8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ray Foss

unread,
Apr 18, 2017, 6:17:31 PM4/18/17
to gce-discussion, mar...@google.com
Paul,

Thank you for your thorough response. The instance I was starting was an extremely typical N1 instance, it took over 3 hours before the start instance button worked. This is an internal dev system that does not need high availability.

We could move the hard drive to another zone by doing the usual rigamaroll, which can take 15 minutes and requires not free intermediate snapshot drive... but we have no idea which zone will less busy. We don't work at night so it makes no sense to keep it on 24/7. 

This is an area for improvement.


On Tuesday, April 18, 2017 at 3:28:26 PM UTC-5, Paul Nash wrote:
Hi Ray,

Thanks for your question. Here's some more information to extend Marilu's great response and links.

First, "overbook" is a fairly emotional term at this moment, at least in other industries, but not an accurate picture of how cloud resources work. Specifically, we always have the resources that customers have paid for (vs selling an airline seat multiple times and betting that some people won't show up).

In general, it's important to recognize that the default model of buying cloud resources is on-demand and pay-as-you-go. That could be rephrased as "first come, first serve." Our goal is to make sure that there are available resources in all locations, almost all the time (where "almost" is a high probability, but never 1.0). Of course we could make sure that every single customer can get anything they could want, whenever they need it. But that would be massively inefficient, and is the reason that traditional IT has slow provisioning rates and constrained flexibility. At the end of the day, the flexibility you get means that we (or any other cloud) can't know exactly how much capacity will be needed at any moment, so we can't plan perfectly for it.

So, quotas. Quotas, which may be what you're referring to as "oversold" are merely limits (which are flexible) to help ensure that all customers have a reasonable opportunity to get what they need (and that no single customer can show up by surprise and take everything). They are not promises, they are a system to bring some level of fairness for our customers.

The trade off of cloud being on-demand, is that it requires a little more work on the developer's part. If you need a higher level guarantee of obtaining resources than what the cloud provides, then you should be using multiple zones, and possibly even regions so you can sample from different pools. Temporary unavailability of new resources is similar in nature to existing resources going down, and the cloud approach is the same - spread your usage across other locations. It is for these reasons that most major cloud providers, including us, require that you be deployed in a multi-zone highly available configuration in order to qualify for SLA guarantees of uptime and obtainability.

So, how can you avoid this? For one thing, we often work directly with customers who have a large enough need to do specific forecasting, so we know that their demand is coming. If your usage is large enough, it would be good to contact our sales team to discuss your likely needs. Another way this is avoided in some clouds is by offering "reserved" product models, but those reservations or guarantees typically come with very restrictive caveats on where and when you can get the capacity. We do not currently offer a *guaranteed* reserved capacity product.

Your question about whether we can help you "steer" is a great one. We don't currently offer something like this, for a variety of reasons, but we are considering it in the future.

Finally, please note - getting a "zone resources exhausted" message is specific to that request and that moment in time. It is possible that the size or type of resource you requested is temporarily in use, but that doesn't mean that all resources are exhausted for all requesters. As a simple example, it might be possible that all GPUs are in use, but other resources are available. That doesn't help you necessarily, if you can't change your request, but it is worth noting that this message does not mean the zone is "down" or "full" for every request.

I hope this helps address some of your questions. Yes, unfortunately it is possible to not be able to get what you need at any given moment. We work hard to make this very rare, but it does happen.
On Apr 18, 2017 12:31 PM, "'Marilu (Cloud Platform Support)' via gce-discussion" <gce-dis...@googlegroups.com> wrote:
Hello Ray,

We are continuously expanding capacity so you can try again later, however, the immediate solution is to create your instance in a different zone.

I'll recommend reviewing this link that recommends distributing your instances across zones to increase availability. You can also review this other link, where you will find some best practices for designing robust systems on GCE.

On Tuesday, April 18, 2017 at 10:48:45 AM UTC-4, Ray Foss wrote:
Today I've stumbled upon the unpleasant surprise that I cannot start a VM because the zone it's in is overbooked.

"Starting VM instance Error: The zone does not have enough resources available to fulfill the request"

I guess I have no choice but to migrate it, which begs the question, how do I know if a zone is over capacity 
so that I can avoid this issue in the future? I'm thinking of moving it to us-west1-b, from us-east1-b

--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.

---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussio...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.

Paul Nash

unread,
Apr 19, 2017, 6:16:21 AM4/19/17
to Ray Foss, gce-discussion, Marilu Rojas
Ray,

You're quite welcome, and thanks for the honest and frank feedback, we take it very seriously, and are indeed working on improvements here.

On a tangential note, for intermittently needed dev systems, you might consider using Preemptible VMs, which would be significantly more affordable (even if you left them running) and also have a different availability profile. As long as you can tolerate an occasional shutdown.

A second thought - you mention copying the disk to another zone. If the VM in question doesn't have long term state, perhaps you could make an Image out of it, which you could then use to start a new VM instance in whatever zone you want.

I understand you'd rather that the VM just always start, but maybe these suggestions will be helpful too.

Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.

---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/ee04f546-087e-447c-b811-b540737cdd27%40googlegroups.com.

Hamish Ogilvy

unread,
Aug 29, 2017, 9:46:43 PM8/29/17
to gce-discussion
Hi Paul,

This is really interesting, thanks for the detailed comments. I'm wondering the best way to plan for this with Kubernetes? This issue prevented us from creating a node pool we needed. A multi AZ cluster duplicates all pools, which is not what we want. It would be great if pools could be scheduled across zones but they can't currently, so it seems this would always leave a cluster in an unhealthy state for the AZ in question. The overbook we saw lasted over a day, so it's a long time to plan for.

Thanks,
Hamish

Kevin Minehart

unread,
Aug 31, 2017, 6:21:32 PM8/31/17
to gce-discussion
I'm having this issue now on us-central1-a.  Is it not possible to display on the status pages when these kinds of issues are happening, and when they are expected to be resolved?

Kamran (Google Cloud Support)

unread,
Sep 1, 2017, 12:02:26 AM9/1/17
to gce-discussion

Hello Kevin,

We'll update this thread on public issue tracker once the stock out issue with us-central1-a is resolved. Please feel free to open your feature request on public issue tracker and we'll look into it further.

Thanks

Paul Nash

unread,
Sep 7, 2017, 4:51:51 AM9/7/17
to Hamish Ogilvy, gce-discussion
Hi Hamish, I'm forwarding your question to one of my colleagues on the k8s team that may be able to answer it more readily than I can.

--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043

Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Paul R. Nash | Group Product Manager, Compute Engine | paul...@google.com | 206-876-1620

Preston Marshall

unread,
Sep 7, 2017, 3:30:11 PM9/7/17
to gce-discussion
I am also having issues doing anything with Kubernetes. To make things even weirder, I can spin up preemptible VMs in the affected zone. This makes me think that there is something else going on. Why would I be unable to upgrade my cluster due to a zone being out of resources, when I can very clearly prove that the zone has resources to spin up instances, and preemptible ones at that. I can understand running out of resources and not being able to add more in a zone, but it's a very poor design to have users unable to make changes to their cluster because of provisioning issues. A common sense solution would be to reserve some capacity for making changes, as mentioned previously in this thread.


On Thursday, September 7, 2017 at 3:51:51 AM UTC-5, Paul Nash wrote:
Hi Hamish, I'm forwarding your question to one of my colleagues on the k8s team that may be able to answer it more readily than I can.
On Tue, Aug 29, 2017 at 6:46 PM, Hamish Ogilvy <hamish...@gmail.com> wrote:
Hi Paul,

This is really interesting, thanks for the detailed comments. I'm wondering the best way to plan for this with Kubernetes? This issue prevented us from creating a node pool we needed. A multi AZ cluster duplicates all pools, which is not what we want. It would be great if pools could be scheduled across zones but they can't currently, so it seems this would always leave a cluster in an unhealthy state for the AZ in question. The overbook we saw lasted over a day, so it's a long time to plan for.

Thanks,
Hamish

--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043

Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.

---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussio...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.

Raj Rajen

unread,
Nov 11, 2018, 4:19:50 PM11/11/18
to gce-discussion
Currently, I am facing the same for n1-highcpu-8 .. I can create a VM for this machine-type but the kubernetes is not spinning off the resources in us-central , us-west

reation failed: The zone 'projects/myproject/zones/us-central1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later.

Regards,
Raj

Ramiro Berrelleza

unread,
Nov 11, 2018, 6:56:03 PM11/11/18
to gce-discussion
I have been facing the same issue across 5 different regions, continuously, for the past 3 days. I haven't been able to successfully deploy a gke cluster of a gke node pool in days due to this. I've never experienced something like this with a cloud provider. Are all zones across all this regions out of VMs? Or is there a bug somewhere that's resulting in this situation?

Maxime Bélanger

unread,
Nov 13, 2018, 10:40:48 AM11/13/18
to gce-discussion
Exact same issue with northamerica-northeast-1 for k8s node-pools. It is a bit annoying that when the node-pool needs to grow, there is not enough space in the zone. I have indeed multiple zone node-pools but all 3 zones seems to be full even for preemptible or small n1-standard-1....

Vittorio Calcagno

unread,
Nov 13, 2018, 10:41:36 AM11/13/18
to gce-discussion
I am having this issue as well for the past couple of days. And it is persisting as of now.

This is happening for zone us-central1-a, but happens for may other zones as well, including other regions.


On Sunday, November 11, 2018 at 6:56:03 PM UTC-5, Ramiro Berrelleza wrote:

Howard Zeemer

unread,
Nov 14, 2018, 7:06:13 AM11/14/18
to gce-discussion
Experienced the same issue for several days, responses I've gotten have basically equated to just wait it out.  My cluster is back up and running now and hopefully will stay that way.  I did get told about a feature request so that we can see regional/zone resource availability (https://issuetracker.google.com/issues/72811715), please post to it so that we can push this up as a priority to the development teams. 

Larbi (Google Cloud Support)

unread,
Nov 14, 2018, 12:39:14 PM11/14/18
to gce-discussion
We apologies for all the inconvenient, we are continuously expanding capacity, so you can try again later, however, the immediate solution is to create your instance in a different zone. 


I'll recommend reviewing this link that recommends distributing your instances across zones to increase availability. You can also review this other link, where you will find some best practices for designing robust systems on GCE. 

As Paul said: Getting a "zone resources exhausted" message is specific to that request and that moment in time. It is possible that the size or type of resource you requested is temporarily in use, try other available zone (if a different zone not crucial ) or try in different time. 

I know this is frustrating, We work hard to make this very rare, but it does happen. 

v

unread,
Jul 1, 2019, 7:24:54 PM7/1/19
to gce-discussion
Hi Everyone,

I've been hitting this exact issue since last week, Friday, for days now with no resolution. I run Singularity Hub, which brings up instances as container builders, and every time I get a message that resources are exhausted:

Exception: {'errors': [{'code': 'ZONE_RESOURCE_POOL_EXHAUSTED'
, 'message': "The zone 'projects/xxxxxxxxx/zones/us-west1-a' does not have enough resources available to fulfill the request.  Try a different zone, or try again later."}]}


I'm not able to change the zone, us-west-1a, because it must be launched from the same zone as the instance.

<HttpError 400 when requesting https://www.googleapis.com/compute/v1/projects/xxxxxxxxx/zones/us-west1-a/instances?alt=json returned "Invalid value for field 'resource.machineType': 'zones/us-central1-a/ machineTypes/n1-standard-2'. Machine type specified 'n1-standard-2' is in a different scope than the instance.">

I'm aware of "best practices" to deploy across zones, but this is an academic application so I have to minimize costs to the best of my ability. I can reproduce the same error in the cloud or in the console, and I've tried upping the instance size to be larger (and same issue).

Browsing the internet, there seems to be nothing to do but wait. My users haven't been able to build containers since last week. Is this really the only option? There is no guarantee that this actually will resolve anytime soon, although it's the first time that I've seen this particular error in the lifecycle of the application.

Any help, tricks, etc., would be greatly appreciated.

Best,

Vanessa
Message has been deleted

Ahmad P - Cloud Platform Support

unread,
Jul 3, 2019, 11:58:51 PM7/3/19
to gce-discussion

Stock out issues are very rare but can still happen unfortunately. Our goal is to make sure that there are available resources in all zones. When a situation like this occurs or is about to occur, our Product team is notified immediately and the issue is investigated. 


Furthermore, based on the circumstances, the best approach would be to either try to move the workload to different zones [1] or try again later. 


[1] https://cloud.google.com/compute/docs/regions-zones/

Travis Lamming

unread,
Jul 13, 2019, 9:08:21 AM7/13/19
to gce-discussion
We have run into this very rare issue two times in the past week, with very small vm's in the middle of the night for that zone. Should it be up to your end users to have to move to other zones in this case? Couldn't Google mitigate by doing this for it's customers? It's frustrating, especially for resources that were under committed use agreement with Google.

Do the external ip addresses change for vm's moved to other zones? What if there are multiple vm's that need to talk to one another and now are in another geographic region and have to contend with latency?

Fady (Google Cloud Platform)

unread,
Jul 14, 2019, 12:32:32 AM7/14/19
to gce-discussion

Thank you for the feedback. As mentioned we try to anticipate such events, and Paul has elaborated more about the subject in a previous post (third). As such, I will humbly try to answer the rest of your questions. It would be nice to ask each of the questions as separate topics (threads) as to benefit the community with similar interests. I am starting with the last question:


What if there are multiple vm's that need to talk to one another and now are in another geographic region and have to contend with latency?


While VM instances in the same VPC can talk internally to each other (internal IP addresses), they encounter latency between regions due to geographical distance. However, this may not be evident between zones in the same region. As to explain, zones are considered in the same geographic location, and “tend to have round-trip network latencies of under <1ms on the 95th percentile” [1]  You may think of it as one data center (may not be the case) divided into sections (zones), and “designed to be independent from each other: power, cooling, networking, and control planes are isolated from other zones.” [2] The goal is to provide isolation in events of a failure.


Do the external ip addresses change for vm's moved to other zones? 


If you mean Static IP addresses, they are considered regional resources [3]. As such, you can assign the same IP address to an instance that migrated to a different zone within the same region. The same applies to static internal IP addresses (in case if your instances are also communicating internally). However, ephemeral IP addresses may not persist, and as explained in this document.


It's frustrating, especially for resources that were under committed use agreement with Google.


In any case, if you are concerned about charges due to underutilized commitments, please contact billing to investigate the time periods. (Logs would be helpful). 


Should it be up to your end users to have to move to other zones in this case? Couldn't Google mitigate by doing this for it's customers?


While each customer has a unique use-case, it is advised to take precautionary measures by designing a resilient system across regions (in case of unforeseen disaster). For such cases, and during the incident, moving instances from an unavailable region or zone may not be possible. Here are best practice documents that may help with designing robust systems  [4] [5] [6] [7].


However, your suggestion seems like a great feature request for cases where a zone is (about to be) utilized. As such, I suggest creating a report at issuetracker with a detailed explanation as to be forwarded to the Compute Engine product team.


I hope the above helps.


Travis Lamming

unread,
Jul 14, 2019, 1:59:58 PM7/14/19
to gce-discussion
Thank you, Fady. 

The resources that we are requesting are not huge. In the case of us-east4a, we have 4 servers with a couple of cpu's each, wouldn't moving each server to a different zone in the same region expose us to more chances to get this error? Right now, we do not know the capacity of each zone, nor what the peak times are. We need all four servers up, so if we move each to a theoretical 4 zones, are we not just risking the same chance, since all are required?

Vaqar

unread,
Jul 15, 2019, 6:52:12 PM7/15/19
to gce-discussion
Hi,

In regards to your concern, it is recommended to spread workloads across different regions to avoid the scenario you just mentioned. To know the capacity of a zone is not possible but major outages are reported on status.cloud.google.com as these instances are rare. For further inquiries, it is best to reach out to the Sales team.

FYI, there are no issues in us-east4 so you should be able to deploy all your resources
Reply all
Reply to author
Forward
0 new messages