Hello Ray,We are continuously expanding capacity so you can try again later, however, the immediate solution is to create your instance in a different zone.I'll recommend reviewing this link that recommends distributing your instances across zones to increase availability. You can also review this other link, where you will find some best practices for designing robust systems on GCE.
On Tuesday, April 18, 2017 at 10:48:45 AM UTC-4, Ray Foss wrote:Today I've stumbled upon the unpleasant surprise that I cannot start a VM because the zone it's in is overbooked."Starting VM instance Error: The zone does not have enough resources available to fulfill the request"I guess I have no choice but to migrate it, which begs the question, how do I know if a zone is over capacityso that I can avoid this issue in the future? I'm thinking of moving it to us-west1-b, from us-east1-b
--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/1125700d-3a5e-45a3-8d66-bf8ee43f67d8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Ray,Thanks for your question. Here's some more information to extend Marilu's great response and links.First, "overbook" is a fairly emotional term at this moment, at least in other industries, but not an accurate picture of how cloud resources work. Specifically, we always have the resources that customers have paid for (vs selling an airline seat multiple times and betting that some people won't show up).In general, it's important to recognize that the default model of buying cloud resources is on-demand and pay-as-you-go. That could be rephrased as "first come, first serve." Our goal is to make sure that there are available resources in all locations, almost all the time (where "almost" is a high probability, but never 1.0). Of course we could make sure that every single customer can get anything they could want, whenever they need it. But that would be massively inefficient, and is the reason that traditional IT has slow provisioning rates and constrained flexibility. At the end of the day, the flexibility you get means that we (or any other cloud) can't know exactly how much capacity will be needed at any moment, so we can't plan perfectly for it.So, quotas. Quotas, which may be what you're referring to as "oversold" are merely limits (which are flexible) to help ensure that all customers have a reasonable opportunity to get what they need (and that no single customer can show up by surprise and take everything). They are not promises, they are a system to bring some level of fairness for our customers.The trade off of cloud being on-demand, is that it requires a little more work on the developer's part. If you need a higher level guarantee of obtaining resources than what the cloud provides, then you should be using multiple zones, and possibly even regions so you can sample from different pools. Temporary unavailability of new resources is similar in nature to existing resources going down, and the cloud approach is the same - spread your usage across other locations. It is for these reasons that most major cloud providers, including us, require that you be deployed in a multi-zone highly available configuration in order to qualify for SLA guarantees of uptime and obtainability.So, how can you avoid this? For one thing, we often work directly with customers who have a large enough need to do specific forecasting, so we know that their demand is coming. If your usage is large enough, it would be good to contact our sales team to discuss your likely needs. Another way this is avoided in some clouds is by offering "reserved" product models, but those reservations or guarantees typically come with very restrictive caveats on where and when you can get the capacity. We do not currently offer a *guaranteed* reserved capacity product.Your question about whether we can help you "steer" is a great one. We don't currently offer something like this, for a variety of reasons, but we are considering it in the future.Finally, please note - getting a "zone resources exhausted" message is specific to that request and that moment in time. It is possible that the size or type of resource you requested is temporarily in use, but that doesn't mean that all resources are exhausted for all requesters. As a simple example, it might be possible that all GPUs are in use, but other resources are available. That doesn't help you necessarily, if you can't change your request, but it is worth noting that this message does not mean the zone is "down" or "full" for every request.I hope this helps address some of your questions. Yes, unfortunately it is possible to not be able to get what you need at any given moment. We work hard to make this very rare, but it does happen.
On Apr 18, 2017 12:31 PM, "'Marilu (Cloud Platform Support)' via gce-discussion" <gce-dis...@googlegroups.com> wrote:
Hello Ray,--We are continuously expanding capacity so you can try again later, however, the immediate solution is to create your instance in a different zone.I'll recommend reviewing this link that recommends distributing your instances across zones to increase availability. You can also review this other link, where you will find some best practices for designing robust systems on GCE.On Tuesday, April 18, 2017 at 10:48:45 AM UTC-4, Ray Foss wrote:Today I've stumbled upon the unpleasant surprise that I cannot start a VM because the zone it's in is overbooked."Starting VM instance Error: The zone does not have enough resources available to fulfill the request"I guess I have no choice but to migrate it, which begs the question, how do I know if a zone is over capacityso that I can avoid this issue in the future? I'm thinking of moving it to us-west1-b, from us-east1-b
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussio...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/ee04f546-087e-447c-b811-b540737cdd27%40googlegroups.com.
This is really interesting, thanks for the detailed comments. I'm wondering the best way to plan for this with Kubernetes? This issue prevented us from creating a node pool we needed. A multi AZ cluster duplicates all pools, which is not what we want. It would be great if pools could be scheduled across zones but they can't currently, so it seems this would always leave a cluster in an unhealthy state for the AZ in question. The overbook we saw lasted over a day, so it's a long time to plan for.
Thanks,
Hamish
--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/9711c2f1-d4be-49a3-afe4-7450e667f8b0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Hamish, I'm forwarding your question to one of my colleagues on the k8s team that may be able to answer it more readily than I can.
On Tue, Aug 29, 2017 at 6:46 PM, Hamish Ogilvy <hamish...@gmail.com> wrote:
Hi Paul,
This is really interesting, thanks for the detailed comments. I'm wondering the best way to plan for this with Kubernetes? This issue prevented us from creating a node pool we needed. A multi AZ cluster duplicates all pools, which is not what we want. It would be great if pools could be scheduled across zones but they can't currently, so it seems this would always leave a cluster in an unhealthy state for the AZ in question. The overbook we saw lasted over a day, so it's a long time to plan for.
Thanks,
Hamish
--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussio...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/9711c2f1-d4be-49a3-afe4-7450e667f8b0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Exception: {'errors': [{'code': 'ZONE_RESOURCE_POOL_EXHAUSTED', 'message': "The zone 'projects/xxxxxxxxx/zones/us-west1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later."}]}
<HttpError 400 when requesting https://www.googleapis.com/compute/v1/projects/xxxxxxxxx/zones/us-west1-a/instances?alt=json returned "Invalid value for field 'resource.machineType': 'zones/us-central1-a/ machineTypes/n1-standard-2'. Machine type specified 'n1-standard-2' is in a different scope than the instance.">
Stock out issues are very rare but can still happen unfortunately. Our goal is to make sure that there are available resources in all zones. When a situation like this occurs or is about to occur, our Product team is notified immediately and the issue is investigated.
Furthermore, based on the circumstances, the best approach would be to either try to move the workload to different zones [1] or try again later.
Do the external ip addresses change for vm's moved to other zones? What if there are multiple vm's that need to talk to one another and now are in another geographic region and have to contend with latency?
Thank you for the feedback. As mentioned we try to anticipate such events, and Paul has elaborated more about the subject in a previous post (third). As such, I will humbly try to answer the rest of your questions. It would be nice to ask each of the questions as separate topics (threads) as to benefit the community with similar interests. I am starting with the last question:
What if there are multiple vm's that need to talk to one another and now are in another geographic region and have to contend with latency?
While VM instances in the same VPC can talk internally to each other (internal IP addresses), they encounter latency between regions due to geographical distance. However, this may not be evident between zones in the same region. As to explain, zones are considered in the same geographic location, and “tend to have round-trip network latencies of under <1ms on the 95th percentile” [1] You may think of it as one data center (may not be the case) divided into sections (zones), and “designed to be independent from each other: power, cooling, networking, and control planes are isolated from other zones.” [2] The goal is to provide isolation in events of a failure.
Do the external ip addresses change for vm's moved to other zones?
If you mean Static IP addresses, they are considered regional resources [3]. As such, you can assign the same IP address to an instance that migrated to a different zone within the same region. The same applies to static internal IP addresses (in case if your instances are also communicating internally). However, ephemeral IP addresses may not persist, and as explained in this document.
It's frustrating, especially for resources that were under committed use agreement with Google.
In any case, if you are concerned about charges due to underutilized commitments, please contact billing to investigate the time periods. (Logs would be helpful).
Should it be up to your end users to have to move to other zones in this case? Couldn't Google mitigate by doing this for it's customers?
While each customer has a unique use-case, it is advised to take precautionary measures by designing a resilient system across regions (in case of unforeseen disaster). For such cases, and during the incident, moving instances from an unavailable region or zone may not be possible. Here are best practice documents that may help with designing robust systems [4] [5] [6] [7].
However, your suggestion seems like a great feature request for cases where a zone is (about to be) utilized. As such, I suggest creating a report at issuetracker with a detailed explanation as to be forwarded to the Compute Engine product team.
I hope the above helps.