Transparent maintenance is not so transparent

312 views
Skip to first unread message

Julius Žaromskis

unread,
Nov 3, 2016, 4:54:35 AM11/3/16
to gce-discussion
Never had these issues in the past, but now it's the second time I notice this. After the maintenance event connection is lost for several minutes. I thought maintenance events should not be noticeable. Is this considered "normal"?

systemevent-1478116633503-54056d8e068c3-ec9a2e71-ea35e917

[7:57]  
GCE instance maintenance is starting in 60s
[7:58]  
GCE has finished instance maintenance
Application errors start coming in
The metric Pgpool queue time for GCE VM Instance ign-mdb02.europe-west1-c has not been seen for over  minutes
[8:04]
Pgpool queue time for GCE VM Instance ign-mdb02.europe-west1-c has started to come in again. https://app.google.stackdriver.com/account/login/ignitenet-

Looks like connections dropped during the outage. I can PM with more details, if you're interested in investigating this.

Julius Žaromskis

unread,
Nov 3, 2016, 11:42:45 AM11/3/16
to gce-discussion
I'm digging through the logs:

[305517-2] 2016-11-02 19:58:35: pid 12173: DETAIL: gethostbyname for "ign-mdb01" failed with error: "Unknown host"
WARNING: failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
Nov 2 19:58:35 ign-mdb02 pgpool: failover done. shutdown host ign-mdb01(5432)2016-11-02 19:58:35: pid 25933: LOG: failover done. shutdown host ign-mdb01(5432)
19:58:38.306 compute.instances.migrateOnHostMaintenance {"event_timestamp_us":"1478116718306899","event_type":"GCE_OPERATION_DONE"

It looks like DNS resolution is affected during the instance maintenance. I have seen similar behavior with our redis instance. I'll rephrase my question then: why is DNS resolution sometimes affected during the maintenance? Is this a known issue?

Meanwhile I will try putting my hosts into hostsfile.

George (Google Cloud Support)

unread,
Nov 3, 2016, 4:45:38 PM11/3/16
to gce-discussion
Hello Julius,

As mentioned in this Help Center article, when there are system events that might cause your instance to be disrupted, Google Compute Engine automatically manages the scheduling decisions for your instances. For example, if your instance is terminated due to system or hardware failure, Compute Engine automatically restarts that instance. You can modify this automatic behavior by changing the availability policies for this instance. 

In addition, by default standard instances are set to live migrate, where Google Compute Engine automatically migrates your instances away from an infrastructure maintenance event, and your instance might experience a short period of decreased performance, which affected your instance in this event and explains the DNS resolution that you are referring to.

To mitigate such failures in the future, I would recommend creating diversity across regions and zones and implement load balancing. You should also backup your data or replicate your persistent disk data on multiple locations.

I hope this helps.

Sincerely,
George

Julius Žaromskis

unread,
Nov 4, 2016, 3:43:37 AM11/4/16
to gce-discussion
> and your instance might experience a short period of decreased performance

Thanks for the reply, though, I don't think it's that. I do notice a microscopic hiccup (2ms or so) when instance is migrated. But I am talking about DNS failing for several seconds, so I think this has something to do with metadata service. Anyhow, the quick workaround is not to use DNS for internal host resolution, but instead put all the hosts to /etc/hosts, that's what I'm testing right now.

Paul Nash

unread,
Nov 7, 2016, 4:00:36 AM11/7/16
to Julius Žaromskis, gce-discussion
FYI for the mailing list: this issue is being investigated off-thread with our engineering team.

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/bf67a180-10f5-4ceb-8939-2e2811a056da%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Paul R. Nash | Product Manager, Compute Engine | paul...@google.com | 206-876-1620

Lucian Lazar

unread,
Jun 15, 2020, 12:52:34 PM6/15/20
to gce-discussion
This still happens in now. I mean, I can't even SSH into the machine via putty or even via GCP console. It returns error 255 and I can't diagnose it.
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.

---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-dis...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.

anarayanaswamy

unread,
Jun 15, 2020, 5:51:17 PM6/15/20
to gce-discussion
The error code 255 from gcloud compute ssh is a generic error.  A number of factors can cause it. Follow the link [1] to get more information/troubleshoot the issue.

Reply all
Reply to author
Forward
0 new messages