Network issue after today's Compute Engine outage

281 views
Skip to first unread message

Moritz Honig

unread,
Aug 5, 2016, 1:34:02 PM8/5/16
to gce-discussion
Hello,

Since today's Compute Engine outage (https://status.cloud.google.com/incident/compute/16015) we are experiencing major issues with our GCE instances that seem to be related to network traffic being blocked somehow. This might be exclusively related to VPN traffic.

We have for example an instance acting as slave server in a MySQL replication setup. It is connecting via VPN to a master server outside the Google Cloud. Since this mornings incident it does not receive the relay log any longer while other slaves do still read the log without problems. A restart of the instance did not help. We are seeing the incoming connection on the master but it fails sending the relay log data to the slave.

Doing research on that issue revealed users with similar problems had issues with their network interface configuration on the slave servers (specifically the MTU being set incorrect). This makes it even more likely that such issue could be connected to today's issue with the GCE as it was related to network traffic.

We are experiencing similar issue with servers in other roles that seem also related to VPN traffic not reaching those instances.

Any help would be highly appreciated.

Regards,
Moritz

Moritz Honig

unread,
Aug 5, 2016, 4:11:56 PM8/5/16
to gce-discussion
Update: I found out that e.g. also any scp transfer to one of the GCE instances via our VPN tunnel stalls exactly after 2112 KB are transferred. Searching for this revealed further posts describing that changing the MTU fixed the problem (e.g. http://davidjb.com/blog/2014/03/scprsync-transfers-stall-at-exactly-2112-kb/). This does not help for us though and I assume this is because the IPsec has influence on the MTU (https://cloud.google.com/compute/docs/vpn/advanced#maximum_transfer_unit_mtu_considerations).

Again, since this issue started occurring only today after the GCE network issues it must be related. Can somebody of Google please look into this!

Regards,
Moritz

Moritz Honig

unread,
Aug 6, 2016, 5:39:46 AM8/6/16
to gce-discussion
I found a workaround: Following the post at https://networkcanuck.com/2013/06/10/troubleshooting-mtu-size-over-ipsec-vpn/ I figured out that the MTU of 1378 allows packets to be sent unfragmented via the tunnel. I therefore now set the MTU on our GCE instances manually to that value:

/sbin/ifconfig eth0 mtu 1378


Now everything is working fine again but I assume that this is an issue you guys at Google should look into!

Kamran (Google Cloud Support)

unread,
Aug 6, 2016, 4:32:10 PM8/6/16
to gce-dis...@googlegroups.com
Hello Moritz,

Thank you for your messages. I understand that you are experiencing a data transfer issue to your GCE server via VPN tunnel which you were able to resolve it by changing the value of MTU size of the network interface. You mentioned that your master server is outside of the Google Cloud Platform and the affected slave server is inside the GCP. Can you please let me know:

- Where are the other slave servers, which are working without problem, located? Are they also using the same Cloud VPN tunnel?
- Did you try to restart the VPN tunnel or master server to see if it resolves the issue?

I'm glad that you've found a workaround for this issue. However, if you're still looking to troubleshoot it, please feel free to email me your project ID and the packet captures (tcpdump or wireshark) of the network activities from both ends, around the time that VPN tunnel stalls, and I'll be glad to investigate this problem.

Sincerely,

Moritz Honig

unread,
Aug 9, 2016, 11:17:56 AM8/9/16
to gce-discussion
Hi Kamran,

The issue has disappeared in the meantime. I just reset the MTU on all instances back to the default value of 1460 and we do not see the issue any longer. I am absolutely sure that it was connected to the issue at your end mentioned before.

Regards,
Moritz

Nicholas (Google Cloud Support)

unread,
Aug 15, 2016, 2:18:19 PM8/15/16
to gce-discussion
Thanks for confirming the current state of your project.  Glad to hear that this is working for you again.  If this issue returns, please file a new defect report on the Compute Engine public issue tracker linking back to this thread for context.  If creating a new issue, please be sure to include the information requested by Kamran here to accelerate troubleshooting.
Reply all
Reply to author
Forward
0 new messages