2nd Generation Failover documentation?

Brian Wawok

unread,

Dec 23, 2015, 12:51:36 PM12/23/15

to Google Cloud SQL discuss

Hi!

From reading the documentation and poking around, I am not entirely sure how the cloud SQL failover is setup. Can anyone point me to a more complete documentation, or answer a few questions for me?

1) In 1st gen, when there was a failure a new instance was automagically spun up and activated. In 2nd gen, is this no longer the case? Do I NEED to create a Failover Replica for the same behavior?

2) In the event of a failover, will my failover replica automatically become primary? Or do I need to trigger a failover by hand when something is down?

3) How long of a bad event is there before the automatic failover process is started?

4) My failover replica has an IP.. do I need to change my clients to use this IP, or will the old primary IP now start pointing at the failover replica? i.e. is this really a floating IP that gets moved?

5) Can I use my failover replica as a read slave, or must it just sit idle until an event?

6) What happens to the old primary in a failover after it comes back. Does it become a failover replica for the new primary, or do I need to do something by hand?

7) How do I reset my original primary to be the real master after a failover event is complete?

Thanks!

Brian

grached

unread,

Dec 24, 2015, 11:02:58 AM12/24/15

to Google Cloud SQL discuss

Hello Brian,

All Cloud SQL data is replicated in multiple zones. In the unlikely event of a zone outage, instances fail over to another, available, zone automatically. Failover is designed to be transparent to your applications, so that after failover, an instance has the same instance name, IP address, and firewall rules. During the failover there will typically be a few seconds downtime as the instance starts up in a new zone. However, in some cases, the InnoDB crash-recovery process may take longer, delaying the time before the instance is up.

In a failover event, existing connections to instances are broken. You can test how your application responds to a failover by restarting your instance. For recommendations on managing connections that can help in failover events, see the FAQ entry How should I manage connections?

You can configure:

Cloud SQL instances that replicate from a Cloud SQL master instance.
Cloud SQL instances that replicate from an external master instance.
External MySQL instances that replicate from a Cloud SQL master instance.
Note that external read replica instances must:

Be able to connect to the Cloud SQL master instance with the MySQL wire protocol.
Support row-based replication.
Be the same (or later) version of the Cloud SQL instance being replicated.

For more information about read replicas, including use cases for each type, see Configuring Replication with Google Cloud SQL.

You can use the Google Cloud Platform Console to see all of your Cloud SQL instances, and whether an instance is a master or read replica instance. You can use the Cloud SDK to check whether an instance is a master or read replica. For more information, see Checking replication status.

I hope this helps.

Sincerely,

George

Brian Wawok

unread,

Dec 24, 2015, 11:05:37 AM12/24/15

to Google Cloud SQL discuss

George:

So if gen2 has automatic failover baked in, what is the purpose of a failover replica? Does it make failover faster, or what does it buy me?

Thanks,

Brian

George

unread,

Dec 24, 2015, 5:03:02 PM12/24/15

to Google Cloud SQL discuss

Hello Brian,

In order to deploy fault-tolerant applications that have high availability, Google recommends deploying applications across multiple zones in a region. This helps protect against unexpected failures of components, up to and including a single zone. You can configure a Cloud SQL Second Generation instance to be highly available by configuring replication to a failover replica instance in different zone than the master instance.

You can leverage the high-bandwidth, low-latency network connections between zones in the same region to set up a failover replica instance in a different zone than the master instance. In the event of failure of the master instance's zone, Google Cloud SQL automatically switches over to the failover replica.

I hope this helps.

Sincerely,

George

Brian Wawok

unread,

Dec 24, 2015, 5:11:17 PM12/24/15

to google-cloud...@googlegroups.com

I thought you just said cloud sql gen 2 will automatically fail over without a failover instance to another zone? So I ask again what do I gain by paying for a failover instance. 30 seconds vs 20 second failover? Or what?

--
You received this message because you are subscribed to a topic in the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-cloud-sql-discuss/WwfY_CwFbVU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-cloud-sql-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/df2058c4-ff71-4173-b1e7-7d16d2d97605%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jay Zhu

unread,

Dec 28, 2015, 2:30:14 AM12/28/15

to Google Cloud SQL discuss

Hi Brian,

This is Jay from the CloudSQL team. Thank a lot for these great questions. Let me try to help with clarifications as below. Please let me know if you have any further questions.

The failover feature is designed to provider higher than zonal availability for Cloud SQL 2nd Gen instances. Without failover replica instance a Cloud SQL 2nd Gen instance will be out of service in the unlikely event of a zone outage. A failover replica instance is required in order to be able to failover to a different zone. When the zone failure is detected, the master instance will be recreated in the zone where failover replica resides, with data from the failover replica, and the failover replica will be 'pushed out' to another healthy zone (The actual implementation under the hood is more complicated, but this is what the external user observes). In this way, the metadata of the master instance keeps unchanged before and after the failover and there should be no action in application side to be taken for events like zone failure.

You can try out the failover behavior by calling the API directly (API document: https://cloud.google.com/sql/docs/admin-api/v1beta4/instances/failover) to trigger a manual failover.

In regard to your original questions:

1) In 1st gen, when there was a failure a new instance was automagically spun up and activated. In 2nd gen, is this no longer the case? Do I NEED to create a Failover Replica for the same behavior?

The failover is designed to provider higher than zonal availability for Cloud SQL 2nd Gen instances. I don't think there is similar behavior implemented in 1st Gen instances.

2) In the event of a failover, will my failover replica automatically become primary? Or do I need to trigger a failover by hand when something is down?

In the event of a failover, what you'll observe is that your primary database instance will be moved to a healthy zone (the zone where failover replica resides), and the failover replica will be moved to another healthy zone. There is no change required at all in terms of how your application connects to database, assuming that your application handles database reconnection well.

Currently we triggers failover automatically when there is zone level failures. You can also try to call the failover API directly to try out failover behavior on a specific instance (https://cloud.google.com/sql/docs/admin-api/v1beta4/instances/failover).

3) How long of a bad event is there before the automatic failover process is started?

As I explained in the previous question, currently we only triggers auto-failure in case of zone level failure. The failover is triggered as soon as the zone failure is detected.

4) My failover replica has an IP.. do I need to change my clients to use this IP, or will the old primary IP now start pointing at the failover replica? i.e. is this really a floating IP that gets moved?

No. There should be zero change required in your clients. After the failover, your client still connects tot he old primary IP, which now points to the primary instance that is moved to a healthy zone.

5) Can I use my failover replica as a read slave, or must it just sit idle until an event?

Yes. A failover replica is perfectly capable of being served as a read replica.

6) What happens to the old primary in a failover after it comes back. Does it become a failover replica for the new primary, or do I need to do something by hand?

The primary stays as primary before and after the failover process. It is just moved to a healthy zone. Therefore there is no such thing as "old primary comes back" as it always there, and there is nothing need to be done by hand.

7) How do I reset my original primary to be the real master after a failover event is complete?

Same as questions 6.

Regards,

Jay

To unsubscribe from this group and all its topics, send an email to google-cloud-sql-discuss+unsub...@googlegroups.com.

Brian Wawok

unread,

Dec 28, 2015, 8:17:06 AM12/28/15

to Google Cloud SQL discuss

This is great, thanks for the replies.

So it seems like letting me pick a failover replica in the same zone is pointless (and maybe should be removed in the GUI)? Otherwise this all makes sense and just needs some documentation love :)

Brian

To unsubscribe from this group and all its topics, send an email to google-cloud-sql-d...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/df2058c4-ff71-4173-b1e7-7d16d2d97605%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-cloud-sql-discuss/WwfY_CwFbVU/unsubscribe.

To unsubscribe from this group and all its topics, send an email to google-cloud-sql-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/ad4a984a-1b1c-449c-bd18-377f76a2d5c6%40googlegroups.com.

Brian Wawok

unread,

Dec 28, 2015, 8:58:53 AM12/28/15

to Google Cloud SQL discuss

Never mind see the same Zone is grayed out.

Kurt Josep

unread,

Apr 7, 2016, 9:36:38 AM4/7/16

to Google Cloud SQL discuss

Great answers - that cleared *almost* all of confusion that I had - I still have a few questions about Cloud SQL 2nd Gen and how to make my application fully highly available in the face of maintenance by either google or myself.

1) During the maintenance windows, obviously that's when the daily backups are performed - can there also be non-zone wide individual instances that go down for maintenance during this window? Wouldn't that result in an outage for any of my apps trying to connect to that MySQL instance's IP since it isn't a zone-wide failure so it wouldn't activate my fail-over-replica? What is a rough estimate of the frequency and length of such downtimes (I know gen 2 isn't covered by an SLA but understanding the intended behavior/intended number of maintenance related downtimes would really help as this is a huge unknown for me and switching from managing my own servers in my own datacenter where I largely had control over maintenance outages... this is a nagging worry point for me)

2) Is there a way to change the size of my instance's machine type without my application seeing a 3-10 minute outage? Or is that best handled by creating a read replica of the desired machine size then migrate over to using that read replica after promoting it to an independent master? Obviously 10 minutes isn't the end of the world for something that shouldn't happen frequently but... it's always nice to keep any hiccups under 60 seconds.

3) Was going to ask a follow-up question about the fail-over, but it seems like the fail-over replica is essentially a way for you guys to have a completely up-to-the-second complete backup in another zone so that in the case of a zone failure, you have something local to copy from to get the now failed instance up and running again in the new zone with the same IP etc - correct? If we were using the fail-over-replica as a read-only db to optimize performance of the master instance (writes only), in the event of a failure to the master's zone the master gets moved to the fail-over's zone and the fail-over gets moved to a separate healthy zone - does the moving of the fail-over to a new healthy zone result in downtime for the fail-over replica (will my application lose the instance it's connecting to for reads while the instance is moved between zones)?

This really is a cool product suite... great job - keep it up!

Thanks,

-Kurt

Raul Peixoto

unread,

Feb 16, 2017, 9:07:26 AM2/16/17

to Google Cloud SQL discuss

I have the same exact questions:

paynen

unread,

Feb 16, 2017, 7:49:30 PM2/16/17

to Google Cloud SQL discuss

Hey Raul,

If you have the same questions as OP, but this thread didn't help answer them, perhaps you can tell us what remains to be explained or solved. We'll be happy to help.

Cheers,

Nick
Cloud Platform Community Support

Ray Walker

unread,

Aug 21, 2017, 9:16:31 AM8/21/17

to Google Cloud SQL discuss

Heya Nick,

I may have misinterpreted, but I believe Raul had the same questions as Kurt Josep at https://groups.google.com/d/msg/google-cloud-sql-discuss/WwfY_CwFbVU/JbZoV2rWBAAJ

They are interesting questions and it would be very informative if you had answers to those questions also.