TL;DR Early this morning we had network issues that resulted in a loss of
database replication for about 4 hours. During this time half of DB writes
(password changes, new account signups) were delayed in their replication
to the other datacenter. This could have resulted in users signing up for
accounts or changing passwords at our PHX1 datacenter and then not being
able to log in at our SCL2 datacenter.
All times are in Pacific Daylight Time
At 00:19 AM our global DNS load balancer automatically stopped sending
traffic to our SCL2 datacenter due to detected network issues at the
datacenter
The investigation into the cause of these problems can be found here :
https://bugzilla.mozilla.org/show_bug.cgi?id=870686
The eventual workaround for the network issue was to sequester the entire
physical staging environment for persona to protect the network from it's
bad behavior. NetOps is going to dig into this further on Monday. Until
then the physical persona staging network is unavailable.
Starting at 00:49 AM our production monitoring server in the SCL2
datacenter lost the ability to communicate with AWS, specifically the
bigtent system. Currently it still is unable to reach bigtent however our
production scl2 persona installation is able to. I'm unsure if prior to now
scl2 production persona was able to reach bigtent since 00:49 AM this
morning since our monitoring server lost the ability to.
at 9:18 AM dynect re-enabled traffic at SCL2 and began flapping. Traffic
was enabled at SCL2 from :
* 9:18 AM - 9:19 AM for 1 minute
* 9:35 AM - 9:36 AM for 1 minute
* 9:39 AM - 9:42 AM for 3 minutes
* 9:43 AM - 9:58 AM for 15 minutes
* 9:59 AM - 10:25 AM for 26 minutes
* 10:30 AM - 10:36 AM for 6 minutes
At 10:37 AM dynect re-enabled taffic to SCL2 until I came online and
manually stopped traffic at SCL2 at 1:34 PM. Traffic was being sent to SCL2
for 2 hours 57 minutes
At 13:57 PM I fixed the broken DB replication from db1 in SCL2 to the
master DB in PHX1. The fix was "SLAVE STOP; SLAVE START;" The symptom was
the error message "The slave I/O thread stops because a fatal error is
encountered when it try to get the value of SERVER_ID variable from
master". This means that while traffic was being sent to SCL2 during the
issue (in total 3 hours 49 minutes) data replication was not working
between PHX1 and SCL2. This means that if during that time someone changed
their password or created an account hitting the PHX1 datacenter, they
would not have been able to login at the SCL2 datacenter (if DNS sent them
there)
At 14:56 PM ckolos solved the monitoring problem (Firewall core translation
setting) and we're again able to monitor bigtent from scl2.
At 15:37 PM I re-enabled traffic at SCL2 having fixed monitoring and
re-synced the databases
Areas for improvement :
* When a datacenterwide network issue is occuring, manually move traffic
away from the DC instead of relying on dynect to detect and keep the
datacenter down
* When a problem occurs that puts a monitor into alarm, make sure there
aren't other monitors alerting that are unrelated and are being hidden by
the known monitor noise
Actions :
* I'm adding myself as primary in pager duty for escalations from Persona
monitors
-Gene