post-mortem: 12/6/2012 outage on accountchooser.com

29 views
Skip to first unread message

Eric Sachs

unread,
Dec 6, 2012, 1:29:53 PM12/6/12
to oidf-account-chooser-list
The accountchooser.com service was unavailable for a period of time this morning, but is now coming back online.

The accountchooser.com domain name was transferred from inventures.com to symantec.com this morning, and the DNS was not configured properly.  The operations teams of Symantec, Inventures, & Google were able to quickly diagnose the problem and get it fixed.  Some DNS servers around the Internet may have cached the wrong values, but will update themselves.

There is already some work underway (see below) to define a public "playbook" for operational coordination between the different groups involved in maintaining the service.

--
Eric Sachs | Group Product Manager for Identity | esa...@google.com 




---------- Forwarded message ----------
From: Eric Sachs <esa...@google.com>
Date: Tue, Dec 4, 2012 at 1:14 PM
Subject: More formal "process" for accountchooser.com operations
To: Brian Berliner <Brian_B...@symantec.com>
Cc: Warren Quach <Warren...@symantec.com>, Naveen Agarwal <n...@google.com>, Don Thibeau <d...@oidf.org>


Brian, in parallel with our work on changing the accountchooser.com SSL cert and domain configuration, I figured we could also make progress on a related request by the workgroup to document the "process" that would be used to operate the site.  Below is an initial pass at such a process.

Process to be posted as a message to the AC WG.
A quick link on ac.openid.net will be added that points to that mailing list post
Future updates to the process will similarly be decided on the AC WG and the "quick link" will be updated to point to the latest process

PROCESS:
PROCESS TO CHANGE THE PROCESS
Any change to the process must be discussed and approved by the AC working group.
Except in the case of an urgent security threat, any proposed changes should be posted for review for 2 weeks before the WG makes a final decision to accept the new process.
The process (and any approved updates to it) will to be posted as a message to the AC WG.
The main website of the Account Chooser working group will have a reasonably easy to find link to that mailing list post for viewers who want to find the current process
The OIDF board has the ability to veto any suggested change.  However it is the responsibility of the board to monitor the AC working group for any suggested process changes and to indicate their desire to be involved in the final decision.  The working group is NOT required to request specific approval from the OIDF board for changes.

VENDORS
There may be one or more vendors involved in managing the Account Chooser services for things such as domain mgmt, SSL cert, DNS, hosting, etc.
In general the vendors should be active OIDF sustaining corporate members, but that is not required.
If the vendor is a corporate member, then their default contact is their board representative
The additional vendor contacts are listed below, and the vendor agrees to keep that list reasonably up to date.  The contacts can include the email/phone#s of teams at those vendors.
It is understood that the primary vendor may use other vendors to provide the service

PROCESS FOR CHANGING VENDORS
The Account Chooser working group can change the vendors, or the responsibility of the vendors as defined above in the "PROCESS TO CHANGE THE PROCESS"
The vendor should only change/transfer responsibilities AFTER (1) the updated approved process has been posted by the workgroup & (2) the contact at the vendor has been notified by a member of the workgroup or the OIDF board.
If the vendor has major security/operational concerns about the change/transfer, they should notify the OIDF board for verification of the change.  However the vendor should make a best faith effort to raise those concerns as part of the workgroup "process to change the process."

VENDOR COORDINATION
When multiple vendors are involved, they may need to coordinate interactions, including changes requests for things such as DNS server names, IP addresses, etc.
The listed contacts at one vendor should only accept change requests that came from the listed contacts at the other vendor
If the vendor has major security/operational concerns about the change/transfer, they should notify the AC WG for verification of the change request.

CURRENT VENDORS/CONTACTS
Symantec
Layers of responsibility: domain reg, pointer to DNS servers, SSL cert provisioning
Paul
Brian
mailing list?

Google
Layers of responsibility: serving the domain across multiple-data-centers, DNS load balancing across the data-centers, pushing new code
Eric
Naveen
mailing list?






-- 
Eric Sachs | Group Product Manager for Identity | esa...@google.com 

John Bradley

unread,
Dec 6, 2012, 2:48:48 PM12/6/12
to oidf-account...@googlegroups.com
Currently it looks like there is still a problem with the JS.  The static pages seem to load.

Naveen Agarwal

unread,
Dec 6, 2012, 3:46:16 PM12/6/12
to oidf-account...@googlegroups.com

Are you trying both over https? What IPs are you resolving to?

--
 
 
 

John Bradley

unread,
Dec 6, 2012, 5:01:23 PM12/6/12
to oidf-account...@googlegroups.com
I have one computer where it is not working that is resolving to 173.194.37.32, 2607:f8b0:4002:801::1002

I have another computer where it seems to be working that is resolving to 173.194.37.33

Looking at it the .32 address is coming from the computer that is using the Google DNS resolver 8.8.8.8 however if I query that resolver directly it is giving me:

; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33236
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:

;; ANSWER SECTION:
accountchooser.com. 242 IN A 74.125.139.102
accountchooser.com. 242 IN A 74.125.139.139
accountchooser.com. 242 IN A 74.125.139.113
accountchooser.com. 242 IN A 74.125.139.138
accountchooser.com. 242 IN A 74.125.139.101
accountchooser.com. 242 IN A 74.125.139.100

still pending.
;; Query time: 531 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Dec  6 18:56:38 2012
;; MSG SIZE  rcvd: 132

Quite strange.  Did you change the CDN hosts as well?

I will keep looking at it to see what the DNS strangeness is.

John

Leif Johansson

unread,
Dec 6, 2012, 5:24:02 PM12/6/12
to oidf-account...@googlegroups.com
On 12/06/2012 11:01 PM, John Bradley wrote:
I have one computer where it is not working that is resolving to 173.194.37.32, 2607:f8b0:4002:801::1002

I have another computer where it seems to be working that is resolving to 173.194.37.33

Looking at it the .32 address is coming from the computer that is using the Google DNS resolver 8.8.8.8 however if I query that resolver directly it is giving me:

; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33236
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:

;; ANSWER SECTION:
accountchooser.com. 242 IN A 74.125.139.102
accountchooser.com. 242 IN A 74.125.139.139
accountchooser.com. 242 IN A 74.125.139.113
accountchooser.com. 242 IN A 74.125.139.138
accountchooser.com. 242 IN A 74.125.139.101
accountchooser.com. 242 IN A 74.125.139.100

still pending.
;; Query time: 531 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Dec  6 18:56:38 2012
;; MSG SIZE  rcvd: 132

Quite strange.  Did you change the CDN hosts as well?

I will keep looking at it to see what the DNS strangeness is.
 
 
According to dnscheck.se a couple of the servers have inconsistent
serials.




Naveen Agarwal

unread,
Dec 6, 2012, 6:21:21 PM12/6/12
to oidf-account...@googlegroups.com

There were no changes on CDN/serving side. Only DNS changes.


--
 
 
 

Leif Johansson

unread,
Dec 6, 2012, 6:27:12 PM12/6/12
to oidf-account...@googlegroups.com
On 12/07/2012 12:21 AM, Naveen Agarwal wrote:

There were no changes on CDN/serving side. Only DNS changes.



Right. All dns check tools I've run indicate a problem with inconsistent
serials for a couple of the servers. That is probably something you want
to take a look at.

John Bradley

unread,
Dec 6, 2012, 6:29:10 PM12/6/12
to oidf-account...@googlegroups.com
There seem to be multiple SOA records saying that ns1 ns2 ns3 and ns4.google.com are all the start of authority.  I don't know what Google is running DNS on but that seems strange.  You would expect one host to be the master and the others the slaves.

However the A records for accountchooser.com look OK.

www.accountchooser.com is Cnamed to www3.l.google.com and that seems OK.

Part of my problem seems to have been that the new JS wiped out my stored accounts so that it looked broken.

I need to test more.

Mengcheng Duan

unread,
Dec 6, 2012, 6:34:13 PM12/6/12
to oidf-account...@googlegroups.com
>> Part of my problem seems to have been that the new JS wiped out my stored accounts so that it looked broken.
This is strange. The accountchooser JS should never do this.

- Mengcheng

John Bradley

unread,
Dec 6, 2012, 6:53:48 PM12/6/12
to oidf-account...@googlegroups.com
I looked at that the serials are teh same the host name in the SOA is different each time the record is retrieved.  There is some load balancing strangeness happening.  The names shouldn't be different.

John Bradley

unread,
Dec 6, 2012, 6:55:11 PM12/6/12
to oidf-account...@googlegroups.com
It seems to be OK now.  I may have accidentally wiped them out myself testing with multiple browsers.

Brian Berliner

unread,
Dec 6, 2012, 7:14:45 PM12/6/12
to oidf-account...@googlegroups.com
I agree that is is odd that the Google Name Servers are returning inconsistent serial numbers -- but I don't know what that means.

Earlier today, the name servers for the accountchooser.com domain were incorrectly set, temporarily pointing at UltraDNS Name Servers, which could have potentially caused problems with the delivery of the SSL connections to ac.js. That configuration change has been corrected and the domain now points back to the Google Name Servers. In my experience, these things typically resolve themselves through the course of a day.

I am not seeing any problems with the delivery of ac.js or the resolution of the domain name from where I sit (SSL and non-SSL appear to be working fine). If anyone knows of anything specific that we need to do to kick the Google Name Servers, please let us know.

Thanks,

-Brian

On Dec 6, 2012, at 3:34 PM, Mengcheng Duan <meng...@google.com> wrote:

>> Part of my problem seems to have been that the new JS wiped out my stored accounts so that it looked broken.
This is strange. The accountchooser JS should never do this.

- Mengcheng


On Thu, Dec 6, 2012 at 3:29 PM, John Bradley <ve7...@ve7jtb.com> wrote:
Part of my problem seems to have been that the new JS wiped out my stored accounts so that it looked broken.


--
 
 
 

Leif Johansson

unread,
Dec 7, 2012, 1:15:13 AM12/7/12
to oidf-account...@googlegroups.com
On 12/07/2012 01:14 AM, Brian Berliner wrote:
I agree that is is odd that the Google Name Servers are returning inconsistent serial numbers -- but I don't know what that means.

Normally it would mean unpredictable behaviour wrt slaving (right?) but I
don't know enough about the google setup to know if this is a real issue or not.

I've asked some folks I know in the "dns biz" if they've seen anything like this.

        Cheers Leif

Brian Berliner

unread,
Dec 7, 2012, 1:31:27 AM12/7/12
to oidf-account...@googlegroups.com, oidf-account...@googlegroups.com
As I head off to bed tonight, it looks like the Google Name Servers are all back to serving consistent serial numbers. Hopefully things have settled down. Let me know if strangeness continues tomorrow.

Cheers,

        -Brian

Sent from my iPad
--
 
 
 

Leif Johansson

unread,
Dec 7, 2012, 1:41:20 AM12/7/12
to oidf-account...@googlegroups.com
On 12/07/2012 07:31 AM, Brian Berliner wrote:
> As I head off to bed tonight, it looks like the Google Name Servers
> are all back to serving consistent serial numbers. Hopefully things
> have settled down. Let me know if strangeness continues tomorrow.
>
According to my sources this sort of thing may be the result of using
a dns implementation that keeps zones in sync wo using serials so
it may have been "normal"

Cheers Leif
>
>

Reply all
Reply to author
Forward
0 new messages