Can we take it to a list, to make sure that this information isn't lost?Can you clarify what your question is? pitchup.com is serving an EV cert. An EV cert triggers revocation checks. Those revocation checks go through the OSs' network-stack on not-Linux. If those revocation checks fail, EV status is denied.The choice of the last TLS connection is based on the socket pool's logic to pick the warmest socket - the most recently actively socket.The choice to do multiple TLS handshakes is the same as our choice to do sockets. In 2014, my EP interns looked at delaying the subsequent connections to see about using TLS resumption handshakes (which can be 1RTT), but the answer was that for desktop, it caused a perf-degredation, and for mobile, it was only a perf improvement for the extreme edge. I get NBU, but the complexity tradeoff was extremely high and extremely brittle due to our socket pools.The choice of multiple connections when we believe a server supports H2 is an opportunity for improvement; right now, AIUI, we race connections. I have no thoughts/knowledge on that.On Mon, Apr 4, 2016 at 12:53 PM, Ilya Grigorik <igri...@google.com> wrote:+Ryan.. perhaps you can shed some light on the cert revocation checking mess that's happening here?Pat, thanks for digging into this. I was hoping I missing something obvious, but looks like we potentially have a couple of issues here..On Mon, Apr 4, 2016 at 8:59 AM, Patrick Meenan <pme...@google.com> wrote:The netlog looks a lot like the tcpdump (at least for the base page). As best as I can tell, a handful of connections are opened, 2 of which go through TLS negotiation but fail because the revocation can't be checked. After those 2 fail, TLS negotiation is started on the 3rd connection and that one appears to include a stapled response so it sails through and gets used. What I can't tell is why the revocation checks failed (nothing in the netlog) and the overall behavior is bizarre.One other thing that is clear is that on bandwidth-limited connections and H2, we are burning a BOATLOAD of bytes doing parallel TLS negotiations for connections that will be unused (and doing them in parallel with the connection we care about)base page - Socket 279 was used7195 - Start connect job (x2 - 261, 262)7196 - Start Connect job (x2 - 265, 266)7195-7546 DNS lookup for www.pitchup.com (resolved to 213.138.116.230)7546 - Socket 276 (for connect 262)7547 - Socket 277 (for connect 261)7696 - * Start connect job (278) - "BACKUP_CONNECT_JOB_CREATED"?7696 - * Socket 279 (for connect 278)7934 - Socket 277 connected, TLS negotiation started7935 - Socket 276 connected, TLS negotiation started8085 - * Socket 279 connected8350 - Socket 277 TLS certificate in, verification request 2868383 - Socket 276 TLS certificate in, verification request 2869049 - Socket 277 ERR_CERT_UNABLE_TO_CHECK_REVOCATION, socket closed9050 - Socket 276 ERR_CERT_UNABLE_TO_CHECK_REVOCATION, socket closed9051 - * Socket 279 starts SSL negotiation9474 - * Socket 279 cert in (looks like it may contain a stapled response)9476 - * Start using socket 279Unfortunately the netlog doesn't have details on why the verification failed. All I can see is "cert_status = 131104 (REV_CHECKING_ENABLED | UNABLE_TO_CHECK_REVOCATION)"On Mon, Apr 4, 2016 at 11:27 AM, Patrick Meenan <pme...@google.com> wrote:Sorry, digging in now (thought it would be involved since anything that makes you headscratch is not going to be a quick answer).First thing that jumped out at me was that I didn't report the DNS lookup and socket connect times for the connections so I clearly have a bug to fix around that. It also looks like the TLS negotiation time wasn't reported in all of the waterfalls.That would account for SOME of the gap.Also looks like tcpdump is being a bit buggy and didn't capture in 2/3 of the tests when I ran it but I did manage to capture one with a tcpdump.The absolute times in the netlog, trace and waterfall won't line up but relative to the first request they should. The tcpdump is a mess. For the static domain, we open and negotiate TLS for a lot of separate connections (5 + several more later) and in a bizarre twist of logic, pick the last of the 5 to complete as the one to keep while closing the other 4. There is also a bunch of TLS handshake and certificate traffic in flight for all of those connections competing with each other.The base page also makes no sense what so ever and the craziness may not even be represented in the WPT waterfall. It looks like we opened and negotiated 2 TLS connections, opened a 3rd (because, why not?) looked up the domains for the CRL/OCSP checks (but didn't check anything) and then closed the first 2 connections and went through the TLS negotiation for the 3rd.I'll go through the netlog to see if I can make any sense of it and I guess it could be WPT doing something to piss off Chrome but I can't imaging what.Base Page7.238 - DNS www.pitchup.com7.538 - DNS response 213.138.116.2307.540 - SYN (x2) to 213.138.116.2307.927 - SYN to 213.138.116.230 (hmm, why are we trying a 3rd connection?)7.927 - SYN ACK x27.928 - TLS Client Hello (x2)8.077 - SYN ACK (3rd)8.324 - Server Hello & Certificates (x2)8.342 - Client key, change cipher spec, handshake finished (x2)8.432 - DNS ocsp.digicert.com (revocation check started?)8.732 - DNS response 72.21.91.29 (never see a connection appempted to this address)8.734 - Server handshake complete + some application data (keylog didn't seem to work)8.738 - DNS crl4.digicert.com8.942 - DNS response 66.225.197.197 (don't see a connection attempt here either)9.042 - FIN ACK to both original connections (WTF?)9.044 - Client hello for 3rd connection9.446 - Server hello + certificate9.463 - Client key exchange, handshake finished9.463 - Client application data (presumably the http request)9.857 - Server handshake finished9.858 to 9.934 - Server application data (response?)media.pitchup.co.uk content requestsLooks like client port 50724 ends up being the connectiion used for the bulk of the data (tcp.stream eq 8)10.152 - DNS media.pichup.co.uk10.452 - DNS 213.138.116.230 (same as base page)10.453 - * SYN to 213.138.116.230 (x5 - client ports 50720 - 50724)10.840 - SYN ACK's start coming in10.841 - Client Hello's start going out10.960 - * SYN ACK for 5072411.239 - Server Hello's start coming in11.266 - FIN ACK's (for most of the connections). Looks like all but 50724 (probably discovered it is H2 and picked a connection to keep)11.267 - * Client hello for 5072411.280 - SYN to 213.138.116.230 (client port 50726) - no ide why more connections to the same server are being opened11.301 - SYN to 213.138.116.230 (client port 50727)11.665 - * Server hello for 5072411.679 - * Certificate for 5072411.681 - SYN ACK (50726)11.682 - * 50724 Client handshake finished, requests go out (presumably)11.690 - SYN ACK (50727)12.074 - * 50724 Server handshake finished, responses start coming inOn Fri, Apr 1, 2016 at 5:20 PM, Ilya Grigorik <igri...@google.com> wrote:Hey Pat.I'm trying to make sense of this, wondering if you have any ideas:Specifically, trying to understand the reported gap between the HTML and the subresources.What's confusing me is the fact that the waterfall doesn't really line up with the traces. The timing appears to be off, and I'm seeing some weird behavior with the cert verifier... e.g. netlog shows a bunch of errors:net_error = -205 (ERR_CERT_UNABLE_TO_CHECK_REVOCATION)Can't wrap my head around what's actually going on here.. Any ideas? My hunch is that the EV certs are doing some funny here, but can't prove it yet. =/ig
On Apr 5, 2016 6:36 AM, "Patrick Meenan" <pme...@google.com> wrote:
>
> Looking at the tcpdump:
>
> - All of the TLS negotiations have stapled responses (including the 1st 2 that get rejected)
> - The certificates include alt names for www.pitchup.com, m.pitchup.com and media.pitchup.co.uk
>
> Which raises a bunch of questions. I'll see if I can instrument a dev build and see the code path better but:
>
> - Why are the certificates being passed to Windows for verification when they include stapled responses at the TLS layer?
I am not sure I understand your question.
> - Is it a timing issue since the stapling come in a TLS frame that follows the certificates (though in the same packet in this case). Do we start the OS validation and cancel it once we see the stapled response?
No
> - Windows did the DNS lookups for the crl and ocsp domains and the validation failed after the DNS lookup completed. If we cancel an in-flight validation but it doesn't return until later, is there a race condition where the failure from canceling overwrites the stapled validation?
No
>
> - Why did the subresource requests not get coalesced onto the existing H2 connection since the certificate included the domain and it resolved to the same IP address?
Because we don't coalesce until we have negotiated the TLS connection
On Apr 5, 2016 6:36 AM, "Patrick Meenan" <pme...@google.com> wrote:
>
> Looking at the tcpdump:
>
> - All of the TLS negotiations have stapled responses (including the 1st 2 that get rejected)
> - The certificates include alt names for www.pitchup.com, m.pitchup.com and media.pitchup.co.uk
>
> Which raises a bunch of questions. I'll see if I can instrument a dev build and see the code path better but:
>
> - Why are the certificates being passed to Windows for verification when they include stapled responses at the TLS layer?I am not sure I understand your question.
> - Is it a timing issue since the stapling come in a TLS frame that follows the certificates (though in the same packet in this case). Do we start the OS validation and cancel it once we see the stapled response?
No
> - Windows did the DNS lookups for the crl and ocsp domains and the validation failed after the DNS lookup completed. If we cancel an in-flight validation but it doesn't return until later, is there a race condition where the failure from canceling overwrites the stapled validation?
No
>
> - Why did the subresource requests not get coalesced onto the existing H2 connection since the certificate included the domain and it resolved to the same IP address?Because we don't coalesce until we have negotiated the TLS connection
Following the certificate are DNS lookups for the crl and ocsp responders for the certificate which seems to indicate that the Windows cert verifier at least started to check crl/ocsp for the EV cert (ignoring the stapled response). There don't appear to be intermediate certificates so I assume the DNS lookup is for the leaf certificate verification.
Shortly after the DNS lookups complete for the crl/ocsp domains the cert verification fails. Both TLS connections were bound to the same cert verifier job and both get closed. The cert verifier job shows now indication that it saw the stapled response.
After that, a 3rd connection that was already opened but idled, starts tls negotiation. The cert response from the server is the same as on the earlier connections but this time the netlog shows that the cert verifier job sees the stapled response and the verification passes.
Sorry, I guess that adds at least one more RTT to the coalescing than I thought was in place. I was under the impression that we recognized that the IP is the same as one that we already have an open connection for and that the open connection already has a certificate that includes the new host in the list of alt names.Even then, in this case the actual traffic was over the new connection so we didn't coalesce even after TLS negotiation.
The main unexpected behavior that I'd love to understand is why cert verification failed for the first verifier job but succeeded for the second one (for the base domain). The crl/ocsp responders were looked up but never connected to (not even before the verification succeeded on the second attempt).I guess it could be something screwy inside of Windows' verification logic itself.
CAPI logs were a huge help, thankstl;dr: WPT bug, all fixed and OCSP/CRL checks now show up in WPT waterfalls (for both IE and Chrome)WPT uses CNAME mappings to identify CDN's and to help collect the CNAMEs while a test is running it adds a flag to the DNS lookups so that the CNAME is also returned with the lookup (saves a significant amount of time over actively performing lookups after the test). Unfortunately It universally added the AI_CANONNAME flag to the lookups and that conflicts with AI_FQDN which WinHttp was using for the OCSP/CRL lookups to populate the FQDN in the cname field.That pretty much means every OCSP/CRL validation check on the WPT agents for Chrome and IE has been effectively blocked for years (like 5 years years).The fix was brutally simple and I landed it this afternoon. With the fix, the validation checks show up in the waterfalls now just like they have for Firefox for the longest time: http://www.webpagetest.org/result/160412_W2_e4acee08d8b863a6dfe1c509c23b1c9d/


I'm fairly certain the huge gap in the WPT waterfall is actually accurate and the timings are actually coming from dev tools for this specific site (they are SPDY, not H2 and WPT doesn't have a SPDY decoder). The gap is the DNS lookup + whatever else happens before we decide to coalesce the requests for the media* domain to the base domain.
The requests do originate a lot sooner in the browser but they don't actually go out on to the wire for a while which is what WPT reports.
Showing the "blocked" time is something I've considered but it hasn't been important enough yet to push everything else out of the way. In the case of H2 I'm 100% positive on the timings because it comes from the socket filter. For SPDY it depends on the remote debug net.* events ("request will be sent" I believe is the one that triggers the start of the outbound bytes measurement). Don't line the times up exactly because the traces will have an earlier start point (i.e. the DNS lookup for the base domain will not be at 0).
Eyeballing the waterfall screen shot, it looks like ~500ms from the HTML to the outbound request and those are the 2 you'd want to compare to in the trace. If you have a link to the test result I can pull the raw dev tools debug events to compare.