DigiCert get-entries Outage

139 views
Skip to first unread message

Andrew Ayer

unread,
Feb 15, 2021, 11:08:23 AM2/15/21
to ct-p...@chromium.org
The DigiCert log (Address ct1.digicert-ct.com/log; ID
VhQGmi/XwuzT9eG9RLI+x0Z2ubyZEVzA75SYVdaJ0N0=) experienced a disruption
starting around 2021-02-12 06:00:37+00:00 and lasting until 2021-02-15
14:22:30+00:00. During this time, SSLMate's monitor was unable to
download any log entries after position 23910770 - the log server closed
the connection prematurely in response to the get-entries request.
A user of the open source Cert Spotter reported the same problem
at <https://github.com/SSLMate/certspotter/issues/45> and Graham
Edgecombe's monitor currently gives this log an uptime of 77.45%
<https://ct.grahamedgecombe.com/>.

Calls to get-sth were not impacted, and the above GitHub user reported
that calls to get-entries with end=23910630 succeeded.

Can DigiCert provide an incident report for this disruption?

Regards,
Andrew

Al Cutter

unread,
Feb 15, 2021, 11:43:42 AM2/15/21
to Andrew Ayer, Certificate Transparency Policy
We're seeing the same - similar start time and entry 23910770, and also seeing what looks like the response from get-entries having been truncated.
Calls to get-sth and get-sth-consistency seem fine from here too.

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/20210215110657.f74fc60e86c2abab82d26d42%40andrewayer.name.

Kurt Roeckx

unread,
Feb 15, 2021, 11:44:24 AM2/15/21
to Andrew Ayer, ct-p...@chromium.org
I can confirm that I've also had problems with get-entries but not
with get-sth. I've contacted digicert about it today at 09:44
UTC+01, but did not get any reply so far.

I was stuck at entry 23910612.


Kurt

Jeremy Rowley

unread,
Feb 17, 2021, 11:46:56 AM2/17/21
to Certificate Transparency Policy, Kurt Roeckx, ct-p...@chromium.org, Andrew Ayer
I'm work on an RCA and will post an update here today.

Jeremy Rowley

unread,
Feb 17, 2021, 12:14:08 PM2/17/21
to Certificate Transparency Policy, Jeremy Rowley, Kurt Roeckx, ct-p...@chromium.org, Andrew Ayer
We experienced an issue with our CT log over the weekend (as mentioned on this thread). What happened is that if a query asked for more than 15 entries, the CT log server only returned a partial response. The missed entries were caused because of a filesystem that switched to read-only and then nginx failing to buffer the large requests. This resulted in requests over teh ngnix buffer failing to return all entries properly. 

Looking at the logs, the issue t approximately 05:14 UTC Feb 11th, 2021 and continued until  14:12 UTC Feb 15, 2021. The change to read only happened when IBM performed maintenance on the SAN device. We restarted the server which fixed the issue.  We failed to notice the outage as a single entry was still successfully returning information so the heartbleed checks worked. We are adding an additional monitor on the server for large batches and changes in server config.

Let me know questions you have.

Jeremy

Al Cutter

unread,
Feb 17, 2021, 12:37:11 PM2/17/21
to Jeremy Rowley, Certificate Transparency Policy, Kurt Roeckx, Andrew Ayer
Hi Jeremy, 

Is this related to the get-entries outage in September last year?
I seem to remember there may have been some similar circumstances with fs going read-only there too?

Cheers,
Al.

 



--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

Jeremy Rowley

unread,
Feb 17, 2021, 1:13:07 PM2/17/21
to Certificate Transparency Policy, Al Cutter
Could you send me a link to the Sept issue? I'm not recalling that issue.

Devon O'Brien

unread,
Feb 17, 2021, 1:20:35 PM2/17/21
to Certificate Transparency Policy, Jeremy Rowley, Kurt Roeckx, ct-p...@chromium.org, Andrew Ayer
Hi Jeremy,

Thanks for providing a little more detail about what happened. Just to confirm, during this period of time, were get-entries queries of 15 or fewer entries completely unaffected by this issue? Also, for queries asking for more than 15 entries, were all responses truncated or did ct1 start returning fewer entries during this period? (I suspect the latter answer is no, since the Log wasn't aware of the failures caused by truncating the responses downstream)

-Devon

Jeremy Rowley

unread,
Feb 17, 2021, 1:29:41 PM2/17/21
to Devon O'Brien, Certificate Transparency Policy, Kurt Roeckx, Andrew Ayer
Correct - anything requests for less than 15 entries executed as expected.  All queries for more than 15 were truncated to 15.

Pierre Phaneuf

unread,
Feb 17, 2021, 1:34:54 PM2/17/21
to Jeremy Rowley, Devon O'Brien, Certificate Transparency Policy, Kurt Roeckx, Andrew Ayer
Truncated in a way that still parsed as JSON with only 15 entries in
it, or at some arbitrary byte length?
> --
> You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAFK%3DoS9YsYYrxBUiuxTL%2BnR_4T%3D99q8ECLhKhqzC4cUYsQSXeQ%40mail.gmail.com.

Jeremy Rowley

unread,
Feb 17, 2021, 1:57:42 PM2/17/21
to Pierre Phaneuf, Devon O'Brien, Certificate Transparency Policy, Kurt Roeckx, Andrew Ayer
Yeah - it's the same issue as in September. 

Note this only impacts CT1. None of the other logs experienced issues (and are on better platforms). The real solution is to retire CT1. We'd like to do that.

For the size issue, we are sending X bytes to Nginx to send back. Nginx takes the HTTP header with a content-length of X bytes and starts sending the data but drops the connection for the data that failed to write to disk. If the response was too big, ngix flushes the data to disk and sends the data from the disk. The first part goes out but the rest does not. From the client you are getting only a portion of the expected HTTP response data, meaning it is not a valid JSON since the response was cut off in the middle.

Kurt Roeckx

unread,
Feb 17, 2021, 2:09:12 PM2/17/21
to Jeremy Rowley, Pierre Phaneuf, Devon O'Brien, Certificate Transparency Policy, Andrew Ayer
On Wed, Feb 17, 2021 at 11:57:30AM -0700, Jeremy Rowley wrote:
> Yeah - it's the same issue as in September.
>
> Note this only impacts CT1. None of the other logs experienced issues (and
> are on better platforms). The real solution is to retire CT1. We'd like to
> do that.
>
> For the size issue, we are sending X bytes to Nginx to send back. Nginx
> takes the HTTP header with a content-length of X bytes and starts sending
> the data but drops the connection for the data that failed to write to
> disk. If the response was too big, ngix flushes the data to disk and sends
> the data from the disk. The first part goes out but the rest does not. From
> the client you are getting only a portion of the expected HTTP response
> data, meaning it is not a valid JSON since the response was cut off in the
> middle.

And at least libcurl than reports that the header length doesn't
match the actual received data.


Kurt

Reply all
Reply to author
Forward
0 new messages