2018.03.12 Let's Encrypt Wildcard Certificate Encoding Issue

jo...@letsencrypt.org

unread,

Mar 12, 2018, 10:35:39 PM3/12/18

to mozilla-dev-s...@lists.mozilla.org

During final tests for the general availability of wildcard certificate support, the Let's Encrypt operations team issued six test wildcard certificates under our publicly trusted root:

https://crt.sh/?id=353759994
https://crt.sh/?id=353758875
https://crt.sh/?id=353757861
https://crt.sh/?id=353756805
https://crt.sh/?id=353755984
https://crt.sh/?id=353754255

These certificates contain a subject common name that includes a “*.” label encoded as an ASN.1 PrintableString, which does not allow the asterisk character, violating RFC 5280.

We became aware of the problem on 2018-03-13 at 00:43 UTC via the linter flagging in crt.sh [1]. All six certificates have been revoked.

The root cause of the problem is a Go language bug [2] which has been resolved in Go v1.10 [3], which we were already planning to deploy soon. We will resolve the issue by upgrading to Go v1.10 before proceeding with our wildcard certificate launch plans.

We employ a robust testing infrastructure but there is always room for improvement and sometimes bugs slip through our pre-production tests. We’re fortunate that the PKI community has produced some great testing tools that sometimes catch things we don’t. In response to this incident we are planning to integrate additional tools into our testing infrastructure and improve our test coverage of multiple Go versions.

[1] https://crt.sh/

[2] https://github.com/golang/go/commit/3b186db7b4a5cc510e71f90682732eba3df72fd3

[3] https://golang.org/doc/go1.10#encoding/asn1

Ryan Sleevi

unread,

Mar 12, 2018, 11:22:46 PM3/12/18

to jo...@letsencrypt.org, mozilla-dev-security-policy

Given that Let's Encrypt has been operating a Staging Endpoint (
https://letsencrypt.org/docs/staging-environment/ ) for issuing wildcards,
what controls, if any, existed to examine the certificate profiles prior to
being deployed in production? Normally, that would flush these out -
through both manual and automated testing, preferably.

Given that Let's Encrypt is running on an open-source CA (Boulder), this
offers a unique opportunity to highlight where the controls and checks are
in place, particularly for commonNames. RFC 5280 has other restrictions in
place that have tripped CAs up, such as the exclusively using
PrintableString/UTF8String for DirectoryString types (except for backwards
compatibility, which would not apply here), or length restrictions (such as
64 characters, per the ASN.1 schema), it would be useful to comprehensively
review these controls.

Golang's ASN.1 library is somewhat lax, largely in part to both public and
enterprise CAs' storied history of misencodings. What examinations, if any,
will Let's Encrypt be doing for other classes of potential encoding issues?
Has this caused any changes in how Let's Encrypt will construct
TBSCertificates, or review of that code, beyond the introduction of
additional linting?

Ryan Sleevi

unread,

Mar 12, 2018, 11:27:06 PM3/12/18

to Ryan Sleevi, mozilla-dev-security-policy, jo...@letsencrypt.org

On Mon, Mar 12, 2018 at 11:22 PM, Ryan Sleevi <ry...@sleevi.com> wrote:

>
>
> On Mon, Mar 12, 2018 at 10:35 PM, josh--- via dev-security-policy <
> dev-secur...@lists.mozilla.org> wrote:
>
>> During final tests for the general availability of wildcard certificate
>> support, the Let's Encrypt operations team issued six test wildcard
>> certificates under our publicly trusted root:
>>
>> https://crt.sh/?id=353759994
>> https://crt.sh/?id=353758875
>> https://crt.sh/?id=353757861
>> https://crt.sh/?id=353756805
>> https://crt.sh/?id=353755984
>> https://crt.sh/?id=353754255
>>
>> These certificates contain a subject common name that includes a “*.”
>> label encoded as an ASN.1 PrintableString, which does not allow the
>> asterisk character, violating RFC 5280.
>>
>> We became aware of the problem on 2018-03-13 at 00:43 UTC via the linter
>> flagging in crt.sh [1].
>
>

Also, is this the correct timestamp? For example, examining
https://crt.sh/?id=353754255&opt=ocsp

Shows an issuance time of Not Before: Mar 12 22:18:30 2018 GMT and a
revocation time of 2018-03-12 23:58:10 UTC , but you stated your alerting
time was 2018-03-13 00:43:00 UTC. I'm curious if that's a bug in the
display of crt.sh, a typo in your timezone computation (considering the
recent daylight saving adjustments in the US), a deliberate choice to put
revocation somewhere between those dates (which is semantically valid, but
curious), or perhaps something else.

jacob.hoff...@gmail.com

unread,

Mar 12, 2018, 11:39:06 PM3/12/18

to mozilla-dev-s...@lists.mozilla.org

On Monday, March 12, 2018 at 8:22:46 PM UTC-7, Ryan Sleevi wrote:
> Given that Let's Encrypt has been operating a Staging Endpoint (
> https://letsencrypt.org/docs/staging-environment/ ) for issuing wildcards,
> what controls, if any, existed to examine the certificate profiles prior to
> being deployed in production? Normally, that would flush these out -
> through both manual and automated testing, preferably.

We continuously run our cert-checker tool (https://github.com/letsencrypt/boulder/blob/master/cmd/cert-checker/main.go#L196-L261) in both staging and production. Unfortunately, it tests mainly the higher level semantic aspects of certificates rather than the lower level encoding aspects. Clearly we need better coverage on encoding issues. We expect to get that from integrating more and better linters into both our CI testing framework and our staging and production environments. We will also review the existing controls in our cert-checker tool.

> Golang's ASN.1 library is somewhat lax, largely in part to both public and
> enterprise CAs' storied history of misencodings.

Agreed that Go's asn1 package is lax on parsing, but I don't think that it aims to be lax on encoding; for instance, the mis-encoding of asterisks in PrintableStrings was considered a bug worth fixing.

> What examinations, if any,
> will Let's Encrypt be doing for other classes of potential encoding issues?
> Has this caused any changes in how Let's Encrypt will construct
> TBSCertificates, or review of that code, beyond the introduction of
> additional linting?

We will re-review the code we use to generate TBSCertificates with an eye towards encoding issues, thanks for suggesting it. If there are any broad classes of encoding issues you think are particularly worth watching out for, that could help guide our analysis.

> Also, is this the correct timestamp? For example, examining
> https://crt.sh/?id=353754255&opt=ocsp
> Shows an issuance time of Not Before: Mar 12 22:18:30 2018 GMT and a
> revocation time of 2018-03-12 23:58:10 UTC , but you stated your alerting
> time was 2018-03-13 00:43:00 UTC. I'm curious if that's a bug in the
> display of crt.sh, a typo in your timezone computation (considering the
> recent daylight saving adjustments in the US), a deliberate choice to put
> revocation somewhere between those dates (which is semantically valid, but
> curious), or perhaps something else.

I believe this was a timezone computation error. By my reading of the logs, our alerting time was 2018-03-13 23:43:00 UTC, which agrees with your hypothesis about the recent timezone change (DST) leading to a mistake in calculating UTC times.

js...@letsencrypt.org

unread,

Mar 12, 2018, 11:45:48 PM3/12/18

to mozilla-dev-s...@lists.mozilla.org

On Monday, March 12, 2018 at 8:27:06 PM UTC-7, Ryan Sleevi wrote:
> Also, is this the correct timestamp? For example, examining
> https://crt.sh/?id=353754255&opt=ocsp
>
> Shows an issuance time of Not Before: Mar 12 22:18:30 2018 GMT and a
> revocation time of 2018-03-12 23:58:10 UTC , but you stated your alerting
> time was 2018-03-13 00:43:00 UTC. I'm curious if that's a bug in the
> display of crt.sh, a typo in your timezone computation (considering the
> recent daylight saving adjustments in the US), a deliberate choice to put
> revocation somewhere between those dates (which is semantically valid, but
> curious), or perhaps something else.

Adding a little more detail and precision here: Let's Encrypt backdates certificates by one hour, so "Not Before: Mar 12 22:18:30 2018 GMT" indicates an issuance time of 23:18:30.

Also, you may notice that one of the certificates was actually revoked at 23:30:33, before we became aware of the problem. This was done as part of our regular deployment testing, to ensure that revocation was working properly.

Ryan Sleevi

unread,

Mar 13, 2018, 12:08:37 AM3/13/18

to jacob.hoff...@gmail.com, mozilla-dev-s...@lists.mozilla.org

On Mon, Mar 12, 2018 at 11:38 PM jacob.hoffmanandrews--- via
dev-security-policy <dev-secur...@lists.mozilla.org> wrote:

> On Monday, March 12, 2018 at 8:22:46 PM UTC-7, Ryan Sleevi wrote:
> > Given that Let's Encrypt has been operating a Staging Endpoint (
> > https://letsencrypt.org/docs/staging-environment/ ) for issuing
> wildcards,
> > what controls, if any, existed to examine the certificate profiles prior
> to
> > being deployed in production? Normally, that would flush these out -
> > through both manual and automated testing, preferably.
>

> We continuously run our cert-checker tool (
> https://github.com/letsencrypt/boulder/blob/master/cmd/cert-checker/main.go#L196-L261)
> in both staging and production. Unfortunately, it tests mainly the higher
> level semantic aspects of certificates rather than the lower level encoding
> aspects. Clearly we need better coverage on encoding issues. We expect to
> get that from integrating more and better linters into both our CI testing
> framework and our staging and production environments. We will also review
> the existing controls in our cert-checker tool.
>

> > Golang's ASN.1 library is somewhat lax, largely in part to both public
> and
> > enterprise CAs' storied history of misencodings.
>

> Agreed that Go's asn1 package is lax on parsing, but I don't think that it
> aims to be lax on encoding; for instance, the mis-encoding of asterisks in
> PrintableStrings was considered a bug worth fixing.
>

> > What examinations, if any,
> > will Let's Encrypt be doing for other classes of potential encoding
> issues?
> > Has this caused any changes in how Let's Encrypt will construct
> > TBSCertificates, or review of that code, beyond the introduction of
> > additional linting?
>

> We will re-review the code we use to generate TBSCertificates with an eye
> towards encoding issues, thanks for suggesting it. If there are any broad
> classes of encoding issues you think are particularly worth watching out
> for, that could help guide our analysis.

Well, you’ve already run into one of the common ones I’d seen in the past -
more commonly with older OpenSSL-based bespoke/enterprise CAs (due to
long-since fixed defaults, but nobody upgrading)

Encoding of INTEGERS is another frequent source of pain - minimum length
encoding, ensuring positive numbers - but given the Go ASN1 package’s
author’s hatred of that, I would be surprised.

Reordering of SEQUENCES has been an issue for at least two wholly
independent CA software stacks when integrating CT support; at least one I
suspect is due to using a HashMap that has non-guarantees ordering
semantics / iteration order changing between runs and releases. These seems
relevant to Go, given its map designs.

SET encodings not being sorted according to their values when encoding.
This would manifest in DNs, although I don’t believe Boulder supports
equivalent RDNs/AVAs.

Explicit encoding of DEFAULT values, most commonly basicConstraints. This
is issue most commonly crops up when folks convert ASN.1 schemas to
internal templates by hand, rather than using compilers - which is
something applicable to Go.

Not enforcing size constraints - on strings or sequences. Similar to the
above, many folks forget to convert the restrictions when converting by
hand.

Improper encoding of parameter attributes for signature and SPKI algorithms
- especially RSA. This is due to the “ANY DEFINED BY” approach and folks
hand rolling, or not closely reading the specs. This is more high-level,
but derived from the schema flexibility.

Variable encoding of string types between Subject/Issuer or
Issuer/NameConstraints. This is more quasigrayarea - there are defined
semantics for this, but few get it right. This is more high-level, but
derived from the schema flexibility.

Not realizing DNSName, URI, and rfc822Name nameConstraints have different
semantic rules - this is more high-level than encoding, but within that set.

certlint/cablint catches many of these, in a large part through using an
ASN.1 schema compiler (asn1c) rather than hand-rolling. Yet even it has had
some encoding issues in the past.

> Also, is this the correct timestamp? For example, examining
> > https://crt.sh/?id=353754255&opt=ocsp
> > Shows an issuance time of Not Before: Mar 12 22:18:30 2018 GMT and a
> > revocation time of 2018-03-12 23:58:10 UTC , but you stated your
> alerting
> > time was 2018-03-13 00:43:00 UTC. I'm curious if that's a bug in the
> > display of crt.sh, a typo in your timezone computation (considering the
> > recent daylight saving adjustments in the US), a deliberate choice to put
> > revocation somewhere between those dates (which is semantically valid,
> but
> > curious), or perhaps something else.
>

> I believe this was a timezone computation error. By my reading of the
> logs, our alerting time was 2018-03-13 23:43:00 UTC, which agrees with your
> hypothesis about the recent timezone change (DST) leading to a mistake in
> calculating UTC times.

> _______________________________________________
> dev-security-policy mailing list
> dev-secur...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-security-policy
>

Tom

unread,

Mar 13, 2018, 4:33:50 AM3/13/18

to mozilla-dev-s...@lists.mozilla.org

> During final tests for the general availability of wildcard
certificate support, the Let's Encrypt operations team issued six test
wildcard certificates under our publicly trusted root:
>
> https://crt.sh/?id=353759994
> https://crt.sh/?id=353758875
> https://crt.sh/?id=353757861
> https://crt.sh/?id=353756805
> https://crt.sh/?id=353755984
> https://crt.sh/?id=353754255
>

Somebody noticed there
https://community.letsencrypt.org/t/acmev2-and-wildcard-launch-delay/53654/62
that the certificate of *.api.letsencrypt.org (apparently currently in
use), issued by "TrustID Server CA A52" (IdenTrust) seams to have the
same problem:
https://crt.sh/?id=8373036&opt=cablint,x509lint

jo...@letsencrypt.org

unread,

Mar 13, 2018, 9:47:43 AM3/13/18

to mozilla-dev-s...@lists.mozilla.org

I think it's just a coincidence that we got a wildcard cert from IdenTrust a long time ago and it happens to have the same encoding issue that we ran into. I notified IdenTrust in case they haven't fixed the problem since then.

Matthew Hardeman

unread,

Mar 13, 2018, 4:13:16 PM3/13/18

to jo...@letsencrypt.org, mozilla-dev-security-policy

The fact that this mis-issuance occurred does raise a question for the
community.

For quite some time, it has been repeatedly emphasized that maintaining a
non-trusted but otherwise identical staging environment and practicing all
permutations of tests and issuances -- especially involving new
functionality -- on that parallel staging infrastructure is the mechanism
by which mis-issuances such as those mentioned in this thread may be
avoided within production environments.

Let's Encrypt has been a shining example of best practices up to this point
and has enjoyed the attendant minimization of production issues (presumably
as a result of exercising said best practices).

Despite that, however, either the test cases which resulted in these
mis-issuances were not first executed on the staging platform or did not
result in the mis-issuance there. A reference was made to a Go lang
library error / non-conformance being implicated. Were the builds for
staging and production compiled on different releases of Go lang?

Certainly, I think these particular mis-issuances do not significantly
affect the level of trust which should be accorded to ISRG / Let's Encrypt.

Having said that, however, it is worth noting that in a fully new and novel
PKI infrastructure, it seems likely -- based on recent inclusion / renewal
requests -- that such a mis-issuance would recently have resulted in a
disqualification of a given root / key with guidance to cut a new root PKI
and start the process over.

I am not at all suggesting consequences for Let's Encrypt, but rather
raising a question as to whether that position on new inclusions / renewals
is appropriate. If these things can happen in a celebrated best-practices
environment, can they really in isolation be cause to reject a new
application or a new root from an existing CA?

Another question this incident raised in my mind pertains to the parallel
staging and production environment paradigm: If one truly has the 'courage
of conviction' of the equivalence of the two environments, why would one
not perform all tests in ONLY the staging environment, with no tests and
nothing other than production transactions on the production environment?
That tests continue to be executed in the production environment while
holding to the notion that a fully parallel staging environment is the
place for tests seems to signal that confidence in the staging environment
is -- in some measure, however small -- limited.

On Tue, Mar 13, 2018 at 8:46 AM, josh--- via dev-security-policy <
dev-secur...@lists.mozilla.org> wrote:

> On Tuesday, March 13, 2018 at 3:33:50 AM UTC-5, Tom wrote:

> I think it's just a coincidence that we got a wildcard cert from IdenTrust
> a long time ago and it happens to have the same encoding issue that we ran
> into. I notified IdenTrust in case they haven't fixed the problem since
> then.

Ryan Sleevi

unread,

Mar 13, 2018, 5:02:45 PM3/13/18

to Matthew Hardeman, mozilla-dev-security-policy, jo...@letsencrypt.org

On Tue, Mar 13, 2018 at 4:13 PM, Matthew Hardeman via dev-security-policy <
dev-secur...@lists.mozilla.org> wrote:

> I am not at all suggesting consequences for Let's Encrypt, but rather
> raising a question as to whether that position on new inclusions / renewals
> is appropriate. If these things can happen in a celebrated best-practices
> environment, can they really in isolation be cause to reject a new
> application or a new root from an existing CA?
>

While I certainly appreciate the comparison, I think it's apples and
oranges when we consider both the nature and degree, nor do I think it's
fair to suggest "in isolation" is a comparison.

I'm sure you can agree that incident response is defined by both the nature
and severity of the incident itself, the surrounding ecosystem factors
(i.e. was this a well-understood problem), and the detection, response, and
disclosure practices that follow. A system that does not implement any
checks whatsoever is, I hope, something we can agree is worse than a system
that relies on human checks (and virtually indistinguishable from no
checks), and that both are worse than a system with incomplete technical
checks.

I do agree with you that I find it challenging with how the staging
environment was tested - failure to have robust profile tests in staging,
for example, are what ultimately resulted in Turktrust's notable
misissuance of unconstrained CA certificates. Similarly, given the wide
availability of certificate linting tools - such as ZLint, x509Lint,
(AWS's) certlint, and (GlobalSign's) certlint - there's no dearth of
availability of open tools and checks. Given the industry push towards
integration of these automated tools, it's not entirely clear why LE would
invent yet another, but it's also not reasonable to require that LE use
something 'off the shelf'.

I'm hoping that LE can provide more details about the change management
process and how, in light of this incident, it may change - both in terms
of automated testing and in certificate policy review.

> Another question this incident raised in my mind pertains to the parallel
> staging and production environment paradigm: If one truly has the 'courage
> of conviction' of the equivalence of the two environments, why would one
> not perform all tests in ONLY the staging environment, with no tests and
> nothing other than production transactions on the production environment?
> That tests continue to be executed in the production environment while
> holding to the notion that a fully parallel staging environment is the
> place for tests seems to signal that confidence in the staging environment
> is -- in some measure, however small -- limited.

That's ... just a bad conclusion, especially for a publicly-trusted CA :)

js...@letsencrypt.org

unread,

Mar 13, 2018, 6:19:28 PM3/13/18

to mozilla-dev-s...@lists.mozilla.org

On Tuesday, March 13, 2018 at 2:02:45 PM UTC-7, Ryan Sleevi wrote:
> availability of certificate linting tools - such as ZLint, x509Lint,
> (AWS's) certlint, and (GlobalSign's) certlint - there's no dearth of
> availability of open tools and checks. Given the industry push towards
> integration of these automated tools, it's not entirely clear why LE would
> invent yet another, but it's also not reasonable to require that LE use
> something 'off the shelf'.

We are indeed planning to integrate GlobalSign's certlint and/or zlint into our existing cert-checker pipeline rather than build something new. We've already started submitting issues and PRs, in order to give back to the ecosystem:

https://github.com/zmap/zlint/issues/212
https://github.com/zmap/zlint/issues/211
https://github.com/zmap/zlint/issues/210
https://github.com/globalsign/certlint/pull/5

If your question is why we wrote cert-checker rather than use something off-the-shelf: cablint / x509lint weren't available at the time we wrote cert-checker. When they became available we evaluated them for production and/or CI use, but concluded that the complex dependencies and difficulty of productionizing them in our environment outweighed the extra confidence we expected to gain, especially given that our certificate profile at the time was very static. A system improvement we could have made here would have been to set "deploy cablint or its equivalent" as a blocker for future certificate profile changes. I'll add that to our list of items for remediation.

Matthew Hardeman

unread,

Mar 13, 2018, 6:27:23 PM3/13/18

to Ryan Sleevi, mozilla-dev-security-policy, jo...@letsencrypt.org

On Tue, Mar 13, 2018 at 4:02 PM, Ryan Sleevi <ry...@sleevi.com> wrote:

>
>
> On Tue, Mar 13, 2018 at 4:13 PM, Matthew Hardeman via dev-security-policy
> <dev-secur...@lists.mozilla.org> wrote:
>
>> I am not at all suggesting consequences for Let's Encrypt, but rather
>> raising a question as to whether that position on new inclusions /
>> renewals
>> is appropriate. If these things can happen in a celebrated best-practices
>> environment, can they really in isolation be cause to reject a new
>> application or a new root from an existing CA?
>>
>
> While I certainly appreciate the comparison, I think it's apples and
> oranges when we consider both the nature and degree, nor do I think it's
> fair to suggest "in isolation" is a comparison.
>

I thought I recalled a recent case in which a new root/key was declined
with the sole unresolved (and unresolvable, save for new key generation,
etc.) matter precluding the inclusion being a prior mis-issuance of test
certificates, already revoked and disclosed. Perhaps I am mistaken.

>
> I'm sure you can agree that incident response is defined by both the
> nature and severity of the incident itself, the surrounding ecosystem
> factors (i.e. was this a well-understood problem), and the detection,
> response, and disclosure practices that follow. A system that does not
> implement any checks whatsoever is, I hope, something we can agree is worse
> than a system that relies on human checks (and virtually indistinguishable
> from no checks), and that both are worse than a system with incomplete
> technical checks.
>
>

I certainly concur with all of that, which is the part of the basis for
which I form my own opinion that Let's Encrypt should not suffer any
consequence of significance beyond advice along the lines of "make your
testing environment and procedures better".

> I do agree with you that I find it challenging with how the staging
> environment was tested - failure to have robust profile tests in staging,
> for example, are what ultimately resulted in Turktrust's notable
> misissuance of unconstrained CA certificates. Similarly, given the wide

> availability of certificate linting tools - such as ZLint, x509Lint,
> (AWS's) certlint, and (GlobalSign's) certlint - there's no dearth of
> availability of open tools and checks. Given the industry push towards
> integration of these automated tools, it's not entirely clear why LE would
> invent yet another, but it's also not reasonable to require that LE use
> something 'off the shelf'.
>

I'm very interested in how the testing occurs in terms of procedures. I
would assume, for example, that no test transaction of any kind would ever
be "played" against a production environment unless that same exact test
transaction had already been "played" against the staging environment.
With respect to this case, were these wildcard certificates requested and
issued against the staging system with materially the same test transaction
data, and if so was the encoding incorrect? If these were not performed
against staging, what was the rational basis for executing a new and novel
test transaction against the production system first? If they were
performed AND if they did not encode incorrectly, then what was the
disparity between the environments which led to this? (The implication
being that some sort of change management process needs to be revised to
keep the operating environments of staging and production better
synchronized.) If they were performed and were improperly encoded on the
staging environment, then one would presume that the erroneous result was
missed by the various automated and manual examinations of the results of
the tests.

As you note, it's unreasonable to require use of any particular
implementation of any particular tool but in as far as the other tools
achieve certain results while clearly the LE developed tools did not catch
this issue, it would appear that LE needs to better test their testing
mechanisms and while it may not be necessary for them to incorporate the
competing tools in the live issuance pipeline, it would seem advisable that
Let's Encrypt should pass the output results (the certificates) of tests
within their staging environment through these various other testing tools
as part of a post-staging-deployment testing phase. It would seem logical
to take the best of breed tools and stack them up whether automatically or
manually and waterfall the final output results of a full suite of test
scenarios against the post-deployment state of the staging environment,
with a view to identifying discrepancies between the LE tool opinion and
the external tool's opinion and reconciling those, rejecting invalid
determinations as appropriate.

>
> I'm hoping that LE can provide more details about the change management
> process and how, in light of this incident, it may change - both in terms
> of automated testing and in certificate policy review.
>
>
>> Another question this incident raised in my mind pertains to the parallel
>> staging and production environment paradigm: If one truly has the
>> 'courage
>> of conviction' of the equivalence of the two environments, why would one
>> not perform all tests in ONLY the staging environment, with no tests and
>> nothing other than production transactions on the production environment?
>> That tests continue to be executed in the production environment while
>> holding to the notion that a fully parallel staging environment is the
>> place for tests seems to signal that confidence in the staging environment
>> is -- in some measure, however small -- limited.
>
>
> That's ... just a bad conclusion, especially for a publicly-trusted CA :)
>
>

I certainly agree it's possible that I've reached a bad conclusion there,
but I would like to better understand how specifically? Assuming the same
input data set and software manipulating said data, two systems should in
general execute identically. To the extent that they do not, my initial
position would be that a significant failing of change management of
operating environment or data set or system level matters has occurred. I
would think all of those would be issues of great concern to a CA, if for
no other reason than that they should be very very rare.

js...@letsencrypt.org

unread,

Mar 13, 2018, 6:51:01 PM3/13/18

to mozilla-dev-s...@lists.mozilla.org

On Tuesday, March 13, 2018 at 2:02:45 PM UTC-7, Ryan Sleevi wrote:

> I'm hoping that LE can provide more details about the change management
> process and how, in light of this incident, it may change - both in terms
> of automated testing and in certificate policy review.

Forgot to reply to this specific part. Our change management process starts with our SDLC, which mandates code review (typically dual code review), unit tests, and where appropriate, integration tests. All unittests and integrations tests are run automatically with every change, and before every deploy. Our operations team checks the automated test status and will not deploy if the tests are broken. Any configuration changes that we plan to apply in staging and production are first added to our automated tests.

Each deploy then spends a period of time in our staging environment, where it is subject to further automated tests: periodic issuance testing, plus performance, availability, and correctness monitoring equivalent to our production environment. This includes running the cert-checker software I mentioned earlier. Typically our deploys spend two days in our staging environment before going live, though that depends on our risk evaluation, and hotfix deploys may spend less time in staging if we have high confidence in their safety. Similarly, any configuration changes are applied to the staging environment before going to production. For significant changes we do additional manual testing in the staging environment. Generally this testing means checking that the new change was applied as expected, and that no errors were produced. We don't rely on manual testing as a primary way of catching bugs; we automate everything we can.

If the staging deployment or configuration change doesn't show any problems, we continue to production. Production has the same suite of automated live tests as staging. And similar to staging, for significant changes we do additional manual testing. It was this step that caught the encoding issue, when one of our staff used crt.sh's lint tool to double check the test certificate they issued.

Clearly we should have caught this earlier in the process. The changes we have in the pipeline (integrating certlint and/or zlint) would have automatically caught the encoding issue at each staging in the pipeline: in development, in staging, and in production.

josef.s...@gmail.com

unread,

Mar 14, 2018, 6:18:25 AM3/14/18

to mozilla-dev-s...@lists.mozilla.org

On Tuesday, March 13, 2018 at 23:51:01 UTC+1 js...@letsencrypt.org wrote:

> Clearly we should have caught this earlier in the process. The changes we have in the pipeline (integrating certlint and/or zlint) would have automatically caught the encoding issue at each staging in the pipeline: in development, in staging, and in production.

So to clarify I understand this: The same problem was in the staging environment and there where also certificates with illegal encoding issued in staging, but you didn't notice them because no one manually validated them with the crt.sh lint?

Or are there differences between staging and production?

js...@letsencrypt.org

unread,

Mar 14, 2018, 2:17:42 PM3/14/18

to mozilla-dev-s...@lists.mozilla.org

> So to clarify I understand this: The same problem was in the staging environment and there where also certificates with illegal encoding issued in staging, but you didn't notice them because no one manually validated them with the crt.sh lint?

That's correct.

> Or are there differences between staging and production?

Yep, there are differences, though of course we try to keep them to a minimum. The most notable is that we don't use trusted keys in staging. That means staging can only submit to test CT logs, and is therefore not picked up by crt.sh.

Ryan Sleevi

unread,

Mar 14, 2018, 2:58:58 PM3/14/18

to Matthew Hardeman, Ryan Sleevi, mozilla-dev-security-policy, jo...@letsencrypt.org

On Tue, Mar 13, 2018 at 6:27 PM Matthew Hardeman <mhar...@gmail.com>
wrote:

> Another question this incident raised in my mind pertains to the parallel
>>> staging and production environment paradigm: If one truly has the
>>> 'courage
>>> of conviction' of the equivalence of the two environments, why would one
>>> not perform all tests in ONLY the staging environment, with no tests and
>>> nothing other than production transactions on the production environment?
>>> That tests continue to be executed in the production environment while
>>> holding to the notion that a fully parallel staging environment is the
>>> place for tests seems to signal that confidence in the staging
>>> environment
>>> is -- in some measure, however small -- limited.
>>
>>
>> That's ... just a bad conclusion, especially for a publicly-trusted CA :)
>>
>>
> I certainly agree it's possible that I've reached a bad conclusion there,
> but I would like to better understand how specifically? Assuming the same
> input data set and software manipulating said data, two systems should in
> general execute identically. To the extent that they do not, my initial
> position would be that a significant failing of change management of
> operating environment or data set or system level matters has occurred. I
> would think all of those would be issues of great concern to a CA, if for
> no other reason than that they should be very very rare.
>

I get the impression you may not have run complex production systems,
especially distributed systems, or spent much time with testing
methodology, given statements such as “courage or your conviction.”

No testing system is going to be perfect, and there’s a difference between
designed redundancy and unnecessary testing.

For example, even if you had 100% code coverage through tests, there are
still things that are possible to get wrong - for example, you could test
every line of your codebase and still fail to properly handle IDNs, for
example - or, as other CAs have shown, ampersands.

It’s foolish to think that a staging environment will cover every possible
permutation - even if you solved the halting problem, you will still have
issues with, say, solar radiation induced bitflips, or RAM heat, or any
number of other issues. And yes, these are issues still affecting real
systems today, not scary stories we tell our SREs to keep them up at night.

Look at any complex system - avionics, military command-and-control,
certificate authorities, modern scalable websites - and you will find
systems designed with redundancy throughout, to ensure proper functioning.
It is the madness of inexperience to suggest that somehow this redundancy
is unnecessary or somehow a black mark - the Sean Hannity approach of “F’
it, we’ll do it live” is the antithesis of modern and secure design. The
suggestion that this is somehow a sign of insufficient testing or design
is, at best, naive, and at worse, detrimental towards discussions of how to
improve the ecosystem.

>

Tom Prince

unread,

Mar 14, 2018, 11:30:22 PM3/14/18

to mozilla-dev-s...@lists.mozilla.org

On Tuesday, March 13, 2018 at 4:27:23 PM UTC-6, Matthew Hardeman wrote:
> I thought I recalled a recent case in which a new root/key was declined
> with the sole unresolved (and unresolvable, save for new key generation,
> etc.) matter precluding the inclusion being a prior mis-issuance of test
> certificates, already revoked and disclosed. Perhaps I am mistaken.

I haven't seen this directly addressed. I'm not sure what incident you are referring to, but I'm fairly that the mis-issuance that needed new keys was for certificates that were issued for domains that weren't properly validated.

In the case under discussion in this thread, all the mis-issued certificates are only mis-issued due to encoding issues. The certificates are for sub-domains of randomly generated subdomains of aws.radiantlock.org (which, according to whois, is controlled by Let's Encrypt). I presume these domains are created specifically for testing certificate issuance in the production environment in a way that complies with the BRs.

To put it succinctly, the issue you are referring to is about issuing certificates for domains that aren't authorized (whether for testing or not), rather than creating test certificates.

-- Tom Prince

Wayne Thayer

unread,

Mar 15, 2018, 3:05:25 PM3/15/18

to mozilla-dev-security-policy

This incident, and the resulting action to "integrate GlobalSign's certlint
and/or zlint into our existing cert-checker pipeline" has been documented
in bug 1446080 [1]

This is further proof that pre-issuance TBS certificate linting (either by
incorporating existing tools or using a comprehensive set of rules) is a
best practice that prevents misissuance. I don't understand why all CA's
aren't doing this.

- Wayne

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1446080

Tom

unread,

Mar 15, 2018, 3:23:20 PM3/15/18

to mozilla-dev-s...@lists.mozilla.org

Should another bug be opened for the certificate issued by IdenTrust
with apparently the same encoding problem?

https://crt.sh/?id=8373036&opt=cablint,x509lint

Does Mozilla expects the revocation of such certificates?

https://groups.google.com/d/msg/mozilla.dev.security.policy/wqySoetqUFM/l46gmX0hAwAJ

Wayne Thayer

unread,

Mar 15, 2018, 3:59:20 PM3/15/18

to Tom, mozilla-dev-security-policy

On Thu, Mar 15, 2018 at 12:22 PM, Tom via dev-security-policy <
dev-secur...@lists.mozilla.org> wrote:

> Should another bug be opened for the certificate issued by IdenTrust with
> apparently the same encoding problem?
>

> Yes - this is bug 1446121 (
https://bugzilla.mozilla.org/show_bug.cgi?id=1446121)

https://crt.sh/?id=8373036&opt=cablint,x509lint
>

Does Mozilla expects the revocation of such certificates?
>

> Yes, within 24 hours per BR 4.9.1.1 (9) "The CA is made aware that the
Certificate was not issued in accordance with these Requirements or the
CA’s Certificate Policy or Certification Practice Statement;"

Mozilla requires adherence to the BRs, and the BRs require CAs to comply
with RFC 5280.

https://groups.google.com/d/msg/mozilla.dev.security.policy/
> wqySoetqUFM/l46gmX0hAwAJ
>
> - Wayne

József Szilágyi

unread,

Mar 16, 2018, 12:07:19 PM3/16/18

to mozilla-dev-s...@lists.mozilla.org

Please put also this certificate on that list:
https://crt.sh/?id=181538497&opt=cablint,x509lint

Best Regards,
Jozsef