Request for Input: CA Incident Reporting

1,081 views
Skip to first unread message

Clint Wilson

unread,
Jul 20, 2023, 11:19:20 AM7/20/23
to public

All,


During the CA/Browser Forum Face-to-Face 59 meeting, several Root Store Programs expressed an interest in improving Web PKI incident reporting.


The CCADB Steering Committee is interested in this community’s recommendations on improving the standards applicable to and the overall quality of incident reports submitted by Certification Authority (CA) Owners. We aim to facilitate effective collaboration, foster transparency, and promote the sharing of best practices and lessons learned among CAs and the broader community.


Currently, some Root Store Programs require incident reports from CA Owners to address a list of items in a format detailed on ccadb.org [1]. While the CCADB format provides a framework for reporting, we would like to discuss ideas on how to improve the quality and usefulness of these reports.


We would like to make incident reports more useful and effective where they:


  • Are consistent in quality, transparency, and format.

  • Demonstrate thoroughness and depth of investigation and incident analysis, including for variants.

  • Clearly identify the true root cause(s) while avoiding restating the issue.

  • Provide sufficient detail that enables other CA Owners or members of the public to comprehend and, where relevant, implement an equivalent solution.

  • Present a complete timeline of the incident, including the introduction of the root cause(s).

  • Include specific, actionable, and timebound steps for resolving the issue(s) that contributed to the root cause(s).

  • Are frequently updated when new information is found and steps for resolution are completed, delayed, or changed. 

  • Allow a reader to quickly understand what happened, the scope of the impact, and how the remediation will sufficiently prevent the root cause of the incident from reoccuring. 


We appreciate, to state it lightly, members of this community and the general public who generate and review reports, offer their understanding of the situation and impact, and ask clarifying questions. 


Call to action: In the spirit of continuous improvement, we are requesting (and very much appreciate) this community’s suggestions for how CA incident reporting can be improved.


Not every suggestion will be implemented, but we will commit to reviewing all suggestions and collectively working towards an improved standard.


Thank you

-Clint, on behalf of the CCADB Steering Committee


[1] https://www.ccadb.org/cas/incident-report 

Clint Wilson

unread,
Aug 1, 2023, 10:23:08 AM8/1/23
to public
Hi all,

If you have feedback on this topic, we would love to hear your thoughts.

Thank you!
-Clint

--
You received this message because you are subscribed to the Google Groups "CCADB Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to public+un...@ccadb.org.
To view this discussion on the web visit https://groups.google.com/a/ccadb.org/d/msgid/public/3B253FFF-4070-4F0E-95D2-166FAC01C5A7%40apple.com.

Roman Fischer

unread,
Aug 3, 2023, 4:40:15 AM8/3/23
to public

Dear All,

 

Maybe we should consider to investigate how other security relevant / regulated verticals handle incident reporting and how they see their way forward? As example verticals, we may look at health care, critical infrastructures, air traffic control… all of these also try to improve the security/safety in their area by adopting the regulations around incident reporting.

 

Kind regards
Roman

--

Aaron Gable

unread,
Aug 3, 2023, 7:49:27 PM8/3/23
to pub...@ccadb.org
Hi Clint,

I'm speaking here both as a member of the Let's Encrypt team (and I think we write pretty good incident reports), and as someone with a decade of experience in incident-response roles, including learning from the people who developed and refined Site Reliability Engineering at Google.

Fundamentally, the "Incident Reports" that CAs file in Bugzilla are the same as what might be called "Incident Postmortems" elsewhere. Quoting from The SRE Book, postmortems are "a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring". And honestly, I think that the current set of questions and requirements gets pretty close to that mark: CAs are explicitly required to address the root cause, provide a timeline, and commit to follow-up actions.

There are many resources available regarding how to write good incident reports, and how to promote good "blameless postmortem" culture within an organization. For example, the Google SRE Workbook has an example of a well-written postmortem; PagerDuty and Google provide mostly-empty postmortem templates; the Building Secure and Reliable Systems book has a chapter on postmortems; and much more. I particularly love this checklist of questions to make sure you address in a postmortem.

Refreshing my memory of all of these, I find two big differences between the incident reports that we have here in the Web PKI, and those promoted by all of these other resources.

1. Our incident reports do not have a "lessons learned" section. We focus on what actions were taken, and what actions will be taken, but not on the whys behind those actions. Sure, maybe a follow-up item is "add an alert for circumstance X" but why is that the appropriate action to take in this circumstance? I believe that this is an easy deficiency to remedy, and have suggestions for doing so below.

2. We do not have a culture of blameless postmortems. This is much harder to resolve. Even though a given report may avoid laying the blame at the feet of any individual CA employee, it is difficult to remove the feeling that the CA is to blame for the incident as a whole, and punishment (removal from a trust store) may be meted out as punishment for too many incidents. I cannot speak for other CAs, but here at Let's Encrypt we have carefully cultivate a culture of blamelessness when it comes to our incidents and postmortems... and even here, the act of writing a report to be publicly posted on Bugzilla is nerve-wracking due to fear of criticism and censure. I honestly don't know if there's anything we can do about this. The nature of the WebPKI ecosystem and the asymmetric roles of root programs and CAs are facts we just have to deal with. But at the very least I think it is important to keep this dynamic in mind.

So with all that said, I do have a few concrete suggestions for how to improve the incident report requirements.

1. Provide a template. Not just a list of (currently very verbose) questions, but a verbatim template, with markdown formatting characters (e.g. for headings) already included. This will make both writing and reading incident reports significantly easier, and remove much ambiguity. It will also have nice minor effects like establishing a standard format for the timeline. I'm more than happy to contribute the template that Let's Encrypt uses internally, and make changes / improvements to it based on my other feedback here.

2. Require an executive summary. Many of the best-written incident reports already include a summary at the top, because it provides just enough context for the rest of the report to make sense to a new reader.

3. Remove the "how you first became aware" question. This should be built into the timeline, not a question of its own. In my experience, this question leads to the most repetition of content in the report.

4. Require that the timeline specifically call out the following events:
- Any policy, process, or software changes that contributed to the root cause or the trigger
- The time at which the incident began
- The time at which the CA became aware of the incident
- The time at which the incident ended
- The time(s) at which issuance ceased and resumed, if relevant

5. Questions 4 (summary of affected certs) and 5 (full details) should be revamped. Having these as separate questions back-to-back places undue emphasis on the external impact of the incident, when what we care about much more is the internal impact on the CA going forward (i.e. what changes they're making to learn from and prevent similar incidents). The summary should be moved to directly below the Executive Summary, and turned into a more general "Impact" section -- how many certs, how many ocsp responses, how many days, whatever statistic is relevant can be provided here. The full details should be moved to the very bottom: the list of all affected certificates is usually an attachment, so this section should be an appendix.

6. Change question 6 to explicitly call for a root cause analysis. The current phrasing ("how and why the mistakes were made") lends itself to a blameful-postmortem culture. Instead, we should ask CAs to interrogate what set of circumstances combined to allow the incident to arise, and then what final trigger caused it to actually occur. This root cause / trigger approach is espoused by most of the postmortem guides I linked above.

7. There should be one additional question for "lessons learned". The most common three sub-headings here are "What went well", "What didn't go well", and "Where we got lucky". The first is very valuable in a blameless postmortem culture, because it allows the team to toot its own horn: be proud of the fact that the impact was smaller than it would have been if this other mitigation hadn't been in place, celebrate the fact that an early warning detection system caught it, etc. The second and third strongly inform the set of follow-up action items: everything that went wrong should have an action item designed to make it go right next time, and every lucky break should have an action item design to turn that luck into a guarantee.

8. The action items question should also ask for what kind of action each is: does it detect a future occurrence, does it prevent a future occurrence, or does it mitigate the effects of a future occurrence? CAs should be encouraged (but not required) to include action items of all three types, with an emphasis on prevention and mitigation.

Okay, that ended up being more than a few. I also put together a rough-draft of my suggested template for people to look at and critique and improve.

Finally, I have one last suggestion for how the incident reporting process could be improved outside of the contents of the report itself.

1. It would be great to automate the process of setting "Next-Update" dates on tickets. I feel like I've had several instances where I requested a Next-Update date four or five weeks in the future, but then didn't get confirmation that this would be okay until just hours before I would have needed to post a weekly update. If this process could be flipped -- the Next-Update date gets set automatically based on the Action Items, and weekly updates are only necessary if a root program manager explicitly unsets it and requests more frequent updates -- that would certainly streamline the process a bit.

Apologies for the length of this email. I hope that this is helpful, and gives people a good jumping-off point for further discussion of how these incident reports should be formatted and what information they should contain to be maximally useful to the community.

Thanks,
Aaron

Aaron Gable

unread,
Aug 4, 2023, 2:33:21 PM8/4/23
to pub...@ccadb.org
Apologies for double-posting, but I just wanted to let folks know that I've updated my gist to be a full rewrite of the incident reporting requirements page. It includes most of the existing verbiage about the purpose, filing timeline, and update requirements of reports, and preserves the Audit Incident Report section as-is. It then overhauls the Incident Report section to include a template and explicit instructions for filling out that template. I don't know if this is actually useful to the CCADB Steering Committee, but it seemed like the most succinct way to get all my thoughts in one place.

Thanks again,
Aaron

Paul van Brouwershaven

unread,
Aug 7, 2023, 3:36:47 AM8/7/23
to pub...@ccadb.org, Aaron Gable
Thanks for your contributions to this Aaron, this is very valuable input!

Should we consider adding a topic in the action items regarding the effectiveness of requirements (such as the root program, CA/Browser Forum, and ETSI) in averting incidents like this? Are there oversights, potential areas for improved clarity in language, or additional requirements that warrant consideration?

While CAs are required to monitor Bugzilla incidents, by ensuring that the requirements are clear and conclusive, we could help prevent similar incidents within the ecosystem. Given that incidents from the past may not be readily apparent to new CAs or staff and it's hard to look back on all historic incidents.

Paul



From: 'Aaron Gable' via CCADB Public <pub...@ccadb.org>
Sent: Friday, August 4, 2023 20:33
To: pub...@ccadb.org <pub...@ccadb.org>
Subject: [EXTERNAL] Re: Request for Input: CA Incident Reporting
 
WARNING: This email originated outside of Entrust.
DO NOT CLICK links or attachments unless you trust the sender and know the content is safe.

Any email and files/attachments transmitted with it are intended solely for the use of the individual or entity to whom they are addressed. If this message has been sent to you in error, you must not copy, distribute or disclose of the information it contains. Please notify Entrust immediately and delete the message from your system.

Antonios Chariton

unread,
Aug 7, 2023, 10:00:47 AM8/7/23
to pub...@ccadb.org, Aaron Gable
Thanks for the great content Aaron! I agree on every point, and thanks for even making such detailed suggestions.

I’d like to expand a little bit on the incident reporting part, as I think this is potentially the greatest blind spot. This is the question of when an incident should be filed and when it shouldn’t.

There is legislation in the EU as well as other countries that mandates reporting of security incidents. However most companies in reality choose not to do that. If for example the Data Protection Authorities received an e-mail for every time a data breach happened, it would have been a massive undertaking. By talking to these authorities, as well as various companies that meet these requirements for reporting, it’s clear that what you say Aaron is true.

If you are the “good person” and you report everything, while your competition keeps everything hidden, you are rarely rewarded, and usually punished. And we’re talking about law here, which comes with stricter punishments than removal from a Root Program.

With that in mind, I think a culture of fear won’t help. CAs shouldn’t be afraid to file incidents, but at the same time they must own up to their mistakes. It’s a difficult balancing act, and every member of this community contributes to this constantly. At the end of the day, though, the Root Programs are responsible for protecting their users, so we will inevitably arrive at a point where a CA must be removed. This should be fine, fair, and well understood why.

Personally, I believe that a CA that keeps incidents hidden, downplays the severity, or in general isn’t being honest is far more dangerous to the users than one that keeps making mistakes but shows clear signs of owning them, learning from them, fixing them really and truly, and investing in a culture of improvement and engineering excellence. As a Root Program Manager, I would take a CA I know any day from one I know nothing about, and have no real visibility into, other than a couple of JPEGs once a year. Root Programs have only a tiny amount of signal to make decisions with, and CAs have the power of depriving them of a lot of it if they want to.

I wouldn’t ever rely on the metric “X incidents reported in QY” to make an inclusion or removal decision. I would focus on qualitative metrics across the entire operation. How many incidents are self-reported vs externally-forced, how open was the CA, was the issue really fixed, were they aware of their role and requirements, was the report factual and accurate, did they respond quickly, was it a common mistake anyone could make, did they take ownership and clean up and mitigate the fallout in a timely manner, etc. Eventually it comes down to the question of whether I trust CA X of having the best interest of my users in mind, and can keep them secure. Or whether I think they’re competent enough to do it. Obviously, you also need to factor in the value each one adds, without ever allowing any CA to become Too Big To Fail. After all, there have been zero strike removals in the past.

If the Root Programs agree on that, and this is clear, and their actions are governed by such “principles” where transparency is clearly valued more than X, Y, or Z, then I think we can make more progress. Otherwise it will be a whack-a-mole and word / meaning twisting game where the responsible CAs will keep filing incidents, even if they could have gotten away with them, and the others will not.

Regarding your point about careful wording, this is true, but it’s again part of the balancing act. If a CA changes the story 3 times in a report, and it’s a new story every time they get backed into a corner with their claims, no matter how much goodwill you want to show, you can’t help but entertain the thought of something else going on there. And there’s no real way of figuring out which is true without messing with this balance. I understand that Root Programs have to apply more pressure in certain situations but I’d like to think it’s done to extract more signal for their decision makers, and not to make anyone’s life miserable or day worse. In my mind, CAs should focus on producing high quality, factual, timely, and transparent reports. If that’s the case, things will move forward productively, and we won’t go into the cat and mouse game. It is difficult, I know, and a few mistakes are okay, as we’re all humans after all.

As a side note, I also understand here that especially in the US for example, companies will need to vet every communication, possibly have it reviewed by legal counsels, which only adds delays and more “censorship" layers. And I guess very few lawyers would advise their clients to publicly admit guilt and fault during a commercial activity that could impact their customers. For this, I think most of the time it’s just not explained properly to them:

I view Root Programs as companies and CAs as vendors. Every company has to do vendor assessments during onboarding (inclusion) as well as periodically. They also need to set a number of requirements in their RFP (Root Program Requirements), and every vendor that wants to sell their product to this company has to comply with them. And of course, it’s up to the company to ensure that all its vendors still check all the boxes, including any new ones that have to be added. Some times by sending a questionnaire, others by looking at an audit result / certification, and others by observing the system directly. Otherwise, if they can no longer trust a vendor with their business, they have to shop for alternatives. If a lawyer is made aware of this relationship, I think they can just go to the right playbook and figure it out. It’s always important to explain what you want to do, why, and its importance, otherwise the answers will be wrong, or just “no”. If I ask someone whether X adds risk, they’ll say “yes”, regardless of X. If I ask someone whether we can take the risk of X in order to unlock Y, then it’s a completely different answer.[1]

To summarize some of my points, I think it’s important to ensure that there is a relationship that’s based on trust from both sides, which I understand takes time to build, and until then we’ll need to come up with a more well defined list of what is and what isn’t report-worthy. Let’s work on this in an evolutionary and not a revolutionary manner. With small iterations until we fine tune it. I’d personally err on the side of more reports than fewer for now, and we can then analyze the data and figure out what the right next step is.

We should always be mindful of course of the load on the CAs as they aren’t entities with unlimited money and resources most of the time. There’s even a non-profit one! ;) We need to set requirements that provide the Root Programs with as much signal as possible, without making the trust store an exclusive pool of companies with deep pockets and dedicated “sales engineers”. There is definitely a minimum cost however if you want to “sell to {Mozilla, Apple, Google, Microsoft}”.

Thanks,
Antonios

- - - - -
Footnote
- - - - -

1: There is a corner case here, Mozilla. My take on this is the following: Apple, Google, Microsoft, etc. are for-profit companies that can afford to pay FTEs to work on vendor assessments. Mozilla is an open source / community-driven organization which can afford *some* FTEs but also accepts contributions from anyone following their guidelines and rules that adds value. It’s the same with code: I can’t contribute patches to iOS or Gmail or Windows, but I can contribute code to Firefox. However, that means that Firefox must be open source, people must have access to the bug tracker, ... It’s still a vendor assessment however, just one done collaboratively by Mozilla Staff and external contributors, and due to its nature, it’s done publicly.

Chris Clements

unread,
Sep 8, 2023, 10:22:40 AM9/8/23
to Antonios Chariton, pub...@ccadb.org, Aaron Gable
TL;DR: The CCADB Steering Committee will update the incident reporting format with several suggestions in this thread. Root Stores that are members of the CCADB may update individual Root Store policies to require adherence to this format.

All,

Thank you for the detailed and actionable feedback! The CCADB Steering Committee has discussed the suggestions in this thread and plans to implement several proposed changes as they will help achieve the original goal of making incident reports more useful and effective. In the future, you can expect to see the following:
  • A revised format and template for incident reports (leaning heavily into the suggestions by Aaron Gable in his gist [1])
  • Initial incident reports are created as soon as possible but no later than 72 hours after being made aware of the incident
  • Clarification for when incident reports should be updated (signaled by the use of “Next update” in the ‘Whiteboard’ field by Root Stores)
  • Guidance for responding to questions
  • A revised format and template for audit incident reports and specification that this report does not need to be created if the audit finding(s) were related to a previously created incident report
  • Root Store use of “[external]” in the ‘Whiteboard’ field to signal that the incident report was created by a third-party (rather than self-reported by the CA Owner)
With the exception of the proposed use of the “[external]” tag in the Whiteboard field, these updates are captured in this PR [2]. The Incident Reports page on ccadb.org is planned to be updated on October 17, 2023. A separate message will be posted in this group and via CCADB when the updated page is live.

Some Root Store policies may be updated to require adherence to this common format or in an effort to reduce redundant requirements language across separate policy documents (i.e., a Root Store’s policy and the CCADB policy). Otherwise, this updated format will be highly encouraged and may be referenced in future incident reports. Again, we appreciate the thoughtful and detailed suggestions to make incident reports better.

Thank you
-Chris, on Behalf of the CCADB Steering Committee

[1] https://gist.github.com/aarongable/78167fc1464b6a8a0a7065112ac195e9
[2] https://github.com/mozilla/www.ccadb.org/pull/131/files

Chris Clements

unread,
Oct 17, 2023, 12:23:57 PM10/17/23
to Antonios Chariton, pub...@ccadb.org, Aaron Gable

All,


Thanks again for the thoughtful and detailed suggestions. 


The Incident Reports page on ccadb.org [1] has been updated with the intention of making incident reports more useful and effective. Root Stores that rely on the CCADB may update their individual Root Store policies to require adherence to this updated format. Otherwise, use of this updated format should be considered highly encouraged. 


This same message will be sent via CCADB mass email to CA Owners included in the CCADB.


Thank you

-Chris, on behalf of the CCADB Steering Committee


[1] https://www.ccadb.org/cas/incident-report


Reply all
Reply to author
Forward
0 new messages