(Apologies if this triple or quadruple posts. There appears to be some
hiccups somewhere along the line between my mail server and the m.d.s.p.
mail server and the Google Groups reflector)
I've recently shared some choice words with several CAs over their Incident
Reporting process, highlighting to them how their approach is seriously
undermining trust in their CA and the operations.
Guidance on the minimum expectations for Incident Reports, and while it
includes several examples of some reports that are considered great
responses, it seems there's still some confusion about the underlying
principles about what makes a good incident report.
These principles are touched on in "Follow-up Actions", which was
excellently drafted by Wayne and Kathleen, but I thought it might help to
capture some of the defining characteristics of a good incident report.
1) A good incident report will acknowledge that there's an issue
While I originally thought about calling this "blameless", I think that
might still trip some folks up. If an Incident happens, it means
something's gone wrong. The Incident Report is not about trying to figure
out who to blame for the incident.
For example, when the Incident is expressed as, say, "The Validation
Specialist didn't validate the fields correctly", that's a form of blame.
It makes it seem that the CA is trying to deflect the issue, pretending it
was just a one-off human error rather than trying to understand why the
system placed such huge dependency on humans.
However, this can also manifest in other ways. "The BRs weren't clear about
this" is, in some ways, a form of blame and trying to deflect. I am
certainly not trying to suggest the BRs are perfect, but when something
like that comes up, it's worth asking what steps the CA has in place to
review the BRs, to check interpretations, to solicit feedback. It's also an
area where, for example, better or more comprehensive documentation about
what the CA does (e.g. in its CP/CPS) could have caused, during the CP/CPS
review or other engagement, the community to recognize the BRs weren't
clear, and that the implemented result wasn't the intended result.
In essence, every Incident Report should help us learn how to make the Web
PKI better. Dismissing things as one offs, such as human error or
confusion, gets in the way of understanding the systemic issues at play.
2) A good incident report will demonstrate that the CA understands the
issue, while also providing sufficient information so that anyone else can
understand the issue
A good incident report is going to be like a story. It's going to start
with an opening, introducing the characters and their circumstances. There
will be some conflict they encounter along the way, and hopefully, by the
end of the story, the conflict will have been resolved and everyone lives
happily ever after. But there's a big difference between reading a book
jacket or review and reading the actual book - and a good incident report
is going to read like a book, investing in the characters and their story.
Which is to say, a good incident report is going to have a lot more detail
than just learning out who the actors are. This plays very closely with the
previous principle here; a CA that blames it on human error is not one that
seems like they're acknowledging or understanding the incident, while a CA
that shares the story of what a day in the life of a validation agent looks
like, and all the places for things to go wrong or which could have been
automated or improved really shows they "get" it: that being a validation
agent is hard, and we should all do everything we can to make it easier for
them to do their jobs.
This is the principle behind the template's questions about timelines and
details: trying to express that the CA needs to share the story about what
happened, when it happened, where things went wrong, and why, at many
layers. A timeline that only captures when the failure happened is a bit
like saying that the only thing that happens in "Lord of the Rings" is
"Frodo gets rid of some old jewelry"
3) A good incident report will identify solutions that generalize for CAs
The point of incident reports is not to drag CAs or to call them out. It's
to identify opportunities, as an industry, that we can and should be
improving. A good incident report is going to look and identify solutions
that can and do generalize. While it's absolutely expected that the CA
should fix it for themselves, asking what can or should systemically be
This is the difference between saying "We'll be evaluating whether we use
[data source X]" and saying "We'll be publishing our allowlist of data
sources that we use". It's implementing linters, if that's something that
can be done. It's about sharing the full details of what you're doing, so
if other CAs wanted to (or were required to!) implement something similar,
they could learn from the CA and the incident report about what works and
what doesn't work.
4) A *great* incident report will actually take the steps to generalize it
for all CAs.
This might mean starting discussions on m.d.s.p. about how to solve it via
policy. It might mean proposing actual changes to the BRs or EVGs - as in,
writing ballots, not just suggesting "someone" should do it. It's about
investing the time and energy to make the ecosystem better, more
transparent, more accountable, and more secure. It might even mean looking
through CT for other CAs that have the issue, and reporting that as well!
The primary goal of Incident Reports is not about score keeping. It's not
about saying who has the most incidents. It's about understanding the
challenges and actually working to improve them. All CAs are accountable
for their actions, and so yes, it does mean that there may be multiple
simultaneous incidents, from separate CAs, for the same issue. That's why
understanding these principles is so important: we should be collaborating
to build and systematize this knowledge.
If a CA keeps having issues, that's going to be a huge red flag. The best
thing that CA can do, when finding they're repeatedly having issues, is to
try to push the boundaries forward on Incident Reporting and the ecosystem.
If their failures help make the Web better, that's a huge benefit to the
ecosystem, and can significantly factor in how the incidents are evaluated.
Yet if their incident reports are just scratching the bare minimum -
delayed, lacking information, argumentative, dismissive of issues, not
building solutions but instead layering on workarounds for JUST the issue
noticed - then they're dragging the whole ecosystem down, and creating a
Wiki to track those sorts of systemic failures may be the right thing to do.
That's not trying to be a threat, but it's trying to make it very clear
that the most important thing about the Incident Report is not who had it,
but how the Web PKI ecosystem improved as a result of it. A CA that learns
from its mistakes, and helps us all improve - with concrete changes,
reusable technology, clearer requirements - then I'd much rather have that
as an Incident Report than have none at all. As strange as it sounds, a
good incident report _should_ be a competitive advantage, because it's a
chance to show the CA can learn from, improve, and lead the Web PKI
If you read the example Incident Reports, these are great examples of just
that. Sometimes they didn't hit everything right out the door, but by the
end of the report, you'll find these principles are all at play. More great
examples like that make a huge difference.