Dear All,
I would like to do a quick postmortem for yesterday’s Jenkins weekly release outage that lasted for about 6h30. The weekly release Jenkins 2.263 was listed on
www.jenkins.io as available but was not available to download.
Since April 2020, the weekly release is fully automated and triggered every Tuesday by this
job
It runs two Jenkins jobs from a specific jenkins instance:
2) Build distribution packages using the jenkins.war from the maven repository then update our mirror infrastructure
Yesterday, the second stage failed on the window package step which resulted in no distribution packages published at all.
But because a new version has been published on our maven repository by the first job, every Jenkins instance was notified that a new weekly version was available. And because we didn't update our mirror infrastructure, nobody was able to fetch the update. It took us 6h30 before fixing it, fortunately enough, the second stage is pretty quick, +-15min versus the 2h needed for the first stage, so we rerun the job without windows packaging.
Remark: At the moment the windows package is still not published due to a Windows issue in the infrastructure
This outage reminded us that we still have work to do and help is definitely more than welcome :)
Issues
* [
INFRA-2538] -> To fix the windows packaging issue
* We wrote a python script to detect the latest version from maven-metadata.xml, for some reason the
metadata file we rely on, still references the previous weekly release 2.262 while all the other maven-metadata.xml are correct. :/
Monitoring
6h30 is way too long to detect such issue, fatigue habit is a thing and we must detect when something went wrong as fast as possible
[INFRA-2027] -> I started working on a python script that we could use with Datadog but I haven't had the time to finish it yet
Artifact Promotion
While it would have not solved the current problem, we could have published the maven release to a temporary maven repository then only promote the artifacts once every distribution package is available.
So people would not have been notified, considering that we mainly rely on people monitoring this would have probably delayed even more the release. We already have that logic in place as it's needed for the security release anyway, we just have to agree on a staging repository.
We’ll be working on those improvements and will share our progress as the improvements become available.
Cheers,
Olblak