Today's outage

174 views
Skip to first unread message

Olivier Vernin

unread,
Jun 2, 2020, 5:05:50 PM6/2/20
to Jenkins Infrastructure
Hi,

A bit of context for today's outage

Since several weeks, we experienced issues with the kubernetes cluster that we are using.
Today while I was migrating resources from helm v2 to helm v3, It became really unstable so I decided to upgrade the cluster to the last stable version, hoping that version upgrade would restart every node and instabilities would solved in one of the later version.
In fact it did even worse, while initially I had issues to manipulate resources on that cluster, then every services stop working even if all resources were still running there.

I upgraded from version 1.15.9 t to 1.15.10, then to 1.16.9 but everything was still broken.
After multiple attempt to restore services without loosing the cluster, I decided to recreate it.

Reinstalling everything from scratched highly multiple issues:

* Ldap database backup stopped in February 2020 which means that we lost three months of ldap changes.
* We couldn't reuse azure public ip as it wasn't compatible anymore with default loadbalancer SKU which implied ip recreating then dns updates
* Letsencrypt certificate couldn't be generated for www.jenkins.io and plugins.jenkins.io as they are now handled on Fastly
* As I was using RBAC and helm-v3, I had several authorization issues with various azure resources like public Ip or loadbalancer creation

Most of our websites were affected are directly or indirectly because of Ldap dependencies

Now most services are back as usual, I'll take more time later one to investigate how to better communicate and avoid this outage in the futur.

Cheers

Slide

unread,
Jun 2, 2020, 5:25:39 PM6/2/20
to jenkin...@googlegroups.com
Olivier,

Big shout out to you for all your work on the infrastructure. I know often with infrastructure, you only hear about the negative things that people say, but I want to thank you for your tireless efforts in keeping things up and running. 

Thanks for all you do!

Alex

--
You received this message because you are subscribed to the Google Groups "Jenkins Infrastructure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkins-infr...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/jenkins-infra/7ed0ecff-b803-4da4-a9ac-5dd5337fbbd6%40googlegroups.com.


--

Marky Jackson

unread,
Jun 2, 2020, 5:26:36 PM6/2/20
to jenkin...@googlegroups.com
Strong +1
What you do is nothing short of amazing and I super appreciate you.

On Jun 2, 2020, at 2:25 PM, Slide <slide...@gmail.com> wrote:



Tracy Miranda

unread,
Jun 2, 2020, 5:26:49 PM6/2/20
to jenkin...@googlegroups.com
#hugops
Thanks for all the efforts to get things back online and the detailed post on what happened.

Tracy

Oleg Nenashev

unread,
Jun 3, 2020, 7:09:32 AM6/3/20
to Jenkins Infrastructure
Thanks a lot to Olivier, Tim, Daniel and all other contributors ho worked on it yesterday!
it is great to see that the most of the infra is back to the normal state.

* Ldap database backup stopped in February 2020 which means that we lost three months of ldap changes.
Any chance we have an audit log for the last 3 months?
It would be great to have a list of users who may have been affected, and to communicate with them directly.

On Tuesday, June 2, 2020 at 11:26:49 PM UTC+2, Tracy Miranda wrote:
#hugops
Thanks for all the efforts to get things back online and the detailed post on what happened.

Tracy

On Tue, Jun 2, 2020 at 5:25 PM Slide <slid...@gmail.com> wrote:
Olivier,

Big shout out to you for all your work on the infrastructure. I know often with infrastructure, you only hear about the negative things that people say, but I want to thank you for your tireless efforts in keeping things up and running. 

Thanks for all you do!

Alex

On Tue, Jun 2, 2020 at 2:05 PM Olivier Vernin <vern...@gmail.com> wrote:
Hi,

A bit of context for today's outage

Since several weeks, we experienced issues with the kubernetes cluster that we are using.
Today while I was migrating resources from helm v2 to helm v3, It became really unstable so I decided to upgrade the cluster to the last stable version, hoping that version upgrade would restart every node and instabilities would solved in one of the later version.
In fact it did even worse, while initially I had issues to manipulate resources on that cluster, then every services stop working even if all resources were still running there.

I upgraded from version 1.15.9 t to 1.15.10, then to 1.16.9 but everything was still broken.
After multiple attempt to restore services without loosing the cluster, I decided to recreate it.

Reinstalling everything from scratched highly multiple issues:

* Ldap database backup stopped in February 2020 which means that we lost three months of ldap changes.
* We couldn't reuse azure public ip as it wasn't compatible anymore with default loadbalancer SKU which implied ip recreating then dns updates
* Letsencrypt certificate couldn't be generated for www.jenkins.io and plugins.jenkins.io as they are now handled on Fastly
* As I was using RBAC and helm-v3, I had several authorization issues with various azure resources like public Ip or loadbalancer creation

Most of our websites were affected are directly or indirectly because of Ldap dependencies

Now most services are back as usual, I'll take more time later one to investigate how to better communicate and avoid this outage in the futur.

Cheers

--
You received this message because you are subscribed to the Google Groups "Jenkins Infrastructure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Infrastructure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.

Oleg Nenashev

unread,
Jun 9, 2020, 11:27:14 AM6/9/20
to Jenkins Infrastructure
Reply all
Reply to author
Forward
0 new messages