--
You received this message because you are subscribed to the Google Groups "Jenkins Infrastructure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkins-infr...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/jenkins-infra/7ed0ecff-b803-4da4-a9ac-5dd5337fbbd6%40googlegroups.com.
On Jun 2, 2020, at 2:25 PM, Slide <slide...@gmail.com> wrote:
To view this discussion on the web, visit https://groups.google.com/d/msgid/jenkins-infra/CAPiUgVe1mYcn3nSRxrxa%2BswFOqnzSWCPVnwaVL%2BeEL%3DMft%2B9eA%40mail.gmail.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/jenkins-infra/CAPiUgVe1mYcn3nSRxrxa%2BswFOqnzSWCPVnwaVL%2BeEL%3DMft%2B9eA%40mail.gmail.com.
* Ldap database backup stopped in February 2020 which means that we lost three months of ldap changes.
#hugopsThanks for all the efforts to get things back online and the detailed post on what happened.Tracy
On Tue, Jun 2, 2020 at 5:25 PM Slide <slid...@gmail.com> wrote:
Olivier,Big shout out to you for all your work on the infrastructure. I know often with infrastructure, you only hear about the negative things that people say, but I want to thank you for your tireless efforts in keeping things up and running.Thanks for all you do!Alex
On Tue, Jun 2, 2020 at 2:05 PM Olivier Vernin <vern...@gmail.com> wrote:
Hi,--A bit of context for today's outage
Since several weeks, we experienced issues with the kubernetes cluster that we are using.
Today while I was migrating resources from helm v2 to helm v3, It became really unstable so I decided to upgrade the cluster to the last stable version, hoping that version upgrade would restart every node and instabilities would solved in one of the later version.
In fact it did even worse, while initially I had issues to manipulate resources on that cluster, then every services stop working even if all resources were still running there.
I upgraded from version 1.15.9 t to 1.15.10, then to 1.16.9 but everything was still broken.
After multiple attempt to restore services without loosing the cluster, I decided to recreate it.Reinstalling everything from scratched highly multiple issues:* Ldap database backup stopped in February 2020 which means that we lost three months of ldap changes.* We couldn't reuse azure public ip as it wasn't compatible anymore with default loadbalancer SKU which implied ip recreating then dns updates* Letsencrypt certificate couldn't be generated for www.jenkins.io and plugins.jenkins.io as they are now handled on Fastly* As I was using RBAC and helm-v3, I had several authorization issues with various azure resources like public Ip or loadbalancer creationMost of our websites were affected are directly or indirectly because of Ldap dependenciesNow most services are back as usual, I'll take more time later one to investigate how to better communicate and avoid this outage in the futur.
Cheers
You received this message because you are subscribed to the Google Groups "Jenkins Infrastructure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/jenkins-infra/7ed0ecff-b803-4da4-a9ac-5dd5337fbbd6%40googlegroups.com.
--Website: http://earl-of-code.com--
You received this message because you are subscribed to the Google Groups "Jenkins Infrastructure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.