Postmortem [ci.jenkins.io, plugins.jenkins.io, wiki.jenkins.io] - 2019-12-17

25 views
Skip to first unread message

Olblak

unread,
Dec 18, 2019, 4:58:42 AM12/18/19
to jenkin...@lists.jenkins-ci.org, Jenkins Developers ML
Hi,

Before going into what went wrong, here some context

Yesterday I was working on the Jenkins-infra/azure where multiple things needed to be done
Mainly updating DNS records related to INFRA-1797 including jenkins.io and more other things

What I missed while reviewing the changes, is that two months ago I manually update ci.jenkins.io VM size to have 32GB of RAM without changing the terraform code
So yesterday ci.jenkins.io was downsized to 16GB by accident which leads to
* The VM was restarted
* The Jenkins process couldn't start because it didn't have enough memory available

To fix this issue, I updated the terraform code and then re-applied it

The second issue that happened at the same time is due to the way we define our DNS record.
We use a 'hack' in terraform to use loops, Terraform doesn't correctly keep track of the different resources and so when we add/delete DNS record, it also delete and recreate other DNS records, and if for some reasons something goes wrong before the record is re-created, then we just lose that DNS record and this is what happened to wiki.jenkins.io

So what could we do better

* plugins.jenkins.io should generate his data on his own and not having strong dependencies on ci.jenkins.io
  I would be happy to discuss it with someone willing to contribute to that service.
* DNS record, we have to test if the loop mechanism introduces in terraform 0.12  correctly handle the different resources generated based on an array
* wiki.jenkins.io, we should get rid of that service

Cheers

Oleg Nenashev

unread,
Dec 18, 2019, 5:47:45 AM12/18/19
to Olblak, Jenkin...@lists.jenkins-ci.org, Jenkins Developers ML
+1 for making plugin site self-sufficient (read as: depends only on update center). Wiki is being slowly migrated, including plugin docs and other foundation documentation, contributions will be much appreciated.

Regarding the outage, it looks like ci.jenkins.io still does not build all components (see changelog PRs for jenkins.io)


_______________________________________________
Jenkins-infra mailing list
Jenkin...@lists.jenkins-ci.org
http://lists.jenkins-ci.org/mailman/listinfo/jenkins-infra

Olblak

unread,
Dec 18, 2019, 7:41:29 AM12/18/19
to Oleg Nenashev, Jenkin...@lists.jenkins-ci.org, Jenkins Developers ML
Regarding the outage, it looks like ci.jenkins.io still does not build all components (see changelog PRs for jenkins.io)

To me, it seems to be working fine, so If you an error feel free to share it


---
gpg --keyserver keys.gnupg.net --recv-key 52210D3D
---

Marky Jackson

unread,
Dec 18, 2019, 8:55:37 AM12/18/19
to JenkinsCI Developers, jenkin...@lists.jenkins-ci.org
I would be willing to contribute but the correct access will need to be granted for testing. 
Previous I wanted to onboard and help out but access was limited to much of the infrastructure and I ended up paying for my own infra to test and that became a burden financially.
So if we can figure that out I can more then help given my knowledge in this area.
Thanks kindly.

-- 
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/d6383938-16c3-4e1e-ae8f-b333efc7fb81%40www.fastmail.com.

Olblak

unread,
Dec 18, 2019, 9:47:52 AM12/18/19
to Jenkins Developers ML
I would be willing to contribute but the correct access will need to be granted for testing. 
I don't have the permission to invite any additionnal users on the current azure account at the moment (I am working it), but it's not needed for working on this, Minikube is more than enough.

Timja created a helm chart which can be deployed on minikube, with instruction located here
One of the application settings defines where to fetch data as defined here
Those data are generated on ci.jenkins.io as defined here
Now those data should be generated from the pluginsite instead of ci.j.io but the 'how' depends

* Run the cronjob from a cron resource
* Run the cronjob from the plugin-site-api


---
gpg --keyserver keys.gnupg.net --recv-key 52210D3D
---


Oleg Nenashev

unread,
Dec 18, 2019, 11:47:42 AM12/18/19
to Marky Jackson, JenkinsCI Developers, Jenkin...@lists.jenkins-ci.org

Slide

unread,
Dec 18, 2019, 12:42:04 PM12/18/19
to Jenkins Developer List, Marky Jackson, Jenkin...@lists.jenkins-ci.org
I had a similar issue that I flagged with the repository-permission-updater. The permissions for jenkinsadmin were removed from the whole jenkins-infra org, so the repo scan was not working correctly. 

Olblak

unread,
Dec 19, 2019, 4:07:51 AM12/19/19
to Jenkins Developers ML
Alex rises another important change that I did yesterday (not related to the outage)

The irc bot(jenkinsadmin), was using a Github account with owner permission on the jenkins-infra organization witch means admin permission on every repository, so I reduced that permission to only jenkins-infra/jenkins-infra

Apparently that Github account was also used our jenkins instances to synchronize with Github repositories and needed at least write permission so I changed that for every repository that needs to be synced from ci.jenkins.io

That Github account may be used somewhere else that I am not aware, so keep that in mind.


---
gpg --keyserver keys.gnupg.net --recv-key 52210D3D
---


Mark Waite

unread,
Dec 19, 2019, 4:16:07 AM12/19/19
to jenkinsci-dev
Have the ci jobs for jenkins.io started running again? They were offline all day yesterday with a permission error in the scan log

Olblak

unread,
Dec 19, 2019, 4:44:00 AM12/19/19
to Jenkins Developers ML
It should, I updated jenkins-admin permission this morning then triggered a repository scan and it was working, so for me it should

---
gpg --keyserver keys.gnupg.net --recv-key 52210D3D
---


Mark Waite

unread,
Dec 19, 2019, 5:45:23 AM12/19/19
to jenkinsci-dev
Thanks!  They are working.



--
Thanks!
Mark Waite
Reply all
Reply to author
Forward
0 new messages