Hi,
Before going into what went wrong, here some context
Mainly updating DNS records related to INFRA-1797 including
jenkins.io and more other things
What I missed while reviewing the changes, is that two months ago I manually update
ci.jenkins.io VM size to have 32GB of RAM without changing the terraform code
So yesterday
ci.jenkins.io was downsized to 16GB by accident which leads to
* The VM was restarted
* The Jenkins process couldn't start because it didn't have enough memory available
To fix this issue, I updated the terraform code and then re-applied it
The second issue that happened at the same time is due to the way we define our DNS record.
We use a 'hack' in terraform to use loops, Terraform doesn't correctly keep track of the different resources and so when we add/delete DNS record, it also delete and recreate other DNS records, and if for some reasons something goes wrong before the record is re-created, then we just lose that DNS record and this is what happened to
wiki.jenkins.io
So what could we do better
I would be happy to discuss it with someone willing to contribute to that service.
* DNS record, we have to test if the loop mechanism introduces in terraform 0.12 correctly handle the different resources generated based on an array
Cheers