Ongoing partial outage on ci.jenkins.io

152 views
Skip to first unread message

R. Tyler Croy

unread,
Aug 15, 2019, 11:33:16 AM8/15/19
to jenkin...@googlegroups.com, in...@lists.jenkins-ci.org
I'm writing this to let you all know that we're experiencing on on-going
partial outage on ci.jenkins.io which is affecting the ability of the system to
process pull requests and merges to plugins, core, etc.

Starting sometime yesterday (2019-08-14), all attempts to provision Azure VM
agents began failing with Azure API errors. As such **no** Azure VM agents are
provisioning. Fortunately Azure Container Instances are provisioning, so now
might be a good time to try them out for pull requests you wish to validate:
buildPlugin(useAci: true) will utilize those agent types for the Linux branches
of the pipeline.


I have some backchannel communication going on with some teams at Microsoft
about the issue, but at this time I do not have an estimated time for
resolution.

The underlying cause appears that the Azure REST APIs that ci.jenkins.io are
relying on are rather old, no longer supported, and something may have changed
to cause them to start failing to validate the "Deployments" (Azure
terminology) generated by the plugin which integrates Jenkins to Azure. If my
suspicions are correct about the fix, it will require some new plugin builds
and releases, which would mean we're likely a day or two away from the issue
fully being resolved.


I will update this thread as more information becomes available.



Cheers
--
GitHub: https://github.com/rtyler

GPG Key ID: 0F2298A980EE31ACCA0A7825E5C92681BEF6CEA2
signature.asc

Oleg Nenashev

unread,
Aug 15, 2019, 12:24:37 PM8/15/19
to Jenkins Developers
Hi all,

Just heads-up, we have applied a hotfix to Jenkins Pipeline library in order to stabilize **SOME** of the plugin Pipelines. https://github.com/jenkins-infra/pipeline-library/releases/tag/1.2.0. What does the patch do?
  • buildPlugin() now enforces ACI for Linux flows
  • buildPlugin() now skips Windows configurations
But:
  • all other methods are not fixed. runPCT(), runATH(), essentialsTest() and so on will not work
  • Jenkins core build swill keep failing
  •  ACI agents do not offer Docker on Linux nodes. Any flow using Docker CLI will fail. (e.g. Docker fixtures from Jenkins Test Harness)
Anyway, this change should help some plugin maintainers. 

BR, Oleg

P.S: I have enabled changelog for Jenkins Pipeline Library this week. If you are interested to get notifications for changes happening there, please feel free to subscribe to https://github.com/jenkins-infra/pipeline-library

Oleg Nenashev

unread,
Aug 16, 2019, 3:52:14 AM8/16/19
to Jenkins Developers
Some extra updates:
  • We restored publishing of Incrementals in plugins by moving the code to ACI agents (thanks Jesse!)
  • We restored Jenkins Core CI runs. Windows builds and ATH are temporarily disabled there
Best regards,
Oleg

Olblak

unread,
Aug 16, 2019, 8:38:31 AM8/16/19
to 'Gavin Mogan' via Jenkins Developers
Hi Everybody,

Some updates from my side.

ci.jenkins.io seems to be back to normal and my theory is the outage was related to two independent issues and one is still not resolved.

Wednesday afternoon (utc +2), I noticed that ci.jenkins.io was in a bad state. The build queue was huge (+-700 items), 49 linux machines were deployed to build jenkins core PR and only one windows agent was deployed.
The reason of that unbalanced repartitions of Linux and windows agents were because we configured ci.j.io to not deploy more than 50 azure machines, windows and linux included.

Every Linux agents were waiting since around 17hour for the windows  part to be finished and that for each PR.
While every linux machines  were in a good state, they were all considered as offline with the same issue which I reported here JENKINS-58937, broken ssh connection.

The first thing I did was to increase the limit of azure machines to 100 which immediately provisioned around 50 windows machines but linux agent were still considered as offline so before deleting linux machines in order to force a re-provisioning, I did a last configuration change which was to set bigger disk space size for every vm from 30GB (default setting) to 100GB in order to solve the disk space issue we regularly have with ci.j.io

This last change lead to JENKINS-58961 which is probably related to an azure api change.
Since I revert this last change to the default disk size value, windows and linux agent are now correctly provisionning as you can see here

Cheers
--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Oleg Nenashev

unread,
Aug 16, 2019, 9:10:14 AM8/16/19
to Jenkins Developers
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages