Constant ci.jenkins.io failures - am I just unlucky?

91 views
Skip to first unread message

Mez Pahlan

unread,
Oct 26, 2019, 12:20:08 PM10/26/19
to Jenkins Developers
Am I doing something wrong here? I constantly get failures with ci.jenkins.io typically with agents not working when I expect them to. Not build failures with my plugin but infra issues before the plugin even gets built. Typically out of memory issues when the build doesn't even start. At the time of writing there are ~320 jobs waiting to be built and most of the agents marked as offline. I've seen that number a lot higher in the past. I honestly don't know whether this is a common issue or not. Typically I end up waiting either days for a Windows agent or in this most recent failure I can't even connect to a windows agent so the Linux part of the build doesn't even start and the whole build is marked as a failure.

My Jenkinsfile looks like this:

buildPlugin(useAci: true)

I only added useAci after finding out in a similar manner that my builds were failing with infra issues without it.

I thought the recommendation was to add a buildPlugin() statement in a Jenkinsfile in your plugin's root directory. I therefore assumed that what actually builds the plugin is stable most of the time. Have I got the wrong end of the stick? Should I be using something else in my Jenkinsfile? I can't be the only plugin dev to face these issues which leads me to believe I must be doing something wrong or thinking about this in the wrong way. Do I just happen to be building my plugin at the wrong time of the week?

I know I can always proceed to with my PRs and builds by ignoring the result of the CI build but then..... what's the point of the recommendation to add buildPlugin()?

Yours massively frustrated

Mez

Oleg Nenashev

unread,
Oct 27, 2019, 4:31:16 AM10/27/19
to Jenkins Developers
Hi Mez,

I totally understand your frustration. Current situation is that ci.jenkins.io is very unstable due to numerous factors. The most of CI runs fail, and on Friday we had an ACI outage which has not been resolved yet. On my side I had to postpone Jenkins core PR merges into weekly due to CI, and it is not the first time. I would say that ci.jenkins.io is basically unusable at the moment.

Why? As usual, we need contributors there. Olivier and other infra team members are doing a great job to keep the things afloat, bit right now it is a non-stop firefighting. JIRA, Confluence, ci.jenkins.io, and many other stories. We need more contributors in the infra team to share the maintenance load and to finally improve the setup. Ideally we should make ci.jenkins.io a reference setup which is reusable by other Jenkins users.

Starting from November 15th, I personally commit to focus on infrastructure and, specifically, on ci.jenkins.io as a top priority. I will have less time for other areas like Jenkins Core, bit I consider the current state as a grave danger to the project.

I invite everyone else to consider contributing to the infrastructure: https://jenkins.io/projects/infrastructure/#contributing

Best regards,
Oleg

Richard Bywater

unread,
Oct 27, 2019, 4:50:46 AM10/27/19
to jenkin...@googlegroups.com
I've looked into contributing a few times but unfortunately my timezone (NZT) and times of availability don't really help in terms of being available to communicate in real-time with the rest of the team which feels a bit isolating and harder to contribute.

However if there's things that can be looked into which are easy to pick-up, then, given the need for extra people, perhaps it's time to give a few small issues a go and see how I get on...

Richard.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/e310142d-4770-4c0d-a18d-89de3673ad96%40googlegroups.com.

Mez Pahlan

unread,
Oct 27, 2019, 4:59:54 AM10/27/19
to Jenkins Developers
Thanks for the reply Oleg


I am sure it is a massive task maintaining the infra side of things and more contributors will hopefully help. We all do appreciate the hard work and my comments were mostly out of frustration last night. I tend to do my plugin development on the weekend so maybe I have simply been caught out in an unlucky time period. As a suggestion is there a way to notify a different build status for Github instead of build failed? With some message that easily identifies developers should rerun the build locally?

Good luck with the efforts!

Arnaud Héritier

unread,
Oct 27, 2019, 6:30:01 AM10/27/19
to jenkin...@googlegroups.com
Oleg i can propose my help too. 
For sure I won’t be able to dedicate a major part of my time but I suppose that everything is good to take. 

I am already infra member but I don’t have an admin access on ci.jenkins.io

Do we have the support-core plugin installed on this instance ? If no, can we add it ?

If I can have access to a support bundle I will do a health check review like we do for our customers at CloudBees. 

Can it help?

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/e310142d-4770-4c0d-a18d-89de3673ad96%40googlegroups.com.
--
-----
Arnaud Héritier
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier

Oleg Nenashev

unread,
Oct 27, 2019, 7:34:20 AM10/27/19
to JenkinsCI Developers
Yes. Support Core plugin setup is something we could do as a part of the instance update. Ideally we need integration with an online service like Sentry in Jenkins Evergreen so that we can automate log processing and notifications

You received this message because you are subscribed to a topic in the Google Groups "Jenkins Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-dev/xqC8vW_emJI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFNCU--n1U8f%2BKtL%3DecORahj3vGp7VzznCmZ8BaPrgm_Lg2b7w%40mail.gmail.com.

Daniel Beck

unread,
Oct 27, 2019, 7:58:39 AM10/27/19
to Jenkins Dev, Arnaud Héritier


> On 27. Oct 2019, at 11:29, Arnaud Héritier <aher...@gmail.com> wrote:
>
> Do we have the support-core plugin installed on this instance ? If no, can
> we add it ?

We had it installed for years and nobody cared. After it showed up recently in a thread dump as the likely cause of excessive load while generating background bundles, I disabled it.

Arnaud Héritier

unread,
Oct 27, 2019, 8:14:47 AM10/27/19
to Daniel Beck, Allan Burdajewicz, Jenkins Dev, Jenkin...@lists.jenkins-ci.org


> Do we have the support-core plugin installed on this instance ? If no, can
> we add it ?

We had it installed for years and nobody cared. After it showed up recently in a thread dump as the likely cause of excessive load while generating background bundles, I disabled it.

Not surprising :-/
I can take care of it.
If it's the cause of an excessive load we have to fix it. 
We can help on that with @Allan Burdajewicz 
I know it was the case in the past, especially with the anonymization activated but it's fixed in recent versions.

I don't see any  INFRA ticket for these recurrent stability issues on ci.jenkins.io, do I miss one ?
Otherwise let's create one and I can take the ownership of a subtask to do an healthcheck if I can have access to a support bundle (We can potentially activate the plugin just the time to grab a bundle - it will give me a place to start and better understand the setup)

Daniel Beck

unread,
Oct 29, 2019, 7:47:26 PM10/29/19
to jenkin...@googlegroups.com
And since support-core is disabled, the next restart will disable metrics plugin too.

We're affected by JENKINS-59793, with currently (AFAICT ~30 minutes after restart) more than 700 of these threads.

Arnaud Héritier

unread,
Oct 30, 2019, 9:56:12 AM10/30/19
to Jenkins Developers
I still don't have an admin access on the instance but I have a system level access. 
I was able to grab few bundles and we will see if it helps to start to diagnose the issue

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Baptiste Mathus

unread,
Oct 31, 2019, 6:28:42 AM10/31/19
to Jenkins Developers
I filed https://issues.jenkins-ci.org/browse/INFRA-2308 after Arnaud rightly recommended to have something public and central for us to fix this.

Pretty please anyone who's been suffering from ci.jenkins.io recent instabilities, do not hesitate to provide a status check here.
We're especially interested in common stack traces or errors you're seeing.

We have started analyzing support bundles. First step is https://github.com/jenkins-infra/jenkins-infra/pull/1375

More to come.

Thanks!

Mez Pahlan

unread,
Oct 31, 2019, 3:17:52 PM10/31/19
to Jenkins Developers
Many thanks Baptiste for picking this up. When you say provide a status check here do you mean to add your stack traces to this forum thread or to the INFRA ticket you linked to or something else?

Cheers.


On Thursday, 31 October 2019 10:28:42 UTC, Baptiste Mathus wrote:
I filed https://issues.jenkins-ci.org/browse/INFRA-2308 after Arnaud rightly recommended to have something public and central for us to fix this.

Pretty please anyone who's been suffering from ci.jenkins.io recent instabilities, do not hesitate to provide a status check here.
We're especially interested in common stack traces or errors you're seeing.

We have started analyzing support bundles. First step is https://github.com/jenkins-infra/jenkins-infra/pull/1375

More to come.

Thanks!

Le mer. 30 oct. 2019 à 14:56, Arnaud Héritier <aher...@gmail.com> a écrit :
I still don't have an admin access on the instance but I have a system level access. 
I was able to grab few bundles and we will see if it helps to start to diagnose the issue

On Wed, Oct 30, 2019 at 12:47 AM Daniel Beck <m...@beckweb.net> wrote:


> On 27. Oct 2019, at 12:58, Daniel Beck <m...@beckweb.net> wrote:
>
>
>
>> On 27. Oct 2019, at 11:29, Arnaud Héritier <aher...@gmail.com> wrote:
>>
>> Do we have the support-core plugin installed on this instance ? If no, can
>> we add it ?
>
> We had it installed for years and nobody cared. After it showed up recently in a thread dump as the likely cause of excessive load while generating background bundles, I disabled it.

And since support-core is disabled, the next restart will disable metrics plugin too.

We're affected by JENKINS-59793, with currently (AFAICT ~30 minutes after restart) more than 700 of these threads.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.


--
-----
Arnaud Héritier
Mail/GTalk: aheritier AT gmail DOT com
Twitter/Skype : aheritier

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.

Baptiste Mathus

unread,
Oct 31, 2019, 4:07:57 PM10/31/19
to Jenkins Developers
I guess in the JIRA Epic as a comment would be good. Especially if you still have seen failures in the last ~2 days.

We've applied the change to get back already to G1 from ZGC + bumping to a bigger Xmx value earlier today. But AFAIK the current situation in the last two days since last restart was quite OKish. 
Hence we're waiting somehow for, if so, the situation to degrade again so we get a new support bundle to analyze, fix things/adjust config, then rinse and repeat.

Given things looked ok since restart, we suspect something like thread or memory leak that takes some time to surface.

BTW, we had a bit of an impromptu meeting earlier today on this. 
We do plan to plan recurring public sync-ups (probably during weekly infra meeting), so anyone willing to contribute can do it (contrary to today, facepalm).

Stay tuned, and feel fully free to ask any question or provide any insight on what misbehaviors you're still seeing.

Thanks

To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/a93ee5ab-305e-49e7-8b38-77f8a09fbac7%40googlegroups.com.

Oleg Nenashev

unread,
Oct 31, 2019, 4:45:09 PM10/31/19
to JenkinsCI Developers
One thing to add here is that the Metrics plugin has been disabled by Daniel before the restart.

Right now the instance is indeed stable, bit the load was pretty low. But now we have a regular Jenkins core massive rebuild, so it should put some pressure on CI again.
If it goes down overnight, we can just restart it tomorrow.

Best regards,
Oleg


You received this message because you are subscribed to a topic in the Google Groups "Jenkins Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-dev/xqC8vW_emJI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANWgJS42DAD%2BpAK3Hgzv78VRKvv1GmWW1UL9OJsNaxNNYMnUZg%40mail.gmail.com.

Allan Burdajewicz

unread,
Oct 31, 2019, 8:13:29 PM10/31/19
to Jenkins Developers
About support core, based on what I see in the current bundle, the instance is running a really recent version of the plugin which is quite recent and should include most of the improvements made.

Another thing I can see though is that the support bundles contains all GC logs since March 2019. That's 244 files, each of them being 20 MB (file size limit) which means that we have 4.75 GB of GC logs in the support bundle. Maybe this could be accounting for the load when generating the bundle (that happen every hour by default but could be made less frequent or disabled: https://github.com/jenkinsci/support-core-plugin/blob/support-core-2.61/src/main/java/com/cloudbees/jenkins/support/SupportPlugin.java#L130-L135)

* We should maybe setup log rotation on the server ? 

A further reason why the GC logs are included in the bundle is because they are written right under $JENKINS_HOME/ which is a location that a particular component is scanning: https://github.com/jenkinsci/support-core-plugin/blob/support-core-2.61/src/main/java/com/cloudbees/jenkins/support/impl/JenkinsLogs.java#L101-L130.

* So a workaround (I would say a better practice actually) is to change the location of the GC logs to something like `$JENKINS_HOME/gc/` or it even out them outside the $JENKINS_HOME.

Maybe this could be improved to not include files older than a certain number of days or have some limits (https://issues.jenkins-ci.org/browse/JENKINS-59030).

On Friday, November 1, 2019 at 6:45:09 AM UTC+10, Oleg Nenashev wrote:
One thing to add here is that the Metrics plugin has been disabled by Daniel before the restart.

Right now the instance is indeed stable, bit the load was pretty low. But now we have a regular Jenkins core massive rebuild, so it should put some pressure on CI again.
If it goes down overnight, we can just restart it tomorrow.

Best regards,
Oleg


On Thu, Oct 31, 2019 at 11:07 PM Baptiste Mathus <m...@batmat.net> wrote:
I guess in the JIRA Epic as a comment would be good. Especially if you still have seen failures in the last ~2 days.

We've applied the change to get back already to G1 from ZGC + bumping to a bigger Xmx value earlier today. But AFAIK the current situation in the last two days since last restart was quite OKish. 
Hence we're waiting somehow for, if so, the situation to degrade again so we get a new support bundle to analyze, fix things/adjust config, then rinse and repeat.

Given things looked ok since restart, we suspect something like thread or memory leak that takes some time to surface.

BTW, we had a bit of an impromptu meeting earlier today on this. 
We do plan to plan recurring public sync-ups (probably during weekly infra meeting), so anyone willing to contribute can do it (contrary to today, facepalm).

Stay tuned, and feel fully free to ask any question or provide any insight on what misbehaviors you're still seeing.

Thanks

--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-dev/xqC8vW_emJI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANWgJS42DAD%2BpAK3Hgzv78VRKvv1GmWW1UL9OJsNaxNNYMnUZg%40mail.gmail.com.

Arnaud Héritier

unread,
Nov 1, 2019, 5:13:37 AM11/1/19
to Jenkins Developers
On Fri, Nov 1, 2019 at 1:13 AM Allan Burdajewicz <aburda...@cloudbees.com> wrote:
About support core, based on what I see in the current bundle, the instance is running a really recent version of the plugin which is quite recent and should include most of the improvements made.

yes
 

Another thing I can see though is that the support bundles contains all GC logs since March 2019. That's 244 files, each of them being 20 MB (file size limit) which means that we have 4.75 GB of GC logs in the support bundle. Maybe this could be accounting for the load when generating the bundle (that happen every hour by default but could be made less frequent or disabled: https://github.com/jenkinsci/support-core-plugin/blob/support-core-2.61/src/main/java/com/cloudbees/jenkins/support/SupportPlugin.java#L130-L135)

* We should maybe setup log rotation on the server ? 

A further reason why the GC logs are included in the bundle is because they are written right under $JENKINS_HOME/ which is a location that a particular component is scanning: https://github.com/jenkinsci/support-core-plugin/blob/support-core-2.61/src/main/java/com/cloudbees/jenkins/support/impl/JenkinsLogs.java#L101-L130.

* So a workaround (I would say a better practice actually) is to change the location of the GC logs to something like `$JENKINS_HOME/gc/` or it even out them outside the $JENKINS_HOME.

Maybe this could be improved to not include files older than a certain number of days or have some limits (https://issues.jenkins-ci.org/browse/JENKINS-59030).


It's what I wanted to fix yesterday and failed by not noticing that the instance is running is running on J11 :-/

 
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/9a61a4f9-c850-48c1-93de-4d6bccc0b8e5%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages