Core build instability

34 views
Skip to first unread message

Tim Jacomb

unread,
Sep 20, 2021, 3:14:12 AM9/20/21
to Jenkins Developers
Hello

We're seeing quite unstable core builds recently.
Manifesting in a couple of ways: 'OutOfMemoryError: Java heap space' and agents failing to connect along with very slow tests (they hit the 3hour time out).

This appears to have begun in 2.309.

We have a very stable master branch:

Dependabot PRs always passed first time until:

Which have failed since with:
'OutOfMemoryError: Java heap space' 

Diff

Any help appreciated

Thanks
Tim

Jesse Glick

unread,
Sep 20, 2021, 8:39:56 AM9/20/21
to Jenkins Dev
So we have the JNA upgrade, XStream upgrade, and parallel class loading. I will try to bisect the cause.

Tim Jacomb

unread,
Sep 20, 2021, 1:44:33 PM9/20/21
to Jenkins Developers
Thanks for tracking down where the memory issue appears to be coming from in https://github.com/jenkins-infra/pipeline-steps-doc-generator/pull/94#issuecomment-923094344

I think the other issue is the CPU count appears to have been accidentally reverted to 2 cores =/

Haven't tracked down exactly where it was changed

Tim


On Mon, 20 Sept 2021 at 13:39, Jesse Glick <jgl...@cloudbees.com> wrote:
So we have the JNA upgrade, XStream upgrade, and parallel class loading. I will try to bisect the cause.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1ZGAciWuRiY7VHdNT%2B%2B1osPnRUxiZSC_k1UxVz-AOriQ%40mail.gmail.com.

Basil Crow

unread,
Sep 20, 2021, 3:37:34 PM9/20/21
to jenkin...@googlegroups.com
I see no evidence that jenkinsci/jenkins#5687 has introduced a leak,
so I do not think it should be reverted. I _do_ see evidence that
registering AntClassLoader (specifically) as parallel-capable has
increased the heap size requirement for pipeline-steps-doc-generator:
1280 MiB seems to be sufficient, while what JVM ergonomics picked for
e.g. https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-92/1/consoleFull
(945 MiB) is insufficient. My recommendation to operators is to adjust
the hardware and/or -Xmx settings to ensure that a sufficiently large
heap is provided.

Jesse Glick

unread,
Sep 20, 2021, 3:57:29 PM9/20/21
to Jenkins Dev
On Mon, Sep 20, 2021 at 3:37 PM Basil Crow <m...@basilcrow.com> wrote:
I do see evidence that registering AntClassLoader (specifically) as parallel-capable has increased the heap size requirement

Any notion yet of why that would be? It should be loading the same set of classes, just at slightly different times, unless I am missing something. 

Tim Jacomb

unread,
Sep 20, 2021, 4:04:28 PM9/20/21
to Jenkins Developers
Pending https://github.com/jenkins-infra/jenkins-infra/pull/1872 being approved I've manually changed the settings and we finally have a passing build for stable.



--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Basil Crow

unread,
Sep 20, 2021, 4:24:18 PM9/20/21
to jenkin...@googlegroups.com
On Mon, Sep 20, 2021 at 12:57 PM Jesse Glick <jgl...@cloudbees.com> wrote:
>
> Any notion yet of why that would be?

Why do you ask? The maximum heap size seems to have been 1516 MiB in
e.g. https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/master/299/consoleFull
but had dropped to 954 MiB by e.g.
https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/master/322/consoleFull
so the problem with pipeline-steps-doc-generator seems clear to me:
the operators mistakenly reduced the memory size of the test system,
and the job happened to continue to work for a while until organic
growth exposed the original operational issue. With the operational
issue resolved, PRs like jenkins-infra/pipeline-steps-doc-generator#92
are now passing against recent core releases. As far as I can tell,
this was a false alarm. I should not have been pinged about this.

I do not think it is appropriate to imply that a developer caused a
regression (for example, by describing jenkinsci/jenkins#5687 as "the
culprit") simply because an operational failure occurred. The cause of
the operational failure should be understood, and if that cause points
to a regression caused by a developer (such as a memory leak), then
the developer should be notified.

Anyway, one theory is that the organic increase in heap usage may be
coming from ClassLoader#getClassLoadingLock(String). If the
ClassLoader object is registered as parallel-capable, this method
returns a dedicated object associated with the specified class name;
otherwise, it returns the ClassLoader object. Perhaps there are enough
of these dedicated objects to cause a modest increase in heap usage on
some installations (~300 MiB in the case of
pipeline-steps-doc-generator).

Tim Jacomb

unread,
Sep 20, 2021, 4:30:45 PM9/20/21
to jenkin...@googlegroups.com
Given this has been going on for weeks including people looking at flakey tests and a lot of re-running of builds it was not clear at the time this was because resources had actually changed.

We were looking for the root cause and thought you may have had an insight into it, I would definitely expect to be pinged if a bisect had shown my commit to have caused the job to start failing, (even if to know why resource requirements have increased)

And interesting to know about the class loading lock I was not aware of that :)

Thanks
Tim

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Jesse Glick

unread,
Sep 21, 2021, 10:04:13 AM9/21/21
to Jenkins Dev
On Mon, Sep 20, 2021 at 4:24 PM Basil Crow <m...@basilcrow.com> wrote:
I do not think it is appropriate to imply that a developer caused a
regression […] simply because an operational failure occurred.

That was my best guess based on running `git bisect`: with the parallel class loading, the docs generator failed; without it, the generator worked. As mentioned, we have only speculated on what the real cause of the OOME was—something triggered by parallel class loading, which does not imply a root cause. For example, Jenkins might simply start faster in multiple threads, enabled by parallel class loading, and then do something unrelated to class loading which allocates lots of heap too quickly for GC to keep up.

Sounds like the instability in core builds themselves was unrelated, a coincidence?

Tim Jacomb

unread,
Sep 21, 2021, 10:15:17 AM9/21/21
to Jenkins Developers
> Sounds like the instability in core builds themselves was unrelated, a coincidence?

I think it was another symptom compounded by us not having enough history to see when it started very well.
There was a clear reliable failure change in 2.309 which was caused by a minor resource increase required.

But because the resources were so low on the CI (accidentally) it manifested as a problem there.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Basil Crow

unread,
Sep 21, 2021, 11:23:54 AM9/21/21
to jenkin...@googlegroups.com
On Tue, Sep 21, 2021 at 7:04 AM Jesse Glick <jgl...@cloudbees.com> wrote:
> That was my best guess based on running `git bisect`: with the parallel class loading, the docs generator failed; without it, the generator worked.

But this is just _data_; it doesn't mean anything unless we extract
the _insights_ out of it. To do that, we needed to understand _why_
the docs generator started failing.

> Sounds like the instability in core builds themselves was unrelated, a coincidence?

Looks that way to me, despite the claim that jenkinsci/jenkins#5687
"seems to be the cause of recent OOMEs, […] intermittently here [in
core] (acc. to @timja)".

I understand that operational issues that cause builds/tests to fail
can be tough to track down. I am a professional operator, so I know!
But I get enough of that at the day job and am unwilling to volunteer
that type of work for this project. I am happy to fix regressions that
I have caused as a developer; all I ask is that a little more thought
be given to the root cause analysis before dragging me in.

Jesse Glick

unread,
Sep 21, 2021, 2:33:00 PM9/21/21
to Jenkins Dev
Sorry to have implied that any action was required of you; I should have phrased this as more of a “heads-up, possible regression under investigation here”.

Tobias Gruetzmacher

unread,
Sep 22, 2021, 2:01:41 AM9/22/21
to jenkin...@googlegroups.com
Hi,

Am Tue, Sep 21, 2021 at 03:14:57PM +0100 schrieb Tim Jacomb:
> There was a clear reliable failure change in 2.309 which was caused by
> a minor resource increase required.
>
> But because the resources were so low on the CI (accidentally) it
> manifested as a problem there.

Just as a data point, I wasn't able to collect enough evidence yet: It
seems shortly after the release of 2.309 checkstyle was upgraded to
version 9.0, which switched from ANTLR2 to ANTLR4.

We have seen random OOM exceptions at my company due to
maven-checkstyle-plugin apparently not releasing memory after it has run
(Not thoroughly debugged yet), which leads to mvn processes being
~300-500MB larger in some builds...

Regards, Tobias
Reply all
Reply to author
Forward
0 new messages