RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

48 views
Skip to first unread message

Basil Crow

unread,
Jun 5, 2020, 3:51:53 PM6/5/20
to jenkin...@googlegroups.com
I recently stabilized my plugin's test suite on ci.jenkins.io. The
following is my root cause analysis.

At present there are eight online Ubuntu EC2 agents on ci.jenkins.io.
Three of these are high memory and five of these are not:

• EC2 (aws) - High memory ubuntu 18.04 (i-067cdb5c4dd6bbc66)
• EC2 (aws) - High memory ubuntu 18.04 (i-09868363dd8e0e302)
• EC2 (aws) - High memory ubuntu 18.04 (i-0d3e670dcf9448827)
• EC2 (aws) - Ubuntu 18.04 LTS (i-0147db496a4c3205b)
• EC2 (aws) - Ubuntu 18.04 LTS (i-066509d2e6e564444)
• EC2 (aws) - Ubuntu 18.04 LTS (i-06b6dd7739f0fcad8)
• EC2 (aws) - Ubuntu 18.04 LTS (i-0c6752517c9e4dd86)
• EC2 (aws) - Ubuntu 18.04 LTS (i-0d7ea29c5c4d607c6)

Both the high memory and the regular memory agents have the "linux"
label, so the Linux branches of my plugin's tests may run on either
the high memory or the regular memory agents. I noticed that the
branches of my tests that happen to run on the high memory agents
usually pass, but the branches of my tests that happen to run on the
regular memory agents frequently time out.

I added additional logging and saw that the agent JVM being launched
by my tests was sometimes running out of memory and crashing. This in
turn was causing my test to time out waiting for the agent to connect.
Why was the agent JVM running out of memory?

I added additional logging to print memory usage by process during
each test. I discovered that the regular memory agents have 2 GB of
RAM. They run several JVMs in the course of a typical integration
test:

• Remoting (with no -Xmx or -Xms)
• Maven (with no -Xmx or -Xms)
• surefire (with -Xms768M -Xmx768M)
• The agent JVM launched by my tests (with no -Xmx or -Xms)

I added additional logging and determined that at the time my test
started (at which point the only JVMs running were Remoting, Maven,
and surefire), only about 400 MB of RAM remained free on the system.
Thus it was no surprise that my agent JVMs were frequently running out
of memory.

I worked around the problem by setting

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<argLine>-Xmx256m -Xms256m</argLine>
</configuration>
</plugin>

in pom.xml and setting "-Xmx64m -Xms64m" for my agent JVMs (in my
tests). With these settings my tests consistently pass, even on the
regular memory EC2 agents.

I suggest the Jenkins infrastructure team consider adding -Xmx and
-Xms options to the Remoting JVM and/or using EC2 instance types with
more memory.

Jesse Glick

unread,
Jun 5, 2020, 4:24:11 PM6/5/20
to Jenkins Dev
Thank you for digging into this problem which has been plaguing us.
(INFRA-2548?) Your analysis sounds right. The next step would be PRs
to infrastructure repositories.

256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
plugins plus whatever your test code is doing. Of course 2Gb is also a
bit tight for `JenkinsRule` tests. I agree that `agent.jar` should be
able to run in quite a bit less than whatever HotSpot ergonomics would
pick by default, and probably `mvn` could as well, leaving more room
for the Surefire JVM and any extra processes such as mock agents, Git,
Docker fixtures, etc.

I wonder if there is any way to have all the JVMs in this VM coöperate
to jointly use, say, 75% of available RAM in whatever proportion.

Matt Sicker

unread,
Jun 5, 2020, 4:40:33 PM6/5/20
to jenkin...@googlegroups.com
Might be worth looking at OpenJ9. It has some nifty cloud native
features for helping reduce JVM load. For example:

* https://www.eclipse.org/openj9/docs/jitserver/
* https://www.eclipse.org/openj9/docs/shrc/

Disclaimer: I've only seen a talk about this; I've never tried
configuring this in a real cloud environment. Looks nifty, though.
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1t-OO_xf4JC7ABZApepXSQ44P2SphF_8j0cf8FsZNU1A%40mail.gmail.com.



--
Matt Sicker
Senior Software Engineer, CloudBees

Basil Crow

unread,
Jun 5, 2020, 4:40:42 PM6/5/20
to jenkin...@googlegroups.com
On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <jgl...@cloudbees.com> wrote:
> The next step would be PRs to infrastructure repositories.

I agree. Unfortunately I have spent too much time on this issue already and
cannot volunteer to become an infrastructure developer at present.

> 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
> plugins plus whatever your test code is doing.

I agree, which is why I said "I worked around the problem" rather than "I
solved the problem." Changing the JVM settings for Surefire and the agent
launched by my tests was easy because both those JVMs were completely within
my control. I agree that a long-term solution would involve setting -Xmx and
-Xms on all agent and Maven JVMs as well as possibly increasing the EC2
instance size for these nodes.

Slide

unread,
Jun 5, 2020, 5:02:42 PM6/5/20
to jenkin...@googlegroups.com
We are currently using t2.small instances on EC2 for the non-high memory instances

image.png

Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem. 

The high mem instances are using m5.adxlarge:

image.png

I don't know what the cost difference is between the t2 and m5a instances.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.


--

Vlad Silverman

unread,
Jun 5, 2020, 5:19:09 PM6/5/20
to jenkin...@googlegroups.com
I don't know what the cost difference is between the t2 and m5a instances.

I guess it depends on the region.
On Jun 5, 2020, at 2:02 PM, Slide <slide...@gmail.com> wrote:

We are currently using t2.small instances on EC2 for the non-high memory instances

<image.png>

Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem. 

The high mem instances are using m5.adxlarge:

<image.png>

I don't know what the cost difference is between the t2 and m5a instances.

On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <m...@basilcrow.com> wrote:
On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <jgl...@cloudbees.com> wrote:
> The next step would be PRs to infrastructure repositories.

I agree. Unfortunately I have spent too much time on this issue already and
cannot volunteer to become an infrastructure developer at present.

> 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
> plugins plus whatever your test code is doing.

I agree, which is why I said "I worked around the problem" rather than "I
solved the problem." Changing the JVM settings for Surefire and the agent
launched by my tests was easy because both those JVMs were completely within
my control. I agree that a long-term solution would involve setting -Xmx and
-Xms on all agent and Maven JVMs as well as possibly increasing the EC2
instance size for these nodes.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Matt Sicker

unread,
Jun 5, 2020, 5:32:21 PM6/5/20
to jenkin...@googlegroups.com
Looks like m5a are AMD and t2 are Intel (and burstable). If they cost
similar, m5a sounds better.
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/FCEB4838-4FF8-410F-AEB3-6FACBDF98D80%40gmail.com.

Slide

unread,
Jun 5, 2020, 6:40:06 PM6/5/20
to jenkin...@googlegroups.com
Just for reference...

image.png
image.png
image.png

t2.medium may be the way to go



--

Gavin Mogan

unread,
Jun 5, 2020, 6:43:10 PM6/5/20
to Jenkins Developers
Remember with more resources the tests can often run faster which reduces how much time the instance is needed for.

It's never straight simple math

Slide

unread,
Jun 5, 2020, 7:28:33 PM6/5/20
to jenkin...@googlegroups.com
True, but I am not super sure that double the memory (e.g., m5a.large over t2.medium) would make a big enough difference for almost double the cost. I could be wrong though, I am definitely not an expert in java optimization, etc.



--

Tim Jacomb

unread,
Jun 9, 2020, 3:59:55 AM6/9/20
to Jenkins Developers
Hi all,

I've done the following:

* linux docker - was t3.small, now t3a.large (2 core 8gb)
* arm64 - was a1.medium, now t3a.large (2 core 8gb)


Let's monitor and see how we go (pricing and performance wise)

High mem could possibly do with a change, the AWS ones are much lower spec than the Azure ones, thoughts?

Thanks
Tim

Jesse Glick

unread,
Jun 12, 2020, 10:20:32 AM6/12/20
to Jenkins Dev
On Tue, Jun 9, 2020 at 3:59 AM Tim Jacomb <timja...@gmail.com> wrote:
> High mem could possibly do with a change, the AWS ones are much lower spec than the Azure ones, thoughts?

Not sure but I just got an unexplained

EC2 (aws) - High memory ubuntu 18.04 (i-0e7f3896526c7922e) was
marked offline: Connection was broken: java.io.EOFException

Tim Jacomb

unread,
Jun 12, 2020, 12:09:32 PM6/12/20
to Jenkins Developers, jenkin...@googlegroups.com
Azure high mem:

Standard_D16_v3 
vcpu 16
memory 64

AWS:
m5a.xlarge
vcpu 4 
memory 16 GiB

I've changed it to: 'm5a.4xlarge' (16CPU, 64 GB ram)
It's 4 times the cost so we'll need to keep an eye on it

Thanks
Tim


--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages