Instability on ci.jenkins.io Ubuntu EC2 agents

37 views
Skip to first unread message

Chris Kilding

unread,
May 7, 2020, 10:27:59 AM5/7/20
to jenkin...@googlegroups.com
Hi,

I'm seeing a range of instability on the Ubuntu EC2 build agents on ci.jenkins.io over the past week, including tests randomly taking forever and timing out, and instances randomly being terminated mid-build.

This didn't used to happen previously when I think the builds were running in Azure.

Any ideas?

Chris

Mark Waite

unread,
May 7, 2020, 11:12:29 AM5/7/20
to jenkinsci-dev
On Thu, May 7, 2020 at 8:27 AM Chris Kilding <chris+...@chriskilding.com> wrote:
Hi,

I'm seeing a range of instability on the Ubuntu EC2 build agents on ci.jenkins.io over the past week, including tests randomly taking forever and timing out, and instances randomly being terminated mid-build.


Yes, plenty of ideas, but no solutions yet.  Sorry for the unreliability of those build agents.

We were hosting all our build agents in Azure when Microsoft was sponsoring the Jenkins project infrastructure.  That sponsorship expired late in 2019 so that the Jenkins project was paying the full price of its infrastructure hosted on Azure.

AWS became a sponsor in early 2020.  In order to immediately reduce costs, we've added AWS EC2 agents using the EC2 plugin.  The Jenkins master and the Jenkins containerized agents ("ACI") continue to run on Azure for now, while the Ubuntu agents are now provisioned by the EC2 plugin.

Unfortunately, the agents provisioned by the EC2 plugin randomly lose their connection to the Jenkins master on Azure.  That loss of connection disrupts the job that is running on the agent and causes the types of annoying behaviors that you have seen.

We're currently putting first focus on completing the core release automation project so that we can deliver Jenkins weekly, LTS, and security releases without requiring Kohsuke perform the release.  Jenkins weekly releases 2.232, 2.233, and 2.234 were delivered from core release automation without requiring Kohsuke take any action.  We're working on the automation for long term support releases and for security releases.

Intense focus on the EC2 agent connection failures will have to wait until we either have more people to assist with infrastructure or we have completed the core release automation project.  Those who would like to assist with Jenkins infrastructure are welcome to join the weekly infrastructure meetings and to chat on the IRC channel #jenkins-infra. 

Thanks,
Mark Waite

 
This didn't used to happen previously when I think the builds were running in Azure.

Any ideas?

Chris

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/cd9d4a3d-fdfb-460a-805b-7bfb12d47d2d%40www.fastmail.com.

Chris Kilding

unread,
May 7, 2020, 2:18:28 PM5/7/20
to jenkin...@googlegroups.com
Hi Mark,

Thanks, that explains it. Off-hand thought - can AWS PrivateLink - or the Azure equivalent - do anything to improve connection reliability in our multi cloud setup? (I know these services are more intended for on prem to cloud links, but maybe it can help here.)

In the meantime my workaround is to push trivial commits when I need it to try again. That suffices during PR development, but I can’t do that in master for the fun of it.

Chris

Slide

unread,
May 7, 2020, 2:30:21 PM5/7/20
to jenkin...@googlegroups.com
Hi Chris,

You can also close and reopen (after about 30 seconds) the PR and it will build again.

Regards,

Alex



--

Mark Waite

unread,
May 7, 2020, 2:40:00 PM5/7/20
to jenkinsci-dev
On Thu, May 7, 2020 at 12:18 PM Chris Kilding <chris+...@chriskilding.com> wrote:
Hi Mark,

Thanks, that explains it. Off-hand thought - can AWS PrivateLink - or the Azure equivalent - do anything to improve connection reliability in our multi cloud setup? (I know these services are more intended for on prem to cloud links, but maybe it can help here.)


I'm hesitant to speculate on solutions until we better understand the problem.  I run 11 agents on AWS spot instances that are connected to a master at my house.  There are times when those 11 AWS computers are also connected to a Jenkins master on a different computer.  I've run that configuration for a year or more and have not had more than 3 or 4 connectivity issues.

I am reasonably confident that the network between the AWS data center and the Azure data center is no worse than the network between the master inside my house and the 11 spot instances running on AWS.  However, that's just speculation on my part.
 

Manuel Ramón León Jiménez

unread,
May 8, 2020, 9:55:17 AM5/8/20
to Jenkins Developers
Hi Chris, we've recently released a security fix for EC2 plugin. It implies reading the AWS EC2 console to get the ssh key used in the instance to communicate securely. The existing instances are configured with accept-new strategy, which guarantee they're going to be working even though the server key is not trusted. But the first time the Jenkins master connects to the ec2 instance, it now waits until the instance console is ready with the key printed. If the template was configured with a timeout it may expires because this operation could take several minutes. By default this timeout is empty, so there shouldn't be any problem, although you may see more time in provisioning the instances for the first time. All the information is here: https://github.com/jenkinsci/ec2-plugin/#security You can move to the off strategy to work as before, but it's insecure allowing a MitM attack.

Another issue is when you configure to connect to your ec2 instances with the ssh client, instead of the pure java client and the ssh command installed is pretty old. This command doesn't support the option -o StrictHostKeyChecking= and the connection doesn't take place.

Let me know whether this explanation fits to the case.

Thank you.

On Thursday, May 7, 2020 at 8:40:00 PM UTC+2, Mark Waite wrote:


On Thu, May 7, 2020 at 12:18 PM Chris Kilding <ch...@chriskilding.com> wrote:
Hi Mark,

Thanks, that explains it. Off-hand thought - can AWS PrivateLink - or the Azure equivalent - do anything to improve connection reliability in our multi cloud setup? (I know these services are more intended for on prem to cloud links, but maybe it can help here.)


I'm hesitant to speculate on solutions until we better understand the problem.  I run 11 agents on AWS spot instances that are connected to a master at my house.  There are times when those 11 AWS computers are also connected to a Jenkins master on a different computer.  I've run that configuration for a year or more and have not had more than 3 or 4 connectivity issues.

I am reasonably confident that the network between the AWS data center and the Azure data center is no worse than the network between the master inside my house and the 11 spot instances running on AWS.  However, that's just speculation on my part.
 
In the meantime my workaround is to push trivial commits when I need it to try again. That suffices during PR development, but I can’t do that in master for the fun of it.

Chris

On Thu, 7 May 2020, at 4:12 PM, Mark Waite wrote:
On Thu, May 7, 2020 at 8:27 AM Chris Kilding <ch...@chriskilding.com> wrote:
Hi,

I'm seeing a range of instability on the Ubuntu EC2 build agents on ci.jenkins.io over the past week, including tests randomly taking forever and timing out, and instances randomly being terminated mid-build.


Yes, plenty of ideas, but no solutions yet.  Sorry for the unreliability of those build agents.

We were hosting all our build agents in Azure when Microsoft was sponsoring the Jenkins project infrastructure.  That sponsorship expired late in 2019 so that the Jenkins project was paying the full price of its infrastructure hosted on Azure.

AWS became a sponsor in early 2020.  In order to immediately reduce costs, we've added AWS EC2 agents using the EC2 plugin.  The Jenkins master and the Jenkins containerized agents ("ACI") continue to run on Azure for now, while the Ubuntu agents are now provisioned by the EC2 plugin.

Unfortunately, the agents provisioned by the EC2 plugin randomly lose their connection to the Jenkins master on Azure.  That loss of connection disrupts the job that is running on the agent and causes the types of annoying behaviors that you have seen.

We're currently putting first focus on completing the core release automation project so that we can deliver Jenkins weekly, LTS, and security releases without requiring Kohsuke perform the release.  Jenkins weekly releases 2.232, 2.233, and 2.234 were delivered from core release automation without requiring Kohsuke take any action.  We're working on the automation for long term support releases and for security releases.

Intense focus on the EC2 agent connection failures will have to wait until we either have more people to assist with infrastructure or we have completed the core release automation project.  Those who would like to assist with Jenkins infrastructure are welcome to join the weekly infrastructure meetings and to chat on the IRC channel #jenkins-infra. 

Thanks,
Mark Waite

 
This didn't used to happen previously when I think the builds were running in Azure.

Any ideas?

Chris

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkin...@googlegroups.com.

Chris Kilding

unread,
May 15, 2020, 6:50:09 AM5/15/20
to jenkin...@googlegroups.com
I can't see into the instance logs as a ci.jenkins.io user, so I can't say if this is the cause or not. Maybe the Jenkins infra people can chime in?

I can say that most of the time, only a random handful of tests fail with timeouts while the rest pass. Though occasionally the whole suite can fail at the integration-test phase due to a timeout.

Chris
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Mark Waite

unread,
May 15, 2020, 7:31:23 AM5/15/20
to jenkinsci-dev
On Fri, May 15, 2020 at 4:50 AM Chris Kilding <chris+...@chriskilding.com> wrote:
I can't see into the instance logs as a ci.jenkins.io user, so I can't say if this is the cause or not. Maybe the Jenkins infra people can chime in?


I don't think the failures on ci.jenkins.io agents are related to the initial agent connection to the master.  The failure modes that I've observed for tests on ci.jenkins.io are failures while the job is running on the agent.  The agent is running a job and is disconnected for unknown reasons.  The job log will often include an entry that a FilePath could not be created or used or that the EC2 agent could not be contacted.  The agent log may include as an offline reason that there was no response on the PingThread.  The agent log may include as an offline reason that there was a Java end of file exception.
 
I can say that most of the time, only a random handful of tests fail with timeouts while the rest pass. Though occasionally the whole suite can fail at the integration-test phase due to a timeout.


That matches what I've observed as well.  I haven't tried to correlate test failures to see if certain tests are more likely to fail than other tests, but it is a common failure that some of the tests (in the git plugin, for example) will fail with a timeout exception if they were running on an EC2 agent that disconnected during the test.

Mark Waite
 
Reply all
Reply to author
Forward
0 new messages