[JIRA] (JENKINS-61051) Jobs are started on master instead of EC2 slaves randomly

0 views
Skip to first unread message

gabezzz@gmail.com (JIRA)

unread,
Feb 11, 2020, 7:55:02 AM2/11/20
to jenkinsc...@googlegroups.com
Gabor V created an issue
 
Jenkins / Bug JENKINS-61051
Jobs are started on master instead of EC2 slaves randomly
Issue Type: Bug Bug
Assignee: FABRIZIO MANFREDI
Components: ec2-plugin
Created: 2020-02-11 12:54
Environment: Jenkins v2.220
EC2 plugin v1.49+
Labels: plugin ec2 slave agents
Priority: Critical Critical
Reporter: Gabor V

Jenkins master runs on an AWS Linux 2. Jenkins uses the EC2 plugin to create slaves whenever needed and many jobs are assigned to slaves using the labels.

Since upgrading to EC2 plugin 1.49 some jobs - randomly, it seems - are started on the master node instead of using the started slaves. The aws slave is started, but the workspace is created on master (in the user's home which should have been used on the slave). The job's console log says it is running on the slave but it is not true.

Maybe this is not related to EC2 plugin as I don't see any change related to this problem in the 1.49 version's release history.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

gabezzz@gmail.com (JIRA)

unread,
Feb 11, 2020, 8:03:05 AM2/11/20
to jenkinsc...@googlegroups.com
Gabor V updated an issue
Change By: Gabor V
Attachment: Screenshot 2020-02-11 at 13.59.23.png
Jenkins master runs on an AWS Linux 2. Jenkins uses the EC2 plugin to create slaves whenever needed and many jobs are assigned to slaves using the labels.

Since upgrading to EC2 plugin 1.49 some jobs - randomly, it seems - are started on the master node instead of using the started slaves. The aws slave is started, but the workspace is created on master (in the user's home which should have been used on the slave). The job's console log says it is running on the slave but it is not true.

Maybe this is not related to EC2 plugin as I don't see any change related to this problem in the 1.49 version's release history.


Attachment: I created a snapshot about a node's script console page while - according to the Jenkins logs - it was used for building. I asked for the hostname and although the name of the node suggests it is a slave node, the hostname belongs to the master. And of course the workspace was created on master.

gabezzz@gmail.com (JIRA)

unread,
Feb 11, 2020, 8:47:04 AM2/11/20
to jenkinsc...@googlegroups.com
Gabor V assigned an issue to Jeff Thompson
Change By: Gabor V
Assignee: FABRIZIO MANFREDI Jeff Thompson

gabezzz@gmail.com (JIRA)

unread,
Feb 11, 2020, 8:48:02 AM2/11/20
to jenkinsc...@googlegroups.com
Gabor V updated an issue
Change By: Gabor V
Component/s: remoting
Labels: agents ec2 plugin remoting slave

gabezzz@gmail.com (JIRA)

unread,
Feb 11, 2020, 8:49:02 AM2/11/20
to jenkinsc...@googlegroups.com
Gabor V updated an issue
Jenkins master runs on an AWS Linux 2. Jenkins uses the EC2 plugin to create slaves whenever needed and many jobs are assigned to slaves using the labels.

Since upgrading to EC2 plugin 1.49 (and to Jenkins 2.217 which contains remoting 4.0) some jobs - randomly, it seems - are started on the master node instead of using the started slaves. The aws slave is started, but the workspace is created on master (in the user's home which should have been used on the slave). The job's console log says it is running on the slave but it is not true.


Maybe this is not related to EC2 plugin as I don't see any change related to this problem in the 1.49 version's release history.

Attachment: I created a snapshot about a node's script console page while - according to the Jenkins logs - it was used for building. I asked for the hostname and although the name of the node suggests it is a slave node, the hostname belongs to the master. And of course the workspace was created on master.

jthompson@cloudbees.com (JIRA)

unread,
Feb 11, 2020, 11:57:04 AM2/11/20
to jenkinsc...@googlegroups.com
Jeff Thompson commented on Bug JENKINS-61051
 
Re: Jobs are started on master instead of EC2 slaves randomly

This probably doesn't have anything to do with Remoting. It's probably something about the ec2-plugin not launching the job in the right place or using the desired agent configuration. My guess is that it's going to require additional diagnostics in order to track this down. Anything you can do to collect better troubleshooting data or a reproducible scenario would likely be necessary to resolving this.

jthompson@cloudbees.com (JIRA)

unread,
Feb 11, 2020, 11:58:03 AM2/11/20
to jenkinsc...@googlegroups.com
Jeff Thompson assigned an issue to Unassigned
 
Change By: Jeff Thompson
Assignee: Jeff Thompson

gabezzz@gmail.com (JIRA)

unread,
Feb 12, 2020, 1:22:02 AM2/12/20
to jenkinsc...@googlegroups.com
Gabor V commented on Bug JENKINS-61051
 
Re: Jobs are started on master instead of EC2 slaves randomly

"something about the ec2-plugin not launching the job in the right place" - I thought Jenkins launches the job, the ec2-plugin just creates the slaves

gabezzz@gmail.com (JIRA)

unread,
Feb 12, 2020, 1:23:02 AM2/12/20
to jenkinsc...@googlegroups.com

jthompson@cloudbees.com (JIRA)

unread,
Feb 12, 2020, 1:48:02 PM2/12/20
to jenkinsc...@googlegroups.com
Jeff Thompson commented on Bug JENKINS-61051
 
Re: Jobs are started on master instead of EC2 slaves randomly

I'm not familiar with the details of the ec2-plugin, but I know it does some complicated stuff, including with how it manages agents. When I've looked into the code there before, there was some complicated pieces. If you can reproduce the problems without the ec2-plugin then it probably is due to something in the Jenkins server. (Since I'm not familiar with any reports on that, it seems unlikely.) If it only occurs with the ec2-plugin, then it's probably something to do with the custom capabilities it provides.

laszlo.gaal@gmail.com (JIRA)

unread,
Mar 10, 2020, 5:31:02 PM3/10/20
to jenkinsc...@googlegroups.com

I have run into a similar problem with Jenkins v2.204.2 and EC2 plugin v1.49.1. In our case the master was actually overloaded by the misdirected job, and the Jenkins process was killed by the OOM-killer.

One symptom I found was that the Jenkins log lines that normally log the connection attempt from the EC2 plugin to the newly created worker missed the IP address, printing "null" instead:

Regular log entry:

2020-03-10 04:47:57.202+0000 [id=797295]        INFO    hudson.plugins.ec2.EC2Cloud#log: Connecting to 172.31.26.224 on port 22, with timeout 10000. 

Bad log line (only 2 instances in several weeks, immediately before the failure):

2020-03-10 04:47:57.113+0000 [id=797326]        INFO    hudson.plugins.ec2.EC2Cloud#log: Connecting to null on port 22, with timeout 10000. 

Observe the "null" instead of a valid IP address.

This message was sent by Atlassian Jira (v7.13.12#713012-sha1:6e07c38)
Atlassian logo

jthompson@cloudbees.com (JIRA)

unread,
Mar 10, 2020, 5:53:02 PM3/10/20
to jenkinsc...@googlegroups.com

That sounds like it is an issue in the EC2 plugin. Possibly a timing problem. Presumably if the IP address isn't specified it runs the job on the master.

laszlo.gaal@gmail.com (JIRA)

unread,
Mar 26, 2020, 9:56:02 AM3/26/20
to jenkinsc...@googlegroups.com

Just ran into this again. Jeff Thompson: yeah, it looks like either a timing problem or a race.

As a workaround I installed roadblocks on the master that should fail such an errant job very early in the startup/config phase, before it has a chance to consume all memory and trigger an OOM-kill. We'll see if it's enough; I'd really hate to downgrade the plugin again.

gabezzz@gmail.com (JIRA)

unread,
Mar 26, 2020, 10:01:03 AM3/26/20
to jenkinsc...@googlegroups.com
Gabor V commented on Bug JENKINS-61051

 Any idea who can work on this bug from the ec2 plugin team? To whom should we assign it?

raihaan.shouhell@autodesk.com (JIRA)

unread,
Mar 29, 2020, 9:28:02 PM3/29/20
to jenkinsc...@googlegroups.com

EC2 just launches and manages agents it doesn't actually do anything with regards to assigning agents.
That null does look suspicious.

Does your master use the same pem as your agents? I'm assuming that your agents are linux and using ssh as well.

laszlo.gaal@gmail.com (JIRA)

unread,
May 2, 2020, 2:49:03 PM5/2/20
to jenkinsc...@googlegroups.com

Raihaan Shouhell, yes, they do use the same keys, and I've realized that assigning different keys to them would be a useful workaround.

However, I've never had this problem before upgrading to 1.49.1, so having the same keys does not caue the problem, although it makes the failing case that much more severe.

laszlo.gaal@gmail.com (JIRA)

unread,
May 2, 2020, 2:51:03 PM5/2/20
to jenkinsc...@googlegroups.com

Just saw https://github.com/jenkinsci/ec2-plugin/pull/440, which seems likely to fix this issue; one of the comments actually refers to the

Connecting to null on port 22 

pattern I described in an earlier comment.

laszlo.gaal@gmail.com (JIRA)

unread,
May 2, 2020, 3:04:04 PM5/2/20
to jenkinsc...@googlegroups.com
Laszlo Gaal edited a comment on Bug JENKINS-61051
Just saw   [https://github.com/jenkinsci/ec2-plugin/pull/ 440 447 ], which seems likely to fix this issue; one of the comments actually refers to the
{code:java}
Connecting to null on port 22 {code}

pattern I described in an earlier comment.
Reply all
Reply to author
Forward
0 new messages