I'm experiencing the same problems with EC2 slaves.
We're using a custom AWS Linux AMI and slaves that terminate after 30 minutes of inactivity, instance type C3Large.
At seemingly random moments, slaves lose connectivity.
Sometimes the slaves run fine for a while, sometimes a few lose connectivity in a row.
Symptoms:
- no more build output is added in the build console log
- the slave goes offline
- the slave is accessible trough SSH, but the slave.jar Java process isn't running anymore
We experimented with ClientAliveInterval 15 in the sshd config on the slave; didn't help.
I added process list logging to see what happens.
The slave process disappears without anything strange noticable (except for a disconnect on the master).
This can mean that either the slave Java process terminates unexpectedly, or the ssh connection terminated through a timeout.
Looking at the logging, the latter seems to be happening. Around the second that the slave process disappears from the process list, the following logging appears in /var/log/secure:
Feb 3 11:24:43 ip-10-4-33-150 sshd[2243]: Timeout, client not responding.
Feb 3 11:24:43 ip-10-4-33-150 sshd[2241]: pam_unix(sshd:session): session closed for user ec2-user
That means that sshd is terminating the connection.
On another build environment with pratically the same setup (Ubuntu AMI), we don't see the disconnects.
I compared the two sshd config files on the slaves.
Noticeable difference:
- the Ubuntu slave (no disconnects) has "TCPKeepAlive yes" in its sshd_config, and no ClientAliveInterval/ClientAliveCountMax set
- the AWS Linux slave (disconnect issues) has TCPKeepAlive not set (commented out), ClientAliveInterval 15 and ClientAliveCountMax not set
The next thing we're going to try is to remove ClientAliveInterval and enable "TCPKeepAlive yes" on the AWS Linux slave.