| FYI we got to the bottom of why our Windows slaves were disconnecting - it would appear that the Windows DHCP Client is incompatible with the Windows Time Service. Our slaves were VMs created within OpenStack, and what we were seeing was a failure to renew the DHCP lease correctly. When OpenStack detects that the guest OS has failed to renew the DHCP lease on time, it (briefly) drops the network link in order to prompt a lease renewal. However this causes Windows to panic and kill all TCP connections (due to the way Windows mishandles network layers). It seems that the DHCP client is not calculating the renewal time in a manner that's independent of the system's idea of "real time", and so it all goes wrong when the date/time gets changed (by the Windows Time Service), triggering OpenStack to bounce the physical layer, which Windows cascades in to an application-layer network outage killing the TCP connection that the slave relies on. We "fixed" this by forcing our slaves to:
- run "w32tm /resync" until they'd got the time synchronized,
- turn off the Windows Time Service entirely,
- ipconfig /release /renew to update the DHCP lease time
- start the Jenkins JNLP slave process
This ensured that Windows would not update its clock while the slave's TCP connection was live, meaning that we weren't affected by the DHCP client's inability to keep the network alive after clock changes. Since doing that we've not had any further problems of this nature (and we're quite pleased with that!) Note: I've also seen Windows 10 report unpredictable (and incorrect) DHCP lease renewal times on other (non-OpenStack) machines - it lies. |