Hi all,
I just filed JENKINS-50504 to describe a bug that I hit a few times a month. In short, when the master's connection to an SSH slave times out and a new connection is opened, jobs still keep running under the old Remoting channel, but their workspaces get handed out to new jobs (because the logic that checks for a workspace being in use doesn't take this case into account), and then both jobs clobber each other and fail.
I have written a detailed evaluation of the issue in the bug. The cause of the problem is that WorkspaceList#inUse is a Map<FilePath, Entry> and FilePath#equals checks that the channels are the same to consider a given FilePath equal to another one. In my case, the channel reference of the proposed workspace is a new channel (because the node reconnected), while the entry in inUse references the old channel (because the job is still running under the old channel). As a result, the workspace is not considered to be in use and is handed out to a new job.
Since this bug impacts me at least twice a month and takes down a large percentage of my Jenkins jobs, I would like to try and contribute a fix. However, I need help designing a solution. I can think of two ugly solutions:
1. When a node reconnects due to an I/O error, update the entries in WorkspaceList#inUse to reference the new channel in the key to the map. This would fix the bug. However, it seems ugly to use the new channel in the "in use" map, because the job is still technically running under the old channel.
2. Maintain a list of all channels that a given node has ever had open (including channels that got closed due to timeout). Then, when checking for a workspace being in use, construct a proposed FilePath for each one of those channels, and fail if any of them has an entry in the "in use" map. This design concerns me because of the potential for this list of old channels to grow in size without bound.
Could someone with more familiarity with Jenkins core weigh in with a better way to solve this problem? If so, I could try to submit a pull request.
Thanks in advance,
Basil