|
@steve carter First, let me thank you for summarizing the earlier comment threads. That does help bring everything into focus.
4) Jenkins, in order to catch rogue processes at job end (i.e. those that have broken ties with their parent process) scans the whole process space for those with the particular BUILD_ID in their environment, and kills them. This is correct and good behavior by Jenkins.
Agreed. This is a perfectly valid and useful enhancement for the majority of cases. However, given the debilitating effect it has on this specific use case combined with the fact that the change was included on an LTS release which is expected to be kept as stable as possible is where I take issue. I see this problem as a bug, albeit a difficult to detect bug and admittedly a bug that is really caused by some questionable behavior provided by the Microsoft build tools, but a bug none the less. In that case critical, production halt kind of bugs like this should be fixed immediately or reverted until an appropriate fix can be made. Doing otherwise reduces users' confidence in the stability of the tool. There is a reason shops like ours choose to use LTS editions for production work - to avoid problems like this that may be found on the latest, cutting edge versions.
8) Solution 1: start pdbsrv with a BUILD_ID that doesn't match the build jobs. Then pdbsrv will not be killed at the end of the job.
This should be called a workaround or hack rather than a solution. That point aside, this workaround again won't work for our particular build environment. We use the BUILD_ID throughout our build processes to embed metadata in the binary files we generate. If we reset that environment variable as part of our build this metadata will essentially get corrupted. Changing our tooling to use an alternative environment variable would require significant effort as well, having to be propagated out to dozens of products across several release branches each.
9) Solution 2: use Daniel's whitelist feature to not kill pdbsrv at the end of the job.
Based on my review of his pull request, Daniel's feature has not yet been completed nor has it been included in any actual LTS release. I do believe this would be a reasonable and appropriate solution to this defect though, so hopefully this work can be completed sooner rather than later.
10) The problem with Solutions 1 and 2 are this: pdbsrv still has a timeout, so you will get sporadic failures when the server goes away.
I know some earlier posters did indicate that this was an issue for them I have not been able to reproduce the problem as described. When a compile begins and this process is running it makes use of the existing process, and if the process is not already running it starts it. I have never had a compile running and seen the mspdbsrv process terminate mid-compile without any other background process or system event occurring. Also, I work with many development teams including many dozens of developers and have never once had a report of this bug outside of the reproducible use cases I've stated before.
Conversely, I have shown the problem is reproducible outside of Jenkins in very hard to detect ways which I suspect may appear to some to be an intermittent timeout. For example, if you are logged in to a system which is performing a compile in a background process which is also running under the same user profile as your local session, by simply logging out of the system the service terminates. The reason for this is the pdbsrv process is shared by the background process and your local user session and when you log out from the local session all processes in that memory space are terminated, including pdbsrv. This was a very difficult use case to isolate and not very obvious to users of the target systems and even went undiagnosed at my place of work for months under the assumption that the failure was unpredictable and intermittent.
I know that my argument doesn't prove that this particular problem couldn't ever happen but I am extremely skeptical to say the least. If someone does believe that this problem does in fact exist I would greatly appreciate a detailed description on how to reproduce the problem. Maybe we're using a slightly older or slightly newer version of the compiler that doesn't exhibit the problem or something. Either way, if these individuals were willing to compare notes maybe we can help further isolate the root of this discrepancy.
12) pdbsrv's timeout doesn't get a new lease every time you use pdbsrv. I regard this as a bug in pdbsrv.
As I've stated in earlier posts, my team manages a build farm with close to a dozen agents now, running over 1000 build jobs and never once have I ever had this error occur on any of those systems, nor have any of the development teams we support report this problem on any of their local development machines. I would have to say that if this were in fact a core issue with the Microsoft toolset we would have discovered it by now. Again, if anyone can give me a reproducible use case that proves otherwise I would be happy to hear from them. Maybe we are doing something they aren't, or vice versa.
13) You can't leave pdbsrv running forever because it (allegedly) has memory leaks. I regard this as a bug in pdbsrv.
Again, this is something we have not been able to reproduce. For example, I have watches some of our agents that are under the most considerable load wrt build operations - machines which essentially run 24/7 compiling one or more projects in parallel nearly all the time and these systems continue to run stably day after day, week after week without requiring any outside intervention from me or my team. The pdbsrv process is nearly always active, the memory consumption increases and decreases with the load on the machines, and never causes any fatal errors in our build processes.
If anyone can provide specific, reproducible criteria for this problem I would be interested to hear it. If there is something we have overlooked that may be causing us grief elsewhere that we have not yet considered I would definitely want to know about it.
I really think to roll back Jenkins' ProcessTreeKiller is NOT a solution.
Agreed. I don't think 'just' rolling back this change is the best solution. I think fixing this bug is the best solution. However in the absence of an appropriate fix for this bug, combined with the severity of it's impact, I think that rolling back the change until an appropriate fix was put in place would have been a better solution rather than stranding users of your tool on an old, out of date release as we have been. 
Just my 2 cents.
The use of BUILD_ID brings the Jenkins machine under better control against rogue processes...
Totally agree that the improvement is well worth the effort. My concern is that the change includes a relatively significant bug.
...and the workaround (for well-behaved servers) is easy, set BUILD_ID before starting the server, or use Daniel's whitelist.
Again, 'easy' workaround is a relative term. As just mentioned we would need to rework our build tools and roll that change out to many teams for many products, and backport those changes to many branches for this to work, after which we'd need to going through all 1000+ jobs on our farm and update them with the hack to the environment variable. Obviously significant effort in our case. Also the whitelist solution has yet to be completed from what I can tell, so that is not a usable solution yet.
14) Solution 3: start pdbsrv periodically, e.g. every day with a day-long timeout. That will mitigate against the memory leaks. If you use some concurrency control, e.g. Job Weight plugin, you can make sure this "kill and restart pdbsrv" job does not fire during a build.
Again, just to be clear this is clearly a workaround and not a solution.
This hack may work for us in the interim until an appropriate fix can be made. I will test it out as soon as I can and report back. In our case we'll likely just setup a scheduled task that runs on boot and forces the service to start, and stay running indefinitely as there is no need for it to shut down ever that we have seen.
However, for those individuals who claim that the service does need periodic resetting a solution like this would likely be more complex. Assuming they to need to ensure the utmost stability of their build farm as we do, they would need to ensure the pdbsrv service gets started before any compilation operation runs, including after reboots, power outages, crashes and the like. I don't believe there is any way to achieve this using a Jenkins operation. This means an external process would be needed like the Scheduled Task idea I just mentioned. But then the external process would be running independently from the Jenkins agent making it even more difficult to coordinate the two. For example, I suspect it would be difficult at best to make sure the scheduled task restarts the service at an opportune moment when no compilation operations are happening on the agent. Just something else for those users to keep in mind.
|