Windows Server 2003 Service Pack 1 causes the System.Threading.Timer to not fire, sometimes immediately and sometimes after a while. Once a timer dies, it will never fire again.
Jamus's repro app causes this problem almost immediately by using a large number of timers. We encountered this problem in a larger production app that only had 7 timers. Eventually some timers would stop firing, never to fire again, while other timers continued to fire.
A variation on Jamus's repro app is to change the Timer period from Timeout.Infinite to something like 15000 (15 seconds). This allows you to see that sometimes all timers will fire the first time around, but on subsequent firings some timers will start dying off. Sometimes all timers will fire repeatedly for quite a while, and then system load appears to cause them to drop off. We did some other tests to verify that this isn't a problem with ThreadPool.QueueUserWorkItem or ThreadPool.UnsafeQueueUserWorkItem. It happens with both Debug and Release builds.
This problem only occurs on Windows Server 2003 with Service Pack 1. We tested several OS and Service Pack variants including the .NET Framework 1.1 with and without the .NET Framework 1.1 SP1. The culprit is very clearly Windows Server 2003 SP1. No other OS exhibits this behavior.
Our partial workaround is to implement our own timers using a dedicated thread, however this is insufficient since we also use classes in the .NET Framework that use the System.Threading.Timer. Classes that use System.Threading.Timer include:
As you can see from the list, this has a fairly serious impact on important pieces of the .NET Framework. We have also reproduced this problem using the System.Timers.Timer, and I assume you could find ways of reproducing it with other classes listed above.
One difference we saw from what Jamus reported is that this problem reproduced for us quite easily on single-processor Windows Server 2003 machines.
The problem appears to be worse under load. I took a look at the .NET performance counters while the repro app was running, and the only difference I noticed between runs that lost timers and runs that didn't was an extra GC on the successful runs. I can't see how that would be significant, but the following comment in the Rotor source for AddTimerCallbackEx in comthreadpool.cpp makes me nervous:
// NOTE: there is a potential race between the time we retrieve the app domain pointer, // and the time which this thread enters the domain. // // To solve the race, we rely on the fact that there is a thread sync (via GC) // between releasing an app domain's handle, and destroying the app domain. Thus // it is important that we not go into preemptive gc mode in that window.
Another bit of weirdness we saw while debugging our production app with 7 timers is that for TimerCallback delegates pointing to different instances of the same exact type of object, the TimerCallback's _methodPtr field was sometimes the same as the MethodDesc table's Entry value which points to the beginning of the method's instructions, while at other times the _methodPtr field points to an instruction that does a jmp to the the beginning of the method referenced by the MethodDesc table. I was able to see this with WinDbg and SOS using !dumpmt -MD and !u. This seemed pretty weird since I was under the impression that delegate signatures were "equal" if _target and _methodPtr matched. Perhaps delegate pointers aren't always being fixed up during a GC? However this doesn't match with what we see in Jamus's repro app that only uses a single TimerCallback delegate for all Timers and only some die, so this may be yet another issue, or more likely a misunderstanding on my part.
Oran wrote: > One difference we saw from what Jamus reported is that this problem > reproduced for us quite easily on single-processor Windows Server 2003 > machines.
Apologies; this was due to poor testing on our part; we had tested on three machines, and only the multi-processor machine exhibited the problem- this was before we noticed that only that machine had updated to SP1 as well. Later tests showed that all machines with SP1 seemed to exhibit this behavior.
> The problem appears to be worse under load. I took a look at the .NET > performance counters while the repro app was running, and the only > difference I noticed between runs that lost timers and runs that didn't
Running several copies of the app I made will cause more of the timers to fail, particularly on the later executions. For instance, running 10 copies simultaneously will often give 100% success on the first one or two copies, then progressively lower rates for each subsequently launched process. Similar behavior could be generated, however, by running a process on a higher priority thread which did garbage calculations, creating artificially high CPU load. I suspect that downward trend in running 10 copies of the app is due to the small delay between each process actually starting, which allowed earlier copies to create their timers with less CPU load (or less timers?) on the system than later instances, which had to compete with the earlier instances.
I hope that made sense..
As for the GC.. we had a wrapping class around our system.threading.timers so that we could have some special error reporting and timer tracking in our application. Our application was EXTREMELY heavy on threading timers, with cases of many timers being generated on the same instance of an object, and also of many timers being generated on unique instances. There does not seem to be any difference in performance. Also, a finalizer notified us if the object (and therefore its system.threading.timer reference) were released before the timer firing; this was to ensure that the timers were not somehow either being GCed before firing or failing, and being then GCed. What we found is that those timers that don't fire are NEVER collected. If timers are being continually generated, this is particularly evident by watching the memory charge on the process, which continues to climb as timers are allocated and never destroyed.
We made similar tests on the threadpool functions in an effort to narrow down the problem, but the threadpool does not seem to be the culprit.
We're watching with interest, hoping something comes of this soon; it seems that the only thing which corrects this behavior is to uninstall SP1, and sometimes even that does not work.
This problem can appear for a number of reasons. It often appears because of deadlocks in an application.
If the code has been reviewed and it is clear that no deadlocks exist, the issue may be caused by a Timer issue identified in the following KB article. http://support.microsoft.com/?kbid=900822
The KB article referenced by Todd is the fix for this, but it has a misleading title. It claims this is a fix for "a Windows Forms-based application" using the System.Threading.Timer class. Our tests and the repro app mentioned above have reproduced this problem in a Windows Service and a Console app. It will also happen in an ASP.NET app, anything that uses Timers directly, or anything that uses the other classes listed above.
In addition, the KB article claims that this problem is with the .NET Framework in general. In fact, the problem described above only happens when Windows Server 2003 Service Pack 1 is applied, and the hotfix that fixes this is the one for Windows Server 2003 titled 240661_ENU_i386_zip.exe, not the one titled 240388_ENU_i386_zip.exe which is for the rest of the .NET Framework OSs. I don't doubt there were other problems with the Timer class that are fixed by the second hotfix, but the problem described above only occurs on Windows Server 2003 SP1 and therefore is fixed by the 2003 hotfix.
Anyway, Microsoft won't charge you for the call if you call them to get this hotfix, and their responsiveness to this bug has been great.
In my correspondence with Microsoft on this issue, they referenced another KB article number that doesn't exist as of this writing but may appear at some point: 903091. This one is in relation to the Windows Server 2003 SP1 hotfix. Hopefully it will contain more accurate information on this bug and its fix.
Can you see if this kb applies to you problem by any chance?
FIX: When a Windows Forms-based application uses the System.Threading.Timer class, the timer event may not be signaled in the .NET Framework 1.1 SP1 http://support.microsoft.com/?id=900822
>> One difference we saw from what Jamus reported is that this problem >> reproduced for us quite easily on single-processor Windows Server 2003 >> machines.
> Apologies; this was due to poor testing on our part; we had tested on > three machines, and only the multi-processor machine exhibited the > problem- this was before we noticed that only that machine had updated > to SP1 as well. Later tests showed that all machines with SP1 seemed > to exhibit this behavior.
>> The problem appears to be worse under load. I took a look at the .NET >> performance counters while the repro app was running, and the only >> difference I noticed between runs that lost timers and runs that didn't
> Running several copies of the app I made will cause more of the timers > to fail, particularly on the later executions. For instance, running > 10 copies simultaneously will often give 100% success on the first one > or two copies, then progressively lower rates for each subsequently > launched process. Similar behavior could be generated, however, by > running a process on a higher priority thread which did garbage > calculations, creating artificially high CPU load. I suspect that > downward trend in running 10 copies of the app is due to the small > delay between each process actually starting, which allowed earlier > copies to create their timers with less CPU load (or less timers?) on > the system than later instances, which had to compete with the earlier > instances.
> I hope that made sense..
> As for the GC.. we had a wrapping class around our > system.threading.timers so that we could have some special error > reporting and timer tracking in our application. Our application was > EXTREMELY heavy on threading timers, with cases of many timers being > generated on the same instance of an object, and also of many timers > being generated on unique instances. There does not seem to be any > difference in performance. Also, a finalizer notified us if the object > (and therefore its system.threading.timer reference) were released > before the timer firing; this was to ensure that the timers were not > somehow either being GCed before firing or failing, and being then > GCed. What we found is that those timers that don't fire are NEVER > collected. If timers are being continually generated, this is > particularly evident by watching the memory charge on the process, > which continues to climb as timers are allocated and never destroyed.
> We made similar tests on the threadpool functions in an effort to > narrow down the problem, but the threadpool does not seem to be the > culprit.
> We're watching with interest, hoping something comes of this soon; it > seems that the only thing which corrects this behavior is to uninstall > SP1, and sometimes even that does not work.
I talked with someone at Microsoft about changing the title of the KB900822 article to reflect the fact that it has nothing to do with Windows Forms, and they said they had made a request to have the KB article edited to more clearly indicate that the problem isn't application-specific, but is a problem with the Timer class in Windows Server 2003 SP1.
I have been troubled too with this issue, and have tried every way suggested in forums' articles I found on the net, but no help. Finally, just now, it seems that I have found one workaround.
I have dedicated Elapsed event handler for every Timer object. By adding the following line on the top of every handler, it seems that all Timer's objects now work without stopping again.
System.Threading.Thread.Sleep(0);
As we have already known, the method with argument 0 lets other waiting threads to start, and I think, it just lets any 'stopped' Timer object to "back on the business" again.