Jamus Sprinson posted this first with a simple repro app at
http://groups-beta.google.com/group/microsoft.public.dotnet.framework.performance/browse_thread/thread/8f4430b6f6efe829/92db52b56f6ebf28
Jamus's repro app causes this problem almost immediately by using a
large number of timers. We encountered this problem in a larger
production app that only had 7 timers. Eventually some timers would
stop firing, never to fire again, while other timers continued to fire.
A variation on Jamus's repro app is to change the Timer period from
Timeout.Infinite to something like 15000 (15 seconds). This allows you
to see that sometimes all timers will fire the first time around, but
on subsequent firings some timers will start dying off. Sometimes all
timers will fire repeatedly for quite a while, and then system load
appears to cause them to drop off. We did some other tests to verify
that this isn't a problem with ThreadPool.QueueUserWorkItem or
ThreadPool.UnsafeQueueUserWorkItem. It happens with both Debug and
Release builds.
This problem only occurs on Windows Server 2003 with Service Pack 1.
We tested several OS and Service Pack variants including the .NET
Framework 1.1 with and without the .NET Framework 1.1 SP1. The culprit
is very clearly Windows Server 2003 SP1. No other OS exhibits this
behavior.
Our partial workaround is to implement our own timers using a dedicated
thread, however this is insufficient since we also use classes in the
.NET Framework that use the System.Threading.Timer. Classes that use
System.Threading.Timer include:
System.Data.SqlClient.ConnectionPool
System.Data.SqlClient.TdsParser
System.Data.SqlClient.Lifetime.LeaseManager
System.Timers.Timer
System.Web.Caching.CacheExpires
System.Web.Caching.CacheInternal.StartCacheMemoryTimers
System.Web.HttpRuntime
System.Web.RequestQueue
System.Web.RequestTimeoutManager
System.Web.SessionState
System.Web.Util.ResourcePool
As you can see from the list, this has a fairly serious impact on
important pieces of the .NET Framework. We have also reproduced this
problem using the System.Timers.Timer, and I assume you could find ways
of reproducing it with other classes listed above.
One difference we saw from what Jamus reported is that this problem
reproduced for us quite easily on single-processor Windows Server 2003
machines.
The problem appears to be worse under load. I took a look at the .NET
performance counters while the repro app was running, and the only
difference I noticed between runs that lost timers and runs that didn't
was an extra GC on the successful runs. I can't see how that would be
significant, but the following comment in the Rotor source for
AddTimerCallbackEx in comthreadpool.cpp makes me nervous:
// NOTE: there is a potential race between the time we retrieve the app
domain pointer,
// and the time which this thread enters the domain.
//
// To solve the race, we rely on the fact that there is a thread sync
(via GC)
// between releasing an app domain's handle, and destroying the app
domain. Thus
// it is important that we not go into preemptive gc mode in that
window.
Another bit of weirdness we saw while debugging our production app with
7 timers is that for TimerCallback delegates pointing to different
instances of the same exact type of object, the TimerCallback's
_methodPtr field was sometimes the same as the MethodDesc table's Entry
value which points to the beginning of the method's instructions, while
at other times the _methodPtr field points to an instruction that does
a jmp to the the beginning of the method referenced by the MethodDesc
table. I was able to see this with WinDbg and SOS using !dumpmt -MD
and !u. This seemed pretty weird since I was under the impression that
delegate signatures were "equal" if _target and _methodPtr matched.
Perhaps delegate pointers aren't always being fixed up during a GC?
However this doesn't match with what we see in Jamus's repro app that
only uses a single TimerCallback delegate for all Timers and only some
die, so this may be yet another issue, or more likely a
misunderstanding on my part.
There is also another unresolved report of slightly different
System.Threading.Timer flakiness on Windows Server 2003 here
http://groups-beta.google.com/group/microsoft.public.dotnet.languages.csharp/browse_thread/thread/ad9c74e361a88a06/b7ff4ba0b5e1e533
Does anyone know of a hotfix or better workaround for this issue?
Oran
Oran wrote:
> One difference we saw from what Jamus reported is that this problem
> reproduced for us quite easily on single-processor Windows Server 2003
> machines.
Apologies; this was due to poor testing on our part; we had tested on
three machines, and only the multi-processor machine exhibited the
problem- this was before we noticed that only that machine had updated
to SP1 as well. Later tests showed that all machines with SP1 seemed
to exhibit this behavior.
> The problem appears to be worse under load. I took a look at the .NET
> performance counters while the repro app was running, and the only
> difference I noticed between runs that lost timers and runs that didn't
Running several copies of the app I made will cause more of the timers
to fail, particularly on the later executions. For instance, running
10 copies simultaneously will often give 100% success on the first one
or two copies, then progressively lower rates for each subsequently
launched process. Similar behavior could be generated, however, by
running a process on a higher priority thread which did garbage
calculations, creating artificially high CPU load. I suspect that
downward trend in running 10 copies of the app is due to the small
delay between each process actually starting, which allowed earlier
copies to create their timers with less CPU load (or less timers?) on
the system than later instances, which had to compete with the earlier
instances.
I hope that made sense..
As for the GC.. we had a wrapping class around our
system.threading.timers so that we could have some special error
reporting and timer tracking in our application. Our application was
EXTREMELY heavy on threading timers, with cases of many timers being
generated on the same instance of an object, and also of many timers
being generated on unique instances. There does not seem to be any
difference in performance. Also, a finalizer notified us if the object
(and therefore its system.threading.timer reference) were released
before the timer firing; this was to ensure that the timers were not
somehow either being GCed before firing or failing, and being then
GCed. What we found is that those timers that don't fire are NEVER
collected. If timers are being continually generated, this is
particularly evident by watching the memory charge on the process,
which continues to climb as timers are allocated and never destroyed.
We made similar tests on the threadpool functions in an effort to
narrow down the problem, but the threadpool does not seem to be the
culprit.
We're watching with interest, hoping something comes of this soon; it
seems that the only thing which corrects this behavior is to uninstall
SP1, and sometimes even that does not work.
Jamus
If the code has been reviewed and it is clear that no deadlocks exist, the issue may be caused by a Timer issue identified in the following KB article.
http://support.microsoft.com/?kbid=900822
Thanks!
Todd Reifsteck
In addition, the KB article claims that this problem is with the .NET
Framework in general. In fact, the problem described above only
happens when Windows Server 2003 Service Pack 1 is applied, and the
hotfix that fixes this is the one for Windows Server 2003 titled
240661_ENU_i386_zip.exe, not the one titled 240388_ENU_i386_zip.exe
which is for the rest of the .NET Framework OSs. I don't doubt there
were other problems with the Timer class that are fixed by the second
hotfix, but the problem described above only occurs on Windows Server
2003 SP1 and therefore is fixed by the 2003 hotfix.
Anyway, Microsoft won't charge you for the call if you call them to get
this hotfix, and their responsiveness to this bug has been great.
In my correspondence with Microsoft on this issue, they referenced
another KB article number that doesn't exist as of this writing but may
appear at some point: 903091. This one is in relation to the Windows
Server 2003 SP1 hotfix. Hopefully it will contain more accurate
information on this bug and its fix.
Can you see if this kb applies to you problem by any chance?
FIX: When a Windows Forms-based application uses the System.Threading.Timer
class, the timer event may not be signaled in the .NET Framework 1.1 SP1
http://support.microsoft.com/?id=900822
"Jamus" <jamu...@earthlink.net> wrote in message
news:1120063129.3...@g43g2000cwa.googlegroups.com...
I have dedicated Elapsed event handler for every Timer object. By
adding the following line on the top of every handler, it seems that
all Timer's objects now work without stopping again.
System.Threading.Thread.Sleep(0);
As we have already known, the method with argument 0 lets other waiting
threads to start, and I think, it just lets any 'stopped' Timer object
to "back on the business" again.
--
dixy