My app gets hit by a kiosk about every 3 minutes, and that chews up about 1sec of CPU, and kicks off a task that eats another 250ms. Users come and go now and then. Today I tried 3 different settings:
< 9am: Max idle = 1, max latency = 1s
9am - 1pm: Max idle = auto, max latency = 1s
1pm - now: Max idle = 1, max latency = 5s
For low-traffic, it appears that max idle = 1 is the same as "auto". That makes sense. I'm going to leave it at auto, since I'd like to be able to handle spikes well.
It looks like when I set the max latency way up, it sometimes lets the second instance die, but never for very long.
When I've looked at the instances they never seem to have been alive all that long. So I think the scheduler is spinning up one, adding another, letting the first die, adding another, and so on.
Right now I'm in one of those 1 instance troughs. The site is quite responsive, while I poke around, and it isn't starting another instance. So here in the trough, I'm getting the same behavior that was reported by the person who tried a "hello world" test. Yet the above graph is what it is.
As I'm writing this, the number of instances just jumped up to 2! So now I can see what caused it:
The kiosk hits the URL with an XMLHttp request to make sure it's alive, and if that works OK, then it refreshes itself. This bit of nastiness is there because it's impossible to get a browser to handle failed page loads 100% consistently well.
When the kiosk hits the URL, a task is launched to do some background processing recording that the heartbeat happened. The two hits therefore lead to two tasks. I'm using countdown = 1.
These two tasks are bunching up in the queue, and being processed essentially at the same time. That requires, you guessed it, two instances!
I have been thinking of task queues as sort of a background process, and I certainly wouldn't expect the system to spin up an instance just to handle a queued task. But that thinking isn't really right, since the docs explicitly suggest that spawning a lot of tasks is a good way to get a bunch of instances to munch your hard problem in parallel.
So my fix is actually not so hard. I just need to pass a param with the "just checking" initial kiosk request and use that to avoid spawning the task. That way I'll get hit-hit-task, not hit-hit-task-task, and presumably the system won't feel compelling to crank up a new instance.
If we get two kiosks going and they happen to get synchronized (as such things tend to do), then I'll be screwed. But for now, I think I've got my fix...
If you're still reading, I'll give you a reward: If you are trying to diagnose why you have 2 instances when you have the sliders set to 1/15, go to your log, view with Info and find the requests that are spinning up a new instance. Now look at all requests and find that one that spun up the evil second instance. Was it right on the heels of another request? I bet it was. Is it your fault? (In my case, it certainly was my fault.) Regardless, if you want to avoid that second instance, you need to find a way to get those requests to be farther apart.
-Joshua