Looks like a fun problem!
Zones is perfect for this use-case.
They can be used to encapsulate a number of parallel async activities,
and the first to complete can return a result for the Zone, and it
will tear down all the other ongoing activities... see
https://github.com/strongloop/zone. You could even try it... if it
works for you, who cares if it has bugs that you don't hit, right? :-)
And we could use the feedback.
But, bleeding edge concurrency primitives aside, a few suggestions:
- very unlikely that cluster is your problem, it doesn't have any
effect (you are on 0.10, right? you didn't say, node version is
important) except making sure listening sockets are shared across the
cluster, shouldn't do anything with outgoing client connections
- easy way to verify: run without cluster.... always a good idea
anyhow, if it doesn't work standalone, it won't work with cluster.
You don't have any way to cancel your parallel actions, that you
mention, so when the first of a set of parallel actions completes....
all the others keep going.... so with fast incoming connections, every
complete request will "leak" (for a little while) some set of
incomplete outgoing connections... a brutal multiplier effect, made
worse the more "irregular" the response time is... incoming
requests/sec will be as fast as the fastest outgoing request, but
accumulation of un-needed outgoing connections will be based on the
slowest of the outgoing response times. Ouch.
if you have fd limits (remove them), you can hit those.
if you have connection pool max size limits (outgoing mysql
connections, for example, or the HTTP outgoing concurrent request
limit which defaults to 5,
http://nodejs.org/api/http.html#http_agent_maxsockets) your
dead-but-ongoing requests can slam into those barriers fast with the
multiplier effect you have. This might be capping your incoming
request completion rate.
You don't mention logging... you should count current number of
incoming and outgoing incomplete requests, and log those counts.
Consider using vasync, it is more inspectable, might make it much
easier to dump current state. Also, it appears you may benefit from
its barrier().
For "interrupting", I'd suggest finding a way with all outgoing
requests to cancel, to teardown the connection and terminate it early,
if possible, returning an "interrupted" error result to async ASAP
after being cancelled.
Be interesting to hear what you did after you figure this out!
Cheers,
Sam