This may or may not be related, but this seems like a good thread to share an experience. I'm a huge advocate of build-process tools, and we use Jenkins plus Istanbul (and Mocha) to do code coverage. We use a command in our Jenkins build like the following (including here because it took us a while to sort out - maybe it will help somebody else):
JUNIT_REPORT_STACK=1 JUNIT_REPORT_PATH=report.xml NODE_ENV=test node ./node_modules/.bin/istanbul cover -- ./node_modules/.bin/_mocha -R mocha-jenkins-reporter
node ./node_modules/.bin/istanbul report cobertura
The ugly command lines are because Windows doesn't like executing NodeJS scripts directly - putting "node" before each command works better on those machines.
This produces a nice report that Jenkins can import as a build artifact, and show history over time. We've been using this to drive our API test coverage as high as possible (and also as a defensive shield against new requests from Product when we feel overall quality is dropping - a chart that proves it really helps make the case that some refactoring and "love" is necessary!)
Anyway, there was a branch in our code that we hadn't tested, where we were trapping exceptions from SequelizeJS. We hadn't tested it because nothing had ever failed - so we wrote a hack that uses the beforeCreate/beforeFind/etc hooks in SequelizeJS to pretend the database had failed. And everything started hanging up in almost exactly the way you described - no trouble at low rates, lots of trouble during load testing.
It turns out that when we were processing the exception, we weren't reporting the error to ActionHero properly. ActionHero is a lot like Express in being callback-driven, and you must make ABSOLUTELY SURE that:
- The callback is ALWAYS called no matter how the code works out, and
- The callback is ONLY called a single time.
It's literally the first thing we check for now. Calling next() more than once, not calling it at all, or calling it with something unexpected (the second parameter = false is NOT for reporting errors - grin) is a big problem.
Code coverage tests really saved our bacon here because this is one of those things developers rarely test for and QA never really "sees". It's hard to simulate a total database failure (not an error, a complete loss) so we often write it off and hope it doesn't happen much - then 6 months later when it does we shrug our shoulders and write it off as a freak issue, even though we could have caught it. In this case, by leaving "hanging" connections, it could have killed our production environment by swamping it with zombie sockets that never got closed. Classic cascade failure, because as nodes die under load, it would shift to others... which would have the same bug!
I highly recommend you check the items above, but I also highly recommend a code coverage tool. It's simple enough to run and can be a huge debugging aid for resolving problems like this.