We are trying to debug a poorly performing node application and would appreciate any help or advice from this community. We have a node application that serves as the user facing frontend for a payment platform - code here https://github.com/alphagov/pay-frontend. We are in the process of assessing and expanding our capacity to meet increasing need. We have a target of being able to serve X payment journeys per second.
A payment journey comprises 4 pages, two of which require a form submission.
Each page in the journey entails some communication between the node application in question (that we helpfully call frontend) and other microservices to establish the current status of the payment etc, on average around 2 http calls per page.
By carrying out performance tests (using Gatling) we have found that in order to meet our target of X tx/s, we have to provision around X/2 frontend nodes, i.e. each frontend node appears capable of processing around 2 payment journeys per second on average.
This seems wrong - by my reckoning it is wrong by orders of magnitude.
Details about our tech stack
We are on aws, and the frontends run in docker containers on C5.large ec2 instances.
We use https internally
We are running node 8 in production
The application is an express app
We use http.request to make downstream requests, but have also experimented with using request, with no appreciable difference.
There is no major cpu heavy processes in our frontend app, and event loop latency under normal load is fine
What we have found so far
The frontend nodes are CPU bound
Under strain/near breaking point, profiling reveals the frontends seem to be spending a large amount of time doing things related to making downstream http requests, but nothing obviously ludicrous.
Whilst there is no obvious memory leak, the heap dump deltas show a proportionately large number of Sockets hanging around - I think this is just due to keepalives though
Even not under heavy load, the network latency for a request seems high for an internal request - we are seeing average latency of ~20-40ms, vs around 2-5ms for a Java app that is more or less identical in the calls it's making.
Break down of the phases of a request (gained from request library's timing facility) reveals that under low load on average socket wait, dns lookup and tcp connection take practically no time - bulk of time is waiting for server response
Under load it appears to be the time to establish a tcp connection and the time to 'firstByte' that contribute to overall increase in http request time
Things we have tried
We have tried configuring the standard agent with different values of maxSockets, maxFreeSockets...
We have tried using different agents
We have tried disabling socket pooling entirely
We have tried two different client libs - the core http module, and request.
We have matched the number of workers in our cluster to the number of CPUs
Some of these things have yielded gains of ~10%, but I am still convinced there is something fundamentally wrong with the architecture and configuration of the application - the throughput just seems too low.
I realise I haven't given enough detail to solve anything here, but if anyone has any guidance on approaches that have worked for them, other knobs to twiddle, guidance on better interpretation of profiling and heap dumps, or any other useful pointers I would be very grateful.
Dom