Thanks for replying to my question.
That's precisely what I'm doing, I'm creating a verticle of http server + http client, and running as many instances of that verticle as the number of cores.
That works great, in theory and in benchmarking, where I send the same request to the application that "fan out" into a fixed number of outbound requests.
In that case, all threads are equally busy, and there's no asymmetry. The load is even.
However in practice/production, the number of outbound requests (to different hosts and services) depends on the incoming request and logic. So I may get one request, on one thread, that fans out into 30 outbound requests, and another verticle instance/thread gets a request that fans out into 2 outbound requests. In that case the load is uneven. Some thread will be very busy at times, where others will be almost idle at times.
So that's why I think if I could at least receive the responses on different threads I could spread the load more evenly.
Say I have 2 verticle instances, one sends out 30 requests, and another sends out 2. If I could got all 32 of the responses in the same time, it would be much better if one thread handled 16 and another 16, instead of the first thread handling 30 sequentially, and the other thread just handling its 2 responses.
I hope I make more sense now.
Thanks