I am designing a neural network inference server and I have built my server and client using a synchronous grpc model, with a unary RPC design. For reference, the protobuf formats are based on the Nvidia Triton Inference server formats
https://github.com/NVIDIA/triton-inference-server. My design expects a large batch of inputs (16384, for a total size of 1MB) to be received by the server, the inference to be run, and then the result to be returned to the client. I send these inputs in a repeated bytes field in my protobuf. However, even if I make my server-side function simply return an OK status (no actual processing), I find that the server can only process ~1500-2000 batches of inputs per second (this is run with both server and client on the same machine so network limitations should not be relevant). However, I know that my inference processing can handle throughputs closer to 10000 batches/second.
Is there an inherent limitation to the number of requests that a gRPC server can handle per second? Is there a server setting or design change I can make to increase this maximum throughput?
I am happy to provide more information if it can help in understanding my issue.
Thanks for your help,
-Dylan