Configuration regarding to improve smaller payload that have worse latency than larger payloads?

40 views
Skip to first unread message

eduar...@gmail.com

unread,
Feb 5, 2018, 6:10:03 PM2/5/18
to grpc.io
Hi, I'm working on a custom latency test. I'm using payloads of sizes 1 byte, 200 bytes, 1kb and 10kb. The tests of 1 byte show a very big difference from the rest of the payloads. (longer/worse latency).

I'm working on grpc for c++ on Windows. I'm guessing this has to do with some http2 packing or optimization logic meaning that it is taking longer for the packets to be sent until a buffer is filled.

What are the configuration I should look on modifying to see if I can improve this behavior?

I've tried looking around in 


and in


with no luck. What do you suggest?

Thanks

Eduardo

Carl Mastrangelo

unread,
Feb 5, 2018, 6:43:34 PM2/5/18
to grpc.io
Are you doing a closed loop latency test like gRPC benchmarking does?   Also, can you show your code?

eduar...@gmail.com

unread,
Feb 5, 2018, 7:16:29 PM2/5/18
to grpc.io
With closed loop do you mean 

a) using loopback?
b) measuring from when the request is made and finish measuring when the response gets back?

In the test we have, we are not using loopback (two vms over the network) and we start measuring right before calling into ClientAsyncResponseReader and calling into Finish and we stop measuring when we get back the response and our callback gets called.

If closed loop means something else please explain further.

I may be able to share the code but before I go through that process do you have any general suggestions that I can try or consider?

Thanks

Eduardo

Carl Mastrangelo

unread,
Feb 5, 2018, 7:24:15 PM2/5/18
to grpc.io
By closed loop i mean starting a new RPC upon completion of one.  I think that is the same as your option b).  These should be always faster with small payloads than larger payloads, which it seems like you are saying is happening?   


We have closed loop latency tests that use a 1 byte payload, and measure the 50th and 99th percentiles.   We see about 100us per RPC at 50th.

eduar...@gmail.com

unread,
Feb 5, 2018, 7:32:55 PM2/5/18
to grpc.io
We actually have 8 threads sending bursts of requests simultaneously and measuring each request individually. We are using bursts of request and then waiting for some time to avoid hammering the server with huge amount of requests. It seems you are describing that it is only one client that sends one request only and then waits till the response to send another request. We are not doing that, we are simulating some kind of QPS approximation and measuring the latency.

The behavior I'm seeing is that smaller payloads are slower than the bigger payloads. I was thinking it maybe had to do with some buffer taking longer to be filled and sent over the wire.

The results you mention are they running on the Windows stack? 

Thanks

Eduardo

Carl Mastrangelo

unread,
Feb 5, 2018, 8:01:15 PM2/5/18
to grpc.io
Ah, I thought you were trying to measure latency of a single RPC.    We have 2 QPS benchmarks, an open loop and a closed loop benchmark.  For the closed loop, it runs the single-rpc latency benchmark in parallel with 200 copies.   This means there are only ever 200 active RPCs at a time.    The latecny is recorded, but not published anywhere.

From your description, the open-loop benchmark sounds more like what you are doing.   We have a client that has a target QPS, and uses an exponentially distributed delay between starting RPCs.  This simulates real traffic better and has occasional bursts of RPCs.    We use this to measure CPU while holding the QPS constant.


Larger payloads making them system faster is odd, and may be explained by your benchmark machine.    For example, if there is no work for gRPC to do, it will go to sleep.   When the amount of work is too low, it spends a lot of time waking up and going back to sleep, lowering the overall performance.   Strangely, by adding more work (with bigger payloads), the system never goes to sleep and thus accomplishes more real work.    We work around this by trying to keep the machine as close to 100% CPU as possible without going over.   Additionally, we disable CPU frequency scaling to ensure stable results.  (The CPU down-clocks while waiting for network traffic, and doesn't speed back up fast enough when there is data).


We benchmark almost exclusively on Linux.

eduar...@gmail.com

unread,
Feb 5, 2018, 10:18:41 PM2/5/18
to grpc.io
I'll do some further experimentation based on what you mention. 

Thanks
Reply all
Reply to author
Forward
0 new messages