Ah, I thought you were trying to measure latency of a single RPC. We have 2 QPS benchmarks, an open loop and a closed loop benchmark. For the closed loop, it runs the single-rpc latency benchmark in parallel with 200 copies. This means there are only ever 200 active RPCs at a time. The latecny is recorded, but not published anywhere.
From your description, the open-loop benchmark sounds more like what you are doing. We have a client that has a target QPS, and uses an exponentially distributed delay between starting RPCs. This simulates real traffic better and has occasional bursts of RPCs. We use this to measure CPU while holding the QPS constant.
Larger payloads making them system faster is odd, and may be explained by your benchmark machine. For example, if there is no work for gRPC to do, it will go to sleep. When the amount of work is too low, it spends a lot of time waking up and going back to sleep, lowering the overall performance. Strangely, by adding more work (with bigger payloads), the system never goes to sleep and thus accomplishes more real work. We work around this by trying to keep the machine as close to 100% CPU as possible without going over. Additionally, we disable CPU frequency scaling to ensure stable results. (The CPU down-clocks while waiting for network traffic, and doesn't speed back up fast enough when there is data).
We benchmark almost exclusively on Linux.