[grpc-java] Server crashes after 1 minute at 1000 messages per second.

thas....@gmail.com

unread,

Nov 2, 2017, 2:26:08 PM11/2/17

to grpc.io

I am load testing my grpc server to see how it performs under high throughput and what the average request latency is.

It is configured as:

server = NettyServerBuilder.forPort(port)
        .addService(testService)

        .executor(Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()))

        .build()
        .start();

The machine the server is running on has 16 CPUs. (2 threads per core, 8 cores)

On my client, I am using Guava's RateLimiter to send messages in a bi-di stream at 1000 per second. (All using a shared channel and stub).

Each message I am sending in a Runnable() just to parallelize the work. Same behavior happens if I just call `onNext` directly without the task submission step.
Code roughly looks like:

final long startTime = System.currentTimeMillis();
final long oneMinute = TimeUnit.MINUTES.toMillis(1);
final RateLimiter rateLimiter = RateLimiter.create(1000);
final StreamObserver<TestMessageRequest> requestObserver = client.asyncStub.testMessageRpc(client.replyObserver);

while (System.currentTimeMillis() - startTime < oneMinute) { 
    rateLimiter.acquire(1);
    threadPool.submit(() -> {

       TestMessageRequest request = TestMessageRequest.getDefaultInstance();

       requestObserver.onNext(request);
    });
}

So anywhere from the 20-60 second mark my server throws a:

 SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1@48544a03
     [java] java.lang.OutOfMemoryError: GC overhead limit exceeded

Am I doing something wrong? Is there any way to have the server support this high load.

Thanks in advance!

Eric Anderson

unread,

Nov 2, 2017, 5:33:50 PM11/2/17

to Thas Himalayaratnam, grpc.io

On Thu, Nov 2, 2017 at 11:26 AM, <thas....@gmail.com> wrote:

On my client, I am using Guava's RateLimiter to send messages in a bi-di stream at 1000 per second. (All using a shared channel and stub).
Each message I am sending in a Runnable() just to parallelize the work.

That doesn't sound right, unless it is parallel across streams. onNext is not thread-safe; you'd need to hold a lock when calling it from multiple threads simultaneously. This isn't the cause of your problem, but it is a problem.

Same behavior happens if I just call `onNext` directly without the task submission step.
Code roughly looks like:

final long startTime = System.currentTimeMillis();
final long oneMinute = TimeUnit.MINUTES.toMillis(1);
final RateLimiter rateLimiter = RateLimiter.create(1000);
final StreamObserver<TestMessageRequest> requestObserver = client.asyncStub.testMessageRpc(client.replyObserver);

while (System.currentTimeMillis() - startTime < oneMinute) { 
    rateLimiter.acquire(1);
    threadPool.submit(() -> {

       TestMessageRequest request = TestMessageRequest.getDefaultInstance();

       requestObserver.onNext(request);
    });
}

It's not observing outbound flow control and you're sending messages faster than they are consumed. This causes the buffered messages to consume too much space. You need to use ClientCallStreamObserver.isReady() and setOnReadyHandler(). Your replyObserver will need to implement ClientResponseObserver in order to call setOnReadyHandler() during beforeStart(). If isReady() is false, then try to pause sending to avoid excessive buffering. When isReady() transitions from false back to true, setOnReadyHandler() will be called.

See the ManualFlowControlClient, although you don't need to mess with disableAutoInboundFlowControl() and request().

So anywhere from the 20-60 second mark my server throws a:
SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1@48544a03 [java] java.lang.OutOfMemoryError: GC overhead limit exceeded

Am I doing something wrong? Is there any way to have the server support this high load.

Is that 1000 QPS total, and all over one stream? Using a single stream can't use much more than a single core of processing (excluding protobuf and the application), so you may use some more streams. But 1k QPS is really low. We see 750k QPS ("Streaming secure throughput QPS (8 core client to 8 core server)") between a client and server with 8 cores each. Even with non-streaming RPCs we see 250k QPS.

thas....@gmail.com

unread,

Nov 2, 2017, 6:29:17 PM11/2/17

to grpc.io

Hey Eric!

That doesn't sound right, unless it is parallel across streams. onNext is not thread-safe; you'd need to hold a lock when calling it from multiple threads simultaneously. This isn't the cause of your problem, but it is a problem.

Thanks for pointing that out!

Is that 1000 QPS total, and all over one stream? Using a single stream can't use much more than a single core of processing (excluding protobuf and the application), so you may use some more streams. But 1k QPS is really low. We see 750k QPS ("Streaming secure throughput QPS (8 core client to 8 core server)") between a client and server with 8 cores each. Even with non-streaming RPCs we see 250k QPS.

I'm trying 1000 QPS continuously for a minute, all over one stream. FWIW, 500 QPS works fine and I tried that up to 5 minutes. For those charts you showed, are you able to show the server and client configs/code that produced those metrics? Also, I am still using GRPC 1.5, and I know there were performance improvements from then but I still think 1000 QPS is still very low for this version.

Anything else you could recommend? Thanks again.

Carl Mastrangelo

unread,

Nov 2, 2017, 7:02:43 PM11/2/17

to grpc.io

You should release a token when the RPC completes, rather than solely wait for the rate limiter. The Semaphore class in java lets you do this. It's usually not a a good idea to think in terms of QPS for this kind of use. Better is to think in terms of inflight RPCs.

Eric Anderson

unread,

Nov 3, 2017, 6:47:58 PM11/3/17

to Thas Himalayaratnam, grpc.io

On Thu, Nov 2, 2017 at 3:29 PM, <thas....@gmail.com> wrote:

Is that 1000 QPS total, and all over one stream? Using a single stream can't use much more than a single core of processing (excluding protobuf and the application), so you may use some more streams. But 1k QPS is really low. We see 750k QPS ("Streaming secure throughput QPS (8 core client to 8 core server)") between a client and server with 8 cores each. Even with non-streaming RPCs we see 250k QPS.

I'm trying 1000 QPS continuously for a minute, all over one stream. FWIW, 500 QPS works fine and I tried that up to 5 minutes. For those charts you showed, are you able to show the server and client configs/code that produced those metrics? Also, I am still using GRPC 1.5, and I know there were performance improvements from then but I still think 1000 QPS is still very low for this version.

Anything else you could recommend? Thanks again.

If you're running on a desktop or server class machine with at least a couple cores available with ethernet running between the client and server, then I don't know what would cause it to be that low.

The performance test can be run with run_performance_test.py... but... it's complicated. The Java code it uses is in the benchmarks directory.

I'd instead suggest going into grpc-java's benchmark directory, running `./gradlew -PskipCodegen=true installDist` and then running the binaries in build/install/grpc-benchmarks/bin: qps_server, qps_client, and maybe openloop_client. qps_client does ping-pong, so if you use only one RPC you're just measuring latency. But I'm seeing 10k QPS in that setup on localhost.

In one terminal:

$ build/install/grpc-benchmarks/bin/qps_server --addr:1234

Then in another:
$ build/install/grpc-benchmarks/bin/qps_client --address=localhost:1234 --duration=300 --channels=1 --outstanding_rpcs=1 --streaming_rpcs

The qps_client, for each RPC, sends a message, waits for the response, then sends another message. openloop_client is closer to what you want, but it will is easier to "hold wrong." It does unary RPCs at a requested rate, but it spin-waits and issues all RPCs from the same thread, so its maximum QPS is artificially limited.

thas....@gmail.com

unread,

Nov 7, 2017, 3:00:59 PM11/7/17

to grpc.io

Eric,

Exactly what I was looking for! I found out that there was a high CPU consuming function on my server which was causing the crash under heavy load due to frequent context switching. After removing that and using the server config in AsyncServer.java the throughput substantially increased! I'll look at the different server and client configs in the qps_server/qps_client and see what works best for me.

Thanks so much again!

Reply all

Reply to author

Forward