Scala gRPC High CPU Utilization

Kos

unread,

Aug 28, 2018, 2:54:19 AM8/28/18

to grpc.io

Hi,

I'm using gRPC in a new Scala service and I'm seeing unexpectedly high CPU utilization. I see this high utilization in our production workload but also am able to reproduce via performance tests which I'll describe below.

My setup is using grpc-netty-shaded 1.10 (but i've also repro'd with 1.14). My performance test uses mTLS to talk to the service. The service is deployed on a container with 6 cores and 2 gb ram. I've reduced the footprint of my service to immediately return with a response without doing any other work to try and identify if it's the application or something to do with my gRPC configuration.

My performance test is issuing about 250 requests a second using one Managed Channel to one instance of my service. The data in each request is about 10 bytes. With this workload, my service is running at about 35% CPU, which I feel is far too high for this small amount of rps.

Here is how I've constructed my server:

val serverInterceptor = MonitoringServerInterceptor.create(Configuration.allMetrics())

val realtimeServiceWithMonitoring = ServerInterceptors.intercept(
  RealtimePublishGrpc.bindService(realtimeService, ExecutionContext.global),
  serverInterceptor)
val rppServiceWithMonitoring = ServerInterceptors.intercept(
  RealtimeProxyGrpc.bindService(realtimePublishProxyService, ExecutionContext.global),
  serverInterceptor
)

  val keyManagerFactory = GrpcSSLHelper.getKeyManagerFactory(sslConfig)
  val trustManagerFactory = GrpcSSLHelper.getTrustManagerFactory(sslConfig)
  val serverGrpcSslContexts = GrpcSSLHelper.getServerSslContext(keyManagerFactory, trustManagerFactory)

  NettyServerBuilder
    .forPort(8086)
    .sslContext(serverGrpcSslContexts)
    .addService(realtimeServiceWithMonitoring)
    .addService(rppServiceWithMonitoring)
    .build()
}

The server interceptor is modeled after: https://github.com/grpc-ecosystem/java-grpc-prometheus

The managed channel is constructed as such:

private val interceptor = MonitoringClientInterceptor.create(Configuration.allMetrics())

val trustManagerFactory = GrpcSSLHelper.getTrustManagerFactory(sslConfig)

NettyChannelBuilder
  .forAddress(address, 8086)
  .intercept(interceptor)
  .negotiationType(NegotiationType.TLS)
  .sslContext(GrpcSSLHelper.getClientSslContext(keyManagerFactory, trustManagerFactory))
  .build()

Finally, I use non-blocking stubs to issue the gRPC request.

Any help would be greatly appreciated. Thanks!

-K

Carl Mastrangelo

unread,

Aug 28, 2018, 3:49:38 PM8/28/18

to grpc.io

Can you try setting the executor on both the channel and the server builder? I would recommend ForkJoinPool.commonPool().

kaus...@box.com

unread,

Aug 28, 2018, 5:14:31 PM8/28/18

to grpc.io

Hi Carl,

Thanks for responding! I've tried a couple different executors and they don't seem to change the behavior. I've done FixedThreadPool with the number of threads = # of cores * 2, the ForkJoinPool.commonPool as you recommended, and the Scala global ExecutionContext which ultimately is a ForkJoinPool as well. I've set this in the NettyServerBuilder as well as the call to bind my service.

For some more information, here's results from a gatling test run that lasted 10 minutes using the CommonPool. Server implementation now looks like this:

val realtimeServiceWithMonitoring =
  ServerInterceptors.intercept(
    RealtimePublishGrpc.bindService(realtimeService, ExecutionContext.global),
    serverInterceptor)
val rppServiceWithMonitoring = ServerInterceptors.intercept(
  RealtimeProxyGrpc.bindService(realtimePublishProxyService, ExecutionContext.global),
  serverInterceptor
)

NettyServerBuilder
  .forPort(8086)
  .sslContext(serverGrpcSslContexts)
  .addService(realtimeServiceWithMonitoring)
  .addService(batchPublishWithMonitoring)
  .addService(rppServiceWithMonitoring)
  .executor(ForkJoinPool.commonPool())
  .build()

My service implementation immediately returns Future.successful:

override def publish(request: PublishRequest): Future[PublishResponse] = {
  logger.debug("Received Publish request: " + request)
  Future.successful(PublishResponse())

Test Results:

================================================================================

---- Global Information --------------------------------------------------------

> request count 208686 (OK=208686 KO=0 )

> min response time 165 (OK=165 KO=- )

> max response time 2997 (OK=2997 KO=- )

> mean response time 287 (OK=287 KO=- )

> std deviation 145 (OK=145 KO=- )

> response time 50th percentile 232 (OK=232 KO=- )

> response time 75th percentile 324 (OK=324 KO=- )

> response time 95th percentile 501 (OK=501 KO=- )

> response time 99th percentile 894 (OK=893 KO=- )

> mean requests/sec 347.231 (OK=347.231 KO=- )

---- Response Time Distribution ------------------------------------------------

> t < 800 ms 206014 ( 99%)

> 800 ms < t < 1200 ms 1511 ( 1%)

> t > 1200 ms 1161 ( 1%)

> failed 0 ( 0%)

================================================================================

347 Requests/Sec. CPU Utilization hovers between 29% - 35%.

Thanks for your help!

Carl Mastrangelo

unread,

Aug 29, 2018, 5:28:00 PM8/29/18

to grpc.io

More info is needed to figure out why this is slow. Have you use JProfiler or Yourkit before? There are a couple Java profilers (perf, even) that can tell you where the CPU is going to. Also, you should consider turning on gc logging to see if memory is being consumed too fast.

Our tuned benchmarks get about 5000qps per core, but that took profiling before we could get that fast. The general approach is figure out whats slow, and then fix that. Without knowing whats slow for your test, its hard to recommend a fix.

Kos

unread,

Aug 31, 2018, 5:02:18 PM8/31/18

to grpc.io

Hi Carl,

I did run a Yourkit run against my service and what I see is many threads being created for the event loop group - they're all named something like: 'grpc-default-worker-ELG-...'. I did some reading on your other posts and saw you recommended using an ELG bounded to 1-2 threads. I tried this and I see our CPU utilization drop by about 10-15% with no loss in throughput!

This got me thinking and I'm wondering if my problem is actually having too many managed channels? Some background, this service creates a managed channel per each node that it talks to (about 40 or so). For the traffic it receives, it does some filtering and sends the data down N of those channels. This outbound throughput is quite low - on the order of 10k/sec across all the channels. All of these managed channels now share the same NIO ELG and they use the same ForkJoinPool.commonPool as you recommend. I'm wondering though if one managed channel per host is the correct approach here? Or should I make my own 'ManagedChannel' and create a subchannel per host?

Thanks!

Carl Mastrangelo

unread,

Aug 31, 2018, 7:06:18 PM8/31/18

to grpc.io

It should be one ManagedChannel per host*. There are a lot of batching opportunities possible when using just one channel. For example, TLS encryption can work on larger block sizes at a time. Another example is that netty can poll on fewer threads, meaning fewer wakeups across all your threads. Just to make sure we are on the same page:

- Only use one ManagedChannel per "target" (a.k.a. hostname)

- Use a limited executor (like ForkJoinPool) for each channel. You can share this executor across channels if your RPCs don't block very long

- Use a single Netty EventLoopGroup, and limit the number of loops. You can share the group across all your channels.

- Use Netty tcnative for SSL / TLS if you aren't already. This is enormously faster than the SSL that comes with the JDK.

These also apply the server as well.

* there are rare cases to break this rule, but they don't sound like they apply to your usage.

Kos

unread,

Sep 11, 2018, 1:04:40 AM9/11/18

to grpc.io

Hey Carl,

Yup the way you've described it is exactly how I have it setup.

private def channelBuilder(address: String): ManagedChannel = {
  val interceptor = MonitoringClientInterceptor.create(Configuration.allMetrics())
  val builder = NettyChannelBuilder
    .forAddress(address, realtimePublishProxyConfig.realtimeInstanceConfig.realtimeServicePort)
    .intercept(interceptor)
    .executor(ForkJoinPool.commonPool())
    .eventLoopGroup(nioEventLoopGroup)

  clientSslContext match {
    case Some(sslContext) => builder.sslContext(sslContext).useTransportSecurity()
    case None => builder.usePlaintext(true)
  }

  builder.build()
}

I just got back from PTO so going to start looking into this again. I had a quick question that I couldn't find much information about after doing some searching. Currently we send one logical request per RPC. Would there be advantages to sending multiple requests in one RPC (i.e batching)? Our server would likely convert the batch RPC into individual requests so we wouldn't be able to take advantage of batching throughout the system. So my question is more targeted towards gRPC and how it handles this.

Thanks!

Carl Mastrangelo

unread,

Sep 11, 2018, 8:51:54 PM9/11/18

to grpc.io

An explicit goal of gRPC is to making batching not necessary. You should be able to use unary, or streaming, or any RPC kind really without worrying about the performance.

Can you list your JVM flags you use?

kos

unread,

Sep 11, 2018, 9:13:55 PM9/11/18

to grpc.io

That's what I understood it to be as well - we actually did some testing with batching and saw no improvement. I just wanted to double check!

Here's the JVM flags I'm using. This service is running in a docker container which can result in Java not getting the correct CPU core count. Due to this we explicitly set certain params (i.e. GC threads) to be similar to the number of cores being used (14).

-J-server

-J-verbosegc

-J-Xloggc:/var/log/java.gc.log

-J-XX:+PrintGCApplicationStoppedTime

-J-XX:+PrintGCDetails

-J-XX:+PrintGCTimeStamps

-J-XX:+PrintGCDateStamps

-J-XX:+PrintTenuringDistribution

-J-XX:ErrorFile=/var/log/java.error%p.log

-J-Xms7g

-J-Xmx7g

-J-XX:+UseG1GC

-J-XX:+UnlockExperimentalVMOptions

-J-XX:+UseCGroupMemoryLimitForHeap

-J-XX:MaxRAM=7g

-J-XX:ParallelGCThreads=12

-J-XX:ConcGCThreads=12

-J-Djava.util.concurrent.ForkJoinPool.common.parallelism=12

-J-XX:CICompilerCount=12

Reply all

Reply to author

Forward