A single synchronous call takes ~230ms on local host!?! What am I doing wrong?

aaksh...@cs.unc.edu

unread,

Sep 3, 2018, 1:39:45 PM9/3/18

to grpc.io

Hail gRPC experts (;D),

I'm trying to build a image/video object detection server (as one of the reusable pieces in a benchmark suite) with low RTT requirements (near-realtime say ~60-90ms RTT)...

I've used gRPC and protobuf (built from git master; hashes below in case that is relevant) for the serialization and transport.

_________________________________

grpc:

commit dbc1e27e2e1a81b61eb064eb036ec6a267f88cb6

Merge: 9bc6cd1 5d24ab9

Author: Jiangtao Li <email redacted by me>

Date: Fri Jul 20 17:00:18 2018 -0700

protobuf:

commit b5fbb742af122b565925987e65c08957739976a7

Author: Bo Yang <email redacted by me>

Date: Mon Mar 5 19:54:18 2018 -0800

_________________________________

gRPC seems to add inane amounts of overhead -- ~160ms (~2x the server's processing time)!

For now I'm running on a single machine (a pretty beefy machine, so contention isn't an issue...) operating over localhost (loopback).

The amount of data being transferred is considerable, but not unheard off (~4MiB per request).

Server-side timing measurements:

doDetection: new requeust 0x7ffc77f16920

0x7ffc77f16920: GPU processing took 24.045 milliseconds

0x7ffc77f16920: Server took 72.206 millisecond

Client-side measurements:

10 objects detected.

This request took 234.825 milliseconds

Client RTT - Server processing time = 234.85-72.206 = 162.644ms (!??!)

I've pinned the server and client to separate cores using taskset.

There isn't anything else running on the server and it's a beefy 48 core (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz) machine with ample RAM (128GiB), etc....

As a start, I instrumented the implementation of the synchronous call in include/grpcpp/impl/codegen/client_unary_call.h:

BlockingUnaryCallImpl(ChannelInterface* channel, const RpcMethod& method,

ClientContext* context, const InputMessage& request,

OutputMessage* result)

and found that the vast majority of the time is spent spinning on a completion queue:

line 107: if (cq.Pluck(&ops)) {

I wonder if I need to configure gRPC differently (perhaps the default configurations are more geared towards latency-insensitive batching?)...

Any help understanding these numbers would be appreciated.

Server code: https://github.com/aakshintala/darknet/blob/master/server/server.cpp

Client code: https://github.com/aakshintala/darknet/blob/master/server/client.cpp

Proto file: https://github.com/aakshintala/darknet/blob/master/server/darknetserver.proto

Thanks in advance,

Amogh Akshintala

aakshintala.com

Carl Mastrangelo

unread,

Sep 5, 2018, 1:59:49 PM9/5/18

to grpc.io

Our own benchmarks get about 1000x better latency than that, so something is definitely up. Can you describe how you arrived at that line number, or the tools you used to profile? (We use perf and pprof)

Amogh Akshintala

unread,

Sep 5, 2018, 2:35:10 PM9/5/18

to 'Carl Mastrangelo' via grpc.io, not...@google.com

Hey Carl,

Thanks for your time!

I see the same performance numbers as reported in the protobuf performance dashboard (~0.2ms per rpc) if I set up the HelloWorld client to call SayHello 1000 times in a tight loop and average it over the 1000 calls.

How I arrived at the numbers I reported (everything is measured on localhost):

The links are to my client and server code on GitHub.

Client RPC Stub:

perform any necessary data manipulation

Pack Request message

start = getTimeOfDay()

stub->invokeRPC()

checkStatus()

end = getTimeofDay()

Client RTT = end - start

In Server RPC ServiceImpl:

start = getTimeofDay()

serviceRequest() <- GPU time is calculated inside this function, but ignore that for now.

end=getTimeofDay()

return Status

Server time = end - start

gRPC + protobuf overhead = Client RTT - Server Time

I replaced protobuf with flatbuffers yesterday, after noticing (using perf)

that a significant chunk of processing time was spent in protobuf serialization

and deserialization code.

Latency really improved with flatbuffer (no parsing, so…),

but man is that library hard to use/debug compared to protobuf...

New numbers with flatbuffers:

client RTT = ~70ms

Server Time = ~40ms

gRPC + flatbuffers = ~30ms (for ~4MiB of data over localhost)

Thanks for the link to pprof. Will check it out, especially if

Cheers,

Amogh Akshintala

http://aakshintala.com

--
You received this message because you are subscribed to a topic in the Google Groups "grpc.io" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/grpc-io/USjGJDmu_Hw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/c37ac6ed-9149-43dc-b9a3-5574e4eca439%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carl Mastrangelo

unread,

Sep 5, 2018, 2:52:15 PM9/5/18

to aaksh...@cs.unc.edu, grp...@googlegroups.com

Correct me if I am mistaken, but gettimeofday will use the the realtime clock rather than the monotonic clock. The latter I believe is the correct way to measure time differences.

Second, if you only issue one RPC, it won't be as fast. Code won't be in cache, threads, won't be active, and TCP windows won't be large enough.

Last: flatbuffers I believe are fixed size, which means that as you update your proto with new fields, you'll still have the old fields stick around. This will put you in a situation where you have to either reuse fields (which is suspect from a API compatibility POV) or accept the larger size (which hurts performance). You might be willing to make this trade if your API will seldom change, but generally they do change. Plain protobuf is slower now, for increased flexibility later. You're certainly allowed to swap out the serialization based on need, but I don't think the gRPC team can recommend it.

Amogh Akshintala

unread,

Sep 5, 2018, 3:14:07 PM9/5/18

to Carl Mastrangelo, grp...@googlegroups.com

You’re correct, but AFAIK choice of measurement is irrelevant in this case as the time intervals are relatively large (ms)

I’d have used RDTSC but the machine I’m on has DVFS turned on and I was too lazy to change that…

Didn’t know there was a wrapper in the std library (I’m more of a C programmer)…

I’ll look into using std::chrono::steady_clock next time…

The numbers I’m reporting aren’t from one RPC call, however. It’s an average over several invocations of the same RPC:

Sample:

$ taskset -c 1 ./client -f ../data/video/__909tIOxbc_temp.mp4

video file: ../data/video/__909tIOxbc_temp.mp4

Press control-c to quit at any point

6 objects detected.

This request took 247.746 milliseconds

6 objects detected.

This request took 240.866 milliseconds

7 objects detected.

This request took 230.565 milliseconds

8 objects detected.

This request took 227.763 milliseconds

10 objects detected.

This request took 229.529 milliseconds

10 objects detected.

This request took 225.853 milliseconds

10 objects detected.

This request took 220.809 milliseconds

9 objects detected.

This request took 229.487 milliseconds

8 objects detected.

This request took 237.131 milliseconds

7 objects detected.

This request took 228.043 milliseconds

8 objects detected.

This request took 233.57 milliseconds

9 objects detected.

This request took 230.96 milliseconds

9 objects detected.

This request took 226.805 milliseconds

10 objects detected.

This request took 222.559 milliseconds

9 objects detected.

This request took 233.108 milliseconds

9 objects detected.

This request took 225.9 milliseconds

8 objects detected.

This request took 227.267 milliseconds

9 objects detected.

This request took 227.419 milliseconds

9 objects detected.

This request took 232.574 milliseconds

9 objects detected.

This request took 231.147 milliseconds

10 objects detected.

This request took 230.447 milliseconds

Corresponding server measurements:

$taskset -c 5 ./server ../cfg/coco.data ../cfg/yolov3.cfg ../weights/yolov3.weights

Server listening on localhost:50051