A single synchronous call takes ~230ms on local host!?! What am I doing wrong?

108 views
Skip to first unread message

aaksh...@cs.unc.edu

unread,
Sep 3, 2018, 1:39:45 PM9/3/18
to grpc.io
Hail gRPC experts (;D),

I'm trying to build a image/video object detection server (as one of the reusable pieces in a benchmark suite) with low RTT requirements (near-realtime say ~60-90ms RTT)...
I've used gRPC and protobuf (built from git master; hashes below in case that is relevant) for the serialization and transport.
_________________________________
grpc: 
commit dbc1e27e2e1a81b61eb064eb036ec6a267f88cb6
Merge: 9bc6cd1 5d24ab9
Author: Jiangtao Li <email redacted by me>
Date:   Fri Jul 20 17:00:18 2018 -0700 

protobuf:
commit b5fbb742af122b565925987e65c08957739976a7
Author: Bo Yang <email redacted by me>
Date:   Mon Mar 5 19:54:18 2018 -0800
_________________________________

gRPC seems to add inane amounts of overhead -- ~160ms (~2x the server's processing time)!
For now I'm running on a single machine (a pretty beefy machine, so contention isn't an issue...) operating over localhost (loopback).
The amount of data being transferred is considerable, but not unheard off (~4MiB per request).

Server-side timing measurements:
doDetection: new requeust 0x7ffc77f16920
0x7ffc77f16920: GPU processing took 24.045 milliseconds
0x7ffc77f16920: Server took 72.206 millisecond

Client-side measurements:
10 objects detected.
This request took 234.825 milliseconds 

Client RTT - Server processing time = 234.85-72.206 = 162.644ms (!??!)
I've pinned the server and client to separate cores using taskset.
There isn't anything else running on the server and it's a beefy 48 core (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz) machine with ample RAM (128GiB), etc....

As a start, I instrumented the implementation of the synchronous call in include/grpcpp/impl/codegen/client_unary_call.h:
BlockingUnaryCallImpl(ChannelInterface* channel, const RpcMethod& method,
                         ClientContext* context, const InputMessage& request,
                         OutputMessage* result)

and found that the vast majority of the time is spent spinning on a completion queue:
line 107:   if (cq.Pluck(&ops)) {

I wonder if I need to configure gRPC differently (perhaps the default configurations are more geared towards latency-insensitive batching?)...

Any help understanding these numbers would be appreciated.

Thanks in advance,
Amogh Akshintala

Carl Mastrangelo

unread,
Sep 5, 2018, 1:59:49 PM9/5/18
to grpc.io
Our own benchmarks get about 1000x better latency than that, so something is definitely up.  Can you describe how you arrived at that line number, or the tools you used to profile?  (We use perf and pprof)

Amogh Akshintala

unread,
Sep 5, 2018, 2:35:10 PM9/5/18
to 'Carl Mastrangelo' via grpc.io, not...@google.com
Hey Carl,

Thanks for your time!

I see the same performance numbers as reported in the protobuf performance dashboard (~0.2ms per rpc) if I set up the HelloWorld client to call SayHello 1000 times in a tight loop and average it over the 1000 calls.

How I arrived at the numbers I reported (everything is measured on localhost):
The links are to my client and server code on GitHub.
perform any necessary data manipulation
Pack Request message
start = getTimeOfDay()
stub->invokeRPC()
checkStatus()
end = getTimeofDay()
Client RTT = end - start

start = getTimeofDay()
serviceRequest() <- GPU time is calculated inside this function, but ignore that for now.
end=getTimeofDay()
return Status
Server time = end - start

gRPC + protobuf overhead = Client RTT - Server Time

I replaced protobuf with flatbuffers yesterday, after noticing (using perf)
that a significant chunk of processing time was spent in protobuf serialization
and deserialization code. 
Latency really improved with flatbuffer (no parsing, so…), 
but man is that library hard to use/debug compared to protobuf...

New numbers with flatbuffers:
client RTT = ~70ms
Server Time = ~40ms
gRPC + flatbuffers = ~30ms (for ~4MiB of data over localhost)

Thanks for the link to pprof. Will check it out, especially if 

Cheers,
Amogh Akshintala

--
You received this message because you are subscribed to a topic in the Google Groups "grpc.io" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/grpc-io/USjGJDmu_Hw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/c37ac6ed-9149-43dc-b9a3-5574e4eca439%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Carl Mastrangelo

unread,
Sep 5, 2018, 2:52:15 PM9/5/18
to aaksh...@cs.unc.edu, grp...@googlegroups.com
Correct me if I am mistaken, but gettimeofday will use the the realtime clock rather than the monotonic clock.  The latter I believe is the correct way to measure time differences.

Second, if you only issue one RPC, it won't be as fast.  Code won't be in cache, threads, won't be active, and TCP windows won't be large enough.   

Last: flatbuffers I believe are fixed size, which means that as you update your proto with new fields, you'll still have the old fields stick around.  This will put you in a situation where you have to either reuse fields (which is suspect from a API compatibility POV) or accept the larger size  (which hurts performance).   You might be willing to make this trade if your API will seldom change, but generally they do change.  Plain protobuf is slower now, for increased flexibility later.  You're certainly allowed to swap out the serialization based on need, but I don't think the gRPC team can recommend it.    

Amogh Akshintala

unread,
Sep 5, 2018, 3:14:07 PM9/5/18
to Carl Mastrangelo, grp...@googlegroups.com
You’re correct, but AFAIK choice of measurement is irrelevant in this case as the time intervals are relatively large (ms)
I’d have used RDTSC but the machine I’m on has DVFS turned on and I was too lazy to change that…
Didn’t know there was a wrapper in the std library (I’m more of a C programmer)…
I’ll look into using std::chrono::steady_clock next time…

The numbers I’m reporting aren’t from one RPC call, however. It’s an average over several invocations of the same RPC:

Sample:
$ taskset -c 1 ./client -f ../data/video/__909tIOxbc_temp.mp4
video file: ../data/video/__909tIOxbc_temp.mp4
Press control-c to quit at any point
 6 objects detected.
This request took 247.746 milliseconds
 6 objects detected.
This request took 240.866 milliseconds
 7 objects detected.
This request took 230.565 milliseconds
 8 objects detected.
This request took 227.763 milliseconds
 10 objects detected.
This request took 229.529 milliseconds
 10 objects detected.
This request took 225.853 milliseconds
 10 objects detected.
This request took 220.809 milliseconds
 9 objects detected.
This request took 229.487 milliseconds
 8 objects detected.
This request took 237.131 milliseconds
 7 objects detected.
This request took 228.043 milliseconds
 8 objects detected.
This request took 233.57 milliseconds
 9 objects detected.
This request took 230.96 milliseconds
 9 objects detected.
This request took 226.805 milliseconds
 10 objects detected.
This request took 222.559 milliseconds
 9 objects detected.
This request took 233.108 milliseconds
 9 objects detected.
This request took 225.9 milliseconds
 8 objects detected.
This request took 227.267 milliseconds
 9 objects detected.
This request took 227.419 milliseconds
 9 objects detected.
This request took 232.574 milliseconds
 9 objects detected.
This request took 231.147 milliseconds
 10 objects detected.
This request took 230.447 milliseconds

Corresponding server measurements:
$taskset -c 5 ./server ../cfg/coco.data ../cfg/yolov3.cfg ../weights/yolov3.weights
Server listening on localhost:50051
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.904 milliseconds
0x7fff5d1dcc90 doDetection: took 66.514 milliseconds
0x7fff5d1dcc90Server took 66.55 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.804 milliseconds
0x7fff5d1dcc90 doDetection: took 66.263 milliseconds
0x7fff5d1dcc90Server took 66.288 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.762 milliseconds
0x7fff5d1dcc90 doDetection: took 66.797 milliseconds
0x7fff5d1dcc90Server took 66.825 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.768 milliseconds
0x7fff5d1dcc90 doDetection: took 66.182 milliseconds
0x7fff5d1dcc90Server took 66.209 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.78 milliseconds
0x7fff5d1dcc90 doDetection: took 66.244 milliseconds
0x7fff5d1dcc90Server took 66.271 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.772 milliseconds
0x7fff5d1dcc90 doDetection: took 66.102 milliseconds
0x7fff5d1dcc90Server took 66.13 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.768 milliseconds
0x7fff5d1dcc90 doDetection: took 66.107 milliseconds
0x7fff5d1dcc90Server took 66.135 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.788 milliseconds
0x7fff5d1dcc90 doDetection: took 71.868 milliseconds
0x7fff5d1dcc90Server took 72.365 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.795 milliseconds
0x7fff5d1dcc90 doDetection: took 66.157 milliseconds
0x7fff5d1dcc90Server took 66.186 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.815 milliseconds
0x7fff5d1dcc90 doDetection: took 66.255 milliseconds
0x7fff5d1dcc90Server took 66.284 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.746 milliseconds
0x7fff5d1dcc90 doDetection: took 66.489 milliseconds
0x7fff5d1dcc90Server took 66.516 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.902 milliseconds
0x7fff5d1dcc90 doDetection: took 68.414 milliseconds
0x7fff5d1dcc90Server took 68.452 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.758 milliseconds
0x7fff5d1dcc90 doDetection: took 66.615 milliseconds
0x7fff5d1dcc90Server took 66.643 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.811 milliseconds
0x7fff5d1dcc90 doDetection: took 66.339 milliseconds
0x7fff5d1dcc90Server took 66.368 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 24.714 milliseconds
0x7fff5d1dcc90 doDetection: took 70.671 milliseconds
0x7fff5d1dcc90Server took 70.704 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.805 milliseconds
0x7fff5d1dcc90 doDetection: took 66.269 milliseconds
0x7fff5d1dcc90Server took 66.299 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.836 milliseconds
0x7fff5d1dcc90 doDetection: took 66.73 milliseconds
0x7fff5d1dcc90Server took 66.755 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.791 milliseconds
0x7fff5d1dcc90 doDetection: took 65.679 milliseconds
0x7fff5d1dcc90Server took 65.707 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.826 milliseconds
0x7fff5d1dcc90 doDetection: took 66.242 milliseconds
0x7fff5d1dcc90Server took 66.271 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.804 milliseconds
0x7fff5d1dcc90 doDetection: took 67.864 milliseconds
0x7fff5d1dcc90Server took 67.893 milliseconds
doDetection: new requeust 0x7fff5d1dcc90
0x7fff5d1dcc90 GPU processing took 23.807 milliseconds
0x7fff5d1dcc90 doDetection: took 67.705 milliseconds
0x7fff5d1dcc90Server took 67.736 milliseconds

Re flatbuffers:
AFAIK, flatbuffers doesn’t transfer anything that is unused (fields can be marked deprecated). 
If the field isn’t set to a value other than the default value, it isn’t transferred over the wire. 
Instead such values are repopulated with the default value (specified in the schema) on the client/server side as needed..
Of course this only works for scalars. If you have need to deprecate a vector, then yes you’re kinda hosed...
Hmmm, I think the gRPC documentation should point to flatbuffers (which supports gRPC out of the box) as an alternative suited to low-latency cases (while making the drawbacks you mentioned clear).

Cheers,
Amogh Akshintala

Carl Mastrangelo

unread,
Sep 5, 2018, 5:24:14 PM9/5/18
to aaksh...@cs.unc.edu, grp...@googlegroups.com, Vijay Pai
One thing I noticed is that you are sharing the same timespec in your service.  I think you need to make one per RPC, in the event RPCs arrive at the same time.  From reading the client it does seem serial (again, correct me if I'm mistaken).  

As for serialization:  It's a fine line to walk.  Some of the serializers are better supported than others, and it might give the wrong impression that gRPC recommends them.  For example, I recently made a blog post about how to plug JSON as the encoding.  The code it more boilerplate, but works better with existing codebases.  If you are happy with the flatbuffer support, I'd be willing to take that as a signal that it's ready for use (real user feedback is more meaningful than evaluating it ourselves).  Consider sending a PR to https://github.com/grpc/grpc.github.io  where the documentation lives.

+Vijay for the C++ performance stuff.
  

Amogh Akshintala

unread,
Sep 5, 2018, 5:42:43 PM9/5/18
to Carl Mastrangelo, grp...@googlegroups.com, Vijay Pai
Good catch, thanks. At the moment, the client only ever has one outstanding request. It is serial. 
I’m going to rewrite the client and server to see how well it scales in terms of throughput while 
trying to keep the latency constant…

Ah, makes sense. :D

I think the flatbuffer+grpc integration is ready for use… 
I’m quite happy with the performance, but the documentation needs some serious work 
to make it more beginner friendly 
(protbuf documentation did a really good job of holding my hand).
Thanks, I’ll probably try to help fix that documentation before sending a PR to the gRPC docs. 

Once again, thanks a lot for looking over this request.
I just wanted to make sure I wasn’t doing something clearly dumb with gRPC, 
but in the end it seems protobuf just wasn’t suitable to my needs.

Cheers,
Amogh Akshintala
Reply all
Reply to author
Forward
0 new messages