debugging tail latencies

Alexander Gallego

<gallego.alexx@gmail.com>

unread,

Aug 16, 2016, 1:10:17 AM8/16/16

to seastar-dev

Hi Guys,

I'm trying to debug some of the seastar tail latencies:

Around the 94th percentile it doubles. It also doubles around the 98th percentile:

https://gist.github.com/991352d719311cd1dd65d315bd32cb57

I'm using HDR histogram to track request latencies and I'm not sure I can explain the latencies around the 94th and 98th percentiles.

In case you are wondering, here is my driver (like seawreck.cc) program: https://github.com/senior7515/smurf/blob/feature/concurrent_requests/src/rpc/templates/client.cc

I suspect that the latencies come during some scheduling delay, but not sure how to debug.

I saw the reactor code has a bunch of stats i.e: seastar::memory::stats() etc.

But don't think that's helpful.

I've notice this behavior with the native (DPDK) and the posix (epoll) drivers too.

Pointers of where to start debugging would be super helpful. Thanks!!

- Alex

client.png

Avi Kivity

<avi@scylladb.com>

unread,

Aug 16, 2016, 2:25:24 AM8/16/16

to Alexander Gallego, seastar-dev

What are the units on the graph?

How loaded are the client and server are when this is measured?

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/e5beac6e-625c-42a4-beb0-f17a0922e7a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexander Gallego

<gallego.alexx@gmail.com>

unread,

Aug 16, 2016, 7:54:45 AM8/16/16

to seastar-dev, gallego.alexx@gmail.com

On Tuesday, August 16, 2016 at 2:25:24 AM UTC-4, Avi Kivity wrote:

What are the units on the graph?

micro seconds on the y access

x axis is the number of req in percentiles

How loaded are the client and server are when this is measured?

This graph is only for the client, full round trip. Effectively I measure from the time I call "send" which does:

output_stream<char>::write(...).then ( flush ). then ( input_stream<char>::read_exactly(40) )

(https://github.com/senior7515/smurf/blob/feature/concurrent_requests/src/rpc/rpc_client.h#L56)

The server and client ran on my old laptop each with 1 core and 1GB of memory using the posix runtime i.e:

./a.out -c 1 -m 1G

I realize this might not be the best scenario since I had chrome open, but I see the same tail latency step function if I just

run the 2 programs (client& server) and I see it on both runtimes.

Avi Kivity

<avi@scylladb.com>

unread,

Aug 16, 2016, 8:07:32 AM8/16/16

to Alexander Gallego, seastar-dev

On 08/16/2016 02:54 PM, Alexander Gallego wrote:

On Tuesday, August 16, 2016 at 2:25:24 AM UTC-4, Avi Kivity wrote:

What are the units on the graph?

micro seconds on the y access

x axis is the number of req in percentiles

Always specify units to avoid annoying the old-timers.

How loaded are the client and server are when this is measured?

This graph is only for the client, full round trip. Effectively I measure from the time I call "send" which does:

output_stream<char>::write(...).then ( flush ). then ( input_stream<char>::read_exactly(40) )

(https://github.com/senior7515/smurf/blob/feature/concurrent_requests/src/rpc/rpc_client.h#L56)

The server and client ran on my old laptop each with 1 core and 1GB of memory using the posix runtime i.e:

./a.out -c 1 -m 1G

I realize this might not be the best scenario since I had chrome open, but I see the same tail latency step function if I just

run the 2 programs (client& server) and I see it on both runtimes.

That's not what I meant, but it's a valid concern. In addition, the OS can interfere. You can see how we tune the OS in ScyllaDB to avoid random processes from interfering:

https://github.com/scylladb/scylla/blob/master/dist/common/sysctl.d/99-scylla-sched.conf

Note that to get the full effect of disabling autogroup you need to reboot.

What I meant is that if either the client or the server are going full tilt, then latency is going to be bad. To get good latency you need both of them running at a low load (in the reactor.load sense).

Alexander Gallego

<gallego.alexx@gmail.com>

unread,

Aug 16, 2016, 9:45:15 AM8/16/16

to Avi Kivity, seastar-dev

Thanks! Going to dig a bit deeper now.

--

Sent from my mobile, please excuse my handwriting.

Reply all

Reply to author

Forward