End to End p90 latency is too long(> 40ms) observed from client side.

34 views
Skip to first unread message

Cong Guo

unread,
Nov 3, 2021, 11:25:47 PM11/3/21
to ScyllaDB users
Hi experts, I got a weird issue, I tried to run 10000 dynamo queries on scylla on the pod locally, but the e2e p99 latency is above 40ms, while observed from monitoring, the p99 latency is less than 2ms. How could I debug such issue?context:
scylla deployed on eks cluster with 4x i3.4xlarge nodes.monitoring from client side running on the same pod with the coordinator:

monitoring from scylla side.
https://files.slack.com/files-pri/T2NLYA215-F02L4SNSCH1/image.png

monitoring from client side:
https://files.slack.com/files-pri/T2NLYA215-F02LHETCWC9/image.png

testing code (I just scan some items from ddb and query them on scylla):

public static void readRandom(String sourceTableName, AmazonDynamoDBAsync sourceCli, String destTableName, AmazonDynamoDBAsync destCli) {
    try {
      DynamoDB ddb = new DynamoDB(sourceCli);
      Table ddbTable = ddb.getTable(sourceTableName);
      ItemCollection<ScanOutcome> scan = ddbTable.scan();
      IteratorSupport<Item, ScanOutcome> ddbIter = scan.iterator();
      long i = 0;
      BufferedWriter file = new BufferedWriter(new FileWriter("/tmp/perf.csv", false));
      while (ddbIter.hasNext()) {
        i++;
        Item item = ddbIter.next();
        String id = item.getString("id");
        GetItemRequest request = new GetItemRequest().withTableName(destTableName).withKey(buildKey(id, "BATCH"));
        long start = System.currentTimeMillis();
        GetItemResult item1 = destCli.getItem(request);
        long duration = System.currentTimeMillis() - start;
        file.write(String.format("%d\n", duration));
        if (i == 10000) {
          break;
        }
      }
      file.close();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }


发送信息给 general

Cong Guo

unread,
Nov 3, 2021, 11:27:28 PM11/3/21
to ScyllaDB users
To sum, e2e p90 latency > 40ms with client running on the same pod/node with the coordinator, while observed from scylla monitoring p90 latency < 2ms. inconsistent

Cong Guo

unread,
Nov 4, 2021, 4:38:20 AM11/4/21
to ScyllaDB users
Can anyone help on this?

Dor Laor

unread,
Nov 4, 2021, 4:44:26 AM11/4/21
to ScyllaDB users
Try to run the client on a pod on another server, Scylla doesn't like
to share resources in general (unless you pin everything perfectly.

Also, your test isn't parallel, so you send one request at a time,
it's fine for a test but it wouldn't represent a real case and will leave
scylla underutilized 
You can use scylla bench, a test code in go to run many test patterns

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/0d6792f6-31ca-4da0-83db-c3cd8980a462n%40googlegroups.com.

Cong Guo

unread,
Nov 4, 2021, 5:31:10 AM11/4/21
to ScyllaDB users
Actually I did run client on another pod and met the same problem. I know that scylla is not under heavy load in my case, but I didn't expect the p99 latency > 40ms while on the Server side monitoring(prom + grafana) shows the p99 latency is only 2ms, so I want to find where is the problem.

To exclude the impact brought by k8s, I installed scylla on 4x i3.4xlarge ec2 directly and repeat the > 40ms latency as well. Do you need extra info?

Cong Guo

unread,
Nov 4, 2021, 5:35:43 AM11/4/21
to ScyllaDB users
Any configuration info do you need?

Piotr Sarna

unread,
Nov 4, 2021, 5:40:16 AM11/4/21
to scyllad...@googlegroups.com

Hi Cong,

I see that you're using alternator, so the thing that could really help us would be query tracing. For alternator, the only way is to turn on tracing node-wide, which you can do by Scylla's REST API:

curl -X POST localhost:10000/storage_service/trace_probability?probability=1

Either run it on each node locally, or just replace `localhost` with each node's address.

Then, run your code again.

After it's done, you can browse the traces with a helper browser script I published here:https://alternator-traces.sarna.dev/ - it lets you configure the endpoint and will extract tracing information from a node. Tracing contains elapsed time for each step, so it will greatly help in seeing what happens in Scylla - after looking at tracing we'll know whether Scylla spent ~1ms or ~40ms on serving the query. If it spent ~1ms, then the problem is on the driver side. Here's an example output from tracing - source_elapsed is the number of microseconds that passed since the query started:

session_id event_id activity scylla_parent_id scylla_span_id source source_elapsed thread
----
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905cffda-3d52-11ec-bd5d-331fd6e9788e Scan 0 253230276965299 127.0.0.1 10 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d056f-3d52-11ec-bd5d-331fd6e9788e Performing a database query 0 253230276965299 127.0.0.1 153 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d06f5-3d52-11ec-bd5d-331fd6e9788e read_data: querying locally 0 253230276965299 127.0.0.1 192 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d0744-3d52-11ec-bd5d-331fd6e9788e Start querying token range (-inf, {minimum token, end}] 0 253230276965299 127.0.0.1 200 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d096a-3d52-11ec-bd5d-331fd6e9788e Creating shard reader on shard: 0 0 253230276965299 127.0.0.1 255 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d09e0-3d52-11ec-bd5d-331fd6e9788e Scanning cache for range (-inf, {minimum token, end}] and slice {(-inf, +inf)} 0 253230276965299 127.0.0.1 267 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d0af7-3d52-11ec-bd5d-331fd6e9788e Page stats: 0 partition(s), 0 static row(s) (0 live, 0 dead), 0 clustering row(s) (0 live, 0 dead) and 0 range tombstone(s) 0 253230276965299 127.0.0.1 295 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d18c5-3d52-11ec-bd5d-331fd6e9788e Querying is done 0 253230276965299 127.0.0.1 648 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d19d9-3d52-11ec-bd5d-331fd6e9788e read_data: querying locally 0 253230276965299 127.0.0.1 676 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d19f3-3d52-11ec-bd5d-331fd6e9788e Start querying token range ({minimum token, end}, +inf) 0 253230276965299 127.0.0.1 678 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d1a22-3d52-11ec-bd5d-331fd6e9788e Creating shard reader on shard: 0 0 253230276965299 127.0.0.1 683 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d1a48-3d52-11ec-bd5d-331fd6e9788e Scanning cache for range ({minimum token, end}, +inf) and slice {(-inf, +inf)} 0 253230276965299 127.0.0.1 687 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d1ad6-3d52-11ec-bd5d-331fd6e9788e Page stats: 0 partition(s), 0 static row(s) (0 live, 0 dead), 0 clustering row(s) (0 live, 0 dead) and 0 range tombstone(s) 0 253230276965299 127.0.0.1 701 shard 0
905cdba0-3d52-11ec-bd5d-331fd6e9788e 905d20a5-3d52-11ec-bd5d-331fd6e9788e Querying is done 0 253230276965299 127.0.0.1 850 shard 0
----

Cong Guo

unread,
Nov 4, 2021, 5:44:22 AM11/4/21
to ScyllaDB users
great, will do that, thx.

Cong Guo

unread,
Nov 4, 2021, 5:56:04 AM11/4/21
to ScyllaDB users
Is there any other way to check the alternator trace log, I cannot connect to the cluster via public ip by https://alternator-traces.sarna.dev/ . It shows "Failed to fetch data: NetworkingError: Network Failure."

On Thursday, November 4, 2021 at 5:40:16 PM UTC+8 sa...@scylladb.com wrote:

Cong Guo

unread,
Nov 4, 2021, 5:58:04 AM11/4/21
to ScyllaDB users
see logs in journalctl

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 3] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 7] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 9] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 10] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 5] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 13] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 11] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 12] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 8] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Nov 04 09:47:16 ip-172-16-89-251 scylla[15656]:  [shard 6] tracing - Setting tracing probability to 1.0 (normalized 281474976710656)

Piotr Sarna

unread,
Nov 4, 2021, 6:03:50 AM11/4/21
to ScyllaDB users
There is - with less convenient output, but it works. Just query the special system table exposed for alternator. Example with aws cli: aws --endpoint-url 'http://127.0.0.1:8000' dynamodb scan --table-name .scylla.alternator.system_traces.events

Cong Guo

unread,
Nov 4, 2021, 6:07:27 AM11/4/21
to ScyllaDB users
seems from system_traces.events, will extract the info.

Cong Guo

unread,
Nov 4, 2021, 6:09:11 AM11/4/21
to ScyllaDB users

session_id                           | event_id                             | activity                                                                                                                                                   | scylla_parent_id | scylla_span_id  | source        | source_elapsed | thread

--------------------------------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------+---------------+----------------+---------

 acaf0af0-3d55-11ec-b12d-bc6d814a812d | acaf2467-3d55-11ec-b12d-bc6d814a812d |                                                                                                                                                    GetItem |                0 | 310160318929845 | 172.16.89.251 |              0 | shard 0

 acaf0af0-3d55-11ec-b12d-bc6d814a812d | acaf255b-3d55-11ec-b12d-bc6d814a812d | Creating read executor for token -5751088081278505224 with all: {172.16.89.45, 172.16.89.191, 172.16.89.194} targets: {172.16.89.45} repair decision: NONE |                0 | 310160318929845 | 172.16.89.251 |             24 | shard 0

 acaf0af0-3d55-11ec-b12d-bc6d814a812d | acaf257b-3d55-11ec-b12d-bc6d814a812d |                                                                                                              read_data: sending a message to /172.16.89.45 |                0 | 310160318929845 | 172.16.89.251 |             27 | shard 0

 acaf0af0-3d55-11ec-b12d-bc6d814a812d | acaf3424-3d55-11ec-b12d-bc6d814a812d |                                                                                                                 read_data: got response from /172.16.89.45 |                0 | 310160318929845 | 172.16.89.251 |            403 | shard 0

Piotr Dulikowski

unread,
Nov 4, 2021, 6:09:22 AM11/4/21
to scyllad...@googlegroups.com
This sounds like "the 40 millisecond bug": https://vorner.github.io/2020/11/06/40-ms-bug.html . Your issue might be caused by a bad interaction between the Nagle's algorithm and delayed ack functionality of the TCP, which increases the latency on the client side (more details in the article under the link).

I'm not familiar with the DynamoDB library, but you can see if it is possible to disable either the Nagle's algorithm or delayed acks and try doing that.

Piotr Sarna

unread,
Nov 4, 2021, 6:12:31 AM11/4/21
to scyllad...@googlegroups.com

Traces indicate that Scylla processed your query in 400microseconds, which translates to sub-1ms latency, as reported by monitoring. It strongly indicates that the problem lies in the driver - in particular, take a look at Piotr Dulikowski's suggestion about the "40ms bug", which comes from using Nagle's algorithm and delayed ack at the same time. If that's indeed the case, the fix is to either disable one of the mechanisms in your system, or at least in the library you're using.

Cong Guo

unread,
Nov 4, 2021, 6:22:05 AM11/4/21
to ScyllaDB users
Will investigate the article, thanks a lot to you all!

Dor Laor

unread,
Nov 4, 2021, 12:43:02 PM11/4/21
to ScyllaDB users

Cong Guo

unread,
Nov 4, 2021, 11:46:47 PM11/4/21
to ScyllaDB users
Hi dor,

Thank you. Actually I want to simulate the real scenario in our prod. My client want to access scylla via ddb api. I found about 10% of the ACK are delayed 40ms even the PSH flag is set, weird. Please notice I do access pod in eks directly. But a more weird thing is, if I access scylla via classic lb, the latency is pretty low < 10ms.

capture for 40ms latency:
https://files.slack.com/files-pri/T02B9QAPR-F02L33C0T46/image.png
Message has been deleted
Message has been deleted

Cong Guo

unread,
Nov 5, 2021, 3:17:06 AM11/5/21
to ScyllaDB users
May related to the ddb sdk.
Reply all
Reply to author
Forward
0 new messages