Scylla-monitoring network latency?

hor...@gmail.com

<horschi@gmail.com>

unread,

Mar 25, 2022, 11:47:13 AM3/25/22

to ScyllaDB users

Hi,

I am currently investigating a scylla cluster (in Microsoft Azure) that spradically seems very slow.

I have the feeling that its related to the internal network being slow. Is there any good metric in scylla that reports network latency?

best regards,

Christian

hor...@gmail.com

<horschi@gmail.com>

unread,

Mar 25, 2022, 2:54:32 PM3/25/22

to ScyllaDB users

After some more investigation, it also becomes clearer that I am generally not able to fully utilize the servers. They have 8 VCPUs and the very expensive Azure SSD storage.

But still the CPU usage rarely goes above 200% and DiskIOs never go above ~2k ops per server (Screenshot shows total of 9 servers).

Queries on that cluster are quite slow with 7ms (measured from application).

The query IO queue delay is reported 500us, so I assume the IO is no problem? Could network latency be an issue?

q.png

Avi Kivity

<avi@scylladb.com>

unread,

Mar 27, 2022, 7:17:23 AM3/27/22

to scylladb-users@googlegroups.com, hor...@gmail.com

Switch monitoring to shard view to see if a shard reaches 100% load.

Also, check if a vcpu that is assigned to a shard has high hardirq/softirq load. It could be that perftune is not distributing NIC interrupts correctly.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/12468f93-3b1d-4c70-8980-410f50837f0dn%40googlegroups.com.

hor...@gmail.com

<horschi@gmail.com>

unread,

Mar 28, 2022, 7:31:46 AM3/28/22

to ScyllaDB users

I will test it ....

horschi

<horschi@gmail.com>

unread,

Mar 28, 2022, 3:41:10 PM3/28/22

to ScyllaDB users

Hi Avi,

first some more background:

I use a overprovisioned setup with CPUSET=" --smp 5". This Azure cluster is the first one to have such a bad behaviour. We otherwise mostly run on bare-metal.

I looked at the data from my friday test: I picked one node and looked at the load per shard (see screenshots). Indeed there is some imbalance, e.g. shard-1 seems to run with 100% for longer.

I will try to correlate those with IRQs...

You received this message because you are subscribed to a topic in the Google Groups "ScyllaDB users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scylladb-users/TW2vh_Io9KM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/582d21e2-92e6-442e-b7e5-3e6dab2563e0n%40googlegroups.com.

shard0.png

shard1.png

shard4.png

shard2.png

shard3.png

horschi

<horschi@gmail.com>

unread,

May 20, 2022, 4:07:13 AM5/20/22

to ScyllaDB users

Hi,

to give feedback on the topic: I added --cpuset 0-4 to the cpu settings and the behaviour seems better now. Strangely, it only seems to have this impact on Azure VMs. We did not see any difference on bare metal servers.