High amount of Semaphore timeout during high disk i/o

christian@adsquare.com

<christian@adsquare.com>

unread,

Dec 12, 2016, 12:09:57 PM12/12/16

to ScyllaDB users

We're having problems with a very high amount of semaphore timeouts when our Scylla servers does a high amount of disk i/o (200mb+ reads/s). This amount of i/o seems to occur mostly during compactions.

The logs are getting spammed with messages like these:

Dec 12 16:57:35 scy02 scylla[3029]: [shard 33] storage_proxy - Exception when communicating with 10.10.101.202: Semaphore timedout

Dec 12 16:57:35 scy02 scylla[3029]: [shard 19] storage_proxy - Exception when communicating with 10.10.101.202: Semaphore timedout

Dec 12 16:57:35 scy02 scylla[3029]: [shard 16] storage_proxy - Exception when communicating with 10.10.101.202: Semaphore timedout

Dec 12 16:57:36 scy02 scylla[3029]: [shard 33] rpc - client 10.10.101.203: ignoring error response: Semaphore timedout

Dec 12 16:57:36 scy02 scylla[3029]: [shard 35] rpc - client 10.10.101.203: ignoring error response: Semaphore timedout

Dec 12 16:57:38 scy02 scylla[3029]: [shard 32] rpc - client 10.10.101.203: ignoring error response: Semaphore timedout

Dec 12 16:57:38 scy02 scylla[3029]: [shard 33] rpc - client 10.10.101.203: ignoring error response: Semaphore timedout

Dec 12 16:57:38 scy02 scylla[3029]: [shard 35] rpc - client 10.10.101.203: ignoring error response: Semaphore timedout

When this is going on we're seing a lot of read timeouts on the client side. I observe this problem on Scylla 1.3.x, 1.4.x and 1.5 rc2.

Our servers are specced like this:

256 gb ram

2xE5-2660 v3 (20 cores / +20 HT)

1x3,6g ssd

Ubuntu 16.04 w /4.4 low latency kernel

Any tips? Could we have misconfigured something somewhere. Any help would be appreciated.

Eyal Gutkind

<eyal@scylladb.com>

unread,

Dec 12, 2016, 1:20:14 PM12/12/16

to scylladb-users@googlegroups.com

Hi Christian,

It can be helpful to better understand the node status during the timeouts.

Do you have any monitoring system in place? i.e. our monitoring system :)

We can see if the timeouts are correlated to the compaction processes you are mentioning above, and the cache utilizations.

It sounds similar to https://github.com/scylladb/scylla/issues/1771

However, you report this issue with 1.5. So, would it be possible to get more information following the procedure mentioned here?

We can look on the settings you have for io, cpu and memory settings on your .conf files and see if a specific node has different settings, just to make sure nothing strange is happening.

Regards,

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/cee42dae-b0d2-40f7-850a-c2a6c27d85d9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Eyal Gutkind

ScyllaDB

www.scylladb.com

Avi Kivity

<avi@scylladb.com>

unread,

Dec 12, 2016, 1:35:31 PM12/12/16

to scylladb-users@googlegroups.com

Hi,

What does your /etc/scylla.d/io.conf show?

What is the client concurrency? (sum of number of threads across all clients, assuming synchronous clients).

Note: with Scylla 1.5, the -lowlatency kernel can reduce throughput, without a corresponding improvement in latency.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.

christian@adsquare.com

<christian@adsquare.com>

unread,

Dec 13, 2016, 7:25:51 AM12/13/16

to ScyllaDB users

Hi, thank you for your quick replies.

Here is some additional info

io.conf: SEASTAR_IO="--max-io-requests=56 --num-io-queues=14"

cpuset.conf: CPUSET="--cpuset 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39 "

It is the same across all nodes

Normal client request concurrency is 192, but this could be much higher for shorter periods of time. At what level of concurrent requests does things start to go bad/slow?

Thanks for the heads up on the low latency kernel. I'll get that sorted.

I've been waiting for this problem to show up today, so that I can capture some metrics for you guys. I'll upload some perf, iostat etc etc data for you as soon as I've got something.

Avi Kivity

<avi@scylladb.com>

unread,

Dec 13, 2016, 7:48:13 AM12/13/16

to scylladb-users@googlegroups.com

You configuration seems sane, and you can push your normal request concurrency even higher.

The only reason you should be seeing timeouts is if you push the disk farther that it can go, but 200 MB/s isn't particularly high. How many IOPS are you pushing? `iostat -x` output would show if the disk is the bottleneck.

If you have monitoring configured (highly recommended) then you can look at reactor.aio-reads and aio-writes to see what Scylla thinks it is doing, and at the io_queue metrics to see how it partitions I/O bandwidth between queries and compaction.

Are you doing single-partition or range queries?

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9f3cec22-bfee-4aba-b168-4cc48bf14393%40googlegroups.com.

christian@adsquare.com

<christian@adsquare.com>

unread,

Dec 13, 2016, 8:13:41 AM12/13/16

to ScyllaDB users

Hi. Some perf data should be available on your s3 now, with uuid 79c0add4-5cbc-4ba8-a3e7-013c3cf77e7a

All queries are single partition.

One of the servers is having a freakout session right now, and iostat -x says

Linux 4.4.0-36-lowlatency (scy02) 12/13/2016 _x86_64_ (40 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle

54.08 0.00 11.03 0.07 0.00 34.82

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sda 0.00 0.01 19755.24 1455.57 102546.73 37968.54 13.25 19.59 0.92 0.73 3.53 0.04 86.24

Reactor.aio-reads / writes

scy02/reactor-0/derive-aio-read-bytes 1968981.0

scy02/reactor-0/derive-aio-write-bytes 0.0

I've attached a full scyllatop dump (10 iterations)

scyllatop.txt

christian@adsquare.com

<christian@adsquare.com>

unread,

Dec 13, 2016, 9:50:01 AM12/13/16

to ScyllaDB users

I just made a strange observation, which may be related. On these nodes, when scylla is starting, I see scylla reading around 500MB/s, however after the node has come up, I never ever see scylla do reads faster than 200mb/s. In fact, it almost seems like reads are capped at exactly 200mb/s. Is there such a limit in scylla, which is configurable somewhere? This behaviour just struck me as a bit odd.

Avi Kivity

<avi@scylladb.com>

unread,

Dec 13, 2016, 10:30:52 AM12/13/16

to scylladb-users@googlegroups.com

There is no such limit, and I've seen Scylla read and write at multiple GB per second.

In your case, it's probably limited on IOPS, not bandwidth. During startup it probably issues large sequential reads (commitlog replay), but here, it's doing small random reads. Note that avgrq-sz=13 (so 7 kB/request), and the queue size is around 20.

(in fact, dividing read bandwidth by r/s, we see about 5 kB/read).

You can run diskplorer [1] when the system is idle to see what the disk is capable of. It's probably not far from being maxed out, and at the peak, over that.

From the counters I note:

- hit rate about 50%, does that make sense for your read load?

- io-queue-*/delay-* reaching 300 ms, indicating disk overload

- not doing any writes; if that's the steady state, and not already using Leveled compaction strategy, consider it

- seeing some data reads - external nodes (instead of digest reads); are you not using prepared statements for some reads?

[1] https://github.com/avikivity/diskplorer

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/d0a4d5d5-34d3-462a-bdbb-5f639943782d%40googlegroups.com.

christian@adsquare.com

<christian@adsquare.com>

unread,

Dec 19, 2016, 8:05:27 AM12/19/16

to ScyllaDB users

Hi! Sorry for the late reply.

It does indeed seem like this is caused by a high number of iops. The disk seems to max out at 60k/s. Why the disk read rate is so high when the dataset should almost fit in memory is a bit unclear to me, but I will do some investigation on this.

Also, for a future version, it would be nice to get some sort of summarizing warning instead of this extreme amount of semaphore timeout messages.

Thank you for your input. Much appreciated.

Avi Kivity

<avi@scylladb.com>

unread,

Dec 20, 2016, 10:34:20 AM12/20/16

to scylladb-users@googlegroups.com

Hi,

The data might not be in cache because it is warming up, or because the extra memory overhead required to keep it in cache is pushing some partitions out.

You can examine the various cache metrics such as number of partitions, hit rate, miss rate, and evictions.

You are right about the exception, we should swallow it and convert it to a metric (with an alert attached) instead.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/cb1e1cd2-8b32-4bba-8181-457a71aed3dc%40googlegroups.com.

Reply all

Reply to author

Forward