Using Scylla as a key/object store

jonathan.guberman@gmail.com

<jonathan.guberman@gmail.com>

unread,

May 5, 2017, 3:01:13 PM5/5/17

to ScyllaDB users

Hello,

We’re currently testing Scylla for use as a pure key-object store for data blobs around 10kB - 60kB each. Our use case is storing on the order of 10 billion objects with about 5-20 million new writes per day. A written object will never be updated or deleted. Objects will be read at least once, some time within 10 days of being written. This will generally happen as a batch; that is, all of the images written on a particular day will be read together at the same time. This batch read will only happen one time; future reads will happen on individual objects, with no grouping, and they will follow a long-tail distribution, with popular objects read thousands of times per year but most read never or virtually never.

I’ve set up a small four node test cluster and have written test scripts to benchmark writing and reading our data. The table I’ve set up is very simple: an ascii primary key column with the object ID and a blob column for the data. All other settings were left at their defaults.

I’ve found write speeds to be very fast to begin with. When testing with Cassandra we found that periodically, writes would slow to a crawl for anywhere between half an hour to two hours, after which speeds recover to their previous levels. Scylla does not experience these periodic slowdowns, but it does seem to slow down over time, and, unlike Cassandra, does not seem to recover to the previous speeds. Over the course of two days of writing, the write speeds slowed by a factor of four. If this trend continues, then Scylla won't for us in production.

Read speeds have been more disappointing. Cached reads are very fast, but random read speed averages about 4 MB/sec, which is too slow when we need to read out a batch of several million objects. I don’t think it’s reasonable to assume that these rows will all still be cached by the time we need to read them for that first large batch read.

My general question is whether anyone has any suggestions for how to improve performance for our use case. More specifically:

- Is there a way to mitigate or eliminate the write speed slowing down over time that I observe?

- Are there settings I should be using in order to maximize read speeds for random reads?

- Is there a way to design our tables to improve the read speeds for the initial large batched reads? I was thinking of using a batch ID column that could be used to retrieve the data for the initial block. However, future reads would need to be done by the object ID, not the batch ID, so it seems to me I’d need to duplicate the data, one in a “objects by batch” table, and the other in a simple “objects” table. Is there a better approach than this?

Thank you!

Jonathan

Dor Laor

<dor@scylladb.com>

unread,

May 5, 2017, 3:42:11 PM5/5/17

to ScyllaDB users

On Fri, May 5, 2017 at 12:01 PM, <jonathan...@gmail.com> wrote:

Hello,

Hi Jonathan,

We’re currently testing Scylla for use as a pure key-object store for data blobs around 10kB - 60kB each. Our use case is storing on the order of 10 billion objects with about 5-20 million new writes per day. A written object will never be updated or deleted. Objects will be read at least once, some time within 10 days of being written. This will generally happen as a batch; that is, all of the images written on a particular day will be read together at the same time. This batch read will only happen one time; future reads will happen on individual objects, with no grouping, and they will follow a long-tail distribution, with popular objects read thousands of times per year but most read never or virtually never.

When you say batch, do you mean like 'plenty at once' or a real CQL batch where it's all or nothing?

Try the parallel table scan technique: http://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/

I’ve set up a small four node test cluster and have written test scripts to benchmark writing and reading our data. The table I’ve set up is very simple: an ascii primary key column with the object ID and a blob column for the data. All other settings were left at their defaults.

I’ve found write speeds to be very fast to begin with. When testing with Cassandra we found that periodically, writes would slow to a crawl for anywhere between half an hour to two hours, after which speeds recover to their previous levels. Scylla does not experience these periodic slowdowns, but it does seem to slow down over time, and, unlike Cassandra, does not seem to recover to the previous speeds. Over the course of two days of writing, the write speeds slowed by a factor of four. If this trend continues, then Scylla won't for us in production.

It is probably the compaction cost. We behave better than Cassandra's spikiness but overtime the database has

more data and need to merge growing amount of files.

Since there is no delete/update, LCS strategy may be better for your case.

Most important is to provide statistics, first with the nodetool compaction history commands and later by deploying our

monitoring stack (based on Prometheus) which will allow us to know what's going on.

Which AWS instances are you using?

Read speeds have been more disappointing. Cached reads are very fast, but random read speed averages about 4 MB/sec, which is too slow when we need to read out a batch of several million objects. I don’t think it’s reasonable to assume that these rows will all still be cached by the time we need to read them for that first large batch read.

It can be a function of slow disks. Large (very) i3 have very fast disks.

The key to everything is to start with our monitoring.

Cheers,

Dor

My general question is whether anyone has any suggestions for how to improve performance for our use case. More specifically:

- Is there a way to mitigate or eliminate the write speed slowing down over time that I observe?

- Are there settings I should be using in order to maximize read speeds for random reads?

- Is there a way to design our tables to improve the read speeds for the initial large batched reads? I was thinking of using a batch ID column that could be used to retrieve the data for the initial block. However, future reads would need to be done by the object ID, not the batch ID, so it seems to me I’d need to duplicate the data, one in a “objects by batch” table, and the other in a simple “objects” table. Is there a better approach than this?

Thank you!

Jonathan

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/15b0145b-eb2a-4534-9dd2-cafe3e8caeb2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Glauber Costa

<glauber@scylladb.com>

unread,

May 5, 2017, 3:52:59 PM5/5/17

to ScyllaDB users

Hello

It would help to know how fast are your disks, and how they are
configured (io.conf)

The "slowdown" you are seeing, is likely the result of Workload Conditioning:
(More at: http://www.scylladb.com/2016/12/15/sswc-part1/)

If that is the case, that's not really a "slowdown", but more like the
system being brought to disk speed, and at some point it will
stabilize.

We always recommend to test the behavior of complex systems over time
- as you did - because only then you'll see the real speed of your
system.

As for your reads: 4 MB/s sounds excuciatingly slow. There is
definitely something wrong there - unless your disks can only do only
around 20 MB/s.

If you are deploying our monitoring system as well, I would advise
sharing some metrics with us, so we can take a closer look.

>
> - Are there settings I should be using in order to maximize read speeds for
> random reads?
>
> - Is there a way to design our tables to improve the read speeds for the
> initial large batched reads? I was thinking of using a batch ID column that
> could be used to retrieve the data for the initial block. However, future
> reads would need to be done by the object ID, not the batch ID, so it seems
> to me I’d need to duplicate the data, one in a “objects by batch” table, and
> the other in a simple “objects” table. Is there a better approach than this?
>
>
> Thank you!
>
>
> Jonathan
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.

Jonathan M. Guberman

<jonathan.guberman@gmail.com>

unread,

May 5, 2017, 3:59:07 PM5/5/17

to scylladb-users@googlegroups.com

On 5 May 2017 at 15:41, Dor Laor <d...@scylladb.com> wrote:

When you say batch, do you mean like 'plenty at once' or a real CQL batch where it's all or nothing?

I mean 'plenty at once,' sorry for the ambiguity.

Try the parallel table scan technique: http://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/

I'll look into that, thanks.

It is probably the compaction cost. We behave better than Cassandra's spikiness but overtime the database has
more data and need to merge growing amount of files.
Since there is no delete/update, LCS strategy may be better for your case.

That too.

Most important is to provide statistics, first with the nodetool compaction history commands and later by deploying our
monitoring stack (based on Prometheus) which will allow us to know what's going on.

I'll set that up, rerun the tests, and report again.

Which AWS instances are you using?

We're not using AWS, this is all on local hardware.

Thank you for the quick and detailed reply! I'll look in to those suggestions and report back to the list.

Jonathan

Glauber Costa

<glauber@scylladb.com>

unread,

May 5, 2017, 4:05:57 PM5/5/17

to ScyllaDB users

On Fri, May 5, 2017 at 3:58 PM, Jonathan M. Guberman
<jonathan...@gmail.com> wrote:
>
>
> On 5 May 2017 at 15:41, Dor Laor <d...@scylladb.com> wrote:
>>
>> When you say batch, do you mean like 'plenty at once' or a real CQL batch
>> where it's all or nothing?
>
>
> I mean 'plenty at once,' sorry for the ambiguity.
>
>>
>> Try the parallel table scan technique:
>> http://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/
>
>
> I'll look into that, thanks.
>
>>
>> It is probably the compaction cost. We behave better than Cassandra's
>> spikiness but overtime the database has
>> more data and need to merge growing amount of files.
>> Since there is no delete/update, LCS strategy may be better for your case.
>
>
> That too.
>

I would also look at write rates for LCS. If write rates are
high/continuous, I wouldn't recommend LCS.
In your case, they arrive at batches, so it can be a win.

Best case is to give it a try

>>
>>
>> Most important is to provide statistics, first with the nodetool
>> compaction history commands and later by deploying our
>> monitoring stack (based on Prometheus) which will allow us to know what's
>> going on.
>
>
> I'll set that up, rerun the tests, and report again.
>
>>
>>
>> Which AWS instances are you using?
>
>
> We're not using AWS, this is all on local hardware.

Can you get us details about your hardware?

>
> Thank you for the quick and detailed reply! I'll look in to those
> suggestions and report back to the list.
>
> Jonathan
>

> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.

> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/scylladb-users/CAP1KxAyevyg5-xLTXifG9w3Vv%2BzeK%2BTNZFZZtqjn%2BUM%2BRWLdfg%40mail.gmail.com.

Jonathan M. Guberman

<jonathan.guberman@gmail.com>

unread,

May 8, 2017, 4:28:44 PM5/8/17

to scylladb-users@googlegroups.com

If you are deploying our monitoring system as well, I would advise
sharing some metrics with us, so we can take a closer look.

I've set up the Prometheus monitoring on my cluster so I can start gathering stats. What, in particular, would be the most useful statistics to report?

Can you get us details about your hardware?

2 x E5-2660 8-core Xeons

64GB RAM DDR-3 PC1300

10Gb internal network (SFP+)

LSI 9210-8i controller (IT mode)

2TB HDD for data

200GB SSD for commitlogs

I'm sorry, I'm not sure what you mean about the disk configuration ("io.conf").

> email to scylladb-users+unsubscribe@googlegroups.com.
> To post to this group, send email to scylladb-users@googlegroups.com.

> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/scylladb-users/CAP1KxAyevyg5-xLTXifG9w3Vv%2BzeK%2BTNZFZZtqjn%2BUM%2BRWLdfg%40mail.gmail.com.

>
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "ScyllaDB users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scylladb-users/-alY011dq1w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

Visit this group at https://groups.google.com/group/scylladb-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CAD-J%3DzY56v5NWXta9PbK%3DSg%2BBCTGPDCjWcC17Yn7ibwn9JATXA%40mail.gmail.com.

Glauber Costa

<glauber@scylladb.com>

unread,

May 8, 2017, 4:42:31 PM5/8/17

to ScyllaDB users

On Mon, May 8, 2017 at 4:28 PM, Jonathan M. Guberman
<jonathan...@gmail.com> wrote:
>>
>> If you are deploying our monitoring system as well, I would advise
>> sharing some metrics with us, so we can take a closer look.
>
>
> I've set up the Prometheus monitoring on my cluster so I can start gathering
> stats. What, in particular, would be the most useful statistics to report?
>

If you have deployed our docker images for prometheus + grafana
according to https://github.com/scylladb/scylla-grafana-monitoring,
you should be able to go to port 3000 (instead of prometheus' 9090),
and there you will find 3 dashboards per version (Cluster, Server,
I/O)

A screenshot of those dashboards would be a great start, as they will
allow us to have an overview and from there we can ask for more
specific metrics.

>> Can you get us details about your hardware?
>
>
> 2 x E5-2660 8-core Xeons
> 64GB RAM DDR-3 PC1300
> 10Gb internal network (SFP+)
> LSI 9210-8i controller (IT mode)
> 2TB HDD for data
> 200GB SSD for commitlogs
>
> I'm sorry, I'm not sure what you mean about the disk configuration
> ("io.conf").

No - it's my bad. I should have been more specific.
To run out of developer mode, the setup procedure (that you have
probably done) need to create a file, /etc/scylla.d/io.conf

That is what I am after.

Also, that you have a split data / commitlog is interesting: although
we support it, we have known issues extracting optimal performance out
of that setup in some circumnstances. Your io.conf contents will shed
some light on this.

>> > email to scylladb-user...@googlegroups.com.
>> > To post to this group, send email to scyllad...@googlegroups.com.

>> > Visit this group at https://groups.google.com/group/scylladb-users.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/scylladb-users/CAP1KxAyevyg5-xLTXifG9w3Vv%2BzeK%2BTNZFZZtqjn%2BUM%2BRWLdfg%40mail.gmail.com.
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "ScyllaDB users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/scylladb-users/-alY011dq1w/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to

>> scylladb-user...@googlegroups.com.
>> To post to this group, send email to scyllad...@googlegroups.com.

>> Visit this group at https://groups.google.com/group/scylladb-users.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/scylladb-users/CAD-J%3DzY56v5NWXta9PbK%3DSg%2BBCTGPDCjWcC17Yn7ibwn9JATXA%40mail.gmail.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.

> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/scylladb-users/CAP1KxAzJCOsg-Frdd8sEP0oG90EAG6%3DskzBxVofP9%3DGFPXrvGg%40mail.gmail.com.

Jonathan M. Guberman

<jonathan.guberman@gmail.com>

unread,

May 8, 2017, 4:59:58 PM5/8/17

to scylladb-users@googlegroups.com

On 8 May 2017 at 16:42, Glauber Costa <gla...@scylladb.com> wrote:

If you have deployed our docker images for prometheus + grafana
according to https://github.com/scylladb/scylla-grafana-monitoring,
you should be able to go to port 3000 (instead of prometheus' 9090),
and there you will find 3 dashboards per version (Cluster, Server,
I/O)

I actually didn't use the docker image, I just set up Grafana myself, using the JSON config files from that GitHub repo. I didn't realize there were three separate dashboards, though, so I just grabbed one of the configs. I'll set the others up, gather some stats, and send screenshots.

No - it's my bad. I should have been more specific.
To run out of developer mode, the setup procedure (that you have
probably done) need to create a file, /etc/scylla.d/io.conf

That is what I am after.

This appears to be different on each node. Each one is a single line. Here are all four:

Node 1:

SEASTAR_IO="--max-io-requests=96 --num-io-queues=24"

Node 2:

SEASTAR_IO="--max-io-requests=150"

Node 3:

SEASTAR_IO="--max-io-requests=64 --num-io-queues=16"

Node 4:

SEASTAR_IO="--max-io-requests=40 --num-io-queues=10"

Also, that you have a split data / commitlog is interesting: although
we support it, we have known issues extracting optimal performance out
of that setup in some circumnstances. Your io.conf contents will shed
some light on this.

That is interesting; I'd just assumed it would be better to split it across the two, but of course I can change it if it is going to negatively impact performance.

Glauber Costa

<glauber@scylladb.com>

unread,

May 8, 2017, 8:39:24 PM5/8/17

to ScyllaDB users

Specially since you care about read performance, having everything on
SSD is the better option,
from the overall system perspective. I am assuming you are not doing
this because you need the storage
that the HDD can get you at this price point.

In that case, using the commitlog in the separate SSD is the right
decision, and the performance will be better than leaving everything
on HDD. It is just that it could be even better - and will be in the
future - when Scylla's I/O stack start making smarter use of the
different disks

>
>
> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.
> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/scylladb-users/CAP1KxAwERV36r8YDQHr30adQ45crBbP5bKqHkk4e_-H4mmJXNA%40mail.gmail.com.

Nadav Har'El

<nyh@scylladb.com>

unread,

May 9, 2017, 1:56:51 AM5/9/17

to scylladb-users@googlegroups.com

On Mon, May 8, 2017 at 11:28 PM, Jonathan M. Guberman <jonathan...@gmail.com> wrote:

Can you get us details about your hardware?

2 x E5-2660 8-core Xeons
64GB RAM DDR-3 PC1300
10Gb internal network (SFP+)
LSI 9210-8i controller (IT mode)
2TB HDD for data
200GB SSD for commitlogs

So you have a spinning disk - not an SSD - for the data, and that has, as we all know, very slow seek rates. Each time you read an uncached 10K object it requires a disk seek, or actually more than one seeks (at least one in the index file and one in the data file), so it's not surprising you're seeing the random reads progressing at just 4 MB a second :-(

I think there are two solutions. The first is, of course, to switch to SSD also for the data. This is the best way in the long run, and should dramatically improve your read performance, but I don't know how it fits your cost analysis today.

There may be a more "hacky" solution for your current needs. If I understand correctly, you mostly (?) care about the read performance during those read "batches", where you want to read a lot of small objects written in the same day. So one solution is to model your data differently: Don't write every object as a separate 10K paratition, but rather put all the objects of the same hour (or whatever other granularity) into separate clustering rows of the same partition. Similarly to how "time series" data is usually modeled in Cassandra or Scylla. Now, reading a batch will no longer involve reading thousands of 10K objects all over the disk with thousands of seeks - but rather involve fewer larger reads, which are very fast on HDD (in HDD, seeks are slow but contiguous reads are fast). A real random-access read will be slightly slower than before, but you may not notice this because the seek cost will dominate the cost of the random-access read anyway.

By the way, since you are comparing Scylla's performance to Cassandra, I wonder if you have the same slow read problem also in Cassandra. I assume you do, because Cassandra would also need to seek in the disk on every read. But if you don't, we need to figure out why.

Nadav.

Glauber Costa

<glauber@scylladb.com>

unread,

May 9, 2017, 9:07:36 AM5/9/17

to ScyllaDB users

There is a trade-off here: if that is done, every row in that hour
will have the same partition key. That leads to bad sharding, with
very real consequences: every request in that hour will be sent to the
same node, and in Scylla's case, the same CPU.

> Similarly to how "time series" data is usually modeled
> in Cassandra or Scylla. Now, reading a batch will no longer involve reading
> thousands of 10K objects all over the disk with thousands of seeks - but
> rather involve fewer larger reads, which are very fast on HDD (in HDD, seeks
> are slow but contiguous reads are fast). A real random-access read will be
> slightly slower than before, but you may not notice this because the seek
> cost will dominate the cost of the random-access read anyway.
>
> By the way, since you are comparing Scylla's performance to Cassandra, I
> wonder if you have the same slow read problem also in Cassandra. I assume
> you do, because Cassandra would also need to seek in the disk on every read.
> But if you don't, we need to figure out why.
>
> Nadav.
>

> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.

> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/scylladb-users/CANEVyjshhGsGku1r-iBKup_3BxP1_K2RmV8jUrOAwJ8pdA6%3DCA%40mail.gmail.com.

Jonathan M. Guberman

<jonathan.guberman@gmail.com>

unread,

May 9, 2017, 9:19:10 AM5/9/17

to scylladb-users@googlegroups.com

On 9 May 2017 at 01:56, Nadav Har'El <n...@scylladb.com> wrote:

There may be a more "hacky" solution for your current needs. If I understand correctly, you mostly (?) care about the read performance during those read "batches", where you want to read a lot of small objects written in the same day. So one solution is to model your data differently: Don't write every object as a separate 10K paratition, but rather put all the objects of the same hour (or whatever other granularity) into separate clustering rows of the same partition.

I was actually considering doing exactly this, although with an even hackier wrinkle to it: I was thinking about writing the data twice, once using time as the partition and once using the object ID (to a separate table). Once the time-series one is read once, it can be deleted, and then for future reads, which don't care about the batch, it can go to the "long term" storage on the ID-based table. However, this doesn't mitigate the partitioning problem that Glauber mentions. I wonder, though, if we might be able a level of granularity that balances those trade-offs.

By the way, since you are comparing Scylla's performance to Cassandra, I wonder if you have the same slow read problem also in Cassandra. I assume you do, because Cassandra would also need to seek in the disk on every read. But if you don't, we need to figure out why.

Scylla read performance has been significantly better than Cassandra. Cassandra write performance is better overall, because it doesn't slow down over time. I'm hoping that changing the compaction strategy might help with that.

Glauber Costa

<glauber@scylladb.com>

unread,

May 9, 2017, 9:24:21 AM5/9/17

to ScyllaDB users

We should look into your prometheus/grafana graphs, and see if you
have requests blocked (there is a graph for that in the per-server
dash).

>
> --
> You received this message because you are subscribed to the Google Groups
> "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.

> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/scylladb-users/CAP1KxAyg4Bn%3DM59FCmwAHRt9s1pkPEdofrHiUP7oR2rEffrD%3DQ%40mail.gmail.com.

Nadav Har'El

<nyh@scylladb.com>

unread,

May 9, 2017, 9:54:13 AM5/9/17

to scylladb-users@googlegroups.com

On Tue, May 9, 2017 at 4:07 PM, Glauber Costa <gla...@scylladb.com> wrote:

>
> There may be a more "hacky" solution for your current needs. If I understand
> correctly, you mostly (?) care about the read performance during those read
> "batches", where you want to read a lot of small objects written in the same
> day. So one solution is to model your data differently: Don't write every
> object as a separate 10K paratition, but rather put all the objects of the
> same hour (or whatever other granularity) into separate clustering rows of
> the same partition.

There is a trade-off here: if that is done, every row in that hour
will have the same partition key. That leads to bad sharding, with
very real consequences: every request in that hour will be sent to the
same node, and in Scylla's case, the same CPU.

You're right (I assume you mean "write requests" when you mention requests above).

One way to solve this is perhaps not to write to one partition every hour but 100 different

partitions every hour - each request goes to one of those 100 based on some hash function

or something. When we need to read the entire hour, we need to do 100 reads and not one,

but at least it's 100 and not 100,000 even if there were 100,000 items written in that hour.

In any case you're right that it's a tradeoff. The only way to make everything better would be to use SSD.

Jonathan M. Guberman

<jonathan.guberman@gmail.com>

unread,

May 9, 2017, 4:00:21 PM5/9/17

to scylladb-users@googlegroups.com

On 9 May 2017 at 09:24, Glauber Costa <gla...@scylladb.com> wrote:

We should look into your prometheus/grafana graphs, and see if you
have requests blocked (there is a graph for that in the per-server
dash).

I had my Grafana set up incorrectly, so I wasn't getting all of the data. Now that I've fixed it I can see that during a write test I am getting requests blocked. I've attached a screenshot of that section of the dashboard. Is there a good way to get a screenshot of the entire dashboard at once?

Glauber Costa

<glauber@scylladb.com>

unread,

May 9, 2017, 4:07:20 PM5/9/17

to ScyllaDB users

There is a firefox plugin we sometimes use - native grafana support for that, I am unaware.

Requests getting blocked is usually a sign that there is some bottleneck somewhere. They block until the bottleneck is gone, and that slows down the system.

So that is definitely part of the problem. I was betting that this would happen, but my bets were on dirty (that is the buffer in front of your HDD). Your blocked requests are in the commitlog, which is on SSD - so that is surprising.

Things to check now are:

1) Load in the system (there is a graph for that, usually in the top). It is also interesting to check the node across the CPUs in the node. For that, it is usually better to use prometheus directly (port 9090). If you tell us the name of one of your instances (hovering the mouse over the lines will tell), I can get you a query for that.

2) whether or not the SSD is at its max throughput (there are prometheus plugins to export those metrics, or you can use any other linux tool)

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

Visit this group at https://groups.google.com/group/scylladb-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CAP1KxAzQutUGBFe7ARMrixrg1rtza5qhh2hDJiS9hW4WdU%2Bneg%40mail.gmail.com.

Jonathan M. Guberman

<jonathan.guberman@gmail.com>

unread,

May 9, 2017, 4:25:30 PM5/9/17

to scylladb-users@googlegroups.com

On 9 May 2017 at 16:07, Glauber Costa <gla...@scylladb.com> wrote:

1) Load in the system (there is a graph for that, usually in the top). It is also interesting to check the node across the CPUs in the node. For that, it is usually better to use prometheus directly (port 9090). If you tell us the name of one of your instances (hovering the mouse over the lines will tell), I can get you a query for that.

See attached screenshot. My instances are all named scylla01, scylla02, etc.

2) whether or not the SSD is at its max throughput (there are prometheus plugins to export those metrics, or you can use any other linux tool)

According to iostat the SSDs are each averaging around 10MB/s for writes and around 5MB/s for reads, which should be nowhere near their maximums.

Glauber Costa

<glauber@scylladb.com>

unread,

May 9, 2017, 4:33:04 PM5/9/17

to ScyllaDB users

On Tue, May 9, 2017 at 4:25 PM, Jonathan M. Guberman <jonathan...@gmail.com> wrote:

On 9 May 2017 at 16:07, Glauber Costa <gla...@scylladb.com> wrote:

1) Load in the system (there is a graph for that, usually in the top). It is also interesting to check the node across the CPUs in the node. For that, it is usually better to use prometheus directly (port 9090). If you tell us the name of one of your instances (hovering the mouse over the lines will tell), I can get you a query for that.

See attached screenshot. My instances are all named scylla01, scylla02, etc.

The graph below tells a different name in the tooltip. (The prometheus name and hostname can be different)

It seems to be 'cephL01" - does that make sense?

Please to to port 9090, and try this query:

scylla_reactor_gauge_load{instance=~".*cephL01*"} (or the actual name of the instance - ~ is just codename for regex syntax)

That will generate a graph, and it would be nice to look at it.

2) whether or not the SSD is at its max throughput (there are prometheus plugins to export those metrics, or you can use any other linux tool)

According to iostat the SSDs are each averaging around 10MB/s for writes and around 5MB/s for reads, which should be nowhere near their maximums.

Neither is the CPU. If I was looking at just those two graphs, I would say that your system is nowhere near any point of saturation. If that happens at the same time as the blocked requests, then something is very wrong.

In the following page:

https://github.com/scylladb/scylla/wiki/How-to-report-a-Scylla-problem

there are instructions (in the bottom) how to upload your prometheus data to our s3 bucket. With that, we can look at all the metrics at once (including the ones that are not in the standard dashes, for more non-obvious things)

If you can somehow add linux metrics to prometheus (with node_exporter, or something else), that helps as well.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CAP1KxAzOxPo5UZgiqa%2BBFdpUjzpF%3Di_9XmGOcQB79TAYrHJ-Lw%40mail.gmail.com.

Jonathan M. Guberman

<jonathan.guberman@gmail.com>

unread,

May 9, 2017, 4:40:58 PM5/9/17

to scylladb-users@googlegroups.com

On 9 May 2017 at 16:33, Glauber Costa <gla...@scylladb.com> wrote:

Please to to port 9090, and try this query:

scylla_reactor_gauge_load{instance=~".*cephL01*"} (or the actual name of the instance - ~ is just codename for regex syntax)

That will generate a graph, and it would be nice to look at it.

Graphs attached below.

In the following page:

https://github.com/scylladb/scylla/wiki/How-to-report-a-Scylla-problem

there are instructions (in the bottom) how to upload your prometheus data to our s3 bucket. With that, we can look at all the metrics at once (including the ones that are not in the standard dashes, for more non-obvious things)

If you can somehow add linux metrics to prometheus (with node_exporter, or something else), that helps as well.

I have Node Exporter running in Prometheus, so that won't be a problem. I'll follow the instructions and send the data to you.

Thank you!

Glauber Costa

<glauber@scylladb.com>

unread,

May 9, 2017, 4:50:29 PM5/9/17

to ScyllaDB users

On Tue, May 9, 2017 at 4:40 PM, Jonathan M. Guberman <jonathan...@gmail.com> wrote:

On 9 May 2017 at 16:33, Glauber Costa <gla...@scylladb.com> wrote:
Please to to port 9090, and try this query:

scylla_reactor_gauge_load{instance=~".*cephL01*"} (or the actual name of the instance - ~ is just codename for regex syntax)

That will generate a graph, and it would be nice to look at it.

Graphs attached below.

Nothing suspicious in there.

With that we exhaust the obvious, usual suspects.

In the following page:

https://github.com/scylladb/scylla/wiki/How-to-report-a-Scylla-problem

there are instructions (in the bottom) how to upload your prometheus data to our s3 bucket. With that, we can look at all the metrics at once (including the ones that are not in the standard dashes, for more non-obvious things)

If you can somehow add linux metrics to prometheus (with node_exporter, or something else), that helps as well.

I have Node Exporter running in Prometheus, so that won't be a problem. I'll follow the instructions and send the data to you.

Thank you!

Cool. Do try to include particular times in which the behavior is bad and tell us the timestamps. It helps narrowing down.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CAP1KxAw4Q6rZt1cBL2RT6p8%3DxbBfiUzmH8cB72xWUrigr3BrUQ%40mail.gmail.com.

Reply all

Reply to author

Forward