SELECT COUNT queries on scylla takes very long

Sachin janani

<sachin.janani203@gmail.com>

unread,

Mar 23, 2018, 12:21:58 AM3/23/18

to ScyllaDB users

We are running some benchmark on ScyllaDB by executing some queries and we found that select count(*) queries take very long time to complete.

Following are the details of scylla cluster:
Number of Nodes: 3
RAM : 64G on each node
Number of CPU cores on each node: 8
Number of Rows in table : 311 Million
Number of columns : 23
Size of table as shown by nodetool stats: 300GB approx. across 3 nodes
Time taken to execute select count(*) queries from CQLSH: 1.1 hour
Time taken to execute select count(*) queries with apache spark using spark-cassandra connector: 28 mins (i.e around 185K rows per second)
CPU consumption on scylla nodes was almost 100%.
Full memory was consumed by scylla while ingestion of rows

Note: We have setup XFS partition for scylla manually i.e not using scylla setup scripts.

Are there any performance tuning that we are missing ?
As compared to Cassandra, what is the perf. difference that we should expect for table scans and point queries.Also can anyone point me to the **READ** benchmarks for large scylla tables?

Tomer Sandler

<tomer@scylladb.com>

unread,

Mar 23, 2018, 3:29:18 AM3/23/18

to scylladb-users@googlegroups.com

You are probably not utilizing the best trades of Scylla's parallelism.

I recommend you read these article about token range queries [1] + [2] and check the goloang code example that demonstrates it [3], you are welcomed to use it.

[1] https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/

[2] https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/

[3] https://github.com/scylladb/scylla-code-samples/tree/master/efficient_full_table_scan_example_code

--
Tomer Sandler
ScyllaDB

(Sent from my android

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/e4dfae1b-8c15-44ff-a7cd-6720b0f51917%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Hemant Bhanawat

<hemant9379@gmail.com>

unread,

Mar 23, 2018, 8:22:01 AM3/23/18

to ScyllaDB users

(Sachin is not able to post for some reason. Posting on his behalf.)

We have already gone through these articles.Is 185K rows per second a decent speed on such a settup? Do you have any benchmark numbers for scans which can refer especially against Cassandra?

Dor Laor

<dor@scylladb.com>

unread,

Mar 23, 2018, 7:16:31 PM3/23/18

to ScyllaDB users

On Fri, Mar 23, 2018 at 5:22 AM, Hemant Bhanawat <heman...@gmail.com> wrote:

(Sachin is not able to post for some reason. Posting on his behalf.)

We have already gone through these articles.Is 185K rows per second a decent speed on such a settup? Do you have any benchmark numbers for scans which can refer especially against Cassandra?

Not that we know about. If you'll hook the Scylla monitoring you'll be able to judge whether

all the cores are at maximum utilization (the desired situation). Cqlsh query isn't parallel, Spark

parallelism is better but make sure you configure it for the core count Scylla has.

We have a good improvement coming in soon for 2.2 with issue #1865

On Friday, March 23, 2018 at 9:51:58 AM UTC+5:30, Sachin janani wrote:
We are running some benchmark on ScyllaDB by executing some queries and we found that select count(*) queries take very long time to complete.

Following are the details of scylla cluster:
Number of Nodes: 3
RAM : 64G on each node
Number of CPU cores on each node: 8
Number of Rows in table : 311 Million
Number of columns : 23
Size of table as shown by nodetool stats: 300GB approx. across 3 nodes
Time taken to execute select count(*) queries from CQLSH: 1.1 hour
Time taken to execute select count(*) queries with apache spark using spark-cassandra connector: 28 mins (i.e around 185K rows per second)
CPU consumption on scylla nodes was almost 100%.
Full memory was consumed by scylla while ingestion of rows

Note: We have setup XFS partition for scylla manually i.e not using scylla setup scripts.

Are there any performance tuning that we are missing ?
As compared to Cassandra, what is the perf. difference that we should expect for table scans and point queries.Also can anyone point me to the **READ** benchmarks for large scylla tables?

--

You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

Visit this group at https://groups.google.com/group/scylladb-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/f420363b-f109-4ed2-bf4e-0f9e04383464%40googlegroups.com.

Avi Kivity

<avi@scylladb.com>

unread,

Mar 24, 2018, 2:54:46 PM3/24/18

to scylladb-users@googlegroups.com, Sachin janani

On 03/23/2018 07:21 AM, Sachin janani wrote:

We are running some benchmark on ScyllaDB by executing some queries and we found that select count(*) queries take very long time to complete.

Following are the details of scylla cluster:
Number of Nodes: 3
RAM : 64G on each node
Number of CPU cores on each node: 8
Number of Rows in table : 311 Million
Number of columns : 23
Size of table as shown by nodetool stats: 300GB approx. across 3 nodes
Time taken to execute select count(*) queries from CQLSH: 1.1 hour

This is expected to be slow due to lack of concurrency. See also https://github.com/scylladb/scylla/issues/1385.

Time taken to execute select count(*) queries with apache spark using spark-cassandra connector: 28 mins (i.e around 185K rows per second)
CPU consumption on scylla nodes was almost 100%.
Full memory was consumed by scylla while ingestion of rows

Can you provide the schema you were using?

In particular, I'm interested in whether you have large partitions or not, and the average size of each row. From the statistics you provided, it looks like 300 bytes per row if you're using RF=1, but please confirm.

Note: We have setup XFS partition for scylla manually i.e not using scylla setup scripts.

Are there any performance tuning that we are missing ?

https://github.com/scylladb/scylla/issues/1865 should provide around 3X improvement when completed.

Meanwhile, please capture a performance profile so we can see if there is something out of line. You can do that by installing the scylla-debuginfo package (scylla-dbg on ubuntu), and running the commands

perf record -a sleep 30 --call-graph dwarf
perf report --no-children > report.txt

While the spark query is running. This will allow us to see if there is something wrong. You should also set up scylla-monitoring and ensure cpu load is 100% (this is different from OS CPU load as reported by top).

As compared to Cassandra, what is the perf. difference that we should expect for table scans and point queries.Also can anyone point me to the **READ** benchmarks for large scylla tables?

In https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/, Tomer achieved 500k requests per second, using similar nodes (4 cores / 8 vcpus). Of course the schema may be different, or perhaps something else.

Sachin janani

<sachin.janani203@gmail.com>

unread,

Mar 25, 2018, 5:38:24 AM3/25/18

to scylladb-users@googlegroups.com

We have already gone through these articles.Is 185K rows per second a
decent speed on such a settup? Do you have any benchmark numbers for
scans which can refer especially against Cassandra?

> You received this message because you are subscribed to a topic in the

> Google Groups "ScyllaDB users" group.

> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scylladb-users/XZJHr6k245M/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

> scylladb-user...@googlegroups.com.
> To post to this group, send email to scyllad...@googlegroups.com.
> Visit this group at https://groups.google.com/group/scylladb-users.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/scylladb-users/CAO_awtikdhz4WPX8M4QiZ3LiD3htLkpxZj-Pjj3YVkfVxSq0nw%40mail.gmail.com.

>
> For more options, visit https://groups.google.com/d/optout.

--
Sachin Janani

Avi Kivity

<avi@scylladb.com>

unread,

Mar 25, 2018, 5:39:24 AM3/25/18

to scylladb-users@googlegroups.com, Hemant Bhanawat

On 03/23/2018 03:22 PM, Hemant Bhanawat wrote:
> (Sachin is not able to post for some reason. Posting on his behalf.)

Google's spam detector marked his posts as spam. If this happens again
please contact me privately and I'll let them through.

Reply all

Reply to author

Forward