Scylla partition key distribution across nodes

jagadish@helpshift.com

<jagadish@helpshift.com>

unread,

Mar 15, 2017, 8:49:51 AM3/15/17

to ScyllaDB users

Context

We have a 3 node ScyllaDB cluster and 2 column families in a given namespace with ~100 million rows each.

Partitioning key is a random key with values like "9c87a2bc-69f2-482d-812d-5af2be12c4f6"

Problem

We observed a bit of skew in data distribution hence I was trying to find "for some sample partition keys - what is the distribution of nodes for each key"

So I ran "select pid from table_x limit 5000" where pid is a partition key.

And for those all partition keys, using "nodetool getendpoints" found the nodes on which those keys reside.

Surprisingly for all these 5000 partition keys, we got "same 2 nodes" as the output but we expected those 5000 keys to be uniformly distributed across all 3 nodes.

Can anyone please help us in finding the reasoning for why this could have happened?

Eyal Gutkind

<eyal@scylladb.com>

unread,

Mar 15, 2017, 12:37:42 PM3/15/17

to scylladb-users@googlegroups.com

Hi,

Can you advise on the version of Scylla you are using the system configuration.

Did you get the keys placement only on 2 out of the 3 node? So, basically, the third node has not data?

Can you provide us with the table and keyspace information? (describe table/keyspace)

Regards,

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9de608cc-50bc-46b5-80e0-d6ed7b177acd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Eyal Gutkind

ScyllaDB

www.scylladb.com

Shlomi Livne

<shlomi@scylladb.com>

unread,

Mar 15, 2017, 4:03:51 PM3/15/17

to ScyllaDB users

The probable reason is that all those partitions are in a single range given that you have a lot of data.

Please run select token(<partition-key>), partition_key from <table> limit 5

For example for the following table

CREATE KEYSPACE keyspace1 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;

CREATE TABLE keyspace1.standard1 (

key blob PRIMARY KEY,

"C0" blob,

"C1" blob,

"C2" blob,

"C3" blob,

"C4" blob

) WITH COMPACT STORAGE

AND bloom_filter_fp_chance = 0.01

AND caching = '{"keys":"ALL","rows_per_partition":"ALL"}'

AND comment = ''

AND compaction = {'class': 'SizeTieredCompactionStrategy'}

AND compression = {}

AND dclocal_read_repair_chance = 0.1

AND default_time_to_live = 0

AND gc_grace_seconds = 864000

AND max_index_interval = 2048

AND memtable_flush_period_in_ms = 0

AND min_index_interval = 128

AND read_repair_chance = 0.0

AND speculative_retry = '99.0PERCENTILE';

I have only entered 1000 rows

cqlsh -e "select token(key),key from keyspace1.standard1 limit 5;"

system.token(key) | key

----------------------+------------------------

-9206895928680792762 | 0x304e37394b3531333530

-9160549343292174675 | 0x4e4c4c3938374b383130

-9090087391989406011 | 0x36384b30374f33314c30

-9087663299618271833 | 0x344d334b4e35304b3331

-9086967334608494579 | 0x4b4f344d373130303030

getendpoints keyspace1 standard1 0x304e37394b3531333530

127.0.0.3

nodetool getendpoints keyspace1 standard1 0x4e4c4c3938374b383130

127.0.0.3

nodetool getendpoints keyspace1 standard1 0x36384b30374f33314c30

127.0.0.1

nodetool getendpoints keyspace1 standard1 0x344d334b4e35304b3331

127.0.0.2

nodetool getendpoints keyspace1 standard1 0x4b4f344d373130303030

127.0.0.3

Lets try to correlate this to the token ownership in the cluster

nodetool ring provides information on tokens selected by each node

Datacenter: datacenter1

==========

Address Rack Status State Load Owns Token

9194719961254193351

127.0.0.3 rack1 Up Normal 133.85 KB ? -9216714678074898560

127.0.0.3 rack1 Up Normal 133.85 KB ? -9196942068020296577

127.0.0.1 rack1 Up Normal 149.99 KB ? -9193556696765974833

127.0.0.1 rack1 Up Normal 149.99 KB ? -9163012479991023323

127.0.0.3 rack1 Up Normal 133.85 KB ? -9133267367462170802

127.0.0.3 rack1 Up Normal 133.85 KB ? -9122303054788654514

127.0.0.2 rack1 Up Normal 132.46 KB ? -9111062212776789516

127.0.0.2 rack1 Up Normal 132.46 KB ? -9089666287802159920

127.0.0.3 rack1 Up Normal 133.85 KB ? -9015579794224601574

127.0.0.3 rack1 Up Normal 133.85 KB ? -9011296699060860916

127.0.0.2 rack1 Up Normal 132.46 KB ? -8976643905893965440

127.0.0.2 rack1 Up Normal 132.46 KB ? -8969426451078557545

127.0.0.1 rack1 Up Normal 149.99 KB ? -8953553517161799888

127.0.0.3 rack1 Up Normal 133.85 KB ? -8947606167129378819

.

The information provides the info on ranges so

the first 3 lines tell us that

127.0.0.3 owns .data with a token range -9216714678074898560 - -9193556696765974833

127.0.0.1 owns data with a token range -9193556696765974833 - -9133267367462170802

.

Lets try to align the getendpoints info with the ring information

For example for the partition key 0x36384b30374f33314c30 the token is.-9090087391989406011 and it will be owned (with rf=1) by 127.0.0.2

127.0.0.2 rack1 Up Normal 132.46 KB ? -9111062212776789516

-9090087391989406011

127.0.0.2 rack1 Up Normal 132.46 KB ? -9089666287802159920

So for your data you can do the same I believe that you have a lot of data so you are getting all information on the same range that shares the same endpoints.

I hope this helps.

Shlomi

On Wed, Mar 15, 2017 at 6:37 PM, Eyal Gutkind <ey...@scylladb.com> wrote:

Hi,
Can you advise on the version of Scylla you are using the system configuration.
Did you get the keys placement only on 2 out of the 3 node? So, basically, the third node has not data?
Can you provide us with the table and keyspace information? (describe table/keyspace)

Regards,

On Wed, Mar 15, 2017 at 5:49 AM, <jaga...@helpshift.com> wrote:

Context
We have a 3 node ScyllaDB cluster and 2 column families in a given namespace with ~100 million rows each.
Partitioning key is a random key with values like "9c87a2bc-69f2-482d-812d-5af2be12c4f6"

Problem
We observed a bit of skew in data distribution hence I was trying to find "for some sample partition keys - what is the distribution of nodes for each key"
So I ran "select pid from table_x limit 5000" where pid is a partition key.
And for those all partition keys, using "nodetool getendpoints" found the nodes on which those keys reside.
Surprisingly for all these 5000 partition keys, we got "same 2 nodes" as the output but we expected those 5000 keys to be uniformly distributed across all 3 nodes.

Can anyone please help us in finding the reasoning for why this could have happened?

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.

To post to this group, send email to scyllad...@googlegroups.com.

Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9de608cc-50bc-46b5-80e0-d6ed7b177acd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Eyal Gutkind
ScyllaDB
www.scylladb.com

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CAAG3UA6RQDU1kRBXV-9w8XpKubJO5BJ7Tuy%3DBD3-dM4qjAtuEA%40mail.gmail.com.

Shlomi Livne

<shlomi@scylladb.com>

unread,

Mar 15, 2017, 4:11:54 PM3/15/17

to ScyllaDB users

With regards to the skew - I assume you are referring to the amount of data each node owns.

Is the "skew" very large - there can be a difference but usually the difference should not be very large.

If you have changed the default number of vnodes from 256 to be much lower then there is a higher probability of getting a larger skew.z

Can you run du -sh /var/lib/scylla/data on each server ?

Reply all

Reply to author

Forward