Scylla partition key distribution across nodes

332 views
Skip to first unread message

jagadish@helpshift.com

<jagadish@helpshift.com>
unread,
Mar 15, 2017, 8:49:51 AM3/15/17
to ScyllaDB users

Context
We have a 3 node ScyllaDB cluster and 2 column families in a given namespace with ~100 million rows each.
Partitioning key is a random key with values like "9c87a2bc-69f2-482d-812d-5af2be12c4f6"

Problem
We observed a bit of skew in data distribution hence I was trying to find "for some sample partition keys - what is the distribution of nodes for each key"
So I ran "select pid from table_x limit 5000" where pid is a partition key.
And for those all partition keys, using "nodetool getendpoints" found the nodes on which those keys reside.
Surprisingly for all these 5000 partition keys, we got "same 2 nodes" as the output but  we expected those 5000 keys to be uniformly distributed across all 3 nodes.

Can anyone please help us in finding the reasoning for why this could have happened?




Eyal Gutkind

<eyal@scylladb.com>
unread,
Mar 15, 2017, 12:37:42 PM3/15/17
to scylladb-users@googlegroups.com
Hi,
Can you advise on the version of Scylla you are using the system configuration.
Did you get the keys placement only on 2 out of the 3 node? So, basically, the third node has not data?
Can you provide us with the table and keyspace information? (describe table/keyspace)


Regards,


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9de608cc-50bc-46b5-80e0-d6ed7b177acd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Eyal Gutkind
ScyllaDB

Shlomi Livne

<shlomi@scylladb.com>
unread,
Mar 15, 2017, 4:03:51 PM3/15/17
to ScyllaDB users
The probable reason is that all those partitions are in a single range given that you have a lot of data.

Please run select token(<partition-key>), partition_key from <table> limit 5

For example for the following table 

CREATE KEYSPACE keyspace1 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE keyspace1.standard1 (
    key blob PRIMARY KEY,
    "C0" blob,
    "C1" blob,
    "C2" blob,
    "C3" blob,
    "C4" blob
) WITH COMPACT STORAGE
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL","rows_per_partition":"ALL"}'
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';


I have only entered 1000 rows

cqlsh -e "select token(key),key from keyspace1.standard1 limit 5;"

system.token(key)    | key
----------------------+------------------------
 -9206895928680792762 | 0x304e37394b3531333530
 -9160549343292174675 | 0x4e4c4c3938374b383130
 -9090087391989406011 | 0x36384b30374f33314c30
 -9087663299618271833 | 0x344d334b4e35304b3331
 -9086967334608494579 | 0x4b4f344d373130303030

getendpoints keyspace1 standard1 0x304e37394b3531333530

127.0.0.3

nodetool getendpoints keyspace1 standard1 0x4e4c4c3938374b383130

127.0.0.3

nodetool getendpoints keyspace1 standard1 0x36384b30374f33314c30

127.0.0.1

nodetool getendpoints keyspace1 standard1 0x344d334b4e35304b3331


127.0.0.2

nodetool getendpoints keyspace1 standard1 0x4b4f344d373130303030

127.0.0.3


Lets try to correlate this to the token ownership in the cluster

nodetool ring provides information on tokens selected by each node


Datacenter: datacenter1
==========
Address    Rack        Status State   Load            Owns                Token                                       
                                                                          9194719961254193351                         
127.0.0.3  rack1       Up     Normal  133.85 KB       ?                   -9216714678074898560                        
127.0.0.3  rack1       Up     Normal  133.85 KB       ?                   -9196942068020296577                        
127.0.0.1  rack1       Up     Normal  149.99 KB       ?                   -9193556696765974833                        
127.0.0.1  rack1       Up     Normal  149.99 KB       ?                   -9163012479991023323                        
127.0.0.3  rack1       Up     Normal  133.85 KB       ?                   -9133267367462170802                        
127.0.0.3  rack1       Up     Normal  133.85 KB       ?                   -9122303054788654514                        
127.0.0.2  rack1       Up     Normal  132.46 KB       ?                   -9111062212776789516                        
127.0.0.2  rack1       Up     Normal  132.46 KB       ?                   -9089666287802159920                        
127.0.0.3  rack1       Up     Normal  133.85 KB       ?                   -9015579794224601574                        
127.0.0.3  rack1       Up     Normal  133.85 KB       ?                   -9011296699060860916                        
127.0.0.2  rack1       Up     Normal  132.46 KB       ?                   -8976643905893965440                        
127.0.0.2  rack1       Up     Normal  132.46 KB       ?                   -8969426451078557545                        
127.0.0.1  rack1       Up     Normal  149.99 KB       ?                   -8953553517161799888                        
127.0.0.3  rack1       Up     Normal  133.85 KB       ?                   -8947606167129378819     
.
.
.

The information provides the info on ranges so

the first 3 lines tell us that 
127.0.0.3 owns .data with a token range -9216714678074898560 - -9193556696765974833
127.0.0.1 owns data with a token range -9193556696765974833 - -9133267367462170802
.
.

Lets try to align the getendpoints info with the ring information

For example for the partition key 0x36384b30374f33314c30 the token is.-9090087391989406011 and it will be owned (with rf=1) by 127.0.0.2 
 
127.0.0.2  rack1       Up     Normal  132.46 KB       ?                   -9111062212776789516                        
                                                                                                   -9090087391989406011
127.0.0.2  rack1       Up     Normal  132.46 KB       ?                   -9089666287802159920                        


So for your data you can do the same I believe that you have a lot of data so you are getting all information on the same range that shares the same endpoints.

I hope this helps.

Shlomi

On Wed, Mar 15, 2017 at 6:37 PM, Eyal Gutkind <ey...@scylladb.com> wrote:
Hi,
Can you advise on the version of Scylla you are using the system configuration.
Did you get the keys placement only on 2 out of the 3 node? So, basically, the third node has not data?
Can you provide us with the table and keyspace information? (describe table/keyspace)


Regards,

On Wed, Mar 15, 2017 at 5:49 AM, <jaga...@helpshift.com> wrote:

Context
We have a 3 node ScyllaDB cluster and 2 column families in a given namespace with ~100 million rows each.
Partitioning key is a random key with values like "9c87a2bc-69f2-482d-812d-5af2be12c4f6"

Problem
We observed a bit of skew in data distribution hence I was trying to find "for some sample partition keys - what is the distribution of nodes for each key"
So I ran "select pid from table_x limit 5000" where pid is a partition key.
And for those all partition keys, using "nodetool getendpoints" found the nodes on which those keys reside.
Surprisingly for all these 5000 partition keys, we got "same 2 nodes" as the output but  we expected those 5000 keys to be uniformly distributed across all 3 nodes.

Can anyone please help us in finding the reasoning for why this could have happened?




--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
--
Eyal Gutkind
ScyllaDB

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

Shlomi Livne

<shlomi@scylladb.com>
unread,
Mar 15, 2017, 4:11:54 PM3/15/17
to ScyllaDB users
With regards to the skew - I assume you are referring to the amount of data each node owns.

Is the "skew" very large - there can be a difference but usually the difference should not be very large.

If you have changed the default number of vnodes from 256 to be much lower then there is a higher probability of getting a larger skew.z

Can you run du -sh /var/lib/scylla/data on each server ?

Reply all
Reply to author
Forward
0 new messages