Some questions regarding how data is distributed among servers during real-time insertion

44 views
Skip to first unread message

Yan Cheng Cheok

unread,
Mar 18, 2016, 6:15:06 AM3/18/16
to pg_shard users
Currently, we wish to

  1. Perform real-time data insertion
  2. Data should be distributed among servers, based on day. For instance, on 16 May 16, all data should goes to server A; on 17 May 16, all data should goes to server B, ...

I tend to test with the following table, with additional column yymmdd (year year month month day day)


CREATE TABLE cheok_ddmmyy
(
    id bigint
,
    data text
,
    yymmdd text
);
SELECT master_create_distributed_table
('cheok_ddmmyy', 'yymmdd');
SELECT master_create_worker_shards
('cheok_ddmmyy', 16, 1);
insert
into cheok_ddmmyy values (1, 'hello 1', '160318');
insert
into cheok_ddmmyy values (2, 'hello 2', '160318');
insert
into cheok_ddmmyy values (3, 'hello 3', '160318');
insert
into cheok_ddmmyy values (4, 'hello 4', '160318');
insert
into cheok_ddmmyy values (5, 'hello 5', '160319');
insert
into cheok_ddmmyy values (6, 'hello 6', '160319');


I can confirm data is distributed based on day. It seems that it meets my 2 requirements.


My question is, is it necessary we create an INDEX on yymmdd column, or it is not necessary?



Also, I would like to understand DISTRIBUTE BY APPEND


I perform

CREATE TABLE cheok_timestamp
(
    rockman_id bigint
,
    data text
,
    time timestamp
) DISTRIBUTE BY APPEND (time);



But when I do


SELECT master_create_worker_shards('cheok_timestamp', 16, 1);
ERROR
:  unsupported table partition type: a


Is it because table created using DISTRIBUTE BY APPEND is *not* suitable to insert data in real-time manner? As I saw the example given in https://www.citusdata.com/documentation/citusdb-documentation/user_guide/append_data_loading.html , it seems that it is meant for insertion by bulk, by loading CSV data from disk? Am I right?

Sumedh Pathak

unread,
Mar 21, 2016, 7:55:41 PM3/21/16
to pg_shard users
Hi Yan Cheng,

See my responses in-line:


On Friday, March 18, 2016 at 3:15:06 AM UTC-7, Yan Cheng Cheok wrote:
Currently, we wish to

  1. Perform real-time data insertion
  2. Data should be distributed among servers, based on day. For instance, on 16 May 16, all data should goes to server A; on 17 May 16, all data should goes to server B, ...

I tend to test with the following table, with additional column yymmdd (year year month month day day)


CREATE TABLE cheok_ddmmyy
(
    id bigint
,
    data text
,
    yymmdd text
);
SELECT master_create_distributed_table
('cheok_ddmmyy', 'yymmdd');
SELECT master_create_worker_shards
('cheok_ddmmyy', 16, 1);
insert
into cheok_ddmmyy values (1, 'hello 1', '160318');
insert
into cheok_ddmmyy values (2, 'hello 2', '160318');
insert
into cheok_ddmmyy values (3, 'hello 3', '160318');
insert
into cheok_ddmmyy values (4, 'hello 4', '160318');
insert
into cheok_ddmmyy values (5, 'hello 5', '160319');
insert
into cheok_ddmmyy values (6, 'hello 6', '160319');


I can confirm data is distributed based on day. It seems that it meets my 2 requirements.


My question is, is it necessary we create an INDEX on yymmdd column, or it is not necessary?


That depends on the size of the data, and the kind of queries which hit each shard. When we perform hash-partitioning, each shard will contain a range of the hash values, and so will contain rows with several date values. Each date value maps to a hash value, and each hash value is owned by a single shard. Please see our documentation on the different partitioning methods for more information.
If the query contains a where-clause on the 'yymmdd' column, Citus will perform partition pruning and only hit the shard which contains rows with that date. However since there will be rows with other dates in that shard, it may possibly be useful to add an index. You can use the PostgreSQL EXPLAIN command on the worker node to determine if an index will be useful. You can go through our performance-tuning documentation section for more information.



Also, I would like to understand DISTRIBUTE BY APPEND


I perform

CREATE TABLE cheok_timestamp
(
    rockman_id bigint
,
    data text
,
    time timestamp
) DISTRIBUTE BY APPEND (time);



But when I do


SELECT master_create_worker_shards('cheok_timestamp', 16, 1);
ERROR
:  unsupported table partition type: a


Is it because table created using DISTRIBUTE BY APPEND is *not* suitable to insert data in real-time manner? As I saw the example given in https://www.citusdata.com/documentation/citusdb-documentation/user_guide/append_data_loading.html , it seems that it is meant for insertion by bulk, by loading CSV data from disk? Am I right?


You are right that the APPEND distribution is primarily meant for bulk-loads. This is primarily useful for append-only immutable data (e.g. logs). See our documentation here:https://www.citusdata.com/documentation/citusdb-documentation/user_guide/distribution_method.html for more information on the different distribution methods and use-cases for each.
 
Reply all
Reply to author
Forward
0 new messages