Is table partitioning being supported in CitusDB ?

204 views
Skip to first unread message

Yan Cheng Cheok

unread,
Feb 26, 2016, 1:44:34 AM2/26/16
to pg_shard users
Hi,

I'm referring to the table partitioning technique described in here : http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

I tend to create table partition by the following commands

DROP table measurement cascade;

CREATE TABLE measurement
(
    city_id        
int not null,
    logdate         date
not null,
    peaktemp        
int,
    unitsales      
int
);

CREATE TABLE measurement_y2006m02
(
    CHECK
( logdate >= DATE '2006-02-01' AND logdate < DATE '2006-03-01' )
) INHERITS (measurement);
CREATE TABLE measurement_y2006m03
(
    CHECK
( logdate >= DATE '2006-03-01' AND logdate < DATE '2006-04-01' )
) INHERITS (measurement);

CREATE INDEX measurement_y2006m02_logdate ON measurement_y2006m02
(logdate);
CREATE INDEX measurement_y2006m03_logdate ON measurement_y2006m03
(logdate);

CREATE OR REPLACE FUNCTION measurement_insert_trigger
()
RETURNS TRIGGER AS $$
BEGIN
    IF
( NEW.logdate >= DATE '2006-02-01' AND
         NEW
.logdate < DATE '2006-03-01' ) THEN
        INSERT INTO measurement_y2006m02 VALUES
(NEW.*);
    ELSIF
( NEW.logdate >= DATE '2006-03-01' AND
            NEW
.logdate < DATE '2006-04-01' ) THEN
        INSERT INTO measurement_y2006m03 VALUES
(NEW.*);
    ELSE
        RAISE EXCEPTION
'Date out of range.  Fix the measurement_insert_trigger() function!';
   
END IF;
    RETURN NULL
;
END;
$$
LANGUAGE plpgsql
;

CREATE TRIGGER insert_measurement_trigger
    BEFORE INSERT ON measurement
    FOR EACH ROW EXECUTE PROCEDURE measurement_insert_trigger
();


Then, I tend to create distributed tables using

SELECT master_create_distributed_table('measurement', 'city_id');
SELECT master_create_worker_shards
('measurement', 16, 2);

All seems OK.

However, when I perform insert, it doesn't work well.

postgres=# insert into measurement values(0, '2006-03-02', 22, 33);
INSERT
0 1
Time: 3.851 ms
postgres
=# select * from measurement;
 city_id
| logdate | peaktemp | unitsales
---------+---------+----------+-----------
(0 rows)

Time: 2.965 ms
postgres
=# select * from measurement_y2006m02;
measurement_y2006m02
postgres
=# select * from measurement_y2006m02;
 city_id
| logdate | peaktemp | unitsales
---------+---------+----------+-----------
(0 rows)

Time: 1.690 ms
postgres
=# select * from measurement_y2006m03;
 city_id
| logdate | peaktemp | unitsales
---------+---------+----------+-----------
(0 rows)

Insertion will work well, if I don't perform master_create_distributed_table and master_create_worker_shards.

I further experiment with

SELECT master_create_distributed_table
('measurement', 'city_id');
SELECT master_create_distributed_table
('measurement_y2006m02', 'city_id');
SELECT master_create_distributed_table
('measurement_y2006m03', 'city_id');

SELECT master_create_worker_shards
('measurement', 16, 2);
SELECT master_create_worker_shards
('measurement_y2006m02', 16, 2);
SELECT master_create_worker_shards
('measurement_y2006m03', 16, 2);

I get the following errors

postgres=# SELECT master_create_distributed_table('measurement', 'city_id');
 master_create_distributed_table
---------------------------------
 
(1 row)

Time: 1.098 ms
postgres
=# SELECT master_create_distributed_table('measurement_y2006m02', 'city_id');
 master_create_distributed_table
---------------------------------
 
(1 row)

Time: 0.756 ms
postgres
=# SELECT master_create_distributed_table('measurement_y2006m03', 'city_id');
 master_create_distributed_table
---------------------------------
 
(1 row)

Time: 1.253 ms
postgres
=# SELECT master_create_worker_shards('measurement', 16, 2);
 master_create_worker_shards
-----------------------------
 
(1 row)

Time: 62.597 ms
postgres
=# SELECT master_create_worker_shards('measurement_y2006m02', 16, 2);
WARNING
:  Bad result from 172.30.1.249:5432
DETAIL
:  Remote message: column "logdate" does not exist
WARNING
:  could not create shard on "172.30.1.249:5432"
WARNING
:  Bad result from 172.30.1.250:5432
DETAIL
:  Remote message: column "logdate" does not exist
WARNING
:  could not create shard on "172.30.1.250:5432"
WARNING
:  Bad result from 172.30.1.252:5432
DETAIL
:  Remote message: column "logdate" does not exist
WARNING
:  could not create shard on "172.30.1.252:5432"
ERROR
:  could not satisfy specified replication factor
DETAIL
:  Created 0 shard replicas, less than the requested replication factor of 2.
Time: 3.662 ms
postgres
=# SELECT master_create_worker_shards('measurement_y2006m03', 16, 2);
WARNING
:  Bad result from 172.30.1.249:5432
DETAIL
:  Remote message: column "logdate" does not exist
WARNING
:  could not create shard on "172.30.1.249:5432"
WARNING
:  Bad result from 172.30.1.250:5432
DETAIL
:  Remote message: column "logdate" does not exist
WARNING
:  could not create shard on "172.30.1.250:5432"
WARNING
:  Bad result from 172.30.1.252:5432
DETAIL
:  Remote message: column "logdate" does not exist
WARNING
:  could not create shard on "172.30.1.252:5432"
ERROR
:  could not satisfy specified replication factor
DETAIL
:  Created 0 shard replicas, less than the requested replication factor of 2.
Time: 4.733 ms


Does this mean, table partitioning is not supported in CitusDB?

Thanks.
Cheok

Onder Kalaci

unread,
Feb 26, 2016, 7:29:59 PM2/26/16
to pg_shard users
Hi Yan,

I haven't tried partitioning with CitusDB (or pg_shard). But, both partitioning and sharding (what CitusDB does) seem to achieve similar goals. 

Could you please explain the aim of using them together? I'd like to understand more so that I can find a way to help.

Thanks,
Onder
...

Yan Cheng Cheok

unread,
Feb 27, 2016, 5:20:33 AM2/27/16
to pg_shard users
Hi Onder,

The idea of having partitioning, is so that we can have some sort of "backup plan", when the query become slow.

Imagine one day the query

SELECT * from measurement ...

become very slow even we had trying various approach like proper indexing.

Having table partition, will give us more chance (as backup plan) to tackle down the slow query problem.

We can try to query

SELECT * from measurement_y2006_m02 ...

So, we have better opportunity and chance to resolve the slow query problem, just in case it hits us.

Onder Kalaci

unread,
Mar 1, 2016, 2:05:59 AM3/1/16
to pg_shard users
Hi Yen,

In CitusDB (or pg_shard) each shard is represented on the worker node as a regular PostgreSQL table with name ‘tablename_shardid’ where tablename is the name of the distributed table and shardid is the unique id assigned to that shard. So, you can connect to the worker postgres instances to view or run commands on individual shards.

Is that what you're looking for?

Thanks,
Onder

ad...@mux.co

unread,
Mar 11, 2016, 12:51:59 PM3/11/16
to pg_shard users
Hi Onder,
I'm having the same issue with a similar use case,  I am attempting to set up row + columnar partitioning as described here: https://www.citusdata.com/blog/14-marco/273-combined-row-based-and-columnar-storage-postgresql-95 of worker shards.  Inserts through the master node work as expected, as well as queries to the "parent" shard table, however, queries to the master node always return 0 rows.  Is it possible that inheritance is somehow disabled for queries through the pg_shard master node.

Thanks,
Adam

Yan Cheng Cheok

unread,
Mar 18, 2016, 6:18:33 AM3/18/16
to pg_shard users
Hi Onder,

tablename_shardid is close but not exactly what I want.

What I want is tablename_yymmdd (year year month month date date)

But, I think I found a proper way to do so.

CREATE TABLE cheok_ddmmyy
(
    id bigint
,
    data text
,
    yymmdd text
);
SELECT master_create_distributed_table
('cheok_ddmmyy', 'yymmdd');
SELECT master_create_worker_shards
('cheok_ddmmyy', 16, 1);
insert
into cheok_ddmmyy values (1, 'hello 1', '160318');
insert
into cheok_ddmmyy values (2, 'hello 2', '160318');

So, when user perform query based on particular date, citus will know exactly which server to look at, by only looking at 1 server.

Thanks.
Reply all
Reply to author
Forward
0 new messages