Access specific partitions in SELECT or the way how CH process data from partitions

32 views
Skip to first unread message

Filip Podstavec

unread,
May 6, 2021, 6:29:20 AMMay 6
to ClickHouse
Hi there,
 I have a table with partitioning by month and data in table ordered by another column.

Let's say for example this:
CREATE TABLE table
(
    `scheme` LowCardinality(String),
    `domain` String,
    `path` String,
    `date` Date,
    `estimate` UInt8  
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (domain, path);

What I need is to keep partitions by month but access some specific dates in table. That means I want from my Select something like "Select this but get data only from these partitions".

I understand that Clickhouse grabs data only from specific partitions if partition key is filtered in SELECT but I'm not sure if in query like this:
SELECT *
FROM table
WHERE domain = 'google.com' AND date > '2021-01-03'

If Clickhouse only reads data from that partition OR if Clickhouse at first check if all data in partition belongs to specified date (because it's specific day and not only month).

That means there are two ways how Clickhouse could process it:
1) Filter domain from all coresponding partitions and then check if all date ranges correspond
2) Filter partitions and check if date ranges fits and after that filter domain

If it's 2) then it's very bad for my case because I cannot order my data by date (in most cases I need to select data for specific domain). If i't 1) then it's ok.

Can you please tell me which way Clickhouse use? Or if it's possible something like:
SELECT *
FROM table
WHERE domain = 'google.com' AND PARTITION IN ('2021-01', '2021-02')

(I know it should be better to make partitions by day in that case but there would be bad compression level and also I have many years of data)

Thank you!

Denis Zhuravlev

unread,
May 6, 2021, 9:01:25 AMMay 6
to ClickHouse
CH version? It's important for answering your question.

Partition pruner is quite sophisticated and can eliminate excessive partitions in various expressions on partitioned columns:

CREATE TABLE t

(
    `scheme` LowCardinality(String),
    `domain` String,
    `path` String,
    `date` Date,
    `estimate` UInt8
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (domain, path)

insert  into t(date) select toDate('2020-01-01') + number/10000 from numbers(1000000);

set send_logs_level='debug';

select count() from t where domain = '' and toYYYYMM(date) in (202001, 202002);
Processed 600.00 thousand rows
...
Selected 2/4 parts by partition key


select count() from t where domain = '' and toStartOfMonth(date) in ('2020-01-01', '2020-02-01');
Processed 600.00 thousand rows
...
Selected 2/4 parts by partition key


select count() from t where domain = '' and _partition_id in ('202001', '202002');
Processed 600.00 thousand rows
...
Selected 2/4 parts by partition key

Check HOW smart pruner is!!!!

It can prune through monotonic functions toYYYYMM(toDateTime(intDiv(d,1000),'UTC'))+1 or toDateTime(intDiv(d,1000),'UTC') , before a query execution:

CREATE TABLE sDD(d UInt64,a Int) ENGINE = MergeTree
PARTITION BY toYYYYMM(toDate(intDiv(d,1000)))
ORDER BY tuple() SETTINGS index_granularity = 8192; 

select uniqExact(_part), count() from sDD
where toYYYYMM(toDateTime(intDiv(d,1000),'UTC'))+1 > 202009 and toDateTime(intDiv(d,1000),'UTC') < toDateTime('2020-10-01 00:00:00','UTC');

Filip Podstavec

unread,
May 6, 2021, 9:19:16 AMMay 6
to ClickHouse
That's awesome! I'm using last stable version v21.4.6.55 so it probably will work.

And thanks for the tests I didn't even know I can select from partition using _partition_id. Thank you! :-)

Dne čtvrtek 6. května 2021 v 15:01:25 UTC+2 uživatel denis.z...@gmail.com napsal:

Denis Zhuravlev

unread,
May 6, 2021, 10:12:28 AMMay 6
to ClickHouse
>And thanks for the tests I didn't even know I can select from partition using _partition_id. Thank you! :-)
I suggest to never use _partition_id in real queries, only for debug purposes.
Use and toYYYYMM(date) in (202001, 202002);
Message has been deleted

Filip Podstavec

unread,
May 6, 2021, 1:49:29 PMMay 6
to ClickHouse
https://giphy.com/gifs/ufc-mma-ufc-205-3o6Zt6KHxJTbXCnSvu

Dne čtvrtek 6. května 2021 v 16:12:28 UTC+2 uživatel denis.z...@gmail.com napsal:

Amos Bird

unread,
May 7, 2021, 12:59:11 AMMay 7
to Denis Zhuravlev, click...@googlegroups.com

I've added another possibility recently: _partition_value. This
can be useful in real queries as partition keys are deterministic
and immutable.

https://github.com/ClickHouse/ClickHouse/pull/23673

Denis Zhuravlev <denis.z...@gmail.com> writes:

>>And thanks for the tests I didn't even know I can select from
>>partition
> using *_partition_id*. Thank you! :-)
> I suggest to never use *_partition_id *in real queries, only for
> debug
> purposes.
> Use *and toYYYYMM(date) in (202001, 202002);*
>
>
> On Thursday, May 6, 2021 at 10:19:16 a.m. UTC-3
> filip.p...@gmail.com wrote:
>
>> That's awesome! I'm using last stable version v21.4.6.55 so it
>> probably
>> will work.
>>
>> And thanks for the tests I didn't even know I can select from
>> partition
>> using *_partition_id*. Thank you! :-)
>>
>> Dne čtvrtek 6. května 2021 v 15:01:25 UTC+2 uživatel
>> denis.z...@gmail.com
>> napsal:
>>
>>> CH version? It's important for answering your question.
>>>
>>> Partition pruner is quite sophisticated and can eliminate
>>> excessive
>>> partitions in various expressions on partitioned columns:
>>>
>>> CREATE TABLE t
>>>
>>> (
>>> `scheme` LowCardinality(String),
>>> `domain` String,
>>> `path` String,
>>> `date` Date,
>>> `estimate` UInt8
>>> )
>>> ENGINE = MergeTree
>>> PARTITION BY toYYYYMM(date)
>>> ORDER BY (domain, path)
>>>
>>> insert into t(date) select toDate('2020-01-01') +
>>> number/10000 from
>>> numbers(1000000);
>>>
>>> set send_logs_level='debug';
>>>
>>> select count() from t where domain = '' *and toYYYYMM(date) in
>>> (202001,
>>> 202002);*
>>>
>>> *Processed 600.00 thousand rows*
>>> ...
>>> *Selected 2/4 parts by partition key*
>>>
>>>
>>> select count() from t where domain = '' *and
>>> toStartOfMonth(date) in
>>> ('2020-01-01', '2020-02-01');*
>>> *Processed 600.00 thousand rows*
>>> ...
>>> *Selected 2/4 parts by partition key*
>>>
>>>
>>> select count() from t where domain = ''* and _partition_id in
>>> ('202001',
>>> '202002');*
>>> *Processed 600.00 thousand rows*
>>> *...*
>>> *Selected 2/4 parts by partition key*
>>>
>>> Check HOW smart pruner is!!!!
>>>
>>> https://github.com/ClickHouse/ClickHouse/blob/0dd244126d1213fd5cd913319f4a779b2b34d4e0/tests/queries/0_stateless/01508_partition_pruning.queries
>>>
>>> It can prune through monotonic functions
>>> *toYYYYMM(toDateTime(intDiv(d,1000),'UTC'))+1* or *
>>> toDateTime(intDiv(d,1000),'UTC') *, before a query execution:
>>>
>>> CREATE TABLE sDD(d UInt64,a Int) ENGINE = MergeTree
>>> PARTITION BY *toYYYYMM(toDate(intDiv(d,1000)))*
>>> ORDER BY tuple() SETTINGS index_granularity = 8192;
>>>
>>> select uniqExact(_part), count() from sDD
>>> where *toYYYYMM(toDateTime(intDiv(d,1000),'UTC'))+1 > 202009
>>> and
>>> toDateTime(intDiv(d,1000),'UTC') < toDateTime('2020-10-01
>>> 00:00:00','UTC');*
Reply all
Reply to author
Forward
0 new messages