difference in partition by toHour() vs toStartOfHour()

755 views
Skip to first unread message

Todd

unread,
Oct 3, 2019, 7:54:40 PM10/3/19
to ClickHouse
This might be a dumb question, but am wondering what the difference is when used in a PARTITION BY

I see a lot of examples using toHour() or toDay() or the such, but it occurred to me that what I want is toStartOfHour() or toStartOfDay().

For example, if I have 3 days worth of data and partition by toHour(), won't ClickHouse have 24 partitions for the 3 days.

And if instead I partition by  toStartOfHour(), then ClickHouse will have 72 partitions for the 3 days.

I think the second partition is what I want, toStartOfHour(), so when I do a query for a specific hour within a certain day, I only get a partition with data for that hour.

Thanks

Todd

Denis Zhuravlev

unread,
Oct 3, 2019, 8:27:39 PM10/3/19
to ClickHouse
>I see a lot of examples using toHour() or toDay()
? Where do you see such examples?
toHour() -- this an hour number -- so it you'll get data from different days in one partition that is kinda strange and bad.
toDay() --  no such funcition.


toStartOfHour()  -- is OK
toStartOfDay() -- is OK (toYYYYMMDD() -- also OK)

Be aware that large number of partitions the cause of bad performance of inserts and selects and able to cause mutation failures and increases a size of ZK db and number of ZK transactions.
CH is designed for monthly partitioning and recommended number of partitions up to 500 (per table).


Each CH part has a special files minmax_{parititon_column}.idx
So if you partition a table by toStartOfHour(event_date). This file (minmax_event_date.idx) will contain  minmax values of event_date over this part.
And partition pruning (elimination) algorithm analyses these minmax values of event_date using where event_date >=< ... but not toStartOfHour. toStartOfHour value is not stored but only used during insert.

todd

unread,
Oct 3, 2019, 10:04:54 PM10/3/19
to Denis Zhuravlev, ClickHouse
Sorry, I had been discussing with some others and had used toDay() to
keep it simpler.

Yes, I've been using toYYYYMMDD(), not toDay().

And had recently switched from toHour() instead of toYYYYMMDD()

Yeah, I'm aware of the ClickHouse preferences for number of
partitions, thanks to the recent excellent ClickHouse Query
Performance, which prompted me to re-assess my partitioning strategy.

My impression was ClickHouse was design for monthly partitioning, but
now is more general purpose.

The key is to pick a partition that matches the typical query such
that you get back hundreds of partitions.

For my current use case, I have data that may cover a weeks worth of
time, and queries that cover a time interval of 3-8 hours, so
partitioning by toYYYYMMDD() is not very helpful at all, so I switched
to toHour().

And then had a second use case of data that only covers 1 day, so
realized for that use case I really need toMinute().

But then realized I need toStartOfHour() and toStartOfMinute(), but
thought I'd ask here to make sure my understanding is correct.

So I think I just goofed when I went from toYYYYMMDD() to toHour()
instead of toStartOfHour().

Thanks for the quick reply and hashing it out with me, I think I'm own
track now!
> --
> You received this message because you are subscribed to the Google Groups "ClickHouse" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clickhouse+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clickhouse/fa165950-8a4c-476b-96c2-6d1333bdc644%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages