ReplacingMergeTree like UPDATE use FINAL too slow

1,524 views
Skip to first unread message

subern

unread,
Aug 4, 2016, 1:08:11 AM8/4/16
to ClickHouse
I use like https://github.com/yandex/ClickHouse/blob/master/dbms/tests/queries/0_stateless/00325_replacing_merge_tree.sql 

but when data become large,I insert(100million row ). the FINAL select will be slow. 10~100 times slower select. if I optimize will take 10min. and select result not same with FINAL select result. and i had turn up the max_bytes_to_meger_parts.

how can i let FINAL select as fast as  select  with correct result

S M

unread,
Aug 4, 2016, 10:38:27 AM8/4/16
to ClickHouse
Have you waited until the merging process is done? FINAL queries will be slow right after you insert a huge batch of data, but as the data gets sorted, merged, and compacted it should get faster. What do the logs say?

man...@gmail.com

unread,
Aug 4, 2016, 8:08:19 PM8/4/16
to ClickHouse
Yes, FINAL is slow (under-optimized).
It is implemented in following way:

1. Primary key is read from table (to do merge), even if it isn't used in query.
2. Streams of data from different data parts are merged.
3. Data is processed in single thread (because it is simplest way to do merge).
4. Even if different data parts doesn't overlap by primary keys, they are also going through merge.
5. Merge is not fully optimized, it contains row-by-row processing, virtual function calls, doing small copies etc
(althrough performance is not very bad, it could be done much better).

Optimizations are possible:
1. Don't read primary key and don't merge data, if for some range, it is discovered, that no other data within same range exists.
2. Maybe introduce separate unique key in addition to primary key, because it could be more compact. Example: primary key is (ClientID, Date, UserID, LogRowID) and unique key might be just LogRowID.
3. Do (dynamically) partition of data parts to multiple ranges, to merge them in parallel.
4. Further optimize merge.

We have not done these optimizations, because we don't use FINAL for production queries in serious load, only for rare queries or for small load.
Even ReplacingMergeTree was developed by request of another department, who convinced me, that they will use it in small workload.
And we don't recommend FINAL.

Keep in mind that using FINAL leads to a selection that includes columns related to the primary key, in addition to the columns specified in the SELECT. Additionally, the query will be executed in a single stream, and data will be merged during query execution. This means that when using FINAL, the query is processed more slowly. In most cases, you should avoid using FINAL.


__

unread,
Apr 9, 2018, 1:45:49 PM4/9/18
to ClickHouse
How do you handle updates, data which is being added at a later time? I'm currently using two tables with a join on an indexed binary UUID, but it is incredible slow when both tables are in the hundreds of millions rows. Like 3m rows per second.
Reply all
Reply to author
Forward
0 new messages