Emulating UPDATE in CollapsingMergeTree

458 views
Skip to first unread message

Ivan Ladelschikov

unread,
Oct 13, 2016, 5:17:08 AM10/13/16
to ClickHouse
Hi, I'm trying to do bulk update in CollapsingMergeTree (millions of rows and more) using the following method:
a) extract the needed rows from original table (A) into temporary MergeTree table (B)
b) drop the 'Sign' column in B and add it again with DEFAULT -1
c) insert all the B rows back to A table ('negative' values)
d) drop the 'Sign' column in B and add it again with DEFAULT 1
e) do the same with all the fields I want to update, e.g. drop the 'value' column with the old value 0 and add it back with DEFAULT 5
f) insert all the B rows back to A table ('positive' values)
g) drop the B table
See the full example here.

The method works pretty well for relatively small datasets (thousands of rows). However for large datasets like million of rows the following two troubles happen:
1) after the bulk insert (both negative and positive values) using 
INSERT INTO A SELECT * FROM B;

some lags are observed:
-- after negative insertions (comes first)
SELECT count
(), Sign FROM A WHERE answer_id=42 GROUP BY Sign
┌─count()─┬─Sign─┐
│  938258 │    1 │ <-- original rows
│  825956 │   -1 │ <-- negative insertions
└─────────┴──────┘

-- after positive insertions
SELECT count(), Sign, value FROM A WHERE answer_id=42 GROUP BY Sign, value
┌─count()─┬─Sign─┬─value─┐
│  883651 │    1 │     5 │ <-- positive insertions
│  938258 │    1 │     0 │ <-- original rows
│  938258 │   -1 │     0 │ <-- negative insertions
└─────────┴──────┴───────┘

 However after some time (like 5-10 minutes or so) the data becomes eventually consistent:
SELECT count(), Sign, value FROM A WHERE answer_id=42 GROUP BY Sign, value

┌─count()─┬─Sign─┬─value─┐
 938258    1     5 <-- positive insertions
 938258    1     0 <-- original rows
 938258   -1     0 <-- negative insertions
└─────────┴──────┴───────┘


2) Despite of the "consistency is reached" as previous paragraph claims, the FINAL query does not work as expected:
SELECT count(), Sign, value FROM A FINAL WHERE answer_id=42 GROUP BY Sign, value

┌─count()─┬─Sign─┬─value─┐
 938258    1     5 <-- it's all right! But it should be the only one row that left with FINAL query.
 234150    1     0 <-- WTF?
└─────────┴──────┴───────┘

So, the questions themselves: 
- is it normal to observe some lags after insertions are done with INSERT INTO A SELECT * FROM B;
- why the FINAL query does not work as expected?
- and the main one: don't I break the CollapsingMergeTree logic with the bulk negative insertions first and then bulk positive ones, not the pairwise?

man...@gmail.com

unread,
Oct 14, 2016, 7:45:21 PM10/14/16
to ClickHouse
First of all, this method is suitable only for rare updates. Not as possible replacement of unimplemented UPDATE statement.

Sing-collapsing applies in background while merging. It is not possible to determine, when data will be fully merged.
CollapsingMergeTree is usually used when you could write queries, that are independent of when data will be collapsed, and will it ever be collapsed.
For example, if you write your aggregation query like this:
SELECT key, sum(Sign) AS cnt FROM table GROUP BY key HAVING sum(Sign) > 0
, its result is invariant of sign-collapsing.
So, lags are normal and expected.

Sign-collapsing (both in background and using FINAL) is dependent on order of rows and placement of rows in blocks.
To replace one row with another, you must insert data where negative row is placed before positive row.
In one inserted block there must be no more than one "change" for each key.
If you do INSERT SELECT, SELECT will be run multithreaded by default, and order of rows may be arbitary.
To keep order of rows, you may SET max_threads = 1 before running INSERT SELECT.

See this thread for details: https://groups.google.com/forum/#!msg/clickhouse/VixyOUD-K68/js6rrmJxAQAJ

Reply all
Reply to author
Forward
0 new messages