Find duplicates between tables

431 views
Skip to first unread message

Dale Dude

unread,
Apr 12, 2017, 11:30:46 AM4/12/17
to ClickHouse
Since Clickhouse doesn't seem to use "normal" joins and doesn't have Primary key constraints I'm looking for advise on how I can insert data into an existing table and deal with duplicates (by id field). The Collapse and Replacing tables work as advertised but not as hoped. I was thinking to insert new data into a temp table and then do a join between it and the original (large 3tb) table to only return the non-duplicates from the temp table but I'm at a loss on how best to do that with Clickhouse.

Any advise would be most appreciated.

Alex Zatelepin

unread,
Apr 17, 2017, 7:04:31 AM4/17/17
to ClickHouse
I guess that if the main table is large, then always doing a GROUP BY primary key is out of question.

To find non-duplicates in the temp table you would need anti-join with the main table, but the problem is that the main table is on the right side of the join so the set of primary keys of the main table must fit in memory which we are trying to avoid. Here is a slightly clunky way to do it anyway:

SELECT * from temp_table WHERE primary_key NOT IN (
    SELECT primary_key FROM main_table WHERE primary_key IN
(
        SELECT primary_key FROM temp_table
))

Sébastien Estienney

unread,
Sep 3, 2020, 1:17:04 PM9/3/20
to ClickHouse
Probably a bit too late but it could be usefull for someone else :

Duplicates can be found by using the neighbor function instead of GROUP BY, it requires much less RAM.

An example to find the first duplicate on a pseudo-PK composed by 2 fields

SELECT pkField1,pkField2
FROM (SELECT pkField1,pkField2 FROM myTable ORDER BY pkField1,pkField2)
WHERE (pkField1,pkField2)=neighbor((pkField1,pkField2), -1)
LIMIT 1
Reply all
Reply to author
Forward
0 new messages