Blocks in MergeTree tables

Burak Emre Kabakcı

unread,

Jun 30, 2016, 1:10:54 AM6/30/16

to ClickHouse

system.parts table have max_block_number, min_block_number and level columns which are not documented. How these columns work?

FWIU, when we insert data to a table, ClickHouse sorts data, creates partitions from the data, calculate min and max values of partition column values and creates directories in this scheme:

minpartitioncolumnvalue_maxpartitioncolumnvalue_minblocknumber_maxblocknumber_level

It also doesn't use any metadata server, instead just lists directories in /data/default/tablename and finds out parts when it's started and keep part metadata in-memory.

Let's say I have a table called ontime in two different ClickHouse nodes. When I move all directories in /data/default/ontime directory from node 1 to node 2, and restart node 2, the newly added partitions show up in system.parts table in node 2 but only some of them are actually visible in data when I execute "SELECT count() from ontime" query. I think the reason is the block numbers doesn't match since node 2 already has some data on ontime table. Is it possible to move data between two nodes by simply moving directories or do I need to perform INSERT INTO SELECT queries in order to make it work?

I also have another question about inactive parts. When I INSERT data to a table, some of the parts become inactive. What's the reason for this? I assumed that when I INSERT data, ClickHouse just creates parts and eventually merges parts if needed for same partitions and delete the previous parts and that's why some of the parts become inactive and eventually deleted.

What I want to do it to develop an alternative replication method. The reasonable replication factor for us is 3 and duplicating data to three nodes and maintaining Zookeeper nodes (we also need 3 other nodes for Zookeeper for availability) is expensive for us. What we would like to do it to keep the hot data in nodes by using MergeTree tables instead of ReplicatedMergeTree tables and just asynchronously copy data in /data directory to Amazon S3 which is cheap and durable. What we found is that we need to use SSD in ClickHouse nodes otherwise the IO becomes the bottleneck so using distributed filesystems is not effective.

Storing data in a different layer also has advantages for us, we can move the data between nodes when we add or remove nodes from the cluster, scale them easily, recover the data in case of failure without actually performing any operation in ClickHouse, we just need to fetch data from S3 from nodes and start ClickHouse.

man...@gmail.com

unread,

Jun 30, 2016, 11:00:28 AM6/30/16

to ClickHouse

Block number is autoincrement number for all INSERTed blocks.
It needs to maintain order of merges. Data is merged only for contiguous intervals of block numbers to maintain some properties of merge operations.
Part with just INSERTed data have one block, and min_block_number = max_block_number.
Parts after merge have data for multiple block numbers.

> It also doesn't use any metadata server, instead just lists directories in /data/default/tablename and finds out parts when it's started and keep part metadata in-memory.

Yes.
For Replicated tables, it also store list of parts in ZooKeeper and compares local set of data with reference in ZK.

> Let's say I have a table called ontime in two different ClickHouse nodes. When I move all directories in /data/default/ontime directory from node 1 to node 2, and restart node 2, the newly added partitions show up in system.parts table in node 2 but only some of them are actually visible in data when I execute "SELECT count() from ontime" query. I think the reason is the block numbers doesn't match since node 2 already has some data on ontime table. Is it possible to move data between two nodes by simply moving directories or do I need to perform INSERT INTO SELECT queries in order to make it work?

It because block numbers intersects. Server will bark about it and won't use half of parts.

You need to place new parts into 'detached' directory and then do ALTER TABLE ... ATTACH PARTITION queries:
https://clickhouse.yandex/reference_en.html#Manipulations%20with%20partitions%20and%20parts

When moving parts, be careful with file ownership and access rights.

(Also it is possible to rename parts by hand. Block numbers and level are not used in data; only for merging.)

> I also have another question about inactive parts. When I INSERT data to a table, some of the parts become inactive. What's the reason for this? I assumed that when I INSERT data, ClickHouse just creates parts and eventually merges parts if needed for same partitions and delete the previous parts and that's why some of the parts become inactive and eventually deleted.

Right. Old parts after merge is not deleted instantly, but with timeout (default is 8 minutes). These are inactive parts.
It is to simplify data repair after rough restart: if data for merged part was not written to non-volatile storage, then after restart, merged part is broken, but old parts is safe, and will be restored in place of merged part.

> What I want to do it to develop an alternative replication method. The reasonable replication factor for us is 3 and duplicating data to three nodes and maintaining Zookeeper nodes (we also need 3 other nodes for Zookeeper for availability) is expensive for us.

It becomes less expensive for large clusters. For example, we use 3 ZK nodes for multiple clusters with more than 500 data nodes in total.
For small clusters, you could allocate more cheap VMs for ZK.
Also, for small clusters, it's acceptable to keep ZooKeeper and ClickHouse in same nodes.

> What we would like to do it to keep the hot data in nodes by using MergeTree tables instead of ReplicatedMergeTree tables and just asynchronously copy data in /data directory to Amazon S3 which is cheap and durable.

Maybe it's Ok, though I have not much experience with Amazon S3.
Manually restoring data directory in MergeTree table works fine.

> What we found is that we need to use SSD in ClickHouse nodes otherwise the IO becomes the bottleneck so using distributed filesystems is not effective.

It depends on IO subsystem. On physical servers, RAID-10,5,6,50 with 4..8 SATA 7200 RPM HDDs works fine. But SSDs are better :)
If you use VMs, it often have bad performance with HDD (for example, having one virtual HDD with less than 100 MB/sec sequential throughput is terrible).

Use about 300-1000 MB/sec sequential throughput of IO subsystem as a reference.

> Storing data in a different layer also has advantages for us, we can move the data between nodes when we add or remove nodes from the cluster, scale them easily, recover the data in case of failure without actually performing any operation in ClickHouse, we just need to fetch data from S3 from nodes and start ClickHouse.

Ok.

Burak Emre Kabakcı

unread,

Jul 7, 2016, 6:08:16 AM7/7/16

to ClickHouse

Is there any plat to add "DETACH PART" feature. Currently, it's possible to ATTACH parts manually but DETACH only works for partitions which makes manually moving data between nodes impossible. Currently, the only way to do that is to delete the directory and restart the node again.

man...@gmail.com

unread,

Jul 7, 2016, 2:34:11 PM7/7/16

to ClickHouse

There is some difficulties with DETACH PART. Detaching part must not participate in merges, or otherwise, after merge, data for part will be present.

Implementation of DETACH PART is possible, but more complicated than DETACH PARTITION. (BTW, it's easier to implement in non-replicated MergeTree)

Burak Emre Kabakcı

unread,

Jul 7, 2016, 2:40:56 PM7/7/16

to ClickHouse

Our use-case is non-replicated MergeTree tables and unfortunately it's a real problem for us. Is there any workaround that we can use? Currently, we remove the directory of the part that we want to remove and restart the service but startup takes some time and we're not sure that this is actually reliable.

man...@gmail.com

unread,

Jul 7, 2016, 3:35:36 PM7/7/16

to ClickHouse

If you delete part directory before server shutdown, then some queries will throw exception about missing files, because at that time, server think, that part exists, but it's gone.

Also, just after you delete part directory, server could finish merge of two parts, one of them is that you want to delete, and merged part will contain all data for part that you delete.

I will look around, maybe it is rather simple to implement DETACH PART.

Burak Emre Kabakcı

unread,

Jul 7, 2016, 3:39:15 PM7/7/16

to ClickHouse

OK, thanks. I can also open an issue on Github if you wish.

man...@gmail.com

unread,

Jul 7, 2016, 3:53:11 PM7/7/16

to ClickHouse

You could open an issue with link to this topic.

Burak Emre Kabakcı

unread,

Jul 7, 2016, 5:14:48 PM7/7/16

to ClickHouse

Done: https://github.com/yandex/ClickHouse/issues/46

Reply all

Reply to author

Forward