Steps needed to move a large database table between machines Standalone -> Cluster

ste...@activitystream.com

unread,

Jun 14, 2017, 1:51:13 PM6/14/17

to ClickHouse

Hi,

Can someone please outline the steps I need to take to get a local database table (not distributed) to a remote cluster (3 shards).

I have configured the zk+cluster but I'm unsure what is the best way to get my data to it and how to re-shard it there.

All the best,

- Stefán

Alex Zatelepin

unread,

Jun 15, 2017, 2:12:17 PM6/15/17

to ClickHouse

Hi!

To use automatic resharding you must first convert the table to an engine from the Replicated family. Here is how to do it: https://clickhouse.yandex/docs/en/table_engines/replication.html#converting-from-mergetree-to-replicatedmergetree

Then you can use ALTER TABLE RESHARD command: https://clickhouse.yandex/docs/en/table_engines/resharding.html

Note that it is currently in beta, so please back up your data.

Or you can just create empty local tables on each server, create a Distributed table over them and do an INSERT SELECT into the Distributed table.

ste...@activitystream.com

unread,

Jun 16, 2017, 2:19:34 PM6/16/17

to ClickHouse

Thank you Alex.

I have the data available on a local machine and would like to get it to a remote instance.

Do you recommend other steps for that or just to define it a s a remote table? (How efficient is that transfer?)

Regards,

-Stefán

Alex Zatelepin

unread,

Jun 16, 2017, 3:43:20 PM6/16/17

to ClickHouse

I am not sure I have fully grasped your requirements.

If your data is in a MergeTree table on one instance and you want to setup permanent replication with another instance, then first convert the table to a Replicated engine using the instructions from the previous post and then create a replica table on the second instance. Replication will do its thing.

If your data is in a MergeTree table on one instance and you have set up a cluster on separate machines and need to perform a one-off copy of data from the instance to the cluster, then inserting via a Distributed table is convenient and pretty efficient (data is transferred in the native compressed format). Directly manipulating data parts (detaching, then manually transferring them to the cluster and attaching) is perhaps slightly more efficient but much more cumbersome.

If your data is on one ClickHouse instance and you need to perform a one-off copy to a MergeTree table on another instance, then the most efficient way is to simply stop both servers and rsync data directories (but that won't work with Replicated tables).

Hope that helps :)

ste...@activitystream.com

unread,

Jun 17, 2017, 8:27:55 AM6/17/17

to ClickHouse

It does Alex, thank you.

It was the middle approach I was looking for but the rest will come in handy as well :)