Steps needed to move a large database table between machines Standalone -> Cluster

1,969 views
Skip to first unread message

ste...@activitystream.com

unread,
Jun 14, 2017, 1:51:13 PM6/14/17
to ClickHouse
Hi,

Can someone please outline the steps I need to take to get a local database table (not distributed) to a remote cluster (3 shards).

I have configured the zk+cluster but I'm unsure what is the best way to get my data to it and how to re-shard it there.

All the best,
 - Stefán

Alex Zatelepin

unread,
Jun 15, 2017, 2:12:17 PM6/15/17
to ClickHouse
Hi!

To use automatic resharding you must first convert the table to an engine from the Replicated family. Here is how to do it: https://clickhouse.yandex/docs/en/table_engines/replication.html#converting-from-mergetree-to-replicatedmergetree
Then you can use ALTER TABLE RESHARD command: https://clickhouse.yandex/docs/en/table_engines/resharding.html
Note that it is currently in beta, so please back up your data.

Or you can just create empty local tables on each server, create a Distributed table over them and do an INSERT SELECT into the Distributed table.

ste...@activitystream.com

unread,
Jun 16, 2017, 2:19:34 PM6/16/17
to ClickHouse
Thank you Alex.

I have the data available on a local machine and would like to get it to a remote instance.

Do you recommend other steps for that or just to define it a s a remote table? (How efficient is that transfer?)

Regards,
 -Stefán

Alex Zatelepin

unread,
Jun 16, 2017, 3:43:20 PM6/16/17
to ClickHouse
I am not sure I have fully grasped your requirements.

If your data is in a MergeTree table on one instance and you want to setup permanent replication with another instance, then first convert the table to a Replicated engine using the instructions from the previous post and then create a replica table on the second instance. Replication will do its thing.

If your data is in a MergeTree table on one instance and you have set up a cluster on separate machines and need to perform a one-off copy of data from the instance to the cluster, then inserting via a Distributed table is convenient and pretty efficient (data is transferred in the native compressed format). Directly manipulating data parts (detaching, then manually transferring them to the cluster and attaching) is perhaps slightly more efficient but much more cumbersome.

If your data is on one ClickHouse instance and you need to perform a one-off copy to a MergeTree table on another instance, then the most efficient way is to simply stop both servers and rsync data directories (but that won't work with Replicated tables).

Hope that helps :)

ste...@activitystream.com

unread,
Jun 17, 2017, 8:27:55 AM6/17/17
to ClickHouse
It does Alex, thank you.

It was the middle approach I was looking for but the rest will come in handy as well :)

regards,
  -Stefán
Reply all
Reply to author
Forward
0 new messages