Supposed that the original table is Table 1, and we want to have the new table to be Table 2, located in the same cluster. Table 2 has 2x number of the shards, compared to Table 1.
We can have ClickHouse Copier,
https://clickhouse.yandex/docs/en/operations/utils/clickhouse-copier/,
to copy data from Table 1 to Table 2, which takes care of re-sharding the data. It is easy if Table 1 is not changed at all. But since we have continuous incoming data, each copier will fetch the data based on the “select” queries on the partition of Table 1 at some time. Supposed at t1, the copier is started, and at t2, the copier finishes the copy of Table 1 to Table 2 . There is really no way to tell how much new data arrives from t1 to t2 has been copied over, and how much new data does not get copied.
At the same time, the user needs to continue to query the table over the data that is up to date. Therefore, follow the above migration strategy, at t2, if we switch the incoming query to Table 2, queries from now on will not get the correct result as some of the data arrives in (t1, t2) does not get copied.
What would be the good strategy to handle such data migration from a source table to a destination table, while new data continues to arrive at the source table during data migration, and the user continues to issue queries over the data up to most recent data arrival time?
Jun