I'm trying to establish a process for backup and restore of Clickhouse data, but there seem to be a discrepancy on the latest partition when I restore the data and run queries against it.
I'm currently working on a single instance of Clickhouse server version 1.1.54245, not a cluster. I'm not yet able to upgrade to later versions because of incompatibility.
Here's my process for a ReplacingMergeTree table:
1. Delete the contents of the shadow directory, so that we don't get shadow files from earlier runs of the backup process.
2. first of all, execute a select statement that will be used later to verify that the restore was successful:
- SELECT occurred_at_date, count(*) FROM mytable group by occurred_at_date order by occurred_at_date
3. for each partition,
- ALTER TABLE mytable FREEZE PARTITION ...
4. Collect the files that are in shadow/[0-9]* directories and back them up.
5. Collect the metadata/mydb.sql and metadata/mydb/* files and back them up.
6. On a different machine, start a new Clickhouse server using the data and metadata files from the backup.
7. Execute the same select statement as in step 2 to check if there are any differences.
The problem is that for one of our tables, the query returns too low counts in step 7 for the last 2 dates in the result set.
Example 1: I did a backup and restore on 2017-10-02, and here are the results of the check:
All dates from 2010-01-01 to 2017-09-30 are fine.
but 2017-10-01: before backup had 99641. After restore has 68432
and 2017-10-02: before backup had 37790. After restore has 1330
Example 2: I did a backup and restore on 2017-10-03. No discrepancies.
Example 3: I did a second backup and restore on 2017-10-03.
All dates from 2010-01-01 to 2017-09-30 are fine.
2017-10-01 is completely missing in the query results after restore.
2017-10-02 is completely missing in the query results after restore.
2017-10-03 is partially missing: before backup had 30716. After restore has 20526
There is one error message that is shown when I do the FREEZE operation:
$$$$$$$$$$$$$$$$$$$$$$$
Unknown error field: Poco::Exception. Code: 1000
Unknown error field: e.code() = 2
Unknown error field: e.displayText() = File not found: /var/lib/clickhouse/data/mydb/events/20171003_20171003_568801_568801_0
{ Error: Poco::Exception. Code: 1000, e.code() = 2, e.displayText() = File not found: /var/lib/clickhouse/data/mydb/events/20171003_20171003_568801_568801_0, e.what() = File not found
at parseError (/code/node_modules/@apla/clickhouse/src/parse-error.js:2:15)
at errorHandler (/code/node_modules/@apla/clickhouse/src/clickhouse.js:26:13)
at IncomingMessage.<anonymous> (/code/node_modules/@apla/clickhouse/src/clickhouse.js:94:11)
at emitNone (events.js:110:20)
at IncomingMessage.emit (events.js:207:7)
at endReadableNT (_stream_readable.js:1059:12)
at _combinedTickCallback (internal/process/next_tick.js:138:11)
at process._tickCallback (internal/process/next_tick.js:180:9) type: 'File not found' }
This error is for the partition which I am having discrepancy issues with.
Am I doing something wrong? Can you offer any advice?
--
best regards,
Gudmundur Orn Johannsson