Inconsistent backup/restore for latest partition when using ALTER TABLE FREEZE PARTITION

501 views
Skip to first unread message

gudmun...@activitystream.com

unread,
Oct 3, 2017, 10:40:16 AM10/3/17
to ClickHouse
I'm trying to establish a process for backup and restore of Clickhouse data, but there seem to be a discrepancy on the latest partition when I restore the data and run queries against it.

I'm currently working on a single instance of Clickhouse server version 1.1.54245, not a cluster. I'm not yet able to upgrade to later versions because of incompatibility.

Here's my process for a ReplacingMergeTree table:

1. Delete the contents of the shadow directory, so that we don't get shadow files from earlier runs of the backup process.

2. first of all, execute a select statement that will be used later to verify that the restore was successful:
 - SELECT occurred_at_date, count(*) FROM mytable group by occurred_at_date order by occurred_at_date

3. for each partition, 
- ALTER TABLE mytable FREEZE PARTITION ...

4. Collect the files that are in shadow/[0-9]* directories and back them up.

5. Collect the metadata/mydb.sql and metadata/mydb/* files and back them up.

6. On a different machine, start a new Clickhouse server using the data and metadata files from the backup.

7. Execute the same select statement as in step 2 to check if there are any differences.

The problem is that for one of our tables, the query returns too low counts in step 7 for the last 2 dates in the result set.

Example 1: I did a backup and restore on 2017-10-02, and here are the results of the check:
All dates from 2010-01-01 to 2017-09-30 are fine.
but 2017-10-01: before backup had 99641. After restore has 68432
and 2017-10-02: before backup had 37790. After restore has 1330

Example 2: I did a backup and restore on 2017-10-03. No discrepancies.

Example 3: I did a second backup and restore on  2017-10-03.
All dates from 2010-01-01 to 2017-09-30 are fine.
2017-10-01 is completely missing in the query results after restore.
2017-10-02 is completely missing in the query results after restore.
2017-10-03 is partially missing: before backup had 30716. After restore has 20526

There is one error message that is shown when I do the FREEZE operation:

$$$$$$$$$$$$$$$$$$$$$$$
Unknown error field: Poco::Exception. Code: 1000
Unknown error field: e.code() = 2
Unknown error field: e.displayText() = File not found: /var/lib/clickhouse/data/mydb/events/20171003_20171003_568801_568801_0
{ Error: Poco::Exception. Code: 1000, e.code() = 2, e.displayText() = File not found: /var/lib/clickhouse/data/mydb/events/20171003_20171003_568801_568801_0, e.what() = File not found

    at parseError (/code/node_modules/@apla/clickhouse/src/parse-error.js:2:15)
    at errorHandler (/code/node_modules/@apla/clickhouse/src/clickhouse.js:26:13)
    at IncomingMessage.<anonymous> (/code/node_modules/@apla/clickhouse/src/clickhouse.js:94:11)
    at emitNone (events.js:110:20)
    at IncomingMessage.emit (events.js:207:7)
    at endReadableNT (_stream_readable.js:1059:12)
    at _combinedTickCallback (internal/process/next_tick.js:138:11)
    at process._tickCallback (internal/process/next_tick.js:180:9) type: 'File not found' }
$$$$$$$$$$$$$$$$$$$$$$$

This error is for the partition which I am having discrepancy issues with.


Am I doing something wrong? Can you offer any advice?
--
best regards,

Gudmundur Orn Johannsson

gudmun...@activitystream.com

unread,
Oct 13, 2017, 6:51:26 AM10/13/17
to ClickHouse
Reply all
Reply to author
Forward
0 new messages