Corruption of SST file and recovery

1,438 views
Skip to first unread message

Tomas Kolda

unread,
Jun 5, 2020, 11:35:45 AM6/5/20
to rocksdb
Hi,

We have met data corruption of SST file. One of the blocks had checksum that did not match:

2020/05/19-01:40:35.244932 7f1a89ffe700 [WARN] [_impl/db_impl_compaction_flush.cc:2810] Compaction error: Corruption: block checksum mismatch: expected 4068460331, got 2581931954  in ./data/tasdb/4230207.sst offset 103735058 size 1300
2020/05/19-01:40:35.245197 7f1a89ffe700 (Original Log Time 2020/05/19-01:40:35.244711) [mpaction/compaction_job.cc:765] [merged_attributes] compacted to: files[5 3 27 0 0 0 0] max score 0.25, MB/sec: 42.3 rd, 23.9 wr, level 1, files in(4, 3) out(3) MB in(8.5, 251.0) out(146.6), read-write-amplify(48.0) write-amplify(17.3) Corruption: block checksum mismatch: expected 4068460331, got 2581931954  in ./data/tasdb/4230207.sst offset 103735058 size 1300, records in: 1183244, records dropped: 24261 output_compression: ZSTD
2020/05/19-01:40:35.245203 7f1a89ffe700 (Original Log Time 2020/05/19-01:40:35.244754) EVENT_LOG_v1 {"time_micros": 1589852435244735, "job": 218713, "event": "compaction_finished", "compaction_time_micros": 6430821, "compaction_time_cpu_micros": 13306603, "output_level": 1, "num_output_files": 3, "total_output_size": 257432851, "num_input_records": 1176108, "num_output_records": 1151847, "num_subcompactions": 3, "output_compression": "ZSTD", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [5, 3, 27, 0, 0, 0, 0]}
2020/05/19-01:40:35.245207 7f1a89ffe700 [ERROR] [_impl/db_impl_compaction_flush.cc:2346] Waiting after background compaction error: Corruption: block checksum mismatch: expected 4068460331, got 2581931954  in ./data/tasdb/4230207.sst offset 103735058 size 1300, Accumulated background error counts: 1

I know that recommended way is to scan DB and write it again, but in case database is huge (several TB) and when it happens in production it would be nice to have some recommended recipe of how to fix it. I have tried these:

1. ldb compact - It is strange, but ldb compact is able to compact file where checksum does not match. I still do not see how it pass this "ignore" into the compaction job. I need to investigate it. Anyway what is the algorithm? 
 - Does it ignore errors and writes corrupted records? -> DB has broken records.
 - Does it ignore errors and breaks the loop on corrupted record -> DB does not have any subsequent record.
 - Does it ignore broken blocks? -> DB does not have only records from broken blocks.

1.a ldb compact - with SST instead of range. It could be handy to pass file name and use CompactFiles to reconstruct broken file. But again I am still not sure what records are kept (see 1)

2. Manual scan - When I am using iterator there are these cases:
 - verify_checksum = false 
    - When value is broken iterator continue and returns me broken value
    - When key is broken iterator continue and breaks the loop (does not continue with scan)
 - verify_checksum = true
    - Iterator breaks the loop on checksum fail

3. Reconstruction of SST -  
I have inspired in sst_dump --command=raw and write a rebuild tool that go overt the SST file and ignore just broken blocks. That means DB does not contain wrong records, it just miss these. It also recover all blocks and does not stop on broken block. The problem was Manifest so I had to also update manifest. I loaded all version sets and wrote modified set (file size of sst). Then DB was successfully loaded and working. If you are interested I can make a pull request.

Still (3) is the fastest and I have it under full control I cannot say it is the best option. I know backup is the best option, but I would still like to know if there is some disaster recovery with recommended steps that will lose least amount of data possible.

Any hints?

Thanks,
Tomas




Tomas Kolda

unread,
Jun 5, 2020, 12:04:04 PM6/5/20
to rocksdb
Add  (1) - I just tried to damage other DB manually and even ldb compact does not accept it. Strange it behaves differently for the other damaged DB.

Tomas Kolda

unread,
Jun 5, 2020, 12:26:10 PM6/5/20
to rocksdb
OK so I damaged another compressed SST and it get even worse now:

(1) - does not work as for this db it says checksum failed - not an option
(2) - does not work as reading cause crash (most likely decompression) - not an option
(3) - at least my tool work, but still not official



On Friday, June 5, 2020 at 5:35:45 PM UTC+2, Tomas Kolda wrote:
Reply all
Reply to author
Forward
0 new messages