"RDB ERROR DETECTED" loading dumps, including backups

399 views
Skip to first unread message

Peter Taoussanis

unread,
Apr 10, 2023, 9:56:36 AM4/10/23
to Redis DB
Hi all, I'm encountering a RDB issue that I don't understand. Would greatly appreciate assistance if anyone has any ideas?

My Redis instance started reporting an error last night: `MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk.`.

Trying to restart the server, or run `redis-check-rdb dump.rdb` both produce an error:

```
[offset 0] Checking RDB file dump.rdb
[offset 26] AUX FIELD redis-ver = '5.0.8'
[offset 40] AUX FIELD redis-bits = '64'
[offset 52] AUX FIELD ctime = '1681088401'
[offset 67] AUX FIELD used-mem = '1359262528'
[offset 83] AUX FIELD aof-preamble = '0'
[offset 85] Selecting DB ID 0
--- RDB ERROR DETECTED ---
[offset 51432] Internal error in RDB reading offset 0, function at rdb.c:2080 -> Ziplist integrity check failed.
[additional info] While doing: read-object-value
[additional info] Reading key '<redacted>'
[additional info] Reading type 14 (quicklist)
[info] 87 keys read
[info] 1 expires
[info] 0 already expired
46161:C 10 Apr 2023 11:42:54.008 # Terminating server after rdb file reading failure.
```

I have automated daily backups, so figured I'd just restore one - but the same issue seems to be present in all backups going back at least 3 months. Am retrieving older backups from storage, but will take some time.

Some quick info:

- Running Redis 7.0.5 (`v=7.0.5 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64 build=d76e64d63dff22a5`)
- Running single Redis instance on one dedicated server. No Cluster, nor Sentinel.
- `uname -a`: `Linux ensso1 4.15.0-208-generic #220-Ubuntu SMP Mon Mar 20 14:27:01 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux`
- Dumps created using `BGSAVE`, and waiting till `LASTSAVE` has updated.
- Some relevant `redis.conf` options:
- stop-writes-on-bgsave-error yes
- rdbcompression yes
- rdbchecksum yes
- appendfsync everysec
- rdb-save-incremental-fsync yes

- Dump files are ~740MB large.
- System has plenty of free memory (52G free of 62G total)
- System has plenty of free disk space (123G free of 197G total for /var/lib/redis)
- System has had no recent hardware or software changes
- Hard disks in RAID1, `mdadm` reporting all disks healthy.
- All Redis `make test` tests pass
- `redis-server --test-memory 62000` passes, let it run for several hours.
- https://github.com/xueqiu/rdr parses the dumps without any complaints.
https://github.com/HDT3213/rdb seems to confirm an issue with the RDB file.

-Backups were created as follows:
- `BGSAVE` is run
- `LASTSAVE` is checked periodically until it shows an updated value
- dump.rdb is then compressed, encrypted, and sent to S3.

Since the S3 dumps are decrypting successfully, I believe it's safe to conclude that the file integrity is good.

I've also tried copying the dump files to a different system, but loading also fails there with the same Ziplist integrity error.

My main objectives in order are:
1. Try successfully restore a dump, even if some data is lost.
2. Understand what went wrong, and adjust my backup scheme accordingly.

I've tried searching online for info on RDB errors/corruption, but couldn't find much relevant info. My impression from the "Redis persistence" docs, et al. is that the backup procedure above should be pretty solid. Am I missing something obvious here?

Any other pointers? Has anyone experienced something like this before?
Am currently pursuing the possibility of using https://github.com/HDT3213/rdb to try and process the RDB file to skip any apparent broken entries, and maybe try to produce an AOF output for us much of the dump data as possible.

Will update if I reach any conclusions on that.

Thanks again for your time!
Cheers

Peter Taoussanis

unread,
Apr 10, 2023, 11:43:19 AM4/10/23
to Redis DB
Update: have thankfully managed to recover all the data, minus a single ziplist that was apparently somehow corrupted.

This was possible using a trivial modification (https://github.com/HDT3213/rdb/issues/15#issuecomment-1501949718) to https://github.com/HDT3213/rdb. A big thanks to the tool's author for his very quick and helpful assistance.

Shifting attention now to understanding what actually went wrong-
Was my impression incorrect that if `BGSAVE` succeeds, that implies that the resulting .RDB file should be valid?

Would it be sufficient if I modify my backup process to confirm that `redis-check-rdb` clears before sending a dump off to S3?
Any other thoughts on what might have actually caused the (apparently silent) corruption in this case?

Sotto Voce

unread,
Apr 10, 2023, 2:14:59 PM4/10/23
to Redis DB
The most likely cause will be a shortfall of disk space while your backup script runs.  The script was developed and tested and it worked fine, else you wouldn't be relying on it.  Something has changed, and that's usually the amount of data that's being saved/encrypted/copied-to-S3/copied-back/decrypted/restored.  But server disk space often isn't grown at the same pace your Redis data grows.

It's not a good idea to assume that successful decryption means the decrypted file has good integrity.  If the file had a format flaw before it was encrypted, then successful decryption will faithfully reproduce the flaw.  The real test is to skip the encryption and S3 steps and try to restore the plain RDB file.  If that's successful, you can run the file through encryption/decryption and test the restore, and finally test the full procedure - encrypt, save to S3 and delete original file, retrieve encrypted file from S3, decrypt it and restore it.

Does your monitoring track disk space and show you in graphs that the available space on the disk never falls below 5-10%?  Does it sample the space often enough to reveal short-lived periods of low space - i.e., every 5 minutes or more frequently?  If not, I highly, Highly, HIGHLY recommend that you implement that level of monitoring in your production systems. For troubleshooting this issue, you can log into the relevant machine and as the backup script runs, watch the cpu, memory, and disk consumption yourself.

Peter Taoussanis

unread,
Apr 10, 2023, 4:16:24 PM4/10/23
to Redis DB
Hi Sotto, thanks a lot for the quick feedback.

My responses inline-

It's not a good idea to assume that successful decryption means the decrypted file has good integrity.  If the file had a format flaw before it was encrypted, then successful decryption will faithfully reproduce the flaw. 

By good integrity in this context, I mean that an HMAC is being used to cryptographically assure that the encrypted file content hasn't changed since the data was encrypted. I agree with your second assertion- if the dump itself is invalid at encryption time, it will of course remain invalid after decryption. Successful decryption in this case does exclude the possibility that the dump was initially valid, and made invalid through subsequent corruption (e.g. on the way to/from s3).

The real test is to skip the encryption and S3 steps and try to restore the plain RDB file.  If that's successful, you can run the file through encryption/decryption and test the restore, and finally test the full procedure - encrypt, save to S3 and delete original file, retrieve encrypted file from S3, decrypt it and restore it.

Do you know if it would be sufficient to verify the RDB file using `redis-check-rdb`, or is a full restore necessary to be confident that a dump is viable?

Does your monitoring track disk space and show you in graphs that the available space on the disk never falls below 5-10%?  Does it sample the space often enough to reveal short-lived periods of low space - i.e., every 5 minutes or more frequently?

Yes, disk space and memory usage are captured every few minutes. Disk space has never dropped below 50%. Free memory never dropped below ~30%, but that's harder to measure and more variable - I can't firmly exclude the possibility there were momentary spikes. The system doesn't have much disk activity.
 
For troubleshooting this issue, you can log into the relevant machine and as the backup script runs, watch the cpu, memory, and disk consumption yourself.

This is good advice. All metrics are measured during backup, and all have always been well within the above windows. CPU can spike to ~90% load  - but I wouldn't expect that to lead to dump corruption (please correct me if I'm wrong).

More importantly, >=3 months worth of daily dumps seem to have been affected. So whatever the problem, it seems to have been 100% consistent rather than intermittent. Every dump was apparently affected, so it seems that something about the procedure/software/hardware was consistently invalid in the same way each time.

Something peculiar occurs to me now- the server has been rebooted successfully at least a few times over the last 3 months. A reboot implies redis-server needing to restart, and so presumably reload the RDB file. How could that have succeeded if the scheduled dumps were somehow all invalid in the same way?

Is something about the dump process initiated by a manual call to `BGSAVE` different to the process initiated in the normal course of data modifications, etc.? Could it be that the dump files present immediately after `BGSAVE`+`LASTSAVE` were somehow uniquely but consistently invalid?

Peter Taoussanis

unread,
Apr 13, 2023, 6:23:02 AM4/13/23
to Redis DB
Update:

- I've been able to create a reproducible example (using a private dataset, unfortunately) that consistently shows both Redis 6.x and 7.x `SAVE` and `BGSAVE` commands producing an RDB dump that fails `redis-check-rdb` and cannot be loaded.

- I'm not sure what precisely about the dataset is causing the issue. But from the lack of mentions online, whatever the cause - it's presumably rare.

- In any case, I've opened https://github.com/redis/redis/issues/12037 to suggest a recommendation be added to the relevant docs that RDB dumps be tested for viability to help ensure that others don't make the same mistaken assumption that I did (successful SAVE => viable dump).

I'll note that the original affected system used non-ECC memory, so it can't be ruled out that some fluke bit flip caused the unexpected behaviour.
Whatever the cause in my particular case, it seems it'd be useful to suggest some best practice re: dump testing.

Kind regards,
Reply all
Reply to author
Forward
0 new messages