Database Checksum Error

35 views
Skip to first unread message

Xomex

unread,
Aug 30, 2023, 12:53:40 AM8/30/23
to s3ql
Through flakey internet and other problems (now solved) I seem to have caused data corruption problems of some kind resulting in Checksum errors on fsck or mounting the remote filesystem. (see the attached log). I've tried with an empty cache, or --force-remote on fsck but nothing changes. Is there any way to continue, or even salvage partial data? Any help appreciated.


========================
Starting fsck of gs://xxxxxx
Scanning metadata objects...
Downloading metadata...
Downloaded 2500/2500 metadata blocks (100%)
Calculating metadata checksum...
ERROR: Uncaught top-level exception:
Traceback (most recent call last):
  File "/root/S3QL5.venv//s3ql-5.0.0/bin//fsck.s3ql", line 21, in <module>
    s3ql.fsck.main(sys.argv[1:])
  File "/root/S3QL5.venv/s3ql-5.0.0/src/s3ql/fsck.py", line 1316, in main
    db = download_metadata(backend, cachepath + '.db', param, failsafe=param.is_mounted)
  File "/root/S3QL5.venv/s3ql-5.0.0/src/s3ql/database.py", line 506, in download_metadata
    raise DatabaseChecksumError(db_file, params.db_md5, digest)
s3ql.database.DatabaseChecksumError: File /Dump/512G/S3QLCaches.new/gs:=2F=2Fxomex-s3ql=2F.db has checksum 09b18dfa48dcc015be6ada8311a9dc11c1622b518cbec41b2e891985571611de, expected 307d7eca68c20420906de0f9bcb1fa498980505f63e21a32fac6cc02eff7c55a

Henry Wertz

unread,
Aug 30, 2023, 1:37:53 AM8/30/23
to s3ql
Yes!  I ran into this on 5.0.0 due to my odd use of symlinks (bug #321, fixed!), the workaround is:

Open up /usr/lib/s3ql/s3ql/database.py.  Go to around line 540, and comment out these two lines (shown here commented out):
#            if params.db_md5 != digest:
 #               raise DatabaseChecksumError(db_file, params.db_md5, digest)

This simply supresses the error and causes it to continue, essentially your only good copy of the database at that point is in your cache.  So be careful! But in my case it definitely let me go ahead with the local metadata, both for running fsck and for mounting the filesystem. 

Then to restoring your filesystem to full functionality!  (Assuming your internet provider and storage provider problems have cleared up)....   

s3ql now sends the database incrementally, sends 64KB chunks that have changed rather than the entire database.  By default this upload is done when you unmount the filesystem or every 6 hours when it's mounted.  It's slick, instead of uploading a potentially 100MB-1GB+ database, it uploads like a couple hundred KB that have actually changed. If fsck finds any errors, it will do a full database upload instead which does not rely on any previous snapshots.   And I note fsck checks the last 5 available snapshots, so once you get your "bad" snapshots 6 back from the current one, you can undo that patch.  (It does delete the old snapshots after like

So, do that patch, run fsck, let it correct some error and it'll do a full database upload.  Then you can either mount it and use it for at least 30 hours (since it does an automatic metadata backup every 6 hours, it'll have made 5 good snapshots).  Or do 5 cycles of mount the file system, make some change to it (like add or remove 1 file,), unmount it, so the last 5 snapshots are good.  Then remove that patch.    There is code in s3ql to remove those 64KB blocks when they're no longer needed for any recent snapshots.

Henry Wertz

unread,
Aug 30, 2023, 1:44:39 AM8/30/23
to s3ql
On Wednesday, August 30, 2023 at 12:37:53 AM UTC-5 Henry Wertz wrote:

So, do that patch, run fsck, let it correct some error and it'll do a full database upload.  Then you can either mount it and use it for at least 30 hours (since it does an automatic metadata backup every 6 hours, it'll have made 5 good snapshots).  Or do 5 cycles of mount the file system, make some change to it (like add or remove 1 file,), unmount it, so the last 5 snapshots are good.  Then remove that patch.    There is code in s3ql to remove those 64KB blocks when they're no longer needed for any recent snapshots.


Oh yeah and once you remove the patch at the end, then run a fsck.s3ql --force   to verify your remote metadata is actually good again.

Xomex

unread,
Aug 30, 2023, 2:34:34 AM8/30/23
to s3ql
Brilliant Henry, thanks so much. I'll confirm success to the group once I try out the instructions.

Daniel Jagszent

unread,
Aug 30, 2023, 7:38:32 AM8/30/23
to s3ql


Henry Wertz schrieb am 30.08.23 um 07:37:
[...] Then you can either mount it and use it for at least 30 hours (since it does an automatic metadata backup every 6 hours, it'll have made 5 good snapshots).  Or do 5 cycles of mount the file system, make some change to it (like add or remove 1 file,), unmount it, so the last 5 snapshots are good. [...]
Just for completeness: Another option is to use s3qlctrl backup-metadata to forcefully trigger the metadata backups of a mounted filesystem. I currently do not know if it actually does a new snapshot when no block of the sqlite database changed, tho (you might need to touch some files or otherwise trigger metadata changes in-between the s3qlctrl backup-metadata calls). See https://www.rath.org/s3ql-docs/man/ctrl.html

Xomex

unread,
Aug 31, 2023, 10:25:49 PM8/31/23
to s3ql
Updates: One backend filesystem (on Google storage) seems to have been recovered completely by the suggested method. Thank You :)

               The other (on Amazon S3), has failed to fsck fully, giving the following DB integrity errors. I have tried to repair the DB on the command line as suggested with:
sqlite3 corrupt.db .recover >data.sql
sqlite3 recovered.db <data.sql
               This completes without error but doesn't seem to change the DB so I suspect these are filesystem Metadata errors not SQLite corruption.

Lessons: At this stage , I'm wishing I had turned on versioning on the S3 bucket so I could roll back to a known good backend version. 
                 Or, kept known good backups of the local Metadata DB.

 Can anyone, suggest any other way forward? Note there are actuall only a handfull of errors out of nearl 1TB of data. Many Thanks!


===========
Checking DB integrity...
ERROR: *** in database main ***
Tree 4028 page 3795 cell 372: Rowid 402442 out of order
Tree 7253 page 7253 cell 21: 2nd reference to page 7343
Tree 7253 page 7253 cell 89: 2nd reference to page 7342
Tree 7253 page 7253 cell 88: 2nd reference to page 7341
Tree 7253 page 7253 cell 87: 2nd reference to page 7340
Tree 7253 page 7253 cell 86: 2nd reference to page 7339
Tree 7253 page 7253 cell 85: 2nd reference to page 7338
Tree 7253 page 7253 cell 84: 2nd reference to page 7337
Tree 7253 page 7253 cell 83: 2nd reference to page 7336
Tree 7253 page 7253 cell 82: 2nd reference to page 7335
Tree 7253 page 7253 cell 81: 2nd reference to page 7334
Tree 7253 page 7253 cell 80: 2nd reference to page 7333
Tree 7253 page 7253 cell 79: 2nd reference to page 7332
Tree 7253 page 7253 cell 78: 2nd reference to page 7331
Tree 7253 page 7253 cell 77: 2nd reference to page 7330
Tree 7253 page 7253 cell 76: 2nd reference to page 7329
Tree 7253 page 7253 cell 75: 2nd reference to page 7328
Tree 7253 page 7253 cell 74: 2nd reference to page 7327
Tree 7253 page 7253 cell 73: 2nd reference to page 7326
Tree 7253 page 7253 cell 72: 2nd reference to page 7325
ERROR: Database file (f{cachepath}.db) is corrupted. Restore from a backup or try
to repair with the SQLite CLI (cf. https://www.sqlite.org/recovery.html),
then re-run fsck.s3ql

Henry Wertz

unread,
Aug 31, 2023, 11:00:49 PM8/31/23
to s3ql
After making recover.db, there is the additional step to make it "live":
whatever your current db  is called, move that out of the way then put recover.db into it's place.
Like (in s3ql cache directory)
mv yourcache.db yourcache.db.bak
mv recover.db yourcache.db
Most likely your filesystem will be back up and running then but you have the db.bak file just in case.  I had this happen when I was using a USB HDD for the fs and cache, and the USB cable got all flakey causing dropouts and inconvenient times.. As you'd expect from any robust filesystem, it just resulted in loss of the last few seconds of stuff copied in, the fsck put a file or two into lost+found directory and away I went.
Reply all
Reply to author
Forward
0 new messages