possible corruption in the MANIFEST file?

1,616 views
Skip to first unread message

Dhruba Borthakur

unread,
Nov 28, 2012, 12:44:54 AM11/28/12
to lev...@googlegroups.com
I have a possible corruption of the leveldb transaction log. This
happened when the disk was almost full, so some writes were succeeding
while some were failing. Please let me know if the situation I explain
in the following paragraph is a feasible cause of this corruption.

One thread issued a CompactMemtable that was successfully able to
flush the current memtable to a new file F1 via a call to
WriteLevel0Table. Then it invoked VersionSet.LogAndApply() to record
the newly created file in the MANIFEST file.

The LogAndApply method first invoked descriptor_log_->AddRecord()
which successfully recorded this transaction in the MANIFEST file.
Then it invoked descriptor_file_->Fsync() which failed. This caused
LogAndApply to return error without updating the current in-memory
version. This means that the in-memory version of the database does
not have file F1 whereas F1 is recorded inside a valid transaction in
the MANIFEST file. A succeeding invocation of DeleteObsoleteFiles
found a redundant F1 file on disk and deleted it because it does not
belong to the set of live files in the current version.

Now, when the database is restarted, it reads in the contents from the
MANIFEST file and looks for file F1 on disk which it cannot find. Bad!
dhruba

comments/feedback are appreciated
dhruba

--
Subscribe to my posts at http://www.facebook.com/dhruba

Sanjay Ghemawat

unread,
Dec 27, 2012, 6:20:32 PM12/27/12
to lev...@googlegroups.com
Thanks for the detailed report. Your analysis is correct. The next
release will have a fix.

dhruba

unread,
Jan 10, 2013, 2:52:35 AM1/10/13
to lev...@googlegroups.com
Hi sanjay,

thanks for the fix in 1.9.

I still have a doubt about the fix. If the fsync to the manifest file fails, then the code reopens the manifest file to verify whether the entry made it to the file contents or not. If the new log entry now exists inside the manifest file, then it sets up the status=OK so that things run ok.

What will happen if the *reopen* of the manifest-file (inside VersionSet::ManifestContains) fails? The fix assumes that if the reopen of the manifest file fails, then everything is fine and life continues. But this is incorrect, isn't it? Instead, it is better to either mark the entire db as 'unuseable' or invoke DB::Recover() to restore all state from persistent storage.

thanks
dhruba

Joshua Bell

unread,
Jan 10, 2013, 12:43:56 PM1/10/13
to lev...@googlegroups.com
A similar scenario is that this occurred on a leveldb instance prior to the fix, and is now being opened.

It would be nice if e.g. opening w/ the paranoid_checks option would detect a file referenecd from MANIFEST but missing on disk, allowing an application to decide whether to abort or attempt to DB::Recover().

Sanjay Ghemawat

unread,
Jan 10, 2013, 1:57:03 PM1/10/13
to lev...@googlegroups.com
On Wed, Jan 9, 2013 at 11:52 PM, dhruba <dhr...@gmail.com> wrote:
Hi sanjay,

thanks for the fix in 1.9.

I still have a doubt about the fix. If the fsync to the manifest file fails, then the code reopens the manifest file to verify whether the entry made it to the file contents or not. If the new log entry now exists inside the manifest file, then it sets up the status=OK so that things run ok.

What will happen if the *reopen* of the manifest-file (inside VersionSet::ManifestContains) fails? The fix assumes that if the reopen of the manifest file fails, then everything is fine and life continues. But this is incorrect, isn't it? Instead, it is better to  

My assumption was that if opening the manifest file in read-only mode fails, it is a permanent problem, not a transient error. So in this case we make ManifestContains() return false and we ignore the compaction outputs under the assumption that since the manifest file is not readable, the outputs of the compaction are not useful.
 
either mark the entire db as 'unuseable' or invoke DB::Recover() to restore all state from persistent storage.

Let's consider the options:
(1) The current code (ManifestContains)
(2) Calling DB::Recover
(3) Mark the DB as unusable

Option 2 (calling DB::Recover) does not suffice by itself. DB::Recover could still see the same effects on the MANIFEST file as would be seen by option 1.  So if DB::Recover returns an error, what do we do?

Option 3 is interesting, but I worry that some clients will be unhappy with less availability, especially if the problem is transient.

BTW, Option (3) is already the intended behavior if paranoid_checks==true. If compaction fails and if paranoid_checks are on, the error is saved in bg_error_ (see db_impl.cc) and that causes all future writes to the DB fail.  But perhaps there are some other things we could do to solidify this: prevent any more compactions or garbage collections once bg_error_ has been set?  And perhaps we should make failed compactions turn on bg_error_ regardless of the setting of paranoid_checks.

Can people think of other useful options?

Dhruba Borthakur

unread,
Jan 10, 2013, 3:46:21 PM1/10/13
to lev...@googlegroups.com
Hi sanjay,

I somehow feel safe with option 3 but can live with 1 if everybody else likes that option.

thanks
dhruba

On Thu, Jan 10, 2013 at 10:57 AM, Sanjay Ghemawat <san...@google.com> wrote:
Let's consider the options:
(1) The current code (ManifestContains)
(2) Calling DB::Recover
(3) Mark the DB as unusable
Option 3 is interesting, but I worry that some clients will be unhappy with less availability, especially if the problem is transient.
 

Joshua Bell

unread,
Jan 14, 2013, 2:43:04 PM1/14/13
to lev...@googlegroups.com
On Thu, Jan 10, 2013 at 10:57 AM, Sanjay Ghemawat <san...@google.com> wrote:
If paranoid_checks==true and the database is already in this state (missing SST file), is there a way with the existing API to detect the looming problem at open time before a background compaction fails?

We appeared to be seeing that in some reports from the field, and our recovery code doesn't kick in because the DB::Open() succeeds, we just enter the error state at some later point.

Joshua Bell

unread,
Jan 14, 2013, 2:46:13 PM1/14/13
to lev...@googlegroups.com
Let me try that last paragraph again, more coherently this time:

We appeared to be seeing the "missing SST file recorded in MANIFEST" issue reported by Dhruba in some reports from the field; in those cases, our Open-time recovery code doesn't kick in because the DB::Open() succeeds; instead, background compaction fails at some later point and reads start failing.

David Grogan

unread,
Jan 16, 2013, 4:00:57 PM1/16/13
to lev...@googlegroups.com
On Mon, Jan 14, 2013 at 11:43 AM, Joshua Bell <jsb...@chromium.org> wrote:
Chrome always uses paranoid_checks == true and we are ok with the existing resultant "option (3)" behavior. But we'd also like DB::Open to fail if any sst files are missing. Our "recovery" code (delete the leveldb directory and start fresh) would then kick in. So, how easy would it be to make DB::Open check for missing sst files? If it's not too hairy, one of us could make an attempt and send a patch.

Sanjay Ghemawat

unread,
Jan 17, 2013, 1:11:21 PM1/17/13
to lev...@googlegroups.com
> Chrome always uses paranoid_checks == true and we are ok with the existing
> resultant "option (3)" behavior. But we'd also like DB::Open to fail if any
> sst files are missing. Our "recovery" code (delete the leveldb directory and
> start fresh) would then kick in. So, how easy would it be to make DB::Open
> check for missing sst files? If it's not too hairy, one of us could make an
> attempt and send a patch.

Should be fairly easy. DBImpl::Recover() get a list of all the live
files (see call to env_->GetChildren()). Compare that set to the set
of files mentioned in the MANIFEST (call versions_->AddLiveFiles).
Raise an error if AddLiveFiles returns a file F that is not in
GetChildren..

std::set<uint64_t> expected;
versions_->AddLiveFiles(&expected);
for each i in filenames {
if (ParseFileName(...)) {
expected.erase(number);
if (type == kLogFile && ... existing condition ...) { ...
existing code ... }
}
}
if (!expected.empty()) { raise error }
Reply all
Reply to author
Forward
0 new messages