+ ## delete the row of tempfile that has the fid in use by file
table
+ $dbh->do("DELETE from tempfile WHERE fid = '$fid'");
+
+ $dbh->do("DELETE from tempfile WHERE fid = '$fid'");
+ }
On Tue, 22 Jul 2008, Brad Fitzpatrick wrote:
> I haven't had any chance to look into it. Alan?
>
> 2008/7/18 afool41 <afo...@gmail.com>:
>
>>
>> Thanks a lot for your correction...
>>
>> Beside the programming skill, is there any logical problem ?
>>
>> afool
>>
...
while($fid_in_use->($fid)) {
throw("dup") if $explicit_fid_used;
$fid=$dbh->selectrow_array("SELECT MAX(fid) FROM file");For a fresh MogileFS deployment using MySQL as meta storage, assume there is only 1 file "a.jpg" uploading, then the contents of tempfile table could be:
$ins_tempfile->();
$fid=undef;
$ins_tempfile->() or die "register_tempfile failed after seeding";
}
...
<<tempfile>>and when file "a.jpg" is saved completely, the corresponding tempfile row will be removed:
| fid | ... | dmid | dkey | devids |
| 1 | ... | 99 | _server_startup_test | |
| 2 | ... | 1 | a.jpg | 1,2 |
( the fid auto-inc counter = 2 )
<<tempfile>>Assume the system is idle for a long time (the Delete worker purging interval, default to 3600s), then the Delete worker starts to purge anything old enough in tempfile table, left the tempfile table empty:
| fid | ... | dmid | dkey | devids |
| 1 | ... | 99 | _server_startup_test | |
( the fid auto-inc counter = 2 )
<<file>>
| fid | dmid | dkey | ... |
| 2 | 1 | a.jpg | ... |
<<file_on>>
| fid | devid |
| 2 | 1 |
| 2 | 2 |
<<tempfile>>
| fid | ... | dmid | dkey | devids |
( the fid auto-inc counter = 2 )
<<file>>
| fid | dmid | dkey | ... |
| 2 | 1 | a.jpg | ... |
<<file_on>>
| fid | devid |
| 2 | 1 |
| 2 | 2 |
After that, assume MySQL is restarted for some reason, now the fid auto-inc counter is reset to 1 (that's how InnoDB handles auto-inc fields), then another file "b.jpg" uploading happens:
<<tempfile>>when b.jpg uploaded, the table contents will be:
| fid | ... | dmid | dkey | devids |
| 1 | ... | 1 | b.jpg | 1,2 |
( the fid auto-inc counter = 2 )
<<file>>
| fid | dmid | dkey | ... |
| 2 | 1 | a.jpg | ... |
<<file_on>>
| fid | devid |
| 2 | 1 |
| 2 | 2 |
<<tempfile>>Now assume file "c.jpg" uploading begins, duplicating fid will be generated in tempfile which activates the above code snippet. As a result the table contents will be:
| fid | ... | dmid | dkey | devids |
( the fid auto-inc counter = 2 )
<<file>>
| fid | dmid | dkey | ... |
| 2 | 1 | a.jpg | ... |
| 1 | 1 | b.jpg | ... |
<file_on>>
| fid | devid |
| 2 | 1 |
| 2 | 2 |
| 1 | 1 |
| 1 | 2 |
<<tempfile>>When c.jpg is uploaded completely, the table contents will be:
| fid | ... | dmid | dkey | devids |
| 2 | ... | 1 | c.jpg | 1,2 | ( the original tempfile inserted this)
| 3 | ... | 1 | c.jpg | 1,2 | ( the first $ins_tempfile call in the above code failed, the second $ins_tempfile inserted this)
( the fid auto-inc counter = 4 )
<<file>>
| fid | dmid | dkey | ... |
| 2 | 1 | a.jpg | ... |
| 1 | 1 | b.jpg | ... |
<file_on>>
| fid | devid |
| 2 | 1 |
| 2 | 2 |
| 1 | 1 |
| 1 | 2 |
<<tempfile>>Then, when the next purging time comes, the Delete worker will remove the dangling c.jpg record in tempfile, along with the data of a.jpg in table file_on!
| fid | ... | dmid | dkey | devids |
| 2 | ... | 1 | c.jpg | 1,2 | <-- dangling record here!
( the fid auto-inc counter = 4 )
<<file>>
| fid | dmid | dkey | ... |
| 2 | 1 | a.jpg | ... |
| 1 | 1 | b.jpg | ... |
| 3 | 1 | c.jpg | ... |
<file_on>>
| fid | devid |
| 2 | 1 |
| 2 | 2 |
| 1 | 1 |
| 1 | 2 |
| 3 | 1 |
| 3 | 2 |
<<tempfile>>Then the contents of file a.jpg is lost...
| fid | ... | dmid | dkey | devids |
( the fid auto-inc counter = 4 )
<<file>>
| fid | dmid | dkey | ... |
| 2 | 1 | a.jpg | ... |
| 1 | 1 | b.jpg | ... |
| 3 | 1 | c.jpg | ... |
<file_on>>
| fid | devid |
| 1 | 1 |
| 1 | 2 |
| 3 | 1 |
| 3 | 2 |
diff -up ./Store.pm.old ./Store.pm
--- ./Store.pm.old 2008-07-23 10:27:19.000000000 +0800
+++ ./Store.pm 2008-07-23 10:29:06.000000000 +0800
@@ -851,9 +851,13 @@ sub register_tempfile {
# re-seed the table with the highest known fid from the file
# table.
+ $dbh->do("DELETE FROM tempfile WHERE fid=?",undef,$fid);
+
# get the highest fid from the filetable and insert a dummy row
$fid = $dbh->selectrow_array("SELECT MAX(fid) FROM file");
- $ins_tempfile->(); # don't care about its result
+ if(defined $ins_tempfile->()) {
+ $dbh->do("DELETE FROM tempfile WHERE fid=?",undef,$fid);
+ }
# then do a normal auto-increment
$fid = undef;
Reviewing this... I think the patch is probably good, but it doesn't feel
like a strong enough fix for me. It might be possible for multiple
tempfile rows to end up in this gummed state, and this patch will only fix
the latest one.
The *real* problem is Worker/Delete.pm:process_tempfiles() doesn't check
for 'file' rows before nuking entries.
sub process_tempfiles does:
- Store.pm:old_tempfiles
- Mass-insert the devids into file_on (Which already exist in this case!)
- Mass-insert the rows into file_to_delete so they are deleted as through
any other method.
... then the deleter goes by and deletes the files + file_on rows.
Note:
- Delete worker process doesn't actually remove the 'file' rows!
- Store.pm:delete_fidid (called from FID.pm:delete) removes the 'file'
row, any tempfile row, and inserts into file_to_delete for the removal.
So the actual fix is looking like, within Worker/Delete.pm:
- Call Store.pm:old_tempfiles
- Try to instantiate a FID object for each tempfile row.
- If FID is instantiated, delete tempfile row and loop.
- If FID object is undefined, go ahead and delete.
I've verified that the 'file' entry isn't created until cmd_create_close
is executed successfully, so this should be fine?
Any thoughts? I'll whip up a patch when I get a chance.
-Dormando
... and try it out? It should independently fix the issue. I also have no
complaints about the submitted patch and will commit that too once I've
given people a day to read this :) Tested this patch. WFM.
-Dormando
Are you sure you successfully upgraded to mogilefs 2.20? That release
includes the patch you said you applied :)
And yes, I made that second fix since I was sure there were other
conditions where tempfile can have danglers, but didn't want to trust on
my ability to find all of them. It's still rare, if it happens at all.
-Dormando
I upgraded to 2.20 specifically for the following patch:
http://code.sixapart.com/trac/mogilefs/changeset/1203
The patch I installed additionally is this one:
--- ./Store.pm.old 2008-07-23 10:27:19.000000000 +0800
+++ ./Store.pm 2008-07-23 10:29:06.000000000 +0800
@@ -851,9 +851,13 @@ sub register_tempfile {
# re-seed the table with the highest known fid from the file
# table.
+ $dbh->do("DELETE FROM tempfile WHERE fid=?",undef,$fid);
+
# get the highest fid from the filetable and insert a dummy row
$fid = $dbh->selectrow_array("SELECT MAX(fid) FROM file");
- $ins_tempfile->(); # don't care about its result
+ if(defined $ins_tempfile->()) {
+ $dbh->do("DELETE FROM tempfile WHERE fid=?",undef,$fid);
+ }
# then do a normal auto-increment
$fid = undef;
..which is not in the 2.20 tag. File loss happens when we're doing
lots of concurrent inserts into our cluster. I hope the above patch
helps. Any more details how/why this is triggered?
Thanks,
Andy
That patch isn't necessary given what my change was.
Can you confirm a few things, please:
- That my patch to the delete worker exists on your server correctly.
- That your fids that are disappearing have a *file* row but no *file_on*
rows?
If your `file` and all file_on rows disappear, it's either your
application removing the file or a different bug. All of the bugs with
tempfile should leave an orphaned `file` row in the database, since none
of the delete worker code actually touches the file table.
Thanks,
-Dormando