|
I've had small handfuls of files disappear, but as far as I can tell those
were actually deleted, and then there was an error attempting to remove
the associated file row.
For these fids, do the file_on *and* file rows exist, or just file?
-Dormando
> Twelve
> Horses
> Mobile
> Social
> Web
|
|
|
|
|
|
dmesg isn't complaining of filesystem corruption? Are you still losing
files?
The specific bug we're fixing leaves the 'file' row but removes the
'file_on' row. There's no other code in mogilefs that I can find which
deletes files but leaves the 'file_on' rows intact... drain/etc all
supposedly dork up the rows too.
Are you running drain, dead, etc?
-Dormando
On Mon, 28 Jul 2008, Clinton Goudie-Nice wrote:
> I created a similar file on mogile, and then went to the filesystem and
> physically deleted it. It's created different effects:
>
> The fsck returned:
>
> unixtime event fid devid
> 1217259385 MISS 55 1
> 1217259385 MISS 55 2
> 1217259385 SRCH 55 -
> 1217259385 GONE 55 -
>
> (with the devid column giving values)
>
> Additionally, this fid still has associated rows in file_on, so it looks
> like something else went down here.
>
> Clint
>
> Twelve
> Horses
> Mobile
> Social
> Web
>> Twelve
>> Horses
>> Mobile
>> Social
>> Web
We lost two more files from both storage servers. One I'm sure was lost
yesterday.
Due to to some unrelated issues we restarted mogstored on both our
storage servers yesterday. It's possible I restarted the second server
before the first one was fully up. Whether or not that's related, it's
hard to know. A little more than an hour later we lost a jpeg file from
the store on both machines. When it disappeared, it did so in this same
way where the row in file still exists but the rows in file_on have been
removed and the files on disk have been removed. I verified this before
I ran a fsck which came back with the same results as before.
Since our last bit of weird with this issue, I added some verbose
logging around the deletion of anything from mogile from our app. The
app wasn't deleting this file, or anything else anywhere near the time
this happened.
I know when this file disappeared because this jpeg happens to be used
by our monitoring system which pulls the image from the application and
directly from mogile every 15 minutes.
Now for the next bit of weird. The other file had never been requested
in any form. It had been uploaded on the 7th and the tester who uploaded
it never retrieved it. The upload appears to have been successful from
everything I can see. Similarly for this file, we have a row in file but
none in file_on and the files are gone from both storage servers. Since
this file has never been retrieved, I can't say for sure when it entered
this state.
The only tracker on the system was started on Aug 05, so there was no
unusual tracker restarting or weird replication during this window.
Here's the configs for the tracker and store. Anyone see any weird in
here?
--mogilefsd.conf--
db_dsn DBI:mysql:mogilefs:<somelocaldbhostname>;mysql_connect_timeout=5
db_user mogile
db_pass <whatever>
conf_port 6001
listener_jobs 200
replicate_jobs 10
delete_jobs 1
reaper_jobs 5
monitor_jobs 3
-----
--mogstored.conf--
httplisten=0.0.0.0:7500
mgmtlisten=0.0.0.0:7501
docroot=/var/mogdata
-----
This is scary that files are just being eaten like this...
Clint
> > On Mon, 2008-07-28 at 09:19 -0600, Clinton Goudie-Nice wrote:
> >
> >> Just File...
> >>
> >> select * from file where fid = 7;
> >> +-----+------+------+--------+---------+----------+
> >> | fid | dmid | dkey | length | classid | devcount |
> >> +-----+------+------+--------+---------+----------+
> >> | 7 | 1 | 72 | 66272 | 1 | 2 |
> >> +-----+------+------+--------+---------+----------+
> >>
> >> I looked through file_to_delete, file_to_delete_later,
> >> file_to_replicate, file_on, file_on_corrupt and this is the only place
> >> my information can be found.
> >>
> >> Also from my application's logs, there is no indication that the file
> >> was deleted from mogile.
> >>
> >>
http://mirrors.kernel.org/fedora-epel/5/SRPMS/perl-mogilefs-server-2.17-5.el5.src.rpm
http://mirrors.kernel.org/fedora-epel/5/SRPMS/perl-MogileFS-Client-1.07-2.el5.src.rpm
http://mirrors.kernel.org/fedora-epel/5/SRPMS/perl-MogileFS-Utils-2.12-1.el5.src.rpm
Any chance you could apply the delete worker patch I posted to the list?
INSERT IGNORE INTO file_to_delete (fid) VALUES (48),(73)/*!*/;
DELETE FROM tempfile WHERE fid IN (48,73)/*!*/;
DELETE FROM file_on WHERE fid='48' AND devid='2'/*!*/;
DELETE FROM file_on WHERE fid='48' AND devid='1'/*!*/;
DELETE FROM file_to_delete WHERE fid='48'/*!*/;
I have the whole binlog dumped somewhere if it would be helpful for
troubleshooting. Send me an email directly and I'll email or link you to
the giant dump.
Based on the bit about tempfile is this related to the other tempfile
bug being discussed on the list? I can't think of a reason these files
would get rows inserted into tempfile. They were written a long time ago
and appear to have been written successfully. Today I did a select *
from tempfile and there's no rows in there now.
Thanks!
Clint
In this case, mysqld was started on the 31st of last month, and mogilefsd was started on the 5th of this month, so I'd say it's unlikely
>
> Any chance you could apply the delete worker patch I posted to the list?
>
I'm going to look into that next if the binlog info I posted plays into
that theory.
Thanks for the help Dormando!
Clint
It does actually. That's the delete worker doing something that patch
specifically fixes.
> Thanks for the help Dormando!
Sure thang.
-Dormando
After re-reading through the other bugs:
http://groups.google.com/group/mogile/browse_thread/thread/c737e84feea148e2/3ebdf130e57f12bb
http://lists.danga.com/pipermail/mogilefs/2008-June/001728.html
as well as the patch:
http://groups.google.com/group/mogile/browse_thread/thread/2eee02f6c64bf16f/8f6dfe269c21779b
I can see how this could happen. Additionally, I tried manually
inserting a row into tempfile on our test system here and the result
with the file being deleted and the only remaining row being in file is
an exact match for this problem.
Finally, after digging through the source, I couldn't see anywhere else
that the files could be deleted and the rows removed from file_on but
not be removed from file. I'd say all around my issue is this same
issue.
I've also submitted Dormando's patch to the EPEL maintainers here:
https://bugzilla.redhat.com/show_bug.cgi?id=458890
I'm not as familiar with the Ubuntu bug system, but someone might want
to go file the bug there. Also I'm unfamiliar with how CPAN gets it's
updates but this seems a wise update to roll there also.
Thanks for your help everyone!
Clint
Please apply it and let us know if it fixes your specific problem.
We'll be rolling out mogilefs 2.18 in a few days, and that should be good
to go upstream. Lots and lots of fixes since 2.17...
-Dormando
On Tue, 12 Aug 2008, Clinton Goudie-Nice wrote: