Files mysteriously dissapearing from the storage nodes?!!?

8 views
Skip to first unread message

Clinton Goudie-Nice

unread,
Jul 25, 2008, 11:44:47 AM7/25/08
to mog...@googlegroups.com
Greetings all,

I have a new set of pair of CentOS 5 storage machines installed. On them, I have about 10 files, but over the course of the last week, two of the files have vanished.

When I initially noticed that these files couldn't be retrieved from any storage node, I ran a fsck on the tracker. It returned:

unixtime             event           fid      devid
1216940978            NOPA             7          -
1216940978            SRCH             7          -
1216940978            GONE             7          -
1216940978            NOPA            43          -
1216940978            SRCH            43          -
1216940978            GONE            43          -


Then I began hunting inside the dev directories (stored in /var/mogdata/devX) and on *both machines* the files were just gone.

I've checked the logs from my application, and also my front end squid servers. At no time was any request made to delete these files. Additionally, I suspect if there had been, that Mogile wouldn't be complaining in the fsck about them.

The first file that clued me in was FID 7. It was written on the 14th and last read successfully from Mogile on the 21. The next time I tried to retrieve this file was yesterday (24th), and that's when I noticed the problem.

Can anyone think of how these files might have been deleted from BOTH machines without Mogile knowing about it? Does Mogile log anywhere that I could find something interesting in said logs?

>From the information I have available to me, it feels like something else came through the dev directory and deleted these files without mogile's knowledge, but I can't think of what... tmpwatch and logrotate don't appear to crawl all of /var, just specific portions. Are there any thoughts on that front?


Host version: Linux 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

Here are the rpm/epel versions of mogile currently on the box:

mogstored-backend-lighttpd-2.17-5.el5
mogstored-2.17-5.el5
mogstored-backend-apache-2.17-5.el5
mogstored-backend-perlbal-2.17-5.el5
mogilefsd-2.17-5.el5


Thanks for the help!

Clint



Twelve Horses Mobile Social Web Email

Clinton Goudie-Nice
Architect / Senior Software Engineer
cgo...@twelvehorses.com
Phone: +1.801.571.2665 ext 3264
Mobile: +1.801.915.0629
Fax: +1.801.571.2669
LinkedIn: http://www.linkedin.com/in/cgoudie

Twelve Horses
13961 Minuteman Drive
Suite 125
Draper, UT 84020
www.twelvehorses.com




dormando

unread,
Jul 25, 2008, 9:01:54 PM7/25/08
to mog...@googlegroups.com
Honestly, I have no idea what would do that...

I've had small handfuls of files disappear, but as far as I can tell those
were actually deleted, and then there was an error attempting to remove
the associated file row.

For these fids, do the file_on *and* file rows exist, or just file?

-Dormando

> Twelve
> Horses
> Mobile
> Social
> Web

Clinton Goudie-Nice

unread,
Jul 28, 2008, 11:19:02 AM7/28/08
to mog...@googlegroups.com
Just File...

select * from file where fid = 7;
+-----+------+------+--------+---------+----------+
| fid | dmid | dkey | length | classid | devcount |
+-----+------+------+--------+---------+----------+
|   7 |    1 | 72   |  66272 |       1 |        2 |
+-----+------+------+--------+---------+----------+

I looked through file_to_delete, file_to_delete_later, file_to_replicate, file_on, file_on_corrupt and this is the only place my information can be found.

Also from my application's logs, there is no indication that the file was deleted from mogile.

Clint


Twelve Horses Mobile Social Web
Email

Clinton Goudie-Nice
Architect / Senior Software Engineer
cgo...@twelvehorses.com

Twelve Horses
13961 Minuteman Drive
Suite 125
Draper, UT 84020
www.twelvehorses.com

Clinton Goudie-Nice

unread,
Jul 28, 2008, 11:43:49 AM7/28/08
to mog...@googlegroups.com
I created a similar file on mogile, and then went to the filesystem and physically deleted it. It's created different effects:

The fsck returned:

unixtime             event           fid      devid
1217259385            MISS            55          1
1217259385            MISS            55          2
1217259385            SRCH            55          -
1217259385            GONE            55          -

(with the devid column giving values)

Additionally, this fid still has associated rows in file_on, so it looks like something else went down here.


Clint

Twelve Horses Mobile Social Web Email

Clinton Goudie-Nice
Architect / Senior Software Engineer
cgo...@twelvehorses.com
Phone: +1.801.571.2665 ext 3264


Fax: +1.801.571.2669
LinkedIn: http://www.linkedin.com/in/cgoudie

Twelve Horses
13961 Minuteman Drive
Suite 125
Draper, UT 84020
www.twelvehorses.com


dormando

unread,
Jul 28, 2008, 12:48:22 PM7/28/08
to mog...@googlegroups.com
Yeah. You see the analysis of the other bug we've been doing on the list?
This smells related to that. I'll whip up a patch you can try.

On Mon, 28 Jul 2008, Clinton Goudie-Nice wrote:

> I created a similar file on mogile, and then went to the filesystem and physically deleted it. It's created different effects:
>
> The fsck returned:
>
> unixtime             event           fid      devid
> 1217259385            MISS            55          1
> 1217259385            MISS            55          2
> 1217259385            SRCH            55          -
> 1217259385            GONE            55          -
>
> (with the devid column giving values)
>
> Additionally, this fid still has associated rows in file_on, so it looks like something else went down here.
>
> Clint
>
> Twelve Horses Mobile Social Web Email
>
> Clinton Goudie-Nice
> Architect / Senior Software Engineer
> cgo...@twelvehorses.com
> Phone: +1.801.571.2665 ext 3264
>
>
> Fax: +1.801.571.2669
> LinkedIn: http://www.linkedin.com/in/cgoudie
> Twelve Horses
> 13961 Minuteman Drive
> Suite 125
> Draper, UT 84020
> www.twelvehorses.com
>
> On Mon, 2008-07-28 at 09:19 -0600, Clinton Goudie-Nice wrote:
> Just File...
>
> select * from file where fid = 7;
> +-----+------+------+--------+---------+----------+
> | fid | dmid | dkey | length | classid | devcount |
> +-----+------+------+--------+---------+----------+
> |   7 |    1 | 72   |  66272 |       1 |        2 |
> +-----+------+------+--------+---------+----------+
>
> I looked through file_to_delete, file_to_delete_later, file_to_replicate, file_on, file_on_corrupt and this is the only place my information can be found.
>
> Also from my application's logs, there is no indication that the file was deleted from mogile.
>
> Clint
>
>
> Twelve Horses Mobile Social Web Email

dormando

unread,
Aug 11, 2008, 1:54:43 AM8/11/08
to mog...@googlegroups.com
I wonder if this is related to the patch I just sent out? although you
claim there're still file_on rows associated with it, so that might be
unlikely.

dmesg isn't complaining of filesystem corruption? Are you still losing
files?

The specific bug we're fixing leaves the 'file' row but removes the
'file_on' row. There's no other code in mogilefs that I can find which
deletes files but leaves the 'file_on' rows intact... drain/etc all
supposedly dork up the rows too.

Are you running drain, dead, etc?
-Dormando

On Mon, 28 Jul 2008, Clinton Goudie-Nice wrote:

> I created a similar file on mogile, and then went to the filesystem and
> physically deleted it. It's created different effects:
>
> The fsck returned:
>
> unixtime event fid devid
> 1217259385 MISS 55 1
> 1217259385 MISS 55 2
> 1217259385 SRCH 55 -
> 1217259385 GONE 55 -
>
> (with the devid column giving values)
>
> Additionally, this fid still has associated rows in file_on, so it looks
> like something else went down here.
>
> Clint
>

> Twelve
> Horses
> Mobile
> Social
> Web

>> Twelve
>> Horses
>> Mobile
>> Social
>> Web

Clinton Goudie-Nice

unread,
Aug 12, 2008, 11:58:40 AM8/12/08
to mog...@googlegroups.com
It's actually gotten worse.

We lost two more files from both storage servers. One I'm sure was lost
yesterday.

Due to to some unrelated issues we restarted mogstored on both our
storage servers yesterday. It's possible I restarted the second server
before the first one was fully up. Whether or not that's related, it's
hard to know. A little more than an hour later we lost a jpeg file from
the store on both machines. When it disappeared, it did so in this same
way where the row in file still exists but the rows in file_on have been
removed and the files on disk have been removed. I verified this before
I ran a fsck which came back with the same results as before.

Since our last bit of weird with this issue, I added some verbose
logging around the deletion of anything from mogile from our app. The
app wasn't deleting this file, or anything else anywhere near the time
this happened.

I know when this file disappeared because this jpeg happens to be used
by our monitoring system which pulls the image from the application and
directly from mogile every 15 minutes.


Now for the next bit of weird. The other file had never been requested
in any form. It had been uploaded on the 7th and the tester who uploaded
it never retrieved it. The upload appears to have been successful from
everything I can see. Similarly for this file, we have a row in file but
none in file_on and the files are gone from both storage servers. Since
this file has never been retrieved, I can't say for sure when it entered
this state.


The only tracker on the system was started on Aug 05, so there was no
unusual tracker restarting or weird replication during this window.


Here's the configs for the tracker and store. Anyone see any weird in
here?

--mogilefsd.conf--
db_dsn DBI:mysql:mogilefs:<somelocaldbhostname>;mysql_connect_timeout=5
db_user mogile
db_pass <whatever>
conf_port 6001
listener_jobs 200
replicate_jobs 10
delete_jobs 1
reaper_jobs 5
monitor_jobs 3
-----

--mogstored.conf--
httplisten=0.0.0.0:7500
mgmtlisten=0.0.0.0:7501
docroot=/var/mogdata
-----

This is scary that files are just being eaten like this...

Clint

> > On Mon, 2008-07-28 at 09:19 -0600, Clinton Goudie-Nice wrote:
> >
> >> Just File...
> >>
> >> select * from file where fid = 7;
> >> +-----+------+------+--------+---------+----------+
> >> | fid | dmid | dkey | length | classid | devcount |
> >> +-----+------+------+--------+---------+----------+
> >> | 7 | 1 | 72 | 66272 | 1 | 2 |
> >> +-----+------+------+--------+---------+----------+
> >>
> >> I looked through file_to_delete, file_to_delete_later,
> >> file_to_replicate, file_on, file_on_corrupt and this is the only place
> >> my information can be found.
> >>
> >> Also from my application's logs, there is no indication that the file
> >> was deleted from mogile.
> >>
> >>

Clinton Goudie-Nice

unread,
Aug 12, 2008, 2:45:38 PM8/12/08
to mog...@googlegroups.com

dormando

unread,
Aug 12, 2008, 4:25:13 PM8/12/08
to mog...@googlegroups.com
Are you restarting your mysql server?

Any chance you could apply the delete worker patch I posted to the list?

Clinton Goudie-Nice

unread,
Aug 12, 2008, 4:26:06 PM8/12/08
to mog...@googlegroups.com, dormando
It occurred to me that all the sql for the updates would have to have
been written in the binlogs for replication to our slave db. I went
digging through the binlogs to see what was going on, and low and
behold...

INSERT IGNORE INTO file_to_delete (fid) VALUES (48),(73)/*!*/;
DELETE FROM tempfile WHERE fid IN (48,73)/*!*/;
DELETE FROM file_on WHERE fid='48' AND devid='2'/*!*/;
DELETE FROM file_on WHERE fid='48' AND devid='1'/*!*/;
DELETE FROM file_to_delete WHERE fid='48'/*!*/;

I have the whole binlog dumped somewhere if it would be helpful for
troubleshooting. Send me an email directly and I'll email or link you to
the giant dump.

Based on the bit about tempfile is this related to the other tempfile
bug being discussed on the list? I can't think of a reason these files
would get rows inserted into tempfile. They were written a long time ago
and appear to have been written successfully. Today I did a select *
from tempfile and there's no rows in there now.

Thanks!

Clint

Clinton Goudie-Nice

unread,
Aug 12, 2008, 5:17:16 PM8/12/08
to mog...@googlegroups.com
On Tue, 2008-08-12 at 13:25 -0700, dormando wrote:
> Are you restarting your mysql server?

In this case, mysqld was started on the 31st of last month, and mogilefsd was started on the 5th of this month, so I'd say it's unlikely

>
> Any chance you could apply the delete worker patch I posted to the list?
>

I'm going to look into that next if the binlog info I posted plays into
that theory.

Thanks for the help Dormando!

Clint

dormando

unread,
Aug 12, 2008, 5:19:33 PM8/12/08
to mog...@googlegroups.com
> I'm going to look into that next if the binlog info I posted plays into
> that theory.

It does actually. That's the delete worker doing something that patch
specifically fixes.

> Thanks for the help Dormando!

Sure thang.

-Dormando

Clinton Goudie-Nice

unread,
Aug 12, 2008, 6:52:49 PM8/12/08
to mog...@googlegroups.com
[response to my post for reference]

After re-reading through the other bugs:

http://groups.google.com/group/mogile/browse_thread/thread/c737e84feea148e2/3ebdf130e57f12bb
http://lists.danga.com/pipermail/mogilefs/2008-June/001728.html

as well as the patch:

http://groups.google.com/group/mogile/browse_thread/thread/2eee02f6c64bf16f/8f6dfe269c21779b

I can see how this could happen. Additionally, I tried manually
inserting a row into tempfile on our test system here and the result
with the file being deleted and the only remaining row being in file is
an exact match for this problem.

Finally, after digging through the source, I couldn't see anywhere else
that the files could be deleted and the rows removed from file_on but
not be removed from file. I'd say all around my issue is this same
issue.

I've also submitted Dormando's patch to the EPEL maintainers here:
https://bugzilla.redhat.com/show_bug.cgi?id=458890

I'm not as familiar with the Ubuntu bug system, but someone might want
to go file the bug there. Also I'm unfamiliar with how CPAN gets it's
updates but this seems a wise update to roll there also.

Thanks for your help everyone!

Clint


dormando

unread,
Aug 12, 2008, 7:05:08 PM8/12/08
to mog...@googlegroups.com
Yes, I verified all of that before writing the patch ;)

Please apply it and let us know if it fixes your specific problem.

We'll be rolling out mogilefs 2.18 in a few days, and that should be good
to go upstream. Lots and lots of fixes since 2.17...

-Dormando

On Tue, 12 Aug 2008, Clinton Goudie-Nice wrote:

Reply all
Reply to author
Forward
0 new messages