how to find corrupt files in mogile

43 views
Skip to first unread message

tariq wali

unread,
Jan 4, 2012, 8:05:10 AM1/4/12
to mogile
Hi All,

Is there a way to list all corrupt files in mogile cluster (fid's ) , and also the missing files ?

each time I run a rebalance and telnet the tracker I see a lot of errors telling me either files are missing/corrupt .

or is there a wrapper around to list all the corrupt data ?

--
Tariq Wali.

dormando

unread,
Jan 4, 2012, 11:39:59 AM1/4/12
to mogile

If you run a FSCK, the "status" output will show you how many files are
missing/etc. Then the fsck log will have the fill fid information for each
type of error:
http://code.google.com/p/mogilefs/wiki/FSCK#Interpreting_Results

Eric Wong

unread,
Jan 4, 2012, 4:52:35 PM1/4/12
to mog...@googlegroups.com
tariq wali <gana...@gmail.com> wrote:
> Is there a way to list all corrupt files in mogile cluster (fid's ) , and
> also the missing files ?

There's also thread titled "end-to-end checksums" from Nov 2011 where I
posted my "checksums" branch: git://bogomips.org/MogileFS-Server.git

Still hoping more eyes will look at it and maybe try it out.

I unfortunately haven't had much time to test further, but it seems
alright in my limited testing.

tariq wali

unread,
Jan 17, 2012, 9:21:44 AM1/17/12
to mog...@googlegroups.com, dormando
Eric,


I read the post you talked about https://groups.google.com/forum/#!msg/mogile/Ic5O4J816xY/FBLShkYfJ68J , but quite not sure how it helps to get a list of all missing/corrupt files unless we run mogile fsck and print a long list using printlog/taillog . I was looking for something like moglist but maybe you could correct me if I misunderstood .

This is how my fsck status looks like now ..
mogadm fsck status

    Running: Yes (on lfvsfcp58.dn.net)
     Status: 5740115 / 287432015 (2.00%)
       Time: 265m (360 fids/s; 13025m remain)
 Check Type: Normal (check policy + files)

 [num_GONE]: 2
 [num_MISS]: 567716   (that's a LOT of missing copies that are supposed to exist ) ?
 [num_NOPA]: 2
 [num_POVI]: 709490 (  replication policy violation ) 
 [num_REPL]: 709473 ( FID has been scheduled for replication to fix a policy violation ) 
 [num_SRCH]: 2

if if we try manually to fix all these files that's a HUGE no .. is FSCK really supposed to run for 13025m long ??


 
--
Tariq Wali.

Eric Wong

unread,
Jan 18, 2012, 3:52:15 PM1/18/12
to mog...@googlegroups.com, dormando
tariq wali <gana...@gmail.com> wrote:
> I read the post you talked about
> https://groups.google.com/forum/#!msg/mogile/Ic5O4J816xY/FBLShkYfJ68J , but
> quite not sure how it helps to get a list of all missing/corrupt files
> unless we run mogile fsck and print a long list using printlog/taillog .

Yes, my checksums branch allows running fsck to get the list using
printlog/taillog. No special tweaks to whatever HTTP server you're
using is required, but running mogstored (from the checksums branch) is
/highly/ recommended.

> was looking for something like moglist but maybe you could correct me if I
> misunderstood .

You could probably write a quick script to transform fsck printlog
FID results to namespace/key names if it makes recovery easier...

Not sure about the rest of your fsck questions below. These numbers
are /without/ my checksums changes, right?

> This is how my fsck status looks like now ..
> mogadm fsck status
>
> Running: Yes (on lfvsfcp58.dn.net)
> Status: 5740115 / 287432015 (2.00%)
> Time: 265m (360 fids/s; 13025m remain)

360 fids/s seems really low. What's the network latency between your
trackers <-> storage nodes and trackers <-> DB?

Adding checksums will make fsck much slower esp with large files, but
the current size-only checks should be much faster...

Maybe somebody else can pipe up, I've never had performance issues
with plain fsck...

> Check Type: Normal (check policy + files)
>
> [num_GONE]: 2

> * [num_MISS]: 567716 (that's a LOT of missing copies that are supposed to
> exist ) ?*
> [num_NOPA]: 2
> * [num_POVI]: 709490 ( replication policy violation ) *
> * [num_REPL]: 709473 ( FID has been scheduled for replication to fix a
> policy violation ) *


> [num_SRCH]: 2
>
> if if we try manually to fix all these files that's a HUGE no .. is FSCK

> really supposed to run for *13025m* long ??

Reply all
Reply to author
Forward
0 new messages