Re: the question of mogilefs version 2.55

75 views
Skip to first unread message

dormando

unread,
Jan 11, 2012, 1:40:06 AM1/11/12
to 施槎��, mogile
> we are now using MogileFS for store images;
> ᅵ
> we upgraded mogile from 2.46 to 2.55 some days ago,now we have a question(maybe a bug) about the newest version's replicate:
> ᅵ
> we changed the replicate policy to "HostsPerNetwork(near=2,far=1)",an then,the replicate was working.
> ᅵ
> But the next day,we found the replicate stoped,however the replicate process was still exist.
> ᅵ
> we inputed following command to check replicate queue:
> ᅵ
> mogstatsᅵ--db_dsn="DBI:mysql:mfs:host=192.168.10.89"ᅵ--db_use="root"ᅵ--db_pass="sl11011"ᅵ--verboseᅵ--stats="replication-queue"
> ᅵ
> we found the newfile was always increasing,but the replicate jobs was not working.
> ᅵ
> so,we restart the mogilefsd,the replicate jobs started to work. At this time , the disk util of DB host increased to 98%, after about 30
> minutes, the disk util of DB host decreated.
> ᅵ
> Then, we changed the replicate policy to "MultipleHosts()", however ,the problem continued.
> ᅵ
> Now we want to know is there any notes when we using the version 2.55 OR is there any problems on MogileFS 2.55.
> ᅵ
> thanks in advance.

There might be a bug that causes the JobMaster worker to hang. If you see
replication stopping again, kill all the JobMaster workers once and see if
that unsticks it.

I should have another version out within a week with more fixes. Sorry
four the trouble this has caused :(

-Dormando

dormando

unread,
Jan 11, 2012, 1:55:57 AM1/11/12
to 施槎��, mogile

On Wed, 11 Jan 2012, 施明 wrote:

> thanks very much!
>  
> if i kill all the JobMaster workers once, the upload and display will not work.
>  
> so is there any method to change the version from 2.55 to 2.46

You can edit the schema_version in the database and try downgrading.

Not sure what you did to cause upload to break though. The JobMaster
workers should automatically restart; I think you may have killed the
entire mogilefsd by accident.

AleiPhoenix (A.K.A Areverie)

unread,
Jan 11, 2012, 2:18:27 AM1/11/12
to mog...@googlegroups.com, 施槎��
and another question about JobMaster and workers' queue.

the queue looks like ...

Statistics for replication queue...
  status                      count
  -------------------- ------------
  deferred                      299
  newfile                   1087030
  overdue                         6
  -------------------- ------------

what does each status exactly mean?

what about general-queues ? the rebalance queue is supposed to be seen in this refer to documentation.

and one more thing.

I used to telnet the tracker's port and use `!watch` to monitor the rebalancing progress. Is there some way to monitor the replicating jobs?

thanks.
--
Silence is gold.

twitter: @areverie
wikipedia: AleiPhoenix
blog: weblog.areverie.org
wiki: wiki.areverie.org


dormando

unread,
Jan 11, 2012, 2:49:29 AM1/11/12
to mog...@googlegroups.com
> and another question about JobMaster and workers' queue.
> the queue looks like ...
>
> Statistics for replication queue...
>   status                      count
>   -------------------- ------------
>   deferred                      299
>   newfile                   1087030
>   overdue                         6
>   -------------------- ------------
>
> what does each status exactly mean?

does that number ever go down? it looks like your replication is jammed
totally... "newfile" means they're due to replicate immediately. deferred
means not now, and overdue means they were rescheduled but waiting to be
run.

> what about general-queues ? the rebalance queue is supposed to be seen in this refer to documentation.

same.

> and one more thing.
>
> I used to telnet the tracker's port and use `!watch` to monitor the rebalancing progress. Is there some way to monitor the replicating jobs?

`mogadm rebalance status` to watch a rebalance progress.

mogstats showing the replication queue shows that progress... but if it's
not moving that's an issue. Does restarting all your mogilefsd's get it to
run at all, if it is hung?

AleiPhoenix (A.K.A Areverie)

unread,
Jan 11, 2012, 6:32:18 PM1/11/12
to mog...@googlegroups.com
Yes, the progress of replication looks like being hung. the number of `newfile` keep growing and the number of `overdue` and `deferred` decreased slowing or just stoped.

We're using 2 trackers, tracker 1 handle all app requests (with 1 jobmaster and 10 query workers) and tracker 2 handles all other jobs(including 1 replicate). And yes, by restart all `mogilefsd` process on tracker 2 will make queue move once again and after sometime it stuck again.

dormando

unread,
Jan 11, 2012, 6:50:25 PM1/11/12
to mog...@googlegroups.com
> Yes, the progress of replication looks like being hung. the number of `newfile` keep growing and the number of `overdue` and `deferred` decreased
> slowing or just stoped.
> We're using 2 trackers, tracker 1 handle all app requests (with 1 jobmaster and 10 query workers) and tracker 2 handles all other jobs(including 1
> replicate). And yes, by restart all `mogilefsd` process on tracker 2 will make queue move once again and after sometime it stuck again.

How long is "some time"? And if possible, can you hop into #mogilefs on
IRC? I've not been able to reproduce this issue, but have guesses on how
to fix it. Confirmation would help a lot.

Martijn Lina

unread,
Jan 13, 2012, 8:58:37 AM1/13/12
to mog...@googlegroups.com
Hi,

We have the same problem since version 2.55, only a restart of the mogile
tracker fixes the problem. Today tried to investigate the problem but debugging
was switched off so i didn't get far. I tried to kill all child processors on
each tracker but that didn't help...

This happens every one or two weeks, i didn't have the time to get into it. Now
debugging is switched on on all trackers, so i might get back with more details
in a few weeks.

The same problem might occur with FSCK, which doesn't start FSCK'ing until the
trackers are being restarted. I mailed about this in november when i thought it
had to do with fids with length=0 ;-) It looks more like the replication
problem. I was wondering if maybe a stale mysql connection in the main tracker
daemon process could be the cause of this?

Martijn.

Once upon a 12 Jan 2012, dormando hit keys in the following order:

AleiPhoenix (A.K.A Areverie)

unread,
Jan 13, 2012, 11:05:15 AM1/13/12
to mog...@googlegroups.com
Sorry for the late reply.

I think we should hop on IRC to deal with the problem if this will help :)

And the guy min...@anjuke.com who is looking into the problem's working with me.

Due to the timezone difference, what time will it be ok?

Thanks!

dormando

unread,
Jan 13, 2012, 11:33:31 AM1/13/12
to mog...@googlegroups.com
I'm in US pacific time... anytime during the day or evening I guess.

dormando

unread,
Jan 16, 2012, 2:22:50 PM1/16/12
to mog...@googlegroups.com
For you (and anyone else having this issue!)

Can you try applying the attached patch to the 2.55?

patch -p1 < 2.55-queuefix.patch

... then please let me know soon if your queues stop hanging! If not,
there's another patch to try, but I figured I'd *fix* the logic first :/

thanks,
-Dormando

2.55-queuefix.patch

Martijn Lina

unread,
Jan 17, 2012, 4:16:57 AM1/17/12
to mog...@googlegroups.com
Thanks for the patch. I'm currently in a deadlock situation but i don't know
how to investigate this problem. We really have to get replicating, but i can
wait i bit more for your reply (two hours). If i can help by sending you some
innodb status info, please tell me what you need. I've stored "SHOW ENGINE
INNODB STATUS" output, but i'm not sure if that alone is usable to you.

bye,
Martijn.

Once upon a 16 Jan 2012, dormando hit keys in the following order:

> diff --git a/lib/MogileFS/Store.pm b/lib/MogileFS/Store.pm
> index a70f49f..bbd52d3 100644
> --- a/lib/MogileFS/Store.pm
> +++ b/lib/MogileFS/Store.pm
> @@ -1663,12 +1663,13 @@ sub grab_queue_chunk {
> $dbh->do("UPDATE $queue SET nexttry = $ut + 1000 WHERE fid IN ($fidlist)");
> $dbh->commit;
> };
> - $self->unlock_queue($queue);
> if ($self->was_deadlock_error) {
> eval { $dbh->rollback };
> - return ();
> + $work = undef;
> + } else {
> + $self->condthrow;
> }
> - $self->condthrow;
> + $self->unlock_queue($queue);
>
> return defined $work ? values %$work : ();
> }

dormando

unread,
Jan 17, 2012, 12:55:54 PM1/17/12
to mog...@googlegroups.com
SHOW ENGINE INNODB STATUS is actually good enough. that plus an strace to
be ideal.

Did you try the patch? Does it work?

Martijn Lina

unread,
Jan 18, 2012, 6:24:00 AM1/18/12
to mog...@googlegroups.com
The patch works perfectly, we have it in production. We'll have to wait for one
or two weeks to see if the deadlocks won't happen again.

I don't have an strace i'm afraid. I would have to startup each of our six
mogilefsd's with strace and capture the output, right? Thats gonna give a lot
of data in 10 days i'ld think, so we can't do that.

bye
Martijn.

Once upon a 17 Jan 2012, dormando hit keys in the following order:

tariq wali

unread,
Jan 18, 2012, 7:22:03 AM1/18/12
to mog...@googlegroups.com
what exactly does this patch address ? we have issues too on rebalance/fsck .
--
Tariq Wali.

dormando

unread,
Jan 18, 2012, 12:54:34 PM1/18/12
to mog...@googlegroups.com
> The patch works perfectly, we have it in production. We'll have to wait for one
> or two weeks to see if the deadlocks won't happen again.

How long do they usually take to show up? I was hoping to cut the fix
sooner.

> I don't have an strace i'm afraid. I would have to startup each of our six
> mogilefsd's with strace and capture the output, right? Thats gonna give a lot
> of data in 10 days i'ld think, so we can't do that.

You only need to take an strace for 10-30 seconds after it's been hung.
ie; replication stops working, then you strace one of the jobmasters for a
few seconds, then you give me that trace. Make sure you add some options
though, like -s 999 or one of the timing ones.

thanks!

dormando

unread,
Jan 18, 2012, 12:54:54 PM1/18/12
to mog...@googlegroups.com
It addresses replication/fsck/etc not moving until you restart trackers.

dormando

unread,
Jan 20, 2012, 9:19:49 PM1/20/12
to mog...@googlegroups.com
Hey,

I know a few of you were seeing tracker hangs; has anyone tested this
patch yet and seen any relief from the crash?

Please? :)

Thanks!

tariq wali

unread,
Jan 23, 2012, 9:04:23 AM1/23/12
to mog...@googlegroups.com
Hi I think I may have deleted the mail accidentally   that contained the patch attachment, can you please share it again ?
--
Tariq Wali.

dormando

unread,
Jan 26, 2012, 7:44:34 PM1/26/12
to 施槎��, mogile, 陈磊, lzh...@anjuke.com
Hey folks,

It looks like the fix I have in the tree is indeed the correct fix for
this.

I need to do a little more work and cut the tree. Sorry for the delay.

On Fri, 13 Jan 2012, 施明 wrote:

> hi
>  
> i'm so sorry to distrub you.
>  
> now we find another problem:
>  
> when the replicate queue were finished,the replicate job would not work, although the replicate process and the fsck process were still exist, at
> this time, i can't execute command as "mogadm fsck status"
>  
> sometimes, we must to restart mogilefsd to make the replicate  work again.
>  
>
> ____________________________________________________________________________________________________________________________________________________
> 施明


>  
> From: dormando
> Date: 2012-01-11 14:40
> To: 施槎
> CC: mogile
> Subject: Re: the question of mogilefs version 2.55
> > we are now using MogileFS for store images;
> > ?
> > we upgraded mogile from 2.46 to 2.55 some days ago,now we have a question(maybe a bug) about the newest version's replicate:
> > ?
> > we changed the replicate policy to "HostsPerNetwork(near=2,far=1)",an then,the replicate was working.

> > ?


> > But the next day,we found the replicate stoped,however the replicate process was still exist.

> > ?


> > we inputed following command to check replicate queue:

> > ?
> > mogstats?-db_dsn="DBI:mysql:mfs:host=192.168.10.89"?-db_use="root"?-db_pass="sl11011"?-verbose?-stats="replication-queue"
> > ?


> > we found the newfile was always increasing,but the replicate jobs was not working.

> > ?


> > so,we restart the mogilefsd,the replicate jobs started to work. At this time , the disk util of DB host increased to 98%, after about 30
> > minutes, the disk util of DB host decreated.

> > ?


> > Then, we changed the replicate policy to "MultipleHosts()", however ,the problem continued.

> > ?


> > Now we want to know is there any notes when we using the version 2.55 OR is there any problems on MogileFS 2.55.

> > ?

Martijn Lina

unread,
Jan 27, 2012, 4:58:53 AM1/27/12
to mog...@googlegroups.com
hi Dormando,

We've been running for eleven days with this patch so i think its safe to say
that your fix works perfectly!


many thanks,

Martijn Lina.

Once upon a 27 Jan 2012, dormando hit keys in the following order:

dormando

unread,
Jan 27, 2012, 3:01:52 PM1/27/12
to mog...@googlegroups.com
Thank you! Someone else verified the fix as well. I just need to fix
something else sitting in the tree, and it can go out.

dormando

unread,
Jan 29, 2012, 2:03:26 AM1/29/12
to 施槎��, mogile, 陈磊, lzhang
Hey,

Mogilefs-Server 2.56 is out on CPAN, which includes the patch. You may
just install it as normal.

On Sun, 29 Jan 2012, 施明 wrote:

> so great!
>  
> 1、 Where can i download the patch or which one can fix this at ‘http://code.google.com/p/mogilefs/updates/list’ ?
>  
> 2、How can i install the patch and is there any influences to the online environmental ?
>  
> thanks.
>  
> __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
>  
> 我要努力工作,争取能吃20块的便当[IMAGE]
>  
> 施明
> SA of Sysdev Team
> Mail:min...@anjuke.com
> Tel:021-61821159-8329
> __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/

Reply all
Reply to author
Forward
0 new messages