There might be a bug that causes the JobMaster worker to hang. If you see
replication stopping again, kill all the JobMaster workers once and see if
that unsticks it.
I should have another version out within a week with more fixes. Sorry
four the trouble this has caused :(
-Dormando
On Wed, 11 Jan 2012, 施明 wrote:
> thanks very much!
>
> if i kill all the JobMaster workers once, the upload and display will not work.
>
> so is there any method to change the version from 2.55 to 2.46
You can edit the schema_version in the database and try downgrading.
Not sure what you did to cause upload to break though. The JobMaster
workers should automatically restart; I think you may have killed the
entire mogilefsd by accident.
does that number ever go down? it looks like your replication is jammed
totally... "newfile" means they're due to replicate immediately. deferred
means not now, and overdue means they were rescheduled but waiting to be
run.
> what about general-queues ? the rebalance queue is supposed to be seen in this refer to documentation.
same.
> and one more thing.
>
> I used to telnet the tracker's port and use `!watch` to monitor the rebalancing progress. Is there some way to monitor the replicating jobs?
`mogadm rebalance status` to watch a rebalance progress.
mogstats showing the replication queue shows that progress... but if it's
not moving that's an issue. Does restarting all your mogilefsd's get it to
run at all, if it is hung?
How long is "some time"? And if possible, can you hop into #mogilefs on
IRC? I've not been able to reproduce this issue, but have guesses on how
to fix it. Confirmation would help a lot.
We have the same problem since version 2.55, only a restart of the mogile
tracker fixes the problem. Today tried to investigate the problem but debugging
was switched off so i didn't get far. I tried to kill all child processors on
each tracker but that didn't help...
This happens every one or two weeks, i didn't have the time to get into it. Now
debugging is switched on on all trackers, so i might get back with more details
in a few weeks.
The same problem might occur with FSCK, which doesn't start FSCK'ing until the
trackers are being restarted. I mailed about this in november when i thought it
had to do with fids with length=0 ;-) It looks more like the replication
problem. I was wondering if maybe a stale mysql connection in the main tracker
daemon process could be the cause of this?
Martijn.
Once upon a 12 Jan 2012, dormando hit keys in the following order:
Can you try applying the attached patch to the 2.55?
patch -p1 < 2.55-queuefix.patch
... then please let me know soon if your queues stop hanging! If not,
there's another patch to try, but I figured I'd *fix* the logic first :/
thanks,
-Dormando
bye,
Martijn.
Once upon a 16 Jan 2012, dormando hit keys in the following order:
> diff --git a/lib/MogileFS/Store.pm b/lib/MogileFS/Store.pm
> index a70f49f..bbd52d3 100644
> --- a/lib/MogileFS/Store.pm
> +++ b/lib/MogileFS/Store.pm
> @@ -1663,12 +1663,13 @@ sub grab_queue_chunk {
> $dbh->do("UPDATE $queue SET nexttry = $ut + 1000 WHERE fid IN ($fidlist)");
> $dbh->commit;
> };
> - $self->unlock_queue($queue);
> if ($self->was_deadlock_error) {
> eval { $dbh->rollback };
> - return ();
> + $work = undef;
> + } else {
> + $self->condthrow;
> }
> - $self->condthrow;
> + $self->unlock_queue($queue);
>
> return defined $work ? values %$work : ();
> }
Did you try the patch? Does it work?
I don't have an strace i'm afraid. I would have to startup each of our six
mogilefsd's with strace and capture the output, right? Thats gonna give a lot
of data in 10 days i'ld think, so we can't do that.
bye
Martijn.
Once upon a 17 Jan 2012, dormando hit keys in the following order:
How long do they usually take to show up? I was hoping to cut the fix
sooner.
> I don't have an strace i'm afraid. I would have to startup each of our six
> mogilefsd's with strace and capture the output, right? Thats gonna give a lot
> of data in 10 days i'ld think, so we can't do that.
You only need to take an strace for 10-30 seconds after it's been hung.
ie; replication stops working, then you strace one of the jobmasters for a
few seconds, then you give me that trace. Make sure you add some options
though, like -s 999 or one of the timing ones.
thanks!
I know a few of you were seeing tracker hangs; has anyone tested this
patch yet and seen any relief from the crash?
Please? :)
Thanks!
It looks like the fix I have in the tree is indeed the correct fix for
this.
I need to do a little more work and cut the tree. Sorry for the delay.
On Fri, 13 Jan 2012, 施明 wrote:
> hi
>
> i'm so sorry to distrub you.
>
> now we find another problem:
>
> when the replicate queue were finished,the replicate job would not work, although the replicate process and the fsck process were still exist, at
> this time, i can't execute command as "mogadm fsck status"
>
> sometimes, we must to restart mogilefsd to make the replicate work again.
>
>
> ____________________________________________________________________________________________________________________________________________________
> 施明
>
> From: dormando
> Date: 2012-01-11 14:40
> To: 施槎
> CC: mogile
> Subject: Re: the question of mogilefs version 2.55
> > we are now using MogileFS for store images;
> > ?
> > we upgraded mogile from 2.46 to 2.55 some days ago,now we have a question(maybe a bug) about the newest version's replicate:
> > ?
> > we changed the replicate policy to "HostsPerNetwork(near=2,far=1)",an then,the replicate was working.
> > ?
> > But the next day,we found the replicate stoped,however the replicate process was still exist.
> > ?
> > we inputed following command to check replicate queue:
> > ?
> > mogstats?-db_dsn="DBI:mysql:mfs:host=192.168.10.89"?-db_use="root"?-db_pass="sl11011"?-verbose?-stats="replication-queue"
> > ?
> > we found the newfile was always increasing,but the replicate jobs was not working.
> > ?
> > so,we restart the mogilefsd,the replicate jobs started to work. At this time , the disk util of DB host increased to 98%, after about 30
> > minutes, the disk util of DB host decreated.
> > ?
> > Then, we changed the replicate policy to "MultipleHosts()", however ,the problem continued.
> > ?
> > Now we want to know is there any notes when we using the version 2.55 OR is there any problems on MogileFS 2.55.
> > ?
We've been running for eleven days with this patch so i think its safe to say
that your fix works perfectly!
many thanks,
Martijn Lina.
Once upon a 27 Jan 2012, dormando hit keys in the following order:
Mogilefs-Server 2.56 is out on CPAN, which includes the patch. You may
just install it as normal.
On Sun, 29 Jan 2012, 施明 wrote:
> so great!
>
> 1、 Where can i download the patch or which one can fix this at ‘http://code.google.com/p/mogilefs/updates/list’ ?
>
> 2、How can i install the patch and is there any influences to the online environmental ?
>
> thanks.
>
> __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/
>
> 我要努力工作,争取能吃20块的便当[IMAGE]
>
> 施明
> SA of Sysdev Team
> Mail:min...@anjuke.com
> Tel:021-61821159-8329
> __/__/__/__/__/__/__/__/__/__/__/__/__/__/__/__/