replication issues after maintenance.

1 view
Skip to first unread message

Patsie

unread,
Sep 30, 2009, 5:35:35 AM9/30/09
to mogile
Setup sketch: 2 trackers, 2 master-master mysql DB's, 6 storage nodes
each with 1 big device and nginx as WebDEV backend.

I just performed maintenance on a mogstored host. I first make the
device readonly, then (after making sure no data was only on this
device) marked the host offline did my thing and brung everything back
up (host alive, device alive)
Everything looks okay and most of it works, but... when this host/
device is used as the first device to put a new file on, it won't
replicate it to another host/device but it does get replicated to from
other hosts. It gives a 'policy_no_suggestion' for every file put on
this host and when it tries to replicate it to another.
So now I have a swag of files in my file_to_replicate table all from
this one host/device which won't get replicated.

Any suggestions to see why there is no pollicy suggestion for files
specifically from this host/device?

Patsie

unread,
Oct 1, 2009, 2:46:54 AM10/1/09
to mogile

On Sep 30, 11:35 am, Patsie <patrick.l...@spilgames.com> wrote:
> Any suggestions to see why there is no pollicy suggestion for files
> specifically from this host/device?

After having restarted mogstored and nginx of this host, nothing
changed and 'policy_no_suggestion's kept coming for files from this
host specifically. But when I restarted both trackers replication went
fine again. I'll see if I can reproduce the error. It's not something
I'd like for every controled maintenance downtime.

dormando

unread,
Oct 4, 2009, 10:21:52 PM10/4/09
to mogile
Yo,

I think there's an obscure bug in the replicator workers where the device
status gets wedged as "read only" or "down" and they don't refresh
anymore. If you see the issue again, try "!want 0 replicate" and then
"!want n replicate", where 'n' was your normal process count. Run !stats
inbetween to see when the replicate workers have all died off, before
restarting them.

If that fixes the issue, it again points that I need to find that bug :P
It hits us in very rare situations but it's obnoxious.

-Dormando

Patsie

unread,
Oct 26, 2009, 10:37:54 AM10/26/09
to mogile
Replication failed again 3 days ago (after about 3 weeks from last
time). No maintenance or downtime was scheduled, it just stopped with
the same 'policy_no_suggestions'. Diskspace aplenty, all trackers were
up as were all storage nodes. They were all reachable, readable and
writeable (got a check script for that)
A '!need 0 replicate' and '!need 5 replicate' worked like a charm, but
it's obviously a workaround to a bigger problem.
Just a F.Y.I. :)

Regards,

Patrick

dormando

unread,
Nov 22, 2009, 2:51:30 AM11/22/09
to mogile
I just pushed a change to trunk (r1360) which may fix this issue. Please
try out trunk if you can... or wait until the next release is out.
Reply all
Reply to author
Forward
0 new messages