Missing files, replication not progressing, policy violations never getting corrected - how to clean up the mess?

160 views
Skip to first unread message

Jon

unread,
Nov 2, 2009, 11:36:56 AM11/2/09
to mogile
How can I fix my file system??

I have been using MogileFS in production and in development the past 6
months and I'm still struggling with stability. As I have been adding
storage nodes I have enabled rebalance, and for disk space usage it
appears to have worked as expected. I have also taken out 6 different
harddrives from the setup, all by first issuing drain, then remove the
disk 2 weeks later. I have also change replication policy at different
occations.

What I have ended up with is files totally gone, tons and tons of
policy violations never getting corrected, and severe load issues from
trackers without priority running wild requesting the same 404 over
and over again (example). I have had tracker try and try on devices
clearly marked as DOWN, and even DEAD. Halting all other constructive
tasks. It has tried to read a file where it get 404 in return, just to
try it again the next second. Tons and tons of errors related to mysql
slaves getting overloaded.

What I have done to try and fix this:
- Disabled rebalance (helped A LOT on mysql load)
- Disabled drain (also help a lot on mysql load)
- Increased replication count for classes
- Reset fsck log, and start it again
- Manually removed all entries in mysql database table
file_to_replicate where fromdevid no longer is alive
- Any read and load balancing operation now done in my client
application directly against mysql, to avoid any and all unneeded
traffic to tracker
- Added retries in my application if tracker returns an error due to
mysql load

My file_to_queue table is completely empty, and my fsck_log table has
570000 entries. My file_to_replicate table is just growing, and no
files seems to actually replicate. This is after 2 weeks of first
initiating reset / start of fsck

My setup is 2 master-master replicated mysql server, 2 trackers
pointing to the same mysql server, on version 2.30, 1 tracker pointing
to the other server running 2.31 and 10 mogstored nodes.

fsck status:

[num_BLEN]: 13
[num_GONE]: 795
[num_NOPA]: 787
[num_POVI]: 6971
[num_REPL]: 6964
[num_SRCH]: 793

stats:

domain.com class1 1 1
domain.com class1 2 137
domain.com class1 3 406
domain.com class1 4 32
domain.com class2 0 113
domain.com class2 1 52
domain.com class2 2 3466
domain.com class2 3 1926
domain.com class2 4 23963
domain.com class2 5 790
domain.com class2 6 4
domain.com thumbnail 0 406
domain.com thumbnail 1 60
domain.com thumbnail 3 196779
domain.com thumbnail 4 424
domain.com thumbnail 5 24
domain.com class3 0 13
domain.com class3 1 2
domain.com class3 4 12701
domain.com class3 5 80
domain.com class3 6 4

classes and replication policy:

domain.com class1 3
domain.com class2 4
domain.com thumbnail 3
domain.com class3 4


My limitations:

- I have all the different storage nodes in different data centers so
traffic goes over the internet (ip restricted firewall)
- I have two servers behind NAT, which is making the tracker time out
on requests to localhost. This error is also result in timeout for
localhost if I use mogadm --trackers=ouside.server.com:7001 check

My goal:
- Create a distributed file system for use as content delivery network

Jon

unread,
Nov 8, 2009, 5:41:22 AM11/8/09
to mogile
I also tried upgrading to 2.32, but I keep getting this error at the
client side when I'm trying to inject a file:

ERR size_verify_error Expected: -1; actual: 0 (missing); path:
http://server1:7500/dev10/0/000/404/0000404529.fid; error: Job
queryworker has only 0, wants 10, making 10.

Jon

unread,
Nov 8, 2009, 10:30:44 AM11/8/09
to mogile
I did 'mogadm fsck reset', and then 'mogadm fsck start'. This resulted
in all files I attempt to upload getting this error:

MogileFS::doRequest ERR no_temp_file No tempfile or file already
closed

dormando

unread,
Nov 13, 2009, 4:47:42 AM11/13/09
to mogile
> I have been using MogileFS in production and in development the past 6
> months and I'm still struggling with stability. As I have been adding
> storage nodes I have enabled rebalance, and for disk space usage it
> appears to have worked as expected. I have also taken out 6 different
> harddrives from the setup, all by first issuing drain, then remove the
> disk 2 weeks later. I have also change replication policy at different
> occations.

Are you marking the devices as 'dead' before removing them?

> What I have ended up with is files totally gone, tons and tons of
> policy violations never getting corrected, and severe load issues from
> trackers without priority running wild requesting the same 404 over
> and over again (example). I have had tracker try and try on devices
> clearly marked as DOWN, and even DEAD. Halting all other constructive
> tasks. It has tried to read a file where it get 404 in return, just to
> try it again the next second. Tons and tons of errors related to mysql
> slaves getting overloaded.

This doesn't make much sense... can you post your mogilefsd.conf ? Smells
like you might need to set 'rebalance_ignore_missing = 1'

> What I have done to try and fix this:
> - Disabled rebalance (helped A LOT on mysql load)

Rebalance is broken, you shouldn't use it.

> - Disabled drain (also help a lot on mysql load)

rebalance_ignore_missing = 1 should stop this from spinning.

> - Increased replication count for classes

This doesn't automatically change anything. you have to run a fsck first.

> - Reset fsck log, and start it again
> - Manually removed all entries in mysql database table
> file_to_replicate where fromdevid no longer is alive
> - Any read and load balancing operation now done in my client
> application directly against mysql, to avoid any and all unneeded
> traffic to tracker

Oi. Bad bad bad.

> - Added retries in my application if tracker returns an error due to
> mysql load

Not as bad.

> My file_to_queue table is completely empty, and my fsck_log table has
> 570000 entries. My file_to_replicate table is just growing, and no
> files seems to actually replicate. This is after 2 weeks of first
> initiating reset / start of fsck
>
> My setup is 2 master-master replicated mysql server, 2 trackers
> pointing to the same mysql server, on version 2.30, 1 tracker pointing
> to the other server running 2.31 and 10 mogstored nodes.

To confirm, do you have:

Databases 'A' and 'B', in master<->master replicated pairs.

Tracker 1 (2.30) -> talks to database A
Tracker 2 (2.30) -> talks to database A
Tracker 3 (2.31) -> talks to database B

? If so, please stop *immediately*. *ALL* trackers *MUST* point to the
same database, at all times. It will absolutely not work correctly with
them split.

If I missunderstood what you just described, correct me and submit what
your tracker configs are. That should give us an idea of what's going on.

> - I have all the different storage nodes in different data centers so
> traffic goes over the internet (ip restricted firewall)

Are you using a custom replication policy? How are you ensuring files go
everywhere?

> - I have two servers behind NAT, which is making the tracker time out
> on requests to localhost. This error is also result in timeout for
> localhost if I use mogadm --trackers=ouside.server.com:7001 check

I'm not sure I understand what's going on here.

> My goal:
> - Create a distributed file system for use as content delivery network

MFS can definitely help with something like this, but it seems like you've
stretched it the wrong way. Lets start with the above and move from
there...

-Dormando

Jon

unread,
Nov 13, 2009, 10:13:21 AM11/13/09
to mogile
Thank you so much for taking the time to go through my questions!

Are you marking the devices as 'dead' before removing them? Yes I did,
after I put drain on them for a week or so

mogilefsd.conf:

#Tracker 1,2
db_dsn =
DBI:mysql:mogilefs:host=sql2.domain.com;port=3306;mysql_connect_timeout=10;mysql_ssl=1;mysql_compression=1
#Tracker 3:
db_dsn = DBI:mysql:mogilefs:host=127.0.0.1

db_user = xxx
db_pass = xxx
listen = 0.0.0.0:7001
conf_port = 7001
listener_jobs = 10
delete_jobs = 1
replicate_jobs = 5
mog_root = /mnt/mogilefs
reaper_jobs = 1

Not knowing what default settings for rebalance_ignore_missing, I'm
assuming it is 0. (I added the line 'rebalance_ignore_missing = 1'
now)

"Rebalance is broken, you shouldn't use it. " Noted. But what is
broken excactly? It seems to have filled newly insterted boxes/devices
when I enabled it before.

I can see how the ideal case it so handle tracker only for all files
related tasks, but I was having problems with persistent sockets and
load. Not to mention the extra latency in my case with a mysql server
located in another datacenter. Why is extracting paths for a spesific
key directly from database so horrible? (Or a full set of keys, to
minimize latency issues).

You where correct in the setup, I have changed the version since post
to 2.32 from svn.

So, for reference:

Databases 'A' and 'B', in master<->master replicated pairs.

Tracker 1 (2.32) -> talks to database A
Tracker 2 (2.32) -> talks to database A
Tracker 3 (2.32) -> talks to database B

(I changed all trackers to point to database B now)

I need to write failover logic in the tracker itself, and switch all
trackers to database B, in case A goes down? Or what's the optimal way
to do this?

I am not using any custom replication policy atm. just devcount with
classes.

Behind NAT:
If I bind the mogstored to 0.0.0.0:7001, and issue "mogadm --
trackers=my.tracker.com:7001 check" (from localhost), then I get [ 2]
server1 ... REQUEST FAILIURE under "checking hosts...", regardless of
what tracker I use.
If I issue the same mogadm command from any other machine, it connects
fine

Thanks again for helping me :)

Kiall Mac Innes

unread,
Nov 13, 2009, 11:43:43 AM11/13/09
to mog...@googlegroups.com
Re the database failover,

We use heartbeat and DRBD for MySQL HA.. I can't imagine this working very well over high latency links though!

Re the NAT,

I'm thinking your DNS may be at fault. I'm guessing using split DNS like so will solve that issue.

[Server 1] --- [ Gateway 1 ] ----------- [ Gateway 2 ] --- [Server 2]

Assuming:
Gateway 1 has an external IP of 10.1.1.1
Gateway 2 has an external IP of 10.2.2.2

Server 1 has an internal IP of 10.3.3.3
Server 2 has an internal IP of 10.4.4.4

From Server 1 shell:
ping server1.domain.com -- This should be resolve to 10.3.3.3
ping server2.domain.com -- This should be resolve to 10.2.2.2

From Server 2 shell:
ping server1.domain.com -- This should be resolve to 10.1.1.1
ping server2.domain.com -- This should be resolve to 10.4.4.4

NAT Should not alter the port. eg NAT port 7500 from the outside to 7500 on the inside.

Don't forget to NAT (and firewall) all ports - so tracker and storage ports...

If I'm way off or if this is already done ... ignore me !

Thanks,
Kiall


2009/11/13 Jon <jon.ska...@gmail.com>

Jon

unread,
Nov 13, 2009, 1:38:07 PM11/13/09
to mogile
Thanks Kiall - you where right about the dns! I solved it by editing
my /etc/hosts file and having the dns resolve to 127.0.0.1 - not sure
how that affects replication tho, if tracker tries to use 127.0.0.1 as
uri for replication.

For my mogstored and tracker nodes that I access across the internet I
use an ip restricted rule for IPTables. Optimal would be a reverse
proxy or similar in front to demand login credentials, but the ip rule
is a quick fix.
> 2009/11/13 Jon <jon.skarpet...@gmail.com>

dormando

unread,
Nov 13, 2009, 9:15:15 PM11/13/09
to mogile
You should have hosts configured in the tracker as IP addresses.. so it
doesn't matter what's in /etc/hosts anywhere.

dormando

unread,
Nov 13, 2009, 9:23:36 PM11/13/09
to mogile
> #Tracker 1,2
> db_dsn =
> DBI:mysql:mogilefs:host=sql2.domain.com;port=3306;mysql_connect_timeout=10;mysql_ssl=1;mysql_compression=1
> #Tracker 3:
> db_dsn = DBI:mysql:mogilefs:host=127.0.0.1
>
> db_user = xxx
> db_pass = xxx
> listen = 0.0.0.0:7001
> conf_port = 7001
> listener_jobs = 10
> delete_jobs = 1
> replicate_jobs = 5
> mog_root = /mnt/mogilefs
> reaper_jobs = 1

Is this different on any of your hosts, besides the above dsn line?

How have you configured your classes? what is your mindevcount everywhere?
How many hosts/devices do you have?

> Not knowing what default settings for rebalance_ignore_missing, I'm
> assuming it is 0. (I added the line 'rebalance_ignore_missing = 1'
> now)

Yeah.

> "Rebalance is broken, you shouldn't use it. " Noted. But what is
> broken excactly? It seems to have filled newly insterted boxes/devices
> when I enabled it before.

There're a lot of cases where it'll just spin and waste CPU. Drain has
similar issues, but less of them. MogileFS will prefer hosts that are more
empty and more idle for all operations. So usually it's best to just pick
the hosts you think could benefit from having fewer files and drain them
for a while.

> I can see how the ideal case it so handle tracker only for all files
> related tasks, but I was having problems with persistent sockets and
> load. Not to mention the extra latency in my case with a mysql server
> located in another datacenter. Why is extracting paths for a spesific
> key directly from database so horrible? (Or a full set of keys, to
> minimize latency issues).

Because the tracker has intelligence and options in the way it returns
paths. Doing that also tells me you probably don't understand the problem
so well. Either your tracker is mildly broken or you're not querying it
correctly.

You should be fetching all paths by passing a pathcount argument, and also
passing 'noverify=1', when calling get_paths. Then MogileFS uses its io
monitoring to determine which paths are safe to return, and what order is
ideal for the IO load.

noverify=1 tells mogilefs to not run HEAD requests against each path
before returning them. that's probably most of your slowdown, since those
will go over the internuts.

> You where correct in the setup, I have changed the version since post
> to 2.32 from svn.

Please use the release... SVN trunk has some broken commits in it right
now.

> So, for reference:
>
> Databases 'A' and 'B', in master<->master replicated pairs.
>
> Tracker 1 (2.32) -> talks to database A
> Tracker 2 (2.32) -> talks to database A
> Tracker 3 (2.32) -> talks to database B
>
> (I changed all trackers to point to database B now)
>
> I need to write failover logic in the tracker itself, and switch all
> trackers to database B, in case A goes down? Or what's the optimal way
> to do this?

There's an endless amount of documentation on the internet for the various
ways of doing HA MySQL setups. That's a little beyond the scope of this
discussion :) The easiest would be to use a hostname and switch that out.
Then it gets harder from there.

Once you revert back to the 2.32 tarball... You should stop all trackers,
start all trackers clean. Ensure rebalance is off, ensure nothing is
draining, or readonly. Ensure deletes are actually working, replication is
actually working. Then clear the fsck log (mogadm fsck clear ? I forget),
and re-run a fsck. Once the file_to_queue table is empty, the fsck has
completed.

Then you can look over the log to see what's going on. Replication and etc
should be working.

It might also be a good step to, *before you run fsck*, make sure that
you can upload a few files and they are actually replicating correctly.

-Dormando

Jon

unread,
Nov 14, 2009, 1:06:36 PM11/14/09
to mogile
There is other considerations than the 'weight' column in sql
happening in the tracker? I always assumed the weight was the only
reference to IO load? The case where I use direct database access is
for accessing 300-400 files in one query. If I where to request that
many files one by one over the internet, that would give a fairly bad
result.

Regarding HA MySQL setups, my question is rather;
Is there a way for me to have the trackers connect to mysql server2
when server1 down, without having to move the ip (not possible in my
enviroment) or update dns record (really slow) of the mysql server?

I reverted my trackers to the tarball, and started tracker again in
debug mode:

[fsck(16252)] Fixing FID 405834
[reaper(16250)] Reaper running; looking for dead devices
[monitor(16249)] Monitor running; scanning usage files
[fsck(16252)] node server1.domain.com seems to be down in
get_file_size
[fsck(16252)] Fsck stalled: dev unreachable at /usr/local/share/perl/
5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN3> line 597.
[monitor(16249)] Monitor running; scanning usage files
[monitor(16249)] dev7: used = 12125768, total = 296362216, writeable =
1
[monitor(16249)] dev9: used = 84287852, total = 296362216, writeable =
1
[monitor(16249)] dev5: used = 11099084, total = 98787332, writeable =
1
[reaper(16250)] Reaper running; looking for dead devices
[replicate(16235)] source_down: Requested replication source device 7
not available
[monitor(16249)] Monitor running; scanning usage files

Where dev5 is the only device on server1.domain.com and dev 7 is the
only device on localhost. The dns database record for server2
containing dev7 is added in /etc/hosts file pointing to 127.0.1.1

Both servers are clearly up, and 'mogadm check' returns 'OK' on dev5
and dev7


mogstored.conf:

maxconns = 10000
httplisten = 0.0.0.0:7500
mgmtlisten = 0.0.0.0:7501
docroot = /var/mogdata

What could be the reason for fsck and replicate to fail?

dormando

unread,
Nov 15, 2009, 5:46:56 AM11/15/09
to mogile
> There is other considerations than the 'weight' column in sql
> happening in the tracker? I always assumed the weight was the only
> reference to IO load? The case where I use direct database access is
> for accessing 300-400 files in one query. If I where to request that
> many files one by one over the internet, that would give a fairly bad
> result.

The tracker keeps note of when devices are dead, marked down, etc. I guess
you'll be fine for now, but maybe we can come up with a batch paths
request or something.

> Regarding HA MySQL setups, my question is rather;
> Is there a way for me to have the trackers connect to mysql server2
> when server1 down, without having to move the ip (not possible in my
> enviroment) or update dns record (really slow) of the mysql server?

You should really do some reading and fiddle. Doing it in the
client/tracker won't be the best idea since you have things everywhere
doing requests. DNS isn't slow if you have something smart dumping changes
into hosts files, or swapping configs and restarting, or something.
You'll probably end up with an iptables redirect or a tcp proxy, I'd
bet...

> [fsck(16252)] Fixing FID 405834
> [reaper(16250)] Reaper running; looking for dead devices
> [monitor(16249)] Monitor running; scanning usage files
> [fsck(16252)] node server1.domain.com seems to be down in
> get_file_size
> [fsck(16252)] Fsck stalled: dev unreachable at /usr/local/share/perl/
> 5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN3> line 597.
> [monitor(16249)] Monitor running; scanning usage files
> [monitor(16249)] dev7: used = 12125768, total = 296362216, writeable =
> 1
> [monitor(16249)] dev9: used = 84287852, total = 296362216, writeable =
> 1
> [monitor(16249)] dev5: used = 11099084, total = 98787332, writeable =
> 1
> [reaper(16250)] Reaper running; looking for dead devices
> [replicate(16235)] source_down: Requested replication source device 7
> not available
> [monitor(16249)] Monitor running; scanning usage files
>
> Where dev5 is the only device on server1.domain.com and dev 7 is the
> only device on localhost. The dns database record for server2
> containing dev7 is added in /etc/hosts file pointing to 127.0.1.1
>
> Both servers are clearly up, and 'mogadm check' returns 'OK' on dev5
> and dev7

Mogile's not too tolerant to having a super latent setup... As you've
noticed I guess. It tries pretty hard to not get hung up on servers that
are limping, which both helps the server recover by avoiding giving it
traffic, and avoids mogilefs from tripping up over them.

In short, there's a really low timeout for the get_file_size command
there. If you crack open MogileFS/HTTPFile.pm and look at line 243, you
can increase that timeout. If that clears it up I'll go make it
configurable...

-Dormando

Jon

unread,
Nov 15, 2009, 7:34:05 AM11/15/09
to mogile
[replicate(16235)] source_down: Requested replication source device 7
not available <-- this device is on the same host as tracker, and
problem still exist after increasing timeout

For my dev5 is seems to be working good to change the value to +1
instead of 0.2! :-D

Jon

unread,
Nov 15, 2009, 9:23:38 AM11/15/09
to mogile
I fired up a tracker at another host into debug mode aswell, giving
the same errors dev7. The same fix for dev5 with changing
get_file_size timeout to +1 instead of 0.2 corrected all problems
against that device/host at this tracker also.

If I shut down the mogstored daemon I get:
[monitor(17217)] Port 7500 not listening on otherwise-alive machine
serverx.domain.com? Error was: 500 Can't connect to
serverx.domain.com:7500 (connect: Connection refused)

Running mogstored in console mode didn't give me any clues either.

I managed to upload to dev7 just fine through tracker

So I'm rather curious why I still get [replicate(18974)] source_down:
Requested replication source device 7 not available


dormando

unread,
Nov 15, 2009, 4:32:53 PM11/15/09
to mogile
That tracker error means that the replicate worker doesn't think the
device is up.

Which either means the monitor process hasn't insisted to the replicate
workers that the device is actually up. Do you have any errors before
that? errors in accessing dev7 that end up marking it as down.

Or, if you reset the replicate workers via '!want 0 replicate' ' sleep 60
; '!want N replicate' (N is your old worker count), does it work for a
while, before giving up again?

Guess I should try to make that more explicit sooner than later...

Jon

unread,
Nov 16, 2009, 4:22:35 AM11/16/09
to mogile
It doesn't appear to work at all. The device never gets marked as down
(in fact I've never seen any device getting marked as down by the
tracker even if the device has actually been down for days, just
tries, and tries and tries... in addition I've seen the tracker try on
devices manually marked as 'down'. It was not until I set the device
to status 'dead' that tracker honored the status - on the other hand,
mogadm util honors this flag just fine and skips the device)

When increasing the timeout for the error I had with dev5, the
file_to_queue table got 90% of items removed in a matter of minutes.
Not a single of the remaining items (which I'm assuming is all related
to dev7) has been processes in such a way that it has been removed
from that table.

I have yet to try your telnet approach, but shutting down tracker and
starting it again has no effect. The replicate error is the only error
I can see when running in debug mode. I've attempted debug mode at two
different trackers, one running on localhost, and another offsite
tracker.

Jon

unread,
Nov 16, 2009, 4:53:02 PM11/16/09
to mogile
Somehow the file_to_queue table got emptied over the night, so I
attempted to put some old devices back to life, by bringing them back
from status 'dead'. Theese where 8GB devices from virtual machines I
used to test my initial setup. I hit fsck reset, and fsck start. This
appears to work well, which is really strange for one particular case
- which is one device located at the excact same server, and physical
disk as dev7, which keeps throwing out errors. I put the device to
status 'drain', but I haven't seen a single error message related to
it! The host is a virtual machine with two virtual harddrives
attached, where both of them are at the same physical disk. I was
expecting this to fail in the same manner as dev7, but there's no
evidence of that..

I got some new error messages related to the server with latency
issues earlier (dev5)

[fsck(5866)] get_file_size() connect timeout for HTTP HEAD for size of
http://server1.domain.com:7500/dev3/0/000/329/0000329128.fid
[fsck(5866)] node server1.domain.com seems to be down in get_file_size
[fsck(5866)] Fsck stalled: dev unreachable at /usr/local/share/perl/
5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN26> line 72556.

[fsck(5866)] no time remaining to write HEAD request to
http://server1.domain.com:7500/dev5/0/000/358/0000358199.fid
[fsck(5866)] Fsck stalled: dev unreachable at /usr/local/share/perl/
5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN2> line 51240.

[fsck(5866)] get_file_size() connect timeout for HTTP HEAD for size of
http://server1.domain.com:7500/dev5/0/000/347/0000347343.fid
[fsck(5866)] Fsck stalled: dev unreachable at /usr/local/share/perl/
5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN5> line 22724.

The host is alive and well. Monitor reports the device as readable

dormando

unread,
Nov 16, 2009, 5:19:21 PM11/16/09
to mogile
You're not supposed to ever bring a device back from 'dead' to 'alive', if
you do that you should first rm everything on the device and give it a new
deviceid.

In fact the system tries at least a little bit to prevent you from doing
that :)

When fsck "stalls", it will retry that file after 10 minutes. If your
queue is empty then it has eventually gotten around to fixing all of them.

I'm not sure what you're doing with dev7... are you really sure this thing
works? There must be some error you're missing, since that other error you
note should only happen after that device has been noted as being down.

Can you attach output of 'mogadm check', and also "select * from
devices;"?

Jon

unread,
Nov 17, 2009, 4:24:57 AM11/17/09
to mogile
Checking trackers...
127.0.0.1:7001 ... OK

Checking hosts...
[ 2] server1 ... OK
[ 3] server2 ... OK
[ 4] server3 ... OK
[ 5] server4 ... OK
[ 6] server5 ... OK
[ 7] server6 ... OK
[ 8] server7 ... OK
[ 9] server8 ... OK

Checking devices...
host device size(G) used(G) free(G) use% ob
state I/O%
---- ------------ ---------- ---------- ---------- ------ ----------
-----
[ 2] dev1 7.547 1.229 6.317 16.29%
writeable 0.0
[ 2] dev6 396.844 106.044 290.800 26.72%
writeable 0.0
[ 3] dev4 7.166 1.836 5.330 25.62%
writeable 0.0
[ 3] dev7 282.633 11.564 271.069 4.09%
writeable 0.0
[ 4] dev3 7.166 2.071 5.095 28.90%
writeable 0.0
[ 4] dev5 94.211 10.585 83.626 11.24%
writeable 0.0
[ 5] dev8 248.027 98.747 149.281 39.81%
writeable 0.0
[ 6] dev9 282.633 80.383 202.250 28.44%
writeable 0.0
[ 7] dev10 396.844 72.759 324.084 18.33%
writeable 0.0
[ 8] dev11 396.844 94.855 301.988 23.90%
writeable 0.0
[ 9] dev12 329.738 84.702 245.036 25.69%
writeable 0.0
---- ------------ ---------- ---------- ---------- ------
total: 2449.653 564.777 1884.876 23.06%


1, 2, 'drain', 1, 7727, 1258, 1258438553
2, 3, 'dead', 1, 406367, 198, 1243444292
3, 4, 'drain', 1, 7338, 2120, 1258438542
4, 3, 'drain', 1, 7338, 1880, 1258438553
5, 4, 'alive', 100, 96472, 10839, 1258438553
6, 2, 'alive', 100, 406367, 108588, 1258438542
7, 3, 'alive', 100, 289416, 11841, 1258438541
8, 5, 'alive', 100, 253980, 101116, 1258438553
9, 6, 'alive', 100, 289416, 82312, 1258438552
10, 7, 'alive', 100, 406367, 74505, 1258438553
11, 8, 'alive', 100, 406367, 97131, 1258438542
12, 9, 'alive', 100, 337652, 86735, 1258438552



I am very sure the device works... I can both upload and download from
it using tracker.. in addition, another device is running at the
excact same physical disk..
I took it from dead to down directly in the database, then I used
mogadm tool to bring it to 'alive' / 'drain'

As a sidenote - all the servers are running at different physical
locations, with internet connectivity in between. For the most parts
it seems to work just fine :)

Jon

unread,
Nov 19, 2009, 3:16:05 PM11/19/09
to mogile
After my fsck had completed it's cycle, I still have some weird stats
when I hit 'mogadm fsck status'. file_to_queue table is now completely
empty

[num_BLEN]: 12
[num_GONE]: 1037
[num_NOPA]: 1222
[num_POVI]: 4346
[num_REPL]: 4338
[num_SRCH]: 1228


Isn't it supposed to resolve issues like policy violations?

What has worked is status 'drain' - the devices I put to 'drain' has
now been completely emptied, as opposed to my previous attempts :)

My file_to_replicate tables has about 1000 entries in it, however it
doesn't appear to shrink in any way. In the fromdevid column there's a
total of 6 different devices.. so I don't see why the files never gets
replicated. Some of the fids have a failcount of well above 100, but
for the most parts the failcount is 1. Running mogilefs in debug mode
doesn't print out any errors regarding replication at this time
either.

When doing 'mogadm stats' I get a rather alarming result.. example:


domain.com class1 0 154
domain.com class1 1 18
domain.com class1 2 1668
domain.com class1 3 898
domain.com class1 4 27310
domain.com class1 5 859
domain.com class1 6 21


class1 policy is set to devcount 4


How excactly can I resolve this so I don't loose files?
> ...
>
> les mer »

dormando

unread,
Nov 20, 2009, 9:12:00 PM11/20/09
to mogile
Hey,

There's a certain kind of policy violation it doesn't presently autofix...
which are "over replicated" files. POVI errors are usually logged with
another error.. NOPA, REPL, GONE, etc. You can look at the actual fsck_log
to see what the deal is with those paths.

Since your setup has had so many misconfigurations, at times you might've
had faulty file uploads, or broken replications, and fsck is banging its
head against those.

You should poke through the fsck_log and pull some fids that were in the
'GONE' state or similar, and take a look at their status. You might just
want to delete them, or reupload them if they still work.

Once your setup is more stable, you should see few, if any of these...

I really can't help you if you keep doing things that it isn't designed to
do though. We don't support moving devices from dead -> alive, don't
support having trackers talk to seperate databases, etc. There're a few
things that need to be strict so mogilefs can keep itself sane. Given this
it seems to have done a fairly good job at not completely falling to
pieces...

-Dormando
> > les mer ?
>

dormando

unread,
Nov 20, 2009, 9:16:35 PM11/20/09
to mogile
I wish you could start this all out from scratch :) Trying to repair a
system that's been through so much is hard.

At this point I'd need access to your system to figure out what's up with
dev7. If you can cope at all, I'd recommend marking it as dead, formatting
it, and bringing it back under a new device number.

If not, I'd have to look directly, and it'd probably be a good thing if
you tried the above first... Unless you think there're files on dev7 that
don't exist on any other device.

-Dormando

On Mon, 16 Nov 2009, Jon wrote:

>

Jon

unread,
Nov 21, 2009, 6:14:22 AM11/21/09
to mogile
I've taken down the server running dev7. However, marking it as down
with 'mogadm mark server down' doesn't appear to have any effect on
the monitor thread in tracker..

When issuing 'SELECT * FROM fsck_log f where devid > 0;' I get almost
only BLEN as evcode, with only 3 as MISS. The three MISS are all on
the device with latency issues before (dev5). As for the BLEN, they're
all replicated to a correct devcount as far as I can tell, on devices
that has never been marked as 'dead'. But theese numbers doesn't
correspond to 'mogadm stats' or 'mogadm fsck status' at all as far as
I can tell.

I brought the devices from 'dead' to 'drain', hoping to recover some
of my files. It appears it was somewhat successful atleast, as it did
replicate something :)

I'm fairly scared by my stats atm. due to tons of inconsistencies. Is
there a consistent way to remove all fids and entries in logs and
stats that are completely lost, in order to get a more accurate
overview? Should be a fairly easy task to repair processed files like
thumbnails where possible, if I get an overview over what's actually
broken/missing. Instead of moving a million files back and forth to
create a new setup..

Btw. my fsck_log has little over 2/3 of the entries inside the file_on
table, is that correct? This is after a 'mogadm fsck reset', and
'mogadm fsck start', after file_to_queue table is empty.

dormando

unread,
Nov 22, 2009, 2:58:39 AM11/22/09
to mogile


On Sat, 21 Nov 2009, Jon wrote:

> I've taken down the server running dev7. However, marking it as down
> with 'mogadm mark server down' doesn't appear to have any effect on
> the monitor thread in tracker..

'down' means the monitor process will still check it, but the tracker
shouldn't be storing new files to it, or issuing responses to get_paths
with paths from that host in there. I think monitor only skips it on
'dead'.

> When issuing 'SELECT * FROM fsck_log f where devid > 0;' I get almost
> only BLEN as evcode, with only 3 as MISS. The three MISS are all on
> the device with latency issues before (dev5). As for the BLEN, they're
> all replicated to a correct devcount as far as I can tell, on devices
> that has never been marked as 'dead'. But theese numbers doesn't
> correspond to 'mogadm stats' or 'mogadm fsck status' at all as far as
> I can tell.
>
> I brought the devices from 'dead' to 'drain', hoping to recover some
> of my files. It appears it was somewhat successful atleast, as it did
> replicate something :)

the "dead" state is supposed to remove everything that was in file_on for
that device... I'm not sure what doing that would change.

> I'm fairly scared by my stats atm. due to tons of inconsistencies. Is
> there a consistent way to remove all fids and entries in logs and
> stats that are completely lost, in order to get a more accurate
> overview? Should be a fairly easy task to repair processed files like
> thumbnails where possible, if I get an overview over what's actually
> broken/missing. Instead of moving a million files back and forth to
> create a new setup..

mogadm fsck clearlog
mogadm fsck reset
... the fsck stats were broken pretty badly in 2.30. If you try out trunk
with fsck (but *only* run fsck workers on the host that you upgraded to
trunk), the stats will be better, but still a little off.

> Btw. my fsck_log has little over 2/3 of the entries inside the file_on
> table, is that correct? This is after a 'mogadm fsck reset', and
> 'mogadm fsck start', after file_to_queue table is empty.

You're supposed to run clearlog sometimes too ;) 'reset' resets the fsck
position and the summary, but only clearlog clears entries out of the log
table. Maybe that's all you'll need to do to figure this out? If you run
reset it might be pulling summary stats from old log entries.

-Dormando

Jon

unread,
Nov 22, 2009, 3:51:06 PM11/22/09
to mogile
New day, new error messages. I did what you said and cleared the log.
Then did a reset, and start of fsck.

[replicate(4723)] Unable to create dest socket to server1.domain.com:
7500 for /dev9/0/000/017/0000017802.fid
[replicate(4723)] Failed copying fid 17802 from devid 8 to devid 9
(error type: dest_error)
server1 reports 'writeable' both before and after this error message
came (seconds apart), when issuing 'mogadm check', so not sure what's
happening here

[monitor(4739)] Timeout contacting machine server2.domain.com for dev
6: took 1.99 seconds out of 2 allowed
[fsck(5618)] get_file_size() read timeout (0.489640951156616) for HTTP
HEAD for size of http://server2.domain.com:7500/dev6/0/000/024/0000024299.fid
[fsck(5618)] Connectivity problem reaching device 6 on host
server2.domain.com
[monitor(4739)] dev6: used = 111332928, total = 416120716, writeable =
1
This error message is also unclear to me, the machine and device is
alive and well (also according to monitor as you can see), and I
increased the timeout in HTTPFile.pm without any noticeable effect

[fsck(5618)] get_file_size() connect timeout for HTTP HEAD for size of
http://server3.domain.com:7500/dev7/0/000/022/0000022961.fid
[fsck(5618)] Fsck stalled: dev unreachable at /usr/local/share/perl/
5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN6> line 170.
This is the server I've marked as 'down', so I don't understand why
fsck keeps trying it? Again - 'mogadm check' skips this server just
fine

[replicate(6915)] Unable to create source socket to server4.domain.com:
7500 for /dev8/0/000/034/0000034585.fid
[replicate(6915)] Failed copying fid 34585 from devid 8 to devid 9
(error type: src_error)
[monitor(5615)] dev8: used = 103825396, total = 260075584, writeable =
1
A little later this error also showed up, but to me it looks like this
server is working just fine? :S


The server with the bad device has now been marked as 'dead' (not the
device, only the server) and fsck is still going strong with it:

[fsck(5618)] get_file_size() connect timeout for HTTP HEAD for size of
http://server3.domain.com:7500/dev7/0/000/144/0000144678.fid
[fsck(5618)] Fsck stalled: dev unreachable at /usr/local/share/perl/
5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN212> line 28020.

file_to_queue still have about 1500 items in it, all the other items
got processed in only a couple of hours.
file_to_replicate have about 800 items in it.

How to figure out why theese issues doesn't get resolved?

What has changed tho is 'mogadm fsck status':

[num_NOPA]: 24
[num_SRCH]: 24

Much less items in there now, but this doesn't really match the
file_to_replicate or file_to_queue numbers in any way which are still
at considerably numbers

I also tried marking the dev7 as dead, on the already marked dead host
- same result for fsck - still pounds on it. However if I restarted
the tracker it stopped polling the dead server, and my file_to_queue
items got processed rather quick, except 53 entries (?). My
file_to_replicate table is still the same as before, with 800 records.
My latest 'mogadm fsck status' gives:

[num_GONE]: 1

Right after the restart of the tracker I got some new error messages:

[replicate(22518)] Got HTTP status code 400 PUTing to
http://server5.domain.com:7500/dev6/0/000/399/0000399061.fid
[replicate(22518)] Failed copying fid 399061 from devid 9 to devid 6
(error type: dest_error)
[replicate(22518)] Got HTTP status code 400 PUTing to
http://server6.domain.com:7500/dev8/0/000/399/0000399061.fid
[replicate(22518)] Failed copying fid 399061 from devid 9 to devid 8
(error type: dest_error)
[replicate(22518)] Got HTTP status code 400 PUTing to
http://server7.domain.com:7500/dev10/0/000/399/0000399061.fid
[replicate(22518)] Failed copying fid 399061 from devid 5 to devid 10
(error type: dest_error)
[replicate(22518)] Got HTTP status code 400 PUTing to
http://server8.domain.com:7500/dev11/0/000/399/0000399061.fid
[replicate(22518)] Failed copying fid 399061 from devid 12 to devid 11
(error type: dest_error)
[replicate(22518)] policy_no_suggestions: replication policy ran out
of suggestions for us replicating fid 399061

[fsck(22533)] HEAD response to get_file_size looks bogus
[fsck(22533)] Fsck stalled: dev unreachable at /usr/local/share/perl/
5.8.8/MogileFS/Worker/Fsck.pm line 290, <> line 12.

Also the occational:
[monitor(22530)] Timeout contacting machine server5.domain.com for dev
6: took 2.00 seconds out of 2 allowed
for a fully operational server, which appears to respond in
milliseconds when I issue 'mogadm check'

I'm assuming some input on what theese error message actually means
would be very helpful! :)

My plan is to remove bad apples from the basket, and work around the
missing files in my application. For the most part it's generated
files, which can be generated again (like thumbnails)
I'm assuming my latency issues/locking issues with trackers on
different databases has caused some broken files in the system, which
I intend to remove. (Sure beats putting up a brand new setup, and move
'good' ones =)

Jon

unread,
Nov 22, 2009, 4:13:28 PM11/22/09
to mogile
Update to my previous post:

file_to_queue got emptied after a while
file_to_replicate has gotten smaller, but stuck at the same number as
before the fsck reset. Would this table be safe to clean out, and do a
completely fresh fsck run once again? With both theese tables cleaned
out?

On 22 Nov, 21:51, Jon <jon.skarpet...@gmail.com> wrote:
> New day, new error messages. I did what you said and cleared the log.
> Then did a reset, and start of fsck.
>
> [replicate(4723)] Unable to create dest socket to server1.domain.com:
> 7500 for /dev9/0/000/017/0000017802.fid
> [replicate(4723)] Failed copying fid 17802 from devid 8 to devid 9
> (error type: dest_error)
> server1 reports 'writeable' both before and after this error message
> came (seconds apart), when issuing 'mogadm check', so not sure what's
> happening here
>
> [monitor(4739)] Timeout contacting machine server2.domain.com for dev
> 6:  took 1.99 seconds out of 2 allowed
> [fsck(5618)] get_file_size() read timeout (0.489640951156616) for HTTP
> HEAD for size ofhttp://server2.domain.com:7500/dev6/0/000/024/0000024299.fid
> [fsck(5618)] Connectivity problem reaching device 6 on host
> server2.domain.com
> [monitor(4739)] dev6: used = 111332928, total = 416120716, writeable =
> 1
> This error message is also unclear to me, the machine and device is
> alive and well (also according to monitor as you can see), and I
> increased the timeout in HTTPFile.pm without any noticeable effect
>
> [fsck(5618)] get_file_size() connect timeout for HTTP HEAD for size ofhttp://server3.domain.com:7500/dev7/0/000/022/0000022961.fid
> [fsck(5618)] Fsck stalled: dev unreachable at /usr/local/share/perl/
> 5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN6> line 170.
> This is the server I've marked as 'down', so I don't understand why
> fsck keeps trying it? Again - 'mogadm check' skips this server just
> fine
>
> [replicate(6915)] Unable to create source socket to server4.domain.com:
> 7500 for /dev8/0/000/034/0000034585.fid
> [replicate(6915)] Failed copying fid 34585 from devid 8 to devid 9
> (error type: src_error)
> [monitor(5615)] dev8: used = 103825396, total = 260075584, writeable =
> 1
> A little later this error also showed up, but to me it looks like this
> server is working just fine? :S
>
> The server with the bad device has now been marked as 'dead' (not the
> device, only the server) and fsck is still going strong with it:
>
> [fsck(5618)] get_file_size() connect timeout for HTTP HEAD for size ofhttp://server3.domain.com:7500/dev7/0/000/144/0000144678.fid
> [fsck(5618)] Fsck stalled: dev unreachable at /usr/local/share/perl/
> 5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN212> line 28020.
>
> file_to_queue still have about 1500 items in it, all the other items
> got processed in only a couple of hours.
> file_to_replicate have about 800 items in it.
>
> How to figure out why theese issues doesn't get resolved?
>
> What has changed tho is 'mogadm fsck status':
>
> [num_NOPA]: 24
> [num_SRCH]: 24
>
> Much less items in there now, but this doesn't really match the
> file_to_replicate or file_to_queue numbers in any way which are still
> at considerably numbers
>
> I also tried marking the dev7 as dead, on the already marked dead host
> - same result for fsck - still pounds on it. However if I restarted
> the tracker it stopped polling the dead server, and my file_to_queue
> items got processed rather quick, except 53 entries (?). My
> file_to_replicate table is still the same as before, with 800 records.
> My latest 'mogadm fsck status' gives:
>
> [num_GONE]: 1
>
> Right after the restart of the tracker I got some new error messages:
>
> [replicate(22518)] Got HTTP status code 400 PUTing tohttp://server5.domain.com:7500/dev6/0/000/399/0000399061.fid
> [replicate(22518)] Failed copying fid 399061 from devid 9 to devid 6
> (error type: dest_error)
> [replicate(22518)] Got HTTP status code 400 PUTing tohttp://server6.domain.com:7500/dev8/0/000/399/0000399061.fid
> [replicate(22518)] Failed copying fid 399061 from devid 9 to devid 8
> (error type: dest_error)
> [replicate(22518)] Got HTTP status code 400 PUTing tohttp://server7.domain.com:7500/dev10/0/000/399/0000399061.fid
> [replicate(22518)] Failed copying fid 399061 from devid 5 to devid 10
> (error type: dest_error)
> [replicate(22518)] Got HTTP status code 400 PUTing tohttp://server8.domain.com:7500/dev11/0/000/399/0000399061.fid
> ...
>
> les mer »

dormando

unread,
Nov 22, 2009, 9:09:49 PM11/22/09
to mogile
These are all timeout errors... Even though you're going over the
internet, why are they so slow?

Mogile's design is that it has early timeouts for a lot of conditions so
it can avoid those devices and not get jammed up. The bulk of these are
temporary and will be retried after 10-15 minutes (in the case of "fsck
stalled", replication errors, etc).

On Sun, 22 Nov 2009, Jon wrote:

> New day, new error messages. I did what you said and cleared the log.
> Then did a reset, and start of fsck.
>
> [replicate(4723)] Unable to create dest socket to server1.domain.com:
> 7500 for /dev9/0/000/017/0000017802.fid
> [replicate(4723)] Failed copying fid 17802 from devid 8 to devid 9
> (error type: dest_error)
> server1 reports 'writeable' both before and after this error message
> came (seconds apart), when issuing 'mogadm check', so not sure what's
> happening here
>
> [monitor(4739)] Timeout contacting machine server2.domain.com for dev
> 6: took 1.99 seconds out of 2 allowed
> [fsck(5618)] get_file_size() read timeout (0.489640951156616) for HTTP
> HEAD for size of http://server2.domain.com:7500/dev6/0/000/024/0000024299.fid
> [fsck(5618)] Connectivity problem reaching device 6 on host
> server2.domain.com
> [monitor(4739)] dev6: used = 111332928, total = 416120716, writeable =
> 1
> This error message is also unclear to me, the machine and device is
> alive and well (also according to monitor as you can see), and I
> increased the timeout in HTTPFile.pm without any noticeable effect

That's the monitor process talking... it's a different timeout of the same
length.

> [fsck(5618)] get_file_size() connect timeout for HTTP HEAD for size of
> http://server3.domain.com:7500/dev7/0/000/022/0000022961.fid
> [fsck(5618)] Fsck stalled: dev unreachable at /usr/local/share/perl/
> 5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN6> line 170.
> This is the server I've marked as 'down', so I don't understand why
> fsck keeps trying it? Again - 'mogadm check' skips this server just
> fine

Can you mark all of the devices on the host as down, instead of just the
host? There's a logic bug somewhere, I think.

> [replicate(6915)] Unable to create source socket to server4.domain.com:
> 7500 for /dev8/0/000/034/0000034585.fid
> [replicate(6915)] Failed copying fid 34585 from devid 8 to devid 9
> (error type: src_error)
> [monitor(5615)] dev8: used = 103825396, total = 260075584, writeable =
> 1
> A little later this error also showed up, but to me it looks like this
> server is working just fine? :S

Likely a timeout.

> The server with the bad device has now been marked as 'dead' (not the
> device, only the server) and fsck is still going strong with it:

Host status doesn't seem to work as well as it should. 'dead'ing a host
will not cause the devices on it to be re-replicated. The devices
themselves have to be.

> [fsck(5618)] get_file_size() connect timeout for HTTP HEAD for size of
> http://server3.domain.com:7500/dev7/0/000/144/0000144678.fid
> [fsck(5618)] Fsck stalled: dev unreachable at /usr/local/share/perl/
> 5.8.8/MogileFS/Worker/Fsck.pm line 290, <GEN212> line 28020.
>
> file_to_queue still have about 1500 items in it, all the other items
> got processed in only a couple of hours.
> file_to_replicate have about 800 items in it.
>
> How to figure out why theese issues doesn't get resolved?

Examine what's left in file_to_replicate.. is it being retried, is the
nextry set to ENDOFTIME (2^32)? Examine the fid or let it retry a few
times until they go through.

> What has changed tho is 'mogadm fsck status':
>
> [num_NOPA]: 24
> [num_SRCH]: 24
>
> Much less items in there now, but this doesn't really match the
> file_to_replicate or file_to_queue numbers in any way which are still
> at considerably numbers

As I said earlier the fsck status stuff is pretty darn broken in 2.32,
it's cleaned up in trunk if you want to try, but you have to guarantee
that all fsck workers are on trackers running trunk code. An older worker
will screw it up.

> I also tried marking the dev7 as dead, on the already marked dead host
> - same result for fsck - still pounds on it. However if I restarted
> the tracker it stopped polling the dead server, and my file_to_queue
> items got processed rather quick, except 53 entries (?). My
> file_to_replicate table is still the same as before, with 800 records.
> My latest 'mogadm fsck status' gives:

I bet there's a bug in a host being down blocking status updates
elsewhere. I've only ever marked a host as 'dead' *after* all of the
devices on it were marked as 'dead'. Try that?

> [replicate(22518)] Got HTTP status code 400 PUTing to
> http://server5.domain.com:7500/dev6/0/000/399/0000399061.fid
> [replicate(22518)] Failed copying fid 399061 from devid 9 to devid 6
> (error type: dest_error)
> [replicate(22518)] Got HTTP status code 400 PUTing to
> http://server6.domain.com:7500/dev8/0/000/399/0000399061.fid
> [replicate(22518)] Failed copying fid 399061 from devid 9 to devid 8
> (error type: dest_error)
> [replicate(22518)] Got HTTP status code 400 PUTing to
> http://server7.domain.com:7500/dev10/0/000/399/0000399061.fid
> [replicate(22518)] Failed copying fid 399061 from devid 5 to devid 10
> (error type: dest_error)
> [replicate(22518)] Got HTTP status code 400 PUTing to
> http://server8.domain.com:7500/dev11/0/000/399/0000399061.fid
> [replicate(22518)] Failed copying fid 399061 from devid 12 to devid 11
> (error type: dest_error)
> [replicate(22518)] policy_no_suggestions: replication policy ran out
> of suggestions for us replicating fid 399061

That's all you. go check permissions/etc?

> [fsck(22533)] HEAD response to get_file_size looks bogus
> [fsck(22533)] Fsck stalled: dev unreachable at /usr/local/share/perl/
> 5.8.8/MogileFS/Worker/Fsck.pm line 290, <> line 12.
>
> Also the occational:
> [monitor(22530)] Timeout contacting machine server5.domain.com for dev
> 6: took 2.00 seconds out of 2 allowed
> for a fully operational server, which appears to respond in
> milliseconds when I issue 'mogadm check'

Why are your services so blip-y? are you dropping packets? A SYN packet
will retry after 3 seconds, but the mogile timeouts of 2 seconds don't
allow a connection to retry before dying. Normally that's a good thing so
the worker can move on to other things to test/work on and try that guy
again later.

> I'm assuming some input on what theese error message actually means
> would be very helpful! :)

I did my best above.

> My plan is to remove bad apples from the basket, and work around the
> missing files in my application. For the most part it's generated
> files, which can be generated again (like thumbnails)
> I'm assuming my latency issues/locking issues with trackers on
> different databases has caused some broken files in the system, which
> I intend to remove. (Sure beats putting up a brand new setup, and move
> 'good' ones =)

Yeah I guess. Just make sure everything's pointed at the same DB from now
on...

dormando

unread,
Nov 22, 2009, 9:10:51 PM11/22/09
to mogile
You can clear out file_to_replicate if the stuff in there has been given
up on (ENDOFTIME) and you've looked through a few to confirm that they're
gone.

On Sun, 22 Nov 2009, Jon wrote:

> > les mer ?
>

Jon

unread,
Nov 23, 2009, 2:55:59 PM11/23/09
to mogile
>> [replicate(22518)] Got HTTP status code 400 PUTing to
>> http://server8.domain.com:7500/dev11/0/000/399/0000399061.fid
>> [replicate(22518)] Failed copying fid 399061 from devid 12 to devid 11
>> (error type: dest_error)
>> [replicate(22518)] policy_no_suggestions: replication policy ran out
>> of suggestions for us replicating fid 399061

>That's all you. go check permissions/etc?

How can all these devices get status 'writeable' if permissions are
wrong? And how can uploading work if all are rejecting files?

Another thing I'm having problems with, is that when I run the fsck
from start, I'm having major problems uploading files. The files get
through eventually, after x retries in my client application. Example;
it took two full minutes of constand timeout/retry by the client
application before upload of 4 files of 50kb each was complete! Cpu
load on database is nothing to speak of, and reponse times appear fine
to me towards storage nodes

What's a good way to debug what's actually taking time?
> ...
>
> les mer »

dormando

unread,
Nov 23, 2009, 4:00:26 PM11/23/09
to mogile
> How can all these devices get status 'writeable' if permissions are
> wrong? And how can uploading work if all are rejecting files?

I really don't know. Something's getting your uploads to respond with 400
errors, and you'll find that out by temporarily enabling error logs on
nginx or whatever.

> Another thing I'm having problems with, is that when I run the fsck
> from start, I'm having major problems uploading files. The files get
> through eventually, after x retries in my client application. Example;
> it took two full minutes of constand timeout/retry by the client
> application before upload of 4 files of 50kb each was complete! Cpu
> load on database is nothing to speak of, and reponse times appear fine
> to me towards storage nodes
>
> What's a good way to debug what's actually taking time?

In your client? Strace. It's starting to sound more and more like your
entire setup's severely underpowered. Try lowering the number of fsck
workers.
> > les mer ?
>

Jon

unread,
Nov 23, 2009, 4:36:02 PM11/23/09
to mogile
I am running mogstored with default config. How to debug that one? In
any case, the error appears to have dissappeared, because I am unable
to reproduce it now. All items sendt to file_to_replicate is resolved
right away, and my slow tracker issue seems to have resolved itself by
using another tracker. I now have my client actively cycling through
trackers on any sign of trouble, and retry, which seems to find a
'good' tracker fairly fast (atleast from the clients point of view). I
believe the problem is related to tracker and mysql and web frontend
where all at different locations, once I hit a tracker on the same
host as the mysql master it stopped failing.

My file_to_queue table is almost empty now with failcount = 0 on all
remaining items. New uploads seems to replicate as expected.

All in all, I'm happy about the setup now, considering my starting
point! :D

Thank you so much for all your support and help dormando!
> ...
>
> les mer »
Reply all
Reply to author
Forward
0 new messages