queryworker crashing leaving leaving orphaned entries in file_on table

7 views
Skip to first unread message

shalabalubaosigbus

unread,
Nov 2, 2009, 4:04:56 PM11/2/09
to mogile
I noticed there where some entries showing up in the file_on table
that had no entry in the file table.
I started digging and noticed that every once and a while (under high
load) the tracker fails when trying to write to convert a temp file
into a real one.
It performs the DELETE FROM tempfile.
It then executes the INSERT IGNORE INTO file_on
after this (or during this) it crashes and does not clean up after
itself.

This is happening in lib/MogileFS/Store.pm line 286:
my $rv = eval { $self->dbh->do($sql, @do_params) };

I think the point of this eval is to gracefully handle errors, but
this is not happening.
I put some debugging code in there to see where it was crashing and I
get the message before this line, but not after.

Here is the debug output from one instance of this happening:
[queryworker(31634)] about to call add_to_db on key: /new/
1257191780.11-6650-t1.bofh.lan fidfid: 34922
[queryworker(31634)] calling add_fidid_fo_devid from add_to_db fidid:
34922
[queryworker(31634)] About to insert into file_on in
add_fidid_to_dev_id: fidid: 34922 devid: 2
[queryworker(31634)] dowell: start INSERT IGNORE INTO file_on (fid,
devid) VALUES (?,?) 34922 2
[queryworker(31634)] dowell: inside eval before ->do: INSERT IGNORE
INTO file_on (fid, devid) VALUES (?,?) 34922 2

091102 12:56:23 mogilefs 0 654 1 20678536 0 INSERT INTO tempfile
(dmid,dkey,classid,devids,createtime) VALUES ('5','/new/
1257191780.11-6650-t1.bofh.lan','1','2,1',UNIX_TIMESTAMP());
091102 12:56:28 mogilefs 0 659 1 20713229 0 DELETE FROM tempfile WHERE
fid='34922';
091102 12:56:30 mogilefs 4 659 1 20721067 0 INSERT IGNORE INTO
file_on (fid, devid) VALUES ('34922','2');


There is nothing else logged from this pid after this.


I think this needs to:
1. be able to handle errors happening on the database. i.e. realize
there was an error and report it and clean up after itself. eval is
not cutting it in this case.
and
2. AutoCommit => 0, this would force the tracker to send a commit and
if it didn't then the transaction would be rolled back.

Is there any particular reason AutoCommit is on?

dormando

unread,
Nov 15, 2009, 6:44:48 PM11/15/09
to mogile
Hey,

Are you seeing anything about the queryworker being killed due to watchdog
timeout/etc?

That eval isn't related to whether or not it should work... If it causes
an error you should see a large error message that you can paste. Your
report below just shows that the queryworker disappears abruptly.

-Dormando

shalabalubaosigbus

unread,
Nov 15, 2009, 6:58:28 PM11/15/09
to mogile
There are no error messages. The process just crashes without writing
any more to the logs and without allowing the program to gracefully
handle the error.
I'm not saying the eval is causing the problem, I'm just saying it is
not allowing the program to fail gracefully like it should.

I didn't notice any watchdog timeout messages, I'm assuming that would
be in stdout/stderr with debug turned on.

dormando

unread,
Nov 15, 2009, 7:13:31 PM11/15/09
to mogile
The eval won't make it crash without reporting an error... You'll notice
that the eval is outside of a wrapper that properly throws an error if
there is one.

I don't see a crash log in your list though. Is the parent process
respawning a dead queryworker, or what?

shalabalubaosigbus

unread,
Nov 16, 2009, 11:34:45 AM11/16/09
to mogile
I'm not saying the eval makes it crash without throwing an error.

I'm not sure where the crash log would be. It doesn't appear to be in
stdout/stderr.

All I know is that query worker abruptly stops. That is the last thing
that it processes and that pid no longer shows up in the logs.
I made some adjustments to turn autocommit off just for that function
and the same thing happens except it crashes on commit instead of the -
>do.

This only happens on the INSERT INTO file_on and only randomly.

dormando

unread,
Nov 16, 2009, 4:07:10 PM11/16/09
to mogile
So... You should definitely be seeing a note from the parent process that
says something similar to:

"Child 23523 (queryworker) died: 1 (UNEXPECTED)"
... and then a note about it spawning a new child.

If you're not seeing that, you might not be looking in the right place to
see what the queryworker error is, since it might also be the parent
watchdog killing it off.

You should be doing:

telnet trackerhost 7001
!watch
... to see the errors it's displaying... Though you should see all of
these on stderr/stdout if you're watching the tracker in the foreground,
too.

Try both? You can confirm if you're seeing the error or not by starting a
test tracker somewhere and kill'ing one of the queryworker pids. That
should display the above error and respawn it.

-Dormando

shalabalubaosigbus

unread,
Nov 16, 2009, 5:05:33 PM11/16/09
to mogile
Thank you for showing me what to look for. It looks like the watchdog
is killing it.
Here is what I get:
This is everything that has that pid.

[queryworker(27569)] dowell: before eval: params: 'INSERT IGNORE
INTO file_on (fid, devid) VALUES (?,?)' '107139' '2'
Watchdog killing worker 27569 (queryworker)
Child 27569 (queryworker) died: 9 (expected)

What can I do about this?

dormando

unread,
Nov 20, 2009, 9:25:39 PM11/20/09
to mogile
Can you get your database under a little less load?

It'll require a bit more work until we can get that all into a
transaction.

A workaround, can be to:
sub watchdog_timeout { 10 }
.. in MogileFS/Worker/Query.pm
... change that to 30-60. Which will make it more tolerant of spikes in DB
usage.

In fact I might just go raise that to 30 as a default...

shalabalubaosigbus

unread,
Nov 21, 2009, 4:35:22 PM11/21/09
to mogile
Thanks for those tips.
30 might be good as a default but it would be great to move that to
the config file.
Would benefit things greatly to get things into transactions as well.

Thanks Again.

dormando

unread,
Nov 26, 2009, 5:22:55 AM11/26/09
to mogile
I bumped the default to 30... Will need to make it configurable in future
releases (patches welcome).
Reply all
Reply to author
Forward
0 new messages