We already stated this, basically after the filesystem was blocked for a
while, or after an OSS had crashed.
If it is stuck for too long (default timeout is 1 hour), robinhood tries
to cancel its operation on current directory and continues with the next
one.
Maybe it didn't recover successfuly from this cancellation, and you
receive those messages since that badly happened.
To avoid this problem, you can increase the timeout to a very high
value, to make sure it is never reached (e.g. xxx days).
In that case, robinhood will remain stuck as long as its current
operation in Lustre is blocked,
and it will resume the current operation as soon as Lustre is back.
You can change this timeout by setting the "scan_op_timeout" parameter
in the "FS_Scan" section of config file.
Alternatively, you can also keep a reasonable timeout and make robinhood
exit when the filesystem is not responding
by setting "exit_on_timeout = TRUE" in the same section of the config.
So you can respawn robinhood daemon when everything is fixed.
Best regards,
Thomas LEIBOVICI
CEA/DAM
> A support request from lustre-discuss.
>
> ------------------------------------------------------------------------
>
> Sujet:
> [Lustre-discuss] robinhood error messages
> Expéditeur:
> Thomas Roth <t.r...@gsi.de>
> Date:
> Tue, 23 Nov 2010 20:20:33 +0100
> Destinataire:
> lustre-...@lists.lustre.org
>
> Destinataire:
> lustre-...@lists.lustre.org
>
>
> Hi all,
>
> we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to
> find out where and who the big space consumers are - no purging).
>
> Robinhood sends me lots and lots of messages (~100/day) of the type
>
> > ===== FS scan is blocked (/lustre) =====
> > Date: 2010/11/23 20:05:22
> > Program: robinhood (pid 4826)
> > Host: lxb310
> > Filesystem: /lustre
> > A thread has been inactive for 3660 sec
> > while scanning directory /lustre/....
>
> This seems to indicate some trouble accessing certain directories on the
> node where robinhood is running. However, this is independent of the
> node, and at the same time we neither see any issues / slowness/
> connectivity problems nor get any user complaints of the like.
>
> So I wonder whether anybody else is using robinhood and has seen similar
> messages.
>
> Regards,
> Thomas
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> ------------------------------------------------------------------------
>
> _______________________________________________
> robinhood-support mailing list
> robinhoo...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Btw, whenever I restart the scan, e.g. after a reconfiguration such as
for the timeout, I get the logfile full of
> ListMgr | DB query failed in ListMgr_Insert line 340...
and assorted messages, which seem to indicate that the new robinhood
scan tries to put something into the DB that is already there, and
stumbles on this. Or maybe that happens when several robins are running
simultaneously. I'm not sure if it is a problem for the scan, it is,
however, a problem for the free space on /var, or wherever I point the
log to ;-)
Regards,
Thomas
--
--------------------------------------------------------------------
Thomas Roth IT-HPC-Linux
Location: SB3 1.262 Phone: +49-6159-71 1453
> > > ListMgr | DB query failed in ListMgr_Insert line 340...
> > and assorted messages, which seem to indicate that the new robinhood
> > scan tries to put something into the DB that is already there, and
> > stumbles on this. Or maybe that happens when several robins are
> > running simultaneously.
> Are you running several instances for scanning the same filesystem??
Well, yes, tried that also. Actually I was under the impression that
this is a feature of Robinhood - of course, now that I am looking for
this in the documentation I can't find it.
But these errors from the DB definitely did arise first when I restarted
robinhood anew after some changes (location of log file, debug level,
...) in the config file. But since there was no change in the robinhood
version, I did not empty the database. After this restart, I immediately
got a lot of
> 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3 performing
database operation.
> 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in
ListMgr_Insert line 340: pk='54051386:6D286C', code=3: Duplicate entry
'54051386:6D286C' for key 1
I suppose this is something that should not happen when one is feeding a
database?
Cheers,
Sorry, I just saw the email you sent on robinhood-support mailing list
and that was blocked waiting for admin validation.
About multiple robinhood instances, the documentation says that you can
split the features on different nodes:
basically, the database server can run on a machine, FS scan on another
machine, disk resource monitoring and purging on another machine, etc...
But you must only run a single instance of each feature at a given time.
Thomas Roth wrote:
>> Is there a way to "partition" a file system for Robinhood? Tell an
>> instance to only scan certain directories? Because I think the issue is
>> not a really broken data base, but simply a later coming Robin scanning
>> files that were already done?
What is your need exactly? Do you want to speed-up the scan by running
several robinhood instances,
or do you only want to scan certain directories?
- About speed, robinhood already performs scans in parallel with
multiple threads, each one scanning different directories.
So if you want more parallelism, increase the number of scan threads.
- If your need is to scan only some parts of the namespace, you can
ignore directories by specifying "ignore" rules in the configuration
file (FS_Scan section)
E.g. ignore { path == "/lustre/xyz*" } if you know the path you want to
ignore, or a negation:
ignore { not ( path == "/lustre/dir1" or path == "/lustre/dir2/subdir*"
) } if you know the paths you want to scan.
>> > > ListMgr | DB query failed in ListMgr_Insert line 340...
>> > and assorted messages, which seem to indicate that the new robinhood
>> > scan tries to put something into the DB that is already there, and
>> > stumbles on this. Or maybe that happens when several robins are
>> > running simultaneously.
>> Are you running several instances for scanning the same filesystem??
>
> Well, yes, tried that also. Actually I was under the impression that
> this is a feature of Robinhood - of course, now that I am looking for
> this in the documentation I can't find it.
>
> But these errors from the DB definitely did arise first when I
> restarted robinhood anew after some changes (location of log file,
> debug level, ...) in the config file. But since there was no change in
> the robinhood version, I did not empty the database. After this
> restart, I immediately got a lot of
> > 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3
> performing database operation.
> > 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in
> ListMgr_Insert line 340: pk='54051386:6D286C', code=3: Duplicate entry
> '54051386:6D286C' for key 1
>
> I suppose this is something that should not happen when one is feeding
> a database?
Yes, these errors seams to be caused by the concurrence between several
feeders. This is not sane, and the db content may be inconsistent now.
So I recommend you to stop all your running instances, clear the db
content (command "rbh-config empty_db")
and then, only start a single instance for scanning.
Best regards,
Thomas.