[Lustre-discuss] [robinhood-support] robinhood error messages

LEIBOVICI Thomas

unread,

Nov 24, 2010, 7:20:32 AM11/24/10

to t.r...@gsi.de, robinhoo...@lists.sourceforge.net, lustre-...@lists.lustre.org

Hi Thomas,

We already stated this, basically after the filesystem was blocked for a
while, or after an OSS had crashed.
If it is stuck for too long (default timeout is 1 hour), robinhood tries
to cancel its operation on current directory and continues with the next
one.
Maybe it didn't recover successfuly from this cancellation, and you
receive those messages since that badly happened.

To avoid this problem, you can increase the timeout to a very high
value, to make sure it is never reached (e.g. xxx days).
In that case, robinhood will remain stuck as long as its current
operation in Lustre is blocked,
and it will resume the current operation as soon as Lustre is back.

You can change this timeout by setting the "scan_op_timeout" parameter
in the "FS_Scan" section of config file.

Alternatively, you can also keep a reasonable timeout and make robinhood
exit when the filesystem is not responding
by setting "exit_on_timeout = TRUE" in the same section of the config.
So you can respawn robinhood daemon when everything is fixed.

Best regards,
Thomas LEIBOVICI
CEA/DAM

> A support request from lustre-discuss.
>
> ------------------------------------------------------------------------
>
> Sujet:
> [Lustre-discuss] robinhood error messages
> Expéditeur:
> Thomas Roth <t.r...@gsi.de>
> Date:
> Tue, 23 Nov 2010 20:20:33 +0100
> Destinataire:
> lustre-...@lists.lustre.org
>
> Destinataire:
> lustre-...@lists.lustre.org
>
>
> Hi all,
>
> we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to
> find out where and who the big space consumers are - no purging).
>
> Robinhood sends me lots and lots of messages (~100/day) of the type
>
> > ===== FS scan is blocked (/lustre) =====
> > Date: 2010/11/23 20:05:22
> > Program: robinhood (pid 4826)
> > Host: lxb310
> > Filesystem: /lustre
> > A thread has been inactive for 3660 sec
> > while scanning directory /lustre/....
>
> This seems to indicate some trouble accessing certain directories on the
> node where robinhood is running. However, this is independent of the
> node, and at the same time we neither see any issues / slowness/
> connectivity problems nor get any user complaints of the like.
>
> So I wonder whether anybody else is using robinhood and has seen similar
> messages.
>
> Regards,
> Thomas
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> ------------------------------------------------------------------------
>
> _______________________________________________
> robinhood-support mailing list
> robinhoo...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Thomas Roth

unread,

Nov 24, 2010, 8:00:30 AM11/24/10

to LEIBOVICI Thomas, robinhoo...@lists.sourceforge.net, lustre-...@lists.lustre.org

Thank you Thomas.
If these messages mean that robinhood just continues after the timeout,
it would be nothing to worry about, but I will try to adapt the timeout
anyhow.
Right now, however, it seems the scan is really stuck: since days,
rbh-report -i tells me about 612 TB in the filesystem, but lfs df says
we have 787 TB ;-)

Btw, whenever I restart the scan, e.g. after a reconfiguration such as
for the timeout, I get the logfile full of
> ListMgr | DB query failed in ListMgr_Insert line 340...
and assorted messages, which seem to indicate that the new robinhood
scan tries to put something into the DB that is already there, and
stumbles on this. Or maybe that happens when several robins are running
simultaneously. I'm not sure if it is a problem for the scan, it is,
however, a problem for the free space on /var, or wherever I point the
log to ;-)

Regards,
Thomas

--
--------------------------------------------------------------------
Thomas Roth IT-HPC-Linux
Location: SB3 1.262 Phone: +49-6159-71 1453

http://twitter.com/gsi_it

LEIBOVICI Thomas

unread,

Nov 24, 2010, 9:17:06 AM11/24/10

to Thomas Roth, robinhoo...@lists.sourceforge.net, lustre-...@lists.lustre.org

Thomas Roth wrote:
> Thank you Thomas.
> If these messages mean that robinhood just continues after the
> timeout, it would be nothing to worry about, but I will try to adapt
> the timeout anyhow.
> Right now, however, it seems the scan is really stuck: since days,
> rbh-report -i tells me about 612 TB in the filesystem, but lfs df says
> we have 787 TB ;-)

A couple of such messages would not be a big deal, but 100s/day during
several days is not normal... I suspect a problem on timeout handling in
robinhood, that leads to such a blocking. That's why I suggest you to
avoid timeouts by increasing its value.

> Btw, whenever I restart the scan, e.g. after a reconfiguration such as
> for the timeout, I get the logfile full of

Tips: for changing such a scalar param, you are not obliged to fully
restart the daemon. "service robinhood reload" or "kill -HUP" on the
process is OK.

> > ListMgr | DB query failed in ListMgr_Insert line 340...
> and assorted messages, which seem to indicate that the new robinhood
> scan tries to put something into the DB that is already there, and
> stumbles on this. Or maybe that happens when several robins are
> running simultaneously.

Are you running several instances for scanning the same filesystem??

Thomas Roth

unread,

Nov 24, 2010, 11:58:04 AM11/24/10

to LEIBOVICI Thomas, robinhoo...@lists.sourceforge.net, lustre-...@lists.lustre.org

On 24.11.2010 15:17, LEIBOVICI Thomas wrote:
> Thomas Roth wrote:

> > > ListMgr | DB query failed in ListMgr_Insert line 340...
> > and assorted messages, which seem to indicate that the new robinhood
> > scan tries to put something into the DB that is already there, and
> > stumbles on this. Or maybe that happens when several robins are
> > running simultaneously.
> Are you running several instances for scanning the same filesystem??

Well, yes, tried that also. Actually I was under the impression that
this is a feature of Robinhood - of course, now that I am looking for
this in the documentation I can't find it.

But these errors from the DB definitely did arise first when I restarted
robinhood anew after some changes (location of log file, debug level,
...) in the config file. But since there was no change in the robinhood
version, I did not empty the database. After this restart, I immediately
got a lot of
> 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3 performing
database operation.
> 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in
ListMgr_Insert line 340: pk='54051386:6D286C', code=3: Duplicate entry
'54051386:6D286C' for key 1

I suppose this is something that should not happen when one is feeding a
database?

Cheers,

LEIBOVICI Thomas

unread,

Nov 25, 2010, 4:07:58 AM11/25/10

to Thomas Roth, robinhoo...@lists.sourceforge.net, lustre-...@lists.lustre.org

Hello Thomas,

Sorry, I just saw the email you sent on robinhood-support mailing list
and that was blocked waiting for admin validation.
About multiple robinhood instances, the documentation says that you can
split the features on different nodes:
basically, the database server can run on a machine, FS scan on another
machine, disk resource monitoring and purging on another machine, etc...
But you must only run a single instance of each feature at a given time.

Thomas Roth wrote:
>> Is there a way to "partition" a file system for Robinhood? Tell an
>> instance to only scan certain directories? Because I think the issue is
>> not a really broken data base, but simply a later coming Robin scanning
>> files that were already done?
What is your need exactly? Do you want to speed-up the scan by running
several robinhood instances,
or do you only want to scan certain directories?
- About speed, robinhood already performs scans in parallel with
multiple threads, each one scanning different directories.
So if you want more parallelism, increase the number of scan threads.
- If your need is to scan only some parts of the namespace, you can
ignore directories by specifying "ignore" rules in the configuration
file (FS_Scan section)
E.g. ignore { path == "/lustre/xyz*" } if you know the path you want to
ignore, or a negation:
ignore { not ( path == "/lustre/dir1" or path == "/lustre/dir2/subdir*"
) } if you know the paths you want to scan.

>> > > ListMgr | DB query failed in ListMgr_Insert line 340...
>> > and assorted messages, which seem to indicate that the new robinhood
>> > scan tries to put something into the DB that is already there, and
>> > stumbles on this. Or maybe that happens when several robins are
>> > running simultaneously.
>> Are you running several instances for scanning the same filesystem??
>
> Well, yes, tried that also. Actually I was under the impression that
> this is a feature of Robinhood - of course, now that I am looking for
> this in the documentation I can't find it.
>
> But these errors from the DB definitely did arise first when I
> restarted robinhood anew after some changes (location of log file,
> debug level, ...) in the config file. But since there was no change in
> the robinhood version, I did not empty the database. After this
> restart, I immediately got a lot of
> > 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3
> performing database operation.
> > 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in
> ListMgr_Insert line 340: pk='54051386:6D286C', code=3: Duplicate entry
> '54051386:6D286C' for key 1
>
> I suppose this is something that should not happen when one is feeding
> a database?

Yes, these errors seams to be caused by the concurrence between several
feeders. This is not sane, and the db content may be inconsistent now.
So I recommend you to stop all your running instances, clear the db
content (command "rbh-config empty_db")
and then, only start a single instance for scanning.

Best regards,
Thomas.

Reply all

Reply to author

Forward