ltop with multiple Lustre filesystems

113 views
Skip to first unread message

dosman

unread,
Apr 21, 2011, 10:43:29 AM4/21/11
to lmt-d...@googlegroups.com
Hi,
I have ltop working on my test cluster but I can't seem to get a
second or third filesystem work with it.

# lmtinit -l
dc
dcwan
rack5

# ltop -f dc
ltop: no data found for file system `dc'

I get the same for dcwan. I'm running lmt 3.1.2. My test cluster is on
Lustre 1.8.5, but my other two filesystems are 1.8.1.1. I restarted
cerebro on my headnode after doing the lmtinit for the new systems.
Suggestions?

Thanks,
-Nathan

dosman

unread,
Apr 21, 2011, 3:11:39 PM4/21/11
to lmt-d...@googlegroups.com
Still having problems, but found some more info. Running "ltop -f
lustre" works and appears to find one oss host from the filesystem dc.
Interestingly, the only OST it shows is one which we recently lost
from hardware problems and had to recreate and reformat in February
this year. It appears ltop is picking up on the lustre internal
filesystem name rather than what I used with lmtinit.

It may help to know that filesystem dc started out as Lustre 1.4 so it
has some out-dated naming methods used for UUID and other components.
Dcwan started out as Lustre 1.6, but I still can't get ltop to do
anything with it.

Thanks,
-Nathan

> --
> You received this message because you are subscribed to the Google
> Groups "lmt-discuss" group.
> To post to this group, send email to lmt-d...@googlegroups.com.
> To unsubscribe from this group, send email to lmt-discuss...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/lmt-discuss?hl=en
> .
>

Christopher J. Morrone

unread,
Apr 21, 2011, 4:18:18 PM4/21/11
to lmt-d...@googlegroups.com
Are your different filesystems on the multicast network?

Note that (to the best of my knowledge) lmtinit and ltop are unrelated
in lmt 3. lmt now uses cerebro directly to report information, it does
not use the MySQL database at all. lmtinit on the other hand is only
used to initialize the MySQL database.

Here's something you could try to diagnose the problem:

On the node where you want to run ltop, do:

/usr/sbin/cerebro-stat -m lmt_mdt

What do you see?

On a test cluster where we have two filesystems, I see this:

tycho-mds1:
1;tycho-mds1;0.000000;1.281737;lc1-MDT0000;414108881;463677625;1656435524;1688473892;18;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;2;0;0;0;0;0;5;0;0;1;0;0;1163432;0;0;0;0;0;0;0;0;0;0;0;1;0;0;96;0;0;493;0;0;0;0;0
tycho-mds2:
1;tycho-mds2;0.000000;20.907624;lc2-MDT0000;309666155;309666834;1238664620;1238815968;113125622;0;0;56507168;0;0;599585;0;0;558998;0;0;48266055;0;0;34438276;0;0;34365413;0;0;231593;0;0;0;0;0;2;0;0;277;0;0;134;0;0;273;0;0;4276276;0;0;1502;0;0;1462;0;0;968285;0;0;93182986;0;0;48;0;0;217;0;0;0;0;0

Not the "lc1-MDT0000" and "lc2-MDT0000" strings that label the mdts.
Note that in lmt the "lc1" and "lc2" parts are assumed to be the names
of the filesystem.

You appear to be saying filesystem named "lustre" with the command "ltop
-f lustre", but then saying that the filesystem is named "dc". What is
the actual name of your filesystem? Note that the name of the mount
point on the clients is completely irrelevant here. The mount point
could be named anything.

After you figuring that out, you can also use cerebro-stat to see what
nodes you are getting information from for the metric "lmt_ost".

Chris

dosman

unread,
Apr 21, 2011, 5:27:56 PM4/21/11
to lmt-d...@googlegroups.com
Hi Chris,
Here is my output:

/usr/sbin/cerebro-stat -m lmt_mdt
mds01: 1;mds01;2.816180;99.443103;mds-dc;
27887837;217526400;742639604;761246852;1116524878;0;0;452067805;0;0;865784;0;0;4402;0;0;26225127;0;0;1170466;0;0;1105360;0;0;1660957;0;0;7832;0;0;2;0;0;8824;0;0;7981;0;0;7419;0;0;6196157;0;0;492;0;0;509;0;0;550165989;0;0;394191904;0;0;149;0;0;442;0;0;0;0;0
mds03: 1;mds03;0.349825;95.817841;mds-wan;
57396060;142082048;486786200;497184424;183939442;0;0;86699346;0;0;471;0;0;30;0;0;564650;0;0;109585;0;0;30075;0;0;659897;0;0;232939;0;0;0;0;0;10557;0;0;16344;0;0;7015;0;0;5493522;0;0;48;0;0;56;0;0;1895203;0;0;192098572;0;0;132;0;0;269;0;0;0;0;0
oss19: 1;oss19;0.000000;35.634165;rack5-
MDT0000;214167;214238;856668;874872;412;0;0;206;0;0;0;0;0;0;0;0;201;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;0;1;0;0;1;0;0;0;0;0;1537532;0;0;4;0;0;0;0;0;1;0;0;220;0;0;4;0;0;25;0;0;0;0;0

So by your methodology dc is really named mds-dc and dcwan is mds-wan;
both of those agree with the filesystem name used when mounting up the
mdt. However the output from a command like tunefs.lustre does not
agree, that says the filesystem I refer to as dc is named "lustre" and
dc-wan is "client" when examining an OST. Of course my test system
(rack5/oss19) was freshly built on Lustre 1.8.5 and confirms to what
you are saying.

I removed the lmtinit definitions for dc and dc-wan and recreated one
for mds-dc, however that isn't working either:

ltop -f mds-dc
ltop: no data found for file system `mds-dc'
# ltop -f mds-wan
ltop: no data found for file system `mds-wan'

Lastly, using "cerebrostat -m lmt_ost" shows output from all of my
OSS's from all three filesystems. Here is one oss for the dc filesystem:

cerebro-stat -m lmt_ost
oss01: 2;oss01;0.715746;94.795803;dc-ost1-t01-sdb;
261375068;263103748;1045500272;3784651784;3756176770763;28863622242661;41758735;1415;0;0;0;4820;2915
;COMPLETE 4/4 0s remaining;dc-ost1-t02-sdt;
260874673;262708867;1043498692;3784651784;4331518473912;32251512881465;45210196;1415;0;0;0;4818;2913
;COMPLETE 4/4 0s remaining;dc-ost1-t03-sdc;
240891607;242671012;963566428;3784651784;2741219238556;24900907294852;36673914;1415;0;0;0;4818;2920
;COMPLETE 4/4 0s remaining;dc-ost1-t04-sdu;
254211260;255954932;1016845040;3784651784;3100483272556;26655848666906;39400430;1415;0;0;0;4818;2914
;COMPLETE 4/4 0s remaining;dc-ost1-t05-sdd;
266762617;268605543;1067050468;3784651784;2784307827359;26934858953840;39328771;1415;0;0;0;4818;2911
;COMPLETE 4/4 0s remaining;dc-ost1-t06-sdv;
270301030;272136024;1081204120;3784651784;3449114698240;31016276644389;44388107;1415;0;0;0;4820;2982
;COMPLETE 4/4 0s remaining;


Thanks,
-Nathan

Christopher J. Morrone

unread,
Apr 22, 2011, 2:27:38 PM4/22/11
to lmt-d...@googlegroups.com
On 04/21/2011 02:27 PM, dosman wrote:
> Hi Chris,
> Here is my output:
>
> /usr/sbin/cerebro-stat -m lmt_mdt
> mds01: 1;mds01;2.816180;99.443103;mds-dc;
> 27887837;217526400;742639604;761246852;1116524878;0;0;452067805;0;0;865784;0;0;4402;0;0;26225127;0;0;1170466;0;0;1105360;0;0;1660957;0;0;7832;0;0;2;0;0;8824;0;0;7981;0;0;7419;0;0;6196157;0;0;492;0;0;509;0;0;550165989;0;0;394191904;0;0;149;0;0;442;0;0;0;0;0
> mds03: 1;mds03;0.349825;95.817841;mds-wan;
> 57396060;142082048;486786200;497184424;183939442;0;0;86699346;0;0;471;0;0;30;0;0;564650;0;0;109585;0;0;30075;0;0;659897;0;0;232939;0;0;0;0;0;10557;0;0;16344;0;0;7015;0;0;5493522;0;0;48;0;0;56;0;0;1895203;0;0;192098572;0;0;132;0;0;269;0;0;0;0;0
> oss19: 1;oss19;0.000000;35.634165;rack5-
> MDT0000;214167;214238;856668;874872;412;0;0;206;0;0;0;0;0;0;0;0;201;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;0;1;0;0;1;0;0;0;0;0;1537532;0;0;4;0;0;0;0;0;1;0;0;220;0;0;4;0;0;25;0;0;0;0;0

Ah, I see. I think lmt is making a bad assumption about device naming.
It assumes that the mdt or ost device name always begins with the
filesystem name. That may be the default naming scheme of the newer
lustre filesystem creation tools, but I am guessing that it is by no
means a requirement.

So this sounds like an LMT bug.

>
> So by your methodology dc is really named mds-dc and dcwan is mds-wan;
> both of those agree with the filesystem name used when mounting up the
> mdt. However the output from a command like tunefs.lustre does not
> agree, that says the filesystem I refer to as dc is named "lustre" and
> dc-wan is "client" when examining an OST. Of course my test system
> (rack5/oss19) was freshly built on Lustre 1.8.5 and confirms to what
> you are saying.
>
> I removed the lmtinit definitions for dc and dc-wan and recreated one
> for mds-dc, however that isn't working either:

Don't waste your time with lmtinit for ltop. Like I said, lmtinit is
only used to configure the MySQL database which is not used at all by ltop.

Reply all
Reply to author
Forward
0 new messages