check_hw_ib minimum rate (not exact)

15 views
Skip to first unread message

Ryan Novosielski

unread,
Aug 22, 2016, 9:06:48 PM8/22/16
to n...@lbl.gov
It would be great if there were a way to specify a minimum rate for IB (we have a mixture of 40 and 56). To my knowledge, there isn’t one currently. I guess I could list the machines that have different data rates.

---
check_hw_ib

check_hw_ib rate [device]

check_hw_ib determines whether or not an active IB link is present with the specified data rate (in Gb/sec). Version 1.3 and later support the device parameter for specifying the name of the IB device. Version 1.4.1 and later also verify that the kernel drivers and userspace libraries are the same OFED version.

Example (QDR Infiniband): check_hw_ib 40
---

Thanks for this great tool!

--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

Michael Jennings

unread,
Aug 23, 2016, 12:24:40 AM8/23/16
to LBNL Node Health Check
That's imminently doable, though not something that's supported in the
current code.

To support this properly, I'll need to rewrite the check to use the
more modern getopt-based scheme which uses command-line options rather
than simply fixed-position parameters to define what options are
supported and which ones mean what. So it's a bit weightier of a
change than if it were just a simple "let's add a new flag to support"
patch. Having said that, though, it's still pretty straight-forward
stuff. Shouldn't be too difficult. :-)

Right now the state, device, and rate parameters are all checked as
exact string matches; for the rewrite, I'll most likely make them all
match strings. I could also support, in the case of the rate
parameter, a prefix of a comparison operator (e.g., '>' or '<=') so
that it's not simply a match test but potentially an arithmetic
comparison. Hopefully that would adequately address your use case....
:-)

Would you mind creating an Issue on GitHub for this? That way I can
track it against the milestone of the upcoming 1.4.3 release and make
sure everything comes together correctly and in a timely manner. I'm
targeting SC16 for the release, or before if possible, but there are a
number of "irons in the fire" to pull together, so I want to make sure
everything is tracked cleanly and nothing gets dropped or overlooked.

Thanks!
Michael
> --
> You received this message because you are subscribed to the Google Groups "LBNL Node Health Check" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to nhc+uns...@lbl.gov.
> To post to this group, send email to n...@lbl.gov.
> Visit this group at https://groups.google.com/a/lbl.gov/group/nhc/.



--
Michael Jennings <m...@lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615

Bidwell, Matt

unread,
Aug 23, 2016, 9:26:52 AM8/23/16
to n...@lbl.gov
I would agree with you second thought and use a node range to specify the speed for the nodes? I would avoid a minimum for the whole system because you wouldn't know when you FDR negotiated down to QDR speeds.
For example:
{n0[001-100]} || check_hw_ib 56 mlx4_0:1
{n0[101-200]} || check_hw_ib 40 mlx4_0:1

Ryan Novosielski

unread,
Aug 31, 2016, 4:30:42 PM8/31/16
to Michael Jennings, LBNL Node Health Check
Created Issue #21 for this. Let me know if I can be of some help somehow.
signature.asc
Reply all
Reply to author
Forward
0 new messages