[Rocks-Discuss] Infiniband issue w/ Rocks

Lundrigan, Adam

unread,

May 7, 2009, 3:58:53 PM5/7/09

to npaci-rocks...@sdsc.edu

Hi All,

We are in the process of converting a 108-core Sun HPC grid to use Rocks
5.1. I have lopped off three nodes from the Solaris grid to build a
test grid, and have run into a problem getting Infiniband to work. We
are evaluating the Mellanox OFED roll (mlnx-ofed) from Clustercorp on
our hardware; the roll installs completely, and brings up the
interfaces, however the following error shows up in the logs
perodically:

ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22

And I've noticed a mismatch in the speeds reported by ibstat and
ibstatus. The full description of our IB setup and outputs of various
command can be found in this document:
ftp://ocgftp.nfl.dfo-mpo.gc.ca/outgoing/adam/Rocks-IB/Infiniband_Issue_R
ocks_5.1_mlnx-ofed_roll.doc. Quick note about our configuration: We
have a 9 port switch (Sun) which has the IB subnet manager, and a
24-port switch made by Flextronics chained off it, as that one has no IB
subnet manager of it's own. Most of the nodes (including the three
IB-connected Rocks machines) hang off the 24-port switch.

Has anyone out there had success in using Sun IB HCAs (we have this one:
http://www.sun.com/products/networking/infiniband/ibhcaPCI-E/specs.xml)
with the OFED stack to build an IB-enabled Rocks cluster? I'm very new
to Rocks....any advice you could give on how to go about adding
Infiniband support to a stock Rocks 5.1 cluster?

Thanks for your time,
--
Adam Lundrigan
Computer Systems Programmer
Biological & Physical Oceanography Section
Science, Oceans & Environment Branch
Department of Fisheries and Oceans Canada
Northwest Atlantic Fisheries Centre
St. John's, Newfoundland & Labrador
CANADA A1C 5X1

Tel: (709) 772-8136
Fax: (709) 772-8138
Cell: (709) 277-4575
Office: G10-117J
Email: Adam.Lu...@dfo-mpo.gc.ca

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090507/c05965de/attachment.html

Michael Duncan

unread,

May 7, 2009, 4:28:40 PM5/7/09

to Discussion of Rocks Clusters

I haven't used the Sun HCAs, but my most recent experience with the
mellanox roll from Clustercorp was not that good. While it installed
fine (like you said), the IB fabric just wasn't robust.

Once we reverted to the Cisco OFED roll (1.3) the fabric worked
perfectly.

My guess is there is something wrong in the roll, or the way Clustercorp
is creating the roll, or the mellanox drivers themselves.

I haven't had a chance to look into yet.

------------------
Michael Duncan
Systems Analyst
mdu...@x-iss.com

Hi All,

NOTICE:
This message may contain privileged or otherwise confidential information.
If you are not the intended recipient, please immediately advise the sender
by reply email and delete the message and any attachments without using,
copying or disclosing the contents.

Tim Carlson

unread,

May 7, 2009, 4:50:53 PM5/7/09

to Discussion of Rocks Clusters

On Thu, 7 May 2009, Lundrigan, Adam wrote:

Very nicely phrased question which will most likely get you some well
thought out answers and/or followup questions.

What is the output of "ibhosts". I ask becuase to me it looks like your
IB network is up and you might just have a routing problem on the IP side.
I'd rather see you choose an address space other than 10.2 for your IPoIB
network give that you are using 10.0 for your regular ethernet network.

To me, the output of ibstat and ibstatus looks fine. You've got a SDR IB
network from the looks of things and it is running at 10Gb/s.

Tim

Lundrigan, Adam

unread,

May 7, 2009, 5:55:57 PM5/7/09

to lan...@scalableinformatics.com, Discussion of Rocks Clusters, mdu...@x-iss.com, Power, Debbie

Joe, Michael,

Thank you for your input.

I've downloaded the Mellanox ISO, however it doesn't support CentOS. I
tried to force mlnx_add_kernel_support.sh to work around it, but first
attempts did not succeed...I am working on that as an aside. The
Mellanox ISO included the source code for their OFED (OFED_1.4-mlnx8),
so I will move on to trying that instead, and come back to the firmware
flashing parts later.

If that falls through, I will go back to the stock OFED and try it.

-Adam

-----Original Message-----
From: Joe Landman [mailto:lan...@scalableinformatics.com]
Sent: Thursday, May 07, 2009 6:21 PM
To: Discussion of Rocks Clusters; mdu...@x-iss.com; Lundrigan, Adam
Subject: Re: [Rocks-Discuss] Infiniband issue w/ Rocks

Michael Duncan wrote:
> I haven't used the Sun HCAs, but my most recent experience with the
> mellanox roll from Clustercorp was not that good. While it installed
> fine (like you said), the IB fabric just wasn't robust.
>
> Once we reverted to the Cisco OFED roll (1.3) the fabric worked
> perfectly.
>
> My guess is there is something wrong in the roll, or the way
> Clustercorp is creating the roll, or the mellanox drivers themselves.
>

A few points:

1) OFED isn't that hard to install by itself.

wget \
http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4.tgz
tar -zxf OFED-1.4.tgz
cd OFED-1.4
./install --all -vvv

and it happily generates *all* of the RPMs for you (though you often
need other packages to make it work).

2) Mellanox adapters seem to be quite sensitive to firmware revisions.
I'd suggest going to their site, and have a look at what they have for
download

http://www.mellanox.com/content/pages.php?pg=firmware_download

They have an iso OFED image you can pull, which will even flash your
firmware for you. It is pretty nice.

This said, there appears to be firmware issues in terms of the late
model ConnectX, fast memory registration, and NFS over RDMA
functionality. We just ran into this for a JackRabbit for a customer.
Waiting for updated firmware.

Without the firmware upgrade, the cards were "stuck" on 40 Gbps
(QDR), even when plugged into our 10 Gb (SDR) switch. After the update,
all was well ... at least on the IB side. Apparently, the OFED stack
doesn't have all the Mellanox goodies the Mellanox stack has.

If you are running the stock kernels, use the Mellanox stack. If
you are running your own kernels, use the OFED stack. I think Tim and
his team built with 1.3.1 OFED. We are deploying 1.4.x for our storage
and clusters.

The multicast message isn't so important. The speed mismatch is.

Joe

>
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu
> [mailto:npaci-rocks-dis...@sdsc.edu] On Behalf Of
> Lundrigan, Adam
> Sent: Thursday, May 07, 2009 2:59 PM
> To: npaci-rocks...@sdsc.edu
> Subject: [Rocks-Discuss] Infiniband issue w/ Rocks
>
> Hi All,
>
> We are in the process of converting a 108-core Sun HPC grid to use
> Rocks 5.1. I have lopped off three nodes from the Solaris grid to
> build a test grid, and have run into a problem getting Infiniband to
> work. We are evaluating the Mellanox OFED roll (mlnx-ofed) from
> Clustercorp on our hardware; the roll installs completely, and brings
> up the interfaces, however the following error shows up in the logs
> perodically:
>
> ib0: multicast join failed for
> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
>
> And I've noticed a mismatch in the speeds reported by ibstat and
> ibstatus. The full description of our IB setup and outputs of various

> command can be found in this document:
> ftp://ocgftp.nfl.dfo-mpo.gc.ca/outgoing/adam/Rocks-IB/Infiniband_Issue

> _R ocks_5.1_mlnx-ofed_roll.doc. Quick note about our configuration:

> We have a 9 port switch (Sun) which has the IB subnet manager, and a
> 24-port switch made by Flextronics chained off it, as that one has no
> IB subnet manager of it's own. Most of the nodes (including the three

> IB-connected Rocks machines) hang off the 24-port switch.
>
> Has anyone out there had success in using Sun IB HCAs (we have this
one:
> http://www.sun.com/products/networking/infiniband/ibhcaPCI-E/specs.xml

> ) with the OFED stack to build an IB-enabled Rocks cluster? I'm very

> new to Rocks....any advice you could give on how to go about adding
> Infiniband support to a stock Rocks 5.1 cluster?
>
> Thanks for your time,
> --
> Adam Lundrigan
> Computer Systems Programmer
> Biological & Physical Oceanography Section Science, Oceans &
> Environment Branch Department of Fisheries and Oceans Canada Northwest

> Atlantic Fisheries Centre St. John's, Newfoundland & Labrador
> CANADA A1C 5X1
>
> Tel: (709) 772-8136
> Fax: (709) 772-8138
> Cell: (709) 277-4575
> Office: G10-117J
> Email: Adam.Lu...@dfo-mpo.gc.ca
>
>

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: lan...@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Lundrigan, Adam

unread,

May 7, 2009, 6:02:52 PM5/7/09

to Discussion of Rocks Clusters

Tim,

Thanks for your reply.

I agree that the problem appears to be in the IPoIB, rather than the fabric. The compute nodes with the Mellanox OFED roll are no longer available, as I'm trying to install the Mellanox OFED distribution directly from source at the moment to see if the outcome will be different, so I can't run the ibhosts command at the moment. However, I have run it in the past, and while there were a lot of errors (don't remember most of them, but I do remember one in particular that complained about a lack of subnet manager), it did detect that there were two switches and 19 HCAs on the fabric....so it appears that the fabric is working. However, there still appears to be a speed mismatch on the IB fabric. The ibstat command shows a rate of '10', ibstatus shows a rate of '10Gbps (4x)', but my main switch (the IBSM) shows that Port 8 (the port which connects the 9P and 24P switches) has a link width of 1X [switch console output attached]. Running ibportstate on the active port on compute-0-0 shows that the active link speed is 2.5Gbps (1X).

These Sun IB HCAs (Mellanox Technologies MT25208 InfiniHost III Ex running in Tavor compatibility mode) should be DDR, as should the switches (not 100% certain about the Flextronics 24p switch, though....it was very cheap for a 24p switch....and being a cheap switch, it has no management interface where I can log in and view the status of the ports :( )

I have checked that both adapters have the proper IP addresses, netmasks, broadcasts, etc...and that the machines have the proper entries in their respective routing tables. The only thing that I noticed was that the routing tables had this entry:

255.255.255.255 0.0.0.0 255.255.255.255 UH 0 0 0 eth0

...that shouldn't have any impact on unicast traffic, however. Everything looked A-OK for allowing IP traffic to flow over the links, but it just would not flow.

As a side note, I also noticed that the syslog records a message such as this when 'ifup ib0' is run:

ADDRCONF(NETDEV_UP): ib0: link is not ready

but we never see one that says the link is ready, so could it be possible that the link is actually fine like you say, but somehow the networking subsystem in Linux doesn't "get the message" and keeps the virtual interface (ib0) down? The traffic counters displayed by ifconfig are always zero, so nothing is moving over that link despite the IB subsystem saying everything is OK.

-Adam

-----Original Message-----
From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-dis...@sdsc.edu] On Behalf Of Tim Carlson
Sent: Thursday, May 07, 2009 6:21 PM
To: Discussion of Rocks Clusters
Subject: Re: [Rocks-Discuss] Infiniband issue w/ Rocks

Tim

> _R ocks_5.1_mlnx-ofed_roll.doc. Quick note about our configuration:

> We have a 9 port switch (Sun) which has the IB subnet manager, and a
> 24-port switch made by Flextronics chained off it, as that one has no
> IB subnet manager of it's own. Most of the nodes (including the three
> IB-connected Rocks machines) hang off the 24-port switch.
>
> Has anyone out there had success in using Sun IB HCAs (we have this one:
> http://www.sun.com/products/networking/infiniband/ibhcaPCI-E/specs.xml

> ) with the OFED stack to build an IB-enabled Rocks cluster? I'm very

Joe Landman

unread,

May 7, 2009, 4:51:12 PM5/7/09

to Discussion of Rocks Clusters, mdu...@x-iss.com, Adam.Lu...@dfo-mpo.gc.ca

Michael Duncan wrote:
> I haven't used the Sun HCAs, but my most recent experience with the
> mellanox roll from Clustercorp was not that good. While it installed
> fine (like you said), the IB fabric just wasn't robust.
>
> Once we reverted to the Cisco OFED roll (1.3) the fabric worked
> perfectly.
>
> My guess is there is something wrong in the roll, or the way Clustercorp
> is creating the roll, or the mellanox drivers themselves.
>

A few points:

http://www.mellanox.com/content/pages.php?pg=firmware_download

Joe

>

--

Joe Landman

unread,

May 7, 2009, 6:37:30 PM5/7/09

to Lundrigan, Adam, Power, Debbie, Discussion of Rocks Clusters

Lundrigan, Adam wrote:
> Joe, Michael,
>
> Thank you for your input.
>
> I've downloaded the Mellanox ISO, however it doesn't support CentOS. I

Hmmm... I just installed it on a Centos 5.3 machine (and another 5.2
machine) just to be sure. It only works with the stock baseline kernel,
which means that an update can be problematic for it. If you want to
try it, try rebooting with the stock kernel, install it, and then reboot
with your normal kernel. I have done this on two machines so far.

> tried to force mlnx_add_kernel_support.sh to work around it, but first
> attempts did not succeed...I am working on that as an aside. The
> Mellanox ISO included the source code for their OFED (OFED_1.4-mlnx8),
> so I will move on to trying that instead, and come back to the firmware
> flashing parts later.

Ok

>
> If that falls through, I will go back to the stock OFED and try it.

OFED usually just works (though you need several additional things on
your build system, such as tcl-devel, bison, yacc, flex, pciutils, ...).

Joe

Lundrigan, Adam

unread,

May 7, 2009, 6:54:35 PM5/7/09

to lan...@scalableinformatics.com, Power, Debbie, Discussion of Rocks Clusters

Joe,

I am using the stock kernel that comes out of Rocks 5.1

2.6.18-92.1.13.el5

I haven't attempted to do a 'yum update' yet...last time I did that, it
broke kickstart, and I had to reinstall the frontend...but that's
another story.

When I try to run the Mellanox installer I get this:
[root@cnoofs-dev MLNX_OFED]# ./mlnxofedinstall
The 2.6.18-92.1.13.el5 kernel is installed, but do not have
drivers available.
Cannot continue.

According to the docs, I can build an ISO for my specific kernel this
way:
[root@cnoofs-dev MLNX_OFED]# docs/mlnx_add_kernel_support.sh -i
../MLNX_OFED_LINUX-1.4-rhel5.2.iso
ERROR: Linux Distribution (centos-release-5-2.el5.centos) is not
supported

...but it also fails. So I pulled the source out of the Mellanox ISO
and installed it. The IB adapters are up, and the IPoIB ib* devices are
created, but I am still back to the same point...no IPoIB joy.

-Adam

-----Original Message-----
From: Joe Landman [mailto:lan...@scalableinformatics.com]
Sent: Thursday, May 07, 2009 8:08 PM
To: Lundrigan, Adam
Cc: Discussion of Rocks Clusters; mdu...@x-iss.com; Power, Debbie
Subject: Re: [Rocks-Discuss] Infiniband issue w/ Rocks

Lundrigan, Adam

unread,

May 7, 2009, 7:17:04 PM5/7/09

to lan...@scalableinformatics.com, Discussion of Rocks Clusters, mdu...@x-iss.com, Power, Debbie

Joe, Michael,

I installed the Mellanox OFED from source on the Rocks frontend
(<Mellanox ISO>/src/OFED-1.4-mlnx8.tgz)....it completed and displayed
this at the end:

Device (15b3:6278):
84:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III
Ex (Tavor compatibility mode) (rev a0)
Link Width: 8x
Link Speed: 2.5Gb/s

I've attached the output of some IB diag commands.

The IPoIB devices (ib*) were not created until I ran 'modprobe ib_ipoib'

-Adam

Joe, Michael,

-Adam

A few points:

http://www.mellanox.com/content/pages.php?pg=firmware_download

Joe

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ib_diag_frontend_mellanox_ofed_src.txt
Url: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090507/dc6477e2/ib_diag_frontend_mellanox_ofed_src.txt

Lundrigan, Adam

unread,

May 7, 2009, 7:16:01 PM5/7/09

to lan...@scalableinformatics.com, Power, Debbie, Discussion of Rocks Clusters

I installed Mellanox OFED on a compute node (compute-0-0), rebooted it,
and the adapters failed to come up:

[root@compute-0-0 ~]# ifconfig ib0
ib0: error fetching interface information: Device not found

Until I ran 'modprobe ib_ipoib', then it works as expected:

[root@compute-0-0 ~]# ifconfig ib0
ib0 Link encap:InfiniBand HWaddr
80:00:04:04:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet6 addr: fe80::203:ba00:100:582d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:1 errors:0 dropped:2 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:0 (0.0 b) TX bytes:68 (68.0 b)

I don't think I should have had to load the module manually.....this
could be part of the problem?

Do a 'service restart network', and voila, the adapter is up:

[root@compute-0-0 ~]# ifconfig ib0
ib0 Link encap:InfiniBand HWaddr
80:00:04:04:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.2.255.254 Bcast:10.2.255.255
Mask:255.255.0.0
inet6 addr: fe80::203:ba00:100:582d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:11 errors:0 dropped:0 overruns:0 frame:0
TX packets:8 errors:0 dropped:10 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:872 (872.0 b) TX bytes:496 (496.0 b)

And we even get this in the syslog:

ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
ib0: no IPv6 routers present

Which I didn't get with the OFED roll from Clustercorp. Same story on
the head node:

[root@cnoofs-dev OFED]# ifconfig ib1
ib1 Link encap:InfiniBand HWaddr
80:00:04:05:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.2.0.1 Bcast:10.2.255.255 Mask:255.255.0.0
inet6 addr: fe80::203:ba00:100:57c6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:199 errors:0 dropped:0 overruns:0 frame:0
TX packets:2 errors:0 dropped:16 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:15532 (15.1 KiB) TX bytes:136 (136.0 b)

[root@cnoofs-dev OFED]# dmesg
ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
ib1: no IPv6 routers presen

Host tables are a little messy, but seem to be in order:

[root@compute-0-0 ~]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS
Window irtt Iface

255.255.255.255 0.0.0.0 255.255.255.255 UH 0 0
0 eth0

142.130.249.19 10.0.0.1 255.255.255.255 UGH 0 0
0 eth0
10.2.0.0 0.0.0.0 255.255.0.0 U 0 0
0 ib0
10.0.0.0 0.0.0.0 255.255.0.0 U 0 0
0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0
0 ib0
224.0.0.0 0.0.0.0 240.0.0.0 U 0 0
0 eth0
0.0.0.0 10.0.0.1 0.0.0.0 UG 0 0
0 eth0

[root@cnoofs-dev OFED]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS
Window irtt Iface

255.255.255.255 0.0.0.0 255.255.255.255 UH 0 0
0 eth0

142.130.249.19 10.0.0.1 255.255.255.255 UGH 0 0
0 eth0
142.130.249.0 0.0.0.0 255.255.255.0 U 0 0
0 eth1
10.2.0.0 0.0.0.0 255.255.0.0 U 0 0
0 ib1
10.0.0.0 0.0.0.0 255.255.0.0 U 0 0
0 eth0
10.1.0.0 0.0.0.0 255.255.0.0 U 0 0
0 eth2
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0
0 ib1
224.0.0.0 0.0.0.0 240.0.0.0 U 0 0
0 eth0
0.0.0.0 142.130.249.254 0.0.0.0 UG 0 0
0 eth1

However still no IPoIB joy:

[root@cnoofs-dev OFED]# ping 10.2.255.254
PING 10.2.255.254 (10.2.255.254) 56(84) bytes of data.
From 10.2.0.1 icmp_seq=2 Destination Host Unreachable
From 10.2.0.1 icmp_seq=3 Destination Host Unreachable
From 10.2.0.1 icmp_seq=4 Destination Host Unreachable

--- 10.2.255.254 ping statistics ---
6 packets transmitted, 0 received, +3 errors, 100% packet loss,
time 5000ms
, pipe 3

-Adam

-----Original Message-----
From: Joe Landman [mailto:lan...@scalableinformatics.com]
Sent: Thursday, May 07, 2009 8:30 PM
To: Lundrigan, Adam
Cc: Discussion of Rocks Clusters; mdu...@x-iss.com; Power, Debbie
Subject: Re: [Rocks-Discuss] Infiniband issue w/ Rocks

Lundrigan, Adam wrote:
> Joe,

[...]

> ...but it also fails. So I pulled the source out of the Mellanox ISO
> and installed it. The IB adapters are up, and the IPoIB ib* devices
> are created, but I am still back to the same point...no IPoIB joy.

So if you type

ifconfig ib0

what do you get?

Joe Landman

unread,

May 7, 2009, 6:59:59 PM5/7/09

to Lundrigan, Adam, Power, Debbie, Discussion of Rocks Clusters

Lundrigan, Adam wrote:
> Joe,

[...]

> ...but it also fails. So I pulled the source out of the Mellanox ISO

> and installed it. The IB adapters are up, and the IPoIB ib* devices are
> created, but I am still back to the same point...no IPoIB joy.

So if you type

ifconfig ib0

what do you get?

Joe Landman

unread,

May 7, 2009, 7:34:45 PM5/7/09

to Lundrigan, Adam, Power, Debbie, Discussion of Rocks Clusters

Lundrigan, Adam wrote:
> I installed Mellanox OFED on a compute node (compute-0-0), rebooted it,
> and the adapters failed to come up:
>
> [root@compute-0-0 ~]# ifconfig ib0
> ib0: error fetching interface information: Device not found
>
> Until I ran 'modprobe ib_ipoib', then it works as expected:
>
> [root@compute-0-0 ~]# ifconfig ib0
> ib0 Link encap:InfiniBand HWaddr
> 80:00:04:04:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> inet6 addr: fe80::203:ba00:100:582d/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1 errors:0 dropped:2 overruns:0 carrier:0
> collisions:0 txqueuelen:256
> RX bytes:0 (0.0 b) TX bytes:68 (68.0 b)
>
> I don't think I should have had to load the module manually.....this
> could be part of the problem?

Yes, this is the problem. If you create an
/etc/sysconfig/network-scripts/ifcfg-ib0 file, you can add into
/etc/modprobe.d/infiniband an alias that reads

alias ib0 ib_ipoib
alias ib1 ib_ipoib

and then

depmod -a

You may need to put this into your node customization script, we usually
put this into /etc/rc.local or similar. We have found some problems,
sometimes, with the way Redhat brings up its services (other distros
have similar problems). So more often than not, we find ourselves
forcing the issue on customers clusters.

Joe

Jason Bishop

unread,

May 7, 2009, 7:59:06 PM5/7/09

to Discussion of Rocks Clusters

Adam,

Wanted to give a bit of background/insight into what is going on
behind the scenes of the Clustercorp mlnx-ofed roll:

The Clustercorp mlnx-ofed roll packages together two Mellanox OFED 1.4
iso's - one which supports RHEL 5.2 and the other which supports
RHEL5.3. This enables the mlnx-ofed roll to support RHEL/Centos 5.2
and RHEL/Centos 5.3 using binary drivers from Mellanox. If you are
running a kernel other than one of these (such as an updated/errata
kernel) the drivers will be built from source for your specific kernel
(at headnode/compute install time).

The roll also checks the firmware on each hca against the latest GA
firmware released by Mellanox. Any downrev hca's are updated (as long
as the PSID associated with the hca is included in the firmware
release by Mellanox).

as an example:

[root@compute-0-0 ~]# tail mlnx-ofed-debug.out

140 ini files registered

probing devices

discovered dev: mt25204_pci_cr0
standard firmware for PSID[MT_0260000001] is 1.2.0
probed fw version 1.2.0
firmware up-to-date

We welcome feedback/bug reports etc. Either here on the list or you
can send to in...@clustercorp.com
I am running a test now with the 2.6.18-92.1.13.el5 kernel to see if
i can reproduce any errors with ipoib.

Jason
Clustercorp

Jason Bishop

unread,

May 7, 2009, 8:24:17 PM5/7/09

to Discussion of Rocks Clusters

Adam,

Here is an update from my testing:

Installed Rocks 5.1 using "os" roll and mlnx-ofed IE:

[root@blade ~]# rocks list roll
NAME VERSION ARCH ENABLED
base: 5.1 x86_64 yes
ganglia: 5.1 x86_64 yes
hpc: 5.1 x86_64 yes
java: 5.1 x86_64 yes
kernel: 5.1 x86_64 yes
os: 5.1 x86_64 yes
web-server: 5.1 x86_64 yes
mlnx-ofed: 5.1 x86_64 yes

ibstat reports:

[root@compute-0-0 ~]# ibstat
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.0
Hardware version: a0
Node GUID: 0x003048798a560000
System image GUID: 0x0005ad000100d050
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510a6a
Port GUID: 0x003048798a560001

I then setup ib0 on computes 0-0 and -0-1 and ran a ping -f between
them. i also include a snippit of the dmesg log:

[root@compute-0-1 ~]# ifconfig ib0

ib0 Link encap:InfiniBand HWaddr

80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.30.10.11 Bcast:172.30.255.255 Mask:255.255.0.0
inet6 addr: fe80::203:489:8a54:1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:1687181 errors:0 dropped:0 overruns:0 frame:0
TX packets:1687176 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:141723064 (135.1 MiB) TX bytes:148471488 (141.5 MiB)

[root@compute-0-1 ~]# ssh compute-0-0 !!
ssh compute-0-0 ifconfig ib0
root@compute-0-0's password:

ib0 Link encap:InfiniBand HWaddr

80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.30.10.10 Bcast:172.30.255.255 Mask:255.255.0.0
inet6 addr: fe80::230:4879:8a56:1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:1687181 errors:0 dropped:0 overruns:0 frame:0
TX packets:1687176 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:141723064 (135.1 MiB) TX bytes:148471488 (141.5 MiB)

[root@compute-0-1 ~]# ping -f 172.30.10.10
PING 172.30.10.10 (172.30.10.10) 56(84) bytes of data.

--- 172.30.10.10 ping statistics ---
4812945 packets transmitted, 4812945 received, 0% packet loss, time 214327ms
rtt min/avg/max/mdev = 0.028/0.031/0.838/0.005 ms, ipg/ewma 0.044/0.032 ms

[root@compute-0-1 ~]# uname -r
2.6.18-92.1.13.el5

[root@compute-0-1 ~]# dmesg | tail
PCI: Enabling device 0000:06:00.0 (0140 -> 0142)
ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 16 (level, low) -> IRQ 169
PCI: Setting latency timer of device 0000:06:00.0 to 64
ib0: enabling connected mode will cause multicast packet drops
ib0: mtu > 2044 will cause multicast packet drops.
ib0: mtu > 2044 will cause multicast packet drops.
NET: Registered protocol family 27

ADDRCONF(NETDEV_UP): ib0: link is not ready

ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
ib0: no IPv6 routers present

It looks like basic functionality checks out. If you are using real
mellanox hca's you might try updating the firmware and see if that
resolves.

Jason
Clustercorp

Lundrigan, Adam

unread,

May 7, 2009, 8:59:04 PM5/7/09

to Discussion of Rocks Clusters

Jason,

Thanks for your detailed information. There are only a few slight differences between how you set up your test and how my current cluster is set up:

*
I didn't install the HPC roll
*
ibstat reports "CA type: MT25208 (MT23108 compat mode)"
*
I used 10.2.0.0/16 as the subnet for IPoIB (10.2.0.1 for frontend, 10.2.255.254 and 10.2.255.253 for compute nodes) (I also tried placing two adapters on 172.16.16.0/24 and pinging....no luck)

I even see the same in the dmesg log.... complaint about MTU size -> link not ready -> link ready -> no ipv6 routers

If possible, could you post the routing tables for each machine (netstat -rn)?

I will give the roll another try and see what happens.....If I were a betting man, I would bet that the problem is something with the adapter. It's running in a compat mode...maybe it needs new firmware?

Thanks again for your help,

--
Adam Lundrigan
Computer Systems Programmer
Biological & Physical Oceanography Section
Science, Oceans & Environment Branch
Department of Fisheries and Oceans Canada
Northwest Atlantic Fisheries Centre
St. John's, Newfoundland & Labrador
CANADA A1C 5X1

Tel: (709) 772-8136
Fax: (709) 772-8138
Cell: (709) 277-4575
Office: G10-117J
Email: Adam.Lu...@dfo-mpo.gc.ca

________________________________

From: npaci-rocks-dis...@sdsc.edu on behalf of Jason Bishop
Sent: Thu 5/7/2009 9:54 PM
To: Discussion of Rocks Clusters

Subject: Re: [Rocks-Discuss] Infiniband issue w/ Rocks

Adam,

ibstat reports:

Jason
Clustercorp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 11242 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090507/e4c9f324/attachment.bin

Jason Bishop

unread,

May 7, 2009, 10:27:04 PM5/7/09

to Discussion of Rocks Clusters

Adam, The mlnx-ofed roll does depend on the hpc roll, so when you do
the reinstall please select hpc and mlnx-ofed. (in your case, i think
the dependency only effects your headnode config, not your compute
config).

I got my $0.02 bet on the firmware as well. If you can determine what
the PSID is for the hca you can easily update to the latest ga
firmware. what does the end of the /root/mlnx-ofed-debug.out contain?
(IE: what does flint -d /dev/mst/*cr0 q and flint -d /dev/mst/*cr0 dc
list)

One thing you can do to reduce variables is to connect two compute
nodes together with a single IB cable. This removes the switch from
the equation and may give some useful data. All you need to do is
login to one of the compute nodes and issue a "service opensmd start"
to bring up a subnet manager to manage the pair. the port should go
active on both computes at this point and you can then bring up the
ib0 interface.

compute netowrk information:

[root@blade ~]# tentakel "ibstat; netstat -rn"
### compute-0-0(stat: 0, dur(s): 0.13):

CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.0
Hardware version: a0
Node GUID: 0x003048798a560000
System image GUID: 0x0005ad000100d050
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510a6a
Port GUID: 0x003048798a560001

Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
255.255.255.255 0.0.0.0 255.255.255.255 UH 0 0 0 eth0

192.168.10.21 10.1.1.1 255.255.255.255 UGH 0 0 0 eth0
10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
172.30.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0

224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth0

0.0.0.0 10.1.1.1 0.0.0.0 UG 0 0 0 eth0
### compute-0-1(stat: 0, dur(s): 0.14):

CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.0
Hardware version: a0

Node GUID: 0x000304898a540000

System image GUID: 0x0005ad000100d050
Port 1:
State: Active
Physical state: LinkUp
Rate: 20

Base lid: 2

LMC: 0
SM lid: 1

Capability mask: 0x02510a68
Port GUID: 0x000304898a540001

Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
255.255.255.255 0.0.0.0 255.255.255.255 UH 0 0 0 eth0

192.168.10.21 10.1.1.1 255.255.255.255 UGH 0 0 0 eth0
10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
172.30.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0

224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth0

0.0.0.0 10.1.1.1 0.0.0.0 UG 0 0 0 eth0

Jason
Clustercorp

Lundrigan, Adam

unread,

May 8, 2009, 12:16:04 AM5/8/09

to Discussion of Rocks Clusters, Power, Debbie

Jason,

I have added the HPC and mlnx-ofed rolls to the head node, updated the distro, and deployed two compute nodes. Currently the same story as before....once OFED is fully installed, ib0 and ib1 are created and given IP addresses, but the link never goes up (from the point of view of ib*):

ADDRCONF(NETDEV_UP): ib0: link is not ready

< ... broadcast domain join failure messages, ad finitum ... >

The flint commands you gave me did not work....the /dev/mst folder on both compute nodes is empty. However, I was able to use mstflint to get the PSID:

[root@compute-0-0 ~]# cat /root/mstflint.out
mstflint -device mthca0 q
Image type: Failsafe
FW Version: 4.7.400
I.S. Version: 1
Device ID: 25208
Chip Revision: A0
Description: Node Port1 Port2 Sys image
GUIDs: 0003ba000100582c 0003ba000100582d 0003ba000100582e 0003ba000100582f
Board ID: (SUN0030000001)
VSD:
PSID: SUN0030000001

Which means it is an SDR card. And then flash the adapter with the latest firmware:

[root@compute-0-0 ~]# mstflint -device mthca0 q
Image type: Failsafe
FW Version: 4.8.200
I.S. Version: 1
Device ID: 25208
Chip Revision: A0
Description: Node Port1 Port2 Sys image
GUIDs: 0003ba000100582c 0003ba000100582d 0003ba000100582e 0003ba000100582f
Board ID: (SUN0030000001)
VSD:
PSID: SUN0030000001

Compute-0-0 seems better.....the broadcast domain errors are gone:

ADDRCONF(NETDEV_UP): ib0: link is not ready
ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

ib0: enabling connected mode will cause multicast packet drops
ib0: mtu > 2044 will cause multicast packet drops.
ib0: mtu > 2044 will cause multicast packet drops.

However, no such luck for compute-0-1....it still throws out broadcast failures periodically, and the link nevery becomes ready like ib0 on 0-0 did.

I am thoroughly perplexed,

________________________________

compute netowrk information:

Jason
Clustercorp

Size: 23326 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090508/063f5d33/attachment.bin

Jason Bishop

unread,

May 8, 2009, 1:22:24 AM5/8/09

to Discussion of Rocks Clusters, Power, Debbie

Adam,

First just wanted to mention if you don't see /dev/mst directory
populated you may have to issue an "mst start".
afterwards, you should see output similar to following. If you do,
then the /dev/mst directory is populated and the flint/mlxburn
commands will work fine.

[root@compute-0-0 ~]# mst status
MST modules:
------------
MST PCI module loaded
MST PCI configuration module loaded
MST Calibre (I2C) module is not loaded

MST devices:
------------
/dev/mst/mt25204_pciconf0 - PCI configuration cycles access.
bus:dev.fn=06:00.0 addr.reg=88 data.reg=92
Chip revision is: A0
/dev/mst/mt25204_pci_cr0 - PCI direct access.
bus:dev.fn=06:00.0 bar=0xd8800000
size=0x100000
Chip revision is: A0
/dev/mst/mt25204_pci_uar0 - PCI direct access.
bus:dev.fn=06:00.0 bar=0xd8000000
size=0x800000

That is good news on the firmware. Your PSID of SUN0030000001 is
registered/included in the roll (see below). Are both compute-0-0 and
0-1 updated to 4.8.200 now? You will need to reboot the computes once
for the firmware update to take effect.

[root@compute-0-0 ~]# grep SUN0030000001 /opt/mlnx-ofed/firmware/*/*
/opt/mlnx-ofed/firmware/fw-25208-rel-4_8_200/375-3382-01.ini:PSID =
SUN0030000001

Jason
Clustercorp

On Thu, May 7, 2009 at 9:16 PM, Lundrigan, Adam

Lundrigan, Adam

unread,

May 8, 2009, 7:14:18 AM5/8/09

to Discussion of Rocks Clusters, Power, Debbie

Jason,

It looks as though we've finally isolated the problem:

[root@compute-0-0 ~]# mst status
MST modules:
------------

MST PCI module is not loaded
MST PCI configuration module is not loaded

MST Calibre (I2C) module is not loaded

MST devices:
------------
No MST devices found

[root@compute-0-0 ~]# mst start
Starting MST (Mellanox Software Tools) driver set:
Loading MST PCI module insmod: can't read '/usr/mst/lib/2.6.18-92.1.13.el5/mst_pci.ko': No such file or directory
[FAILED]
Loading MST PCI configuration module insmod: can't read '/usr/mst/lib/2.6.18-92.1.13.el5/mst_pciconf.ko': No such file or directory
[FAILED]
Saving configuration for PCI device 05:00.0 [ OK ]
Create devices

mst_pci driver not found

If I look at /usr/mst/lib:

[root@compute-0-0 lib]# ls -l /usr/mst/lib
total 160
drwxr-xr-x 2 root root 4096 May 8 01:11 2.6.18-92.el5
-rw-r--r-- 1 root root 58104 Mar 15 14:02 libmtcr.a
lrwxrwxrwx 1 root root 16 May 8 01:11 libusb -> libusb-0.1.4.4.0
lrwxrwxrwx 1 root root 16 May 8 01:11 libusb-0.1.4 -> libusb-0.1.4.4.0
-rwxr-xr-x 1 root root 37646 Mar 15 14:01 libusb-0.1.4.4.0
-rw-r--r-- 1 root root 46912 Mar 15 14:02 libusb.a
-rwxr-xr-x 1 root root 724 Mar 15 14:01 libusb.la
[root@compute-0-0 lib]#

There is no directory '/usr/mst/lib/2.6.18-92.1.13.el5'. So I symlinked it to 2.6.18-92.el5, and the modules loaded:

[root@compute-0-0 lib]# mst start
Starting MST (Mellanox Software Tools) driver set:
Loading MST PCI module [ OK ]
Loading MST PCI configuration module [ OK ]
Saving configuration for PCI device 05:00.0 [ OK ]
Create devices

[root@compute-0-0 lib]# mst status

MST modules:
------------
MST PCI module loaded
MST PCI configuration module loaded
MST Calibre (I2C) module is not loaded

MST devices:
------------
/dev/mst/mt25208_pciconf0 - PCI configuration cycles access.
bus:dev.fn=05:00.0 addr.reg=88 data.reg=92
Chip revision is: A0
/dev/mst/mt25208_pci_cr0 - PCI direct access.
bus:dev.fn=05:00.0 bar=0xbe600000 size=0x100000
Chip revision is: A0
/dev/mst/mt25208_pci_ddr0 - PCI direct access.
bus:dev.fn=05:00.0 bar=0xc0000000 size=0x10000000
/dev/mst/mt25208_pci_uar0 - PCI direct access.
bus:dev.fn=05:00.0 bar=0xdf000000 size=0x800000

And now IPoIB works as expected:

[root@compute-0-1 lib]# ifconfig ib0
ib0 Link encap:InfiniBand HWaddr 80:00:04:04:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.2.255.253 Bcast:10.2.255.255 Mask:255.255.0.0
inet6 addr: fe80::203:ba00:100:5899/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1

RX packets:3220 errors:0 dropped:0 overruns:0 frame:0
TX packets:2 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:253956 (248.0 KiB) TX bytes:176 (176.0 b)

[root@compute-0-1 lib]# ping 10.2.255.254

PING 10.2.255.254 (10.2.255.254) 56(84) bytes of data.

64 bytes from 10.2.255.254: icmp_seq=1 ttl=64 time=0.098 ms
64 bytes from 10.2.255.254: icmp_seq=2 ttl=64 time=0.074 ms
64 bytes from 10.2.255.254: icmp_seq=3 ttl=64 time=0.104 ms
64 bytes from 10.2.255.254: icmp_seq=4 ttl=64 time=0.085 ms

--- 10.2.255.254 ping statistics ---

4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.074/0.090/0.104/0.013 ms

Now I just have to rig these fixes we've found along the way (card firmware flash, kernel version symlink, etc..) into the node configuration, redeploy the nodes, then cross my fingers and hope it fixes the problem permanently. I will report back later to let you know how it goes...

Thank you very much for all your help,

--
Adam Lundrigan
Computer Systems Programmer
Biological & Physical Oceanography Section
Science, Oceans & Environment Branch
Department of Fisheries and Oceans Canada
Northwest Atlantic Fisheries Centre
St. John's, Newfoundland & Labrador
CANADA A1C 5X1

Tel: (709) 772-8136
Fax: (709) 772-8138
Cell: (709) 277-4575
Office: G10-117J
Email: Adam.Lu...@dfo-mpo.gc.ca

Lundrigan, Adam

unread,

May 8, 2009, 10:10:40 AM5/8/09

to Lundrigan, Adam, Discussion of Rocks Clusters, Power, Debbie

Jason,

That victory was short lived. Infiniband is working on the Rocks cluster, but there is still a serious problem with the fabric. I haven't been able to test any MPI jobs on the new Rocks cluster, but the Solaris cluster on the same IB fabric is now broken.

The majority of the machines on the IB fabric are members of a Solaris 10 compute grid, and these machines are seeing a tremendous reduction in bandwidth:

# OSU MPI Bandwidth Test (Version 2.3)
# Size Bandwidth (MB/s)
1 0.38
2 0.75
4 1.53
8 3.06
16 6.08
32 12.13
64 24.06
128 47.33
256 85.45
512 141.24
1024 185.67
2048 211.33
4096 224.08
8192 220.40
16384 233.04
32768 235.62
65536 236.79
131072 237.35
262144 237.54
524288 237.61
1048576 237.80
2097152 206.17
4194304 237.76

Which, coupled with the traffic I see when running snoop on the GbE interface, leads me to believe that these numbers show that IB isn't being used.
A quick scan of the IB interfaces on each machine, and it appears that the subnet mask has been modified on all of them. I double-checked to ensure that /etc/netmasks had the proper value (255.255.255.0) for that subnet, but the adapters seem to keep changing their subnet mask to 255.255.0.0 (which is the subnet mask of the Rocks machines' IB adapters). I manually changed all of them back to 255.255.255.0, however I see peculiar behavior with SSH:

-bash-3.00# id
uid=0(root) gid=0(root)
-bash-3.00# hostname
CNOOFS01
-bash-3.00# ssh cnoofs08-ib
^C
-bash-3.00# ping cnoofs08-ib
no answer from cnoofs08-ib
-bash-3.00# ssh cnoofs08
Password:
Last login: Fri May 8 11:29:14 2009 from cnoofs01
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
-bash-3.00# ping cnoofs01-ib
cnoofs01-ib is alive
-bash-3.00# exit
logout
Connection to cnoofs08 closed.
-bash-3.00# ping cnoofs08-ib
cnoofs08-ib is alive
-bash-3.00# ssh cnoofs08-ib
Warning: Permanently added 'cnoofs08-ib,172.16.22.8' (RSA) to the list of known hosts.
Password:
Password:
Last login: Fri May 8 11:39:14 2009 from cnoofs01
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
-bash-3.00#

SSHing in over the IPoIB interface failed until I SSHed into the node over the GbE network and pinged the head node. When I logged out and tried SSHing again via the IPoIB network, it worked. I know this is probably outside of the scope of this mailing list, but have you come across anything like this before? Should this kind of problem be expected when running two distinct IP networks with different subnet masks (172.16.22.0/24 and 10.2.0.0/16) on the same IB fabric?

In the meantime I am going to disconnect the Rocks IB nodes from the IB fabric, and reset the IB switches to see if my Solaris nodes go back to working as normal.

Thanks again for your help,

-Adam

-----Original Message-----
From: Lundrigan, Adam

Jason Bishop

unread,

May 12, 2009, 5:25:00 PM5/12/09

to Discussion of Rocks Clusters, Power, Debbie, Adam.Lu...@dfo-mpo.gc.ca

Adam,

I would recommend downloading more recent version of the mlnx-ofed
roll from clustercorp web site. The error you list below looks like
you have an earlier version which was not rebuilding the mft (mellanox
firmware tools) package for updated/errata kernels.