[Rocks-Discuss] request for help with QLogic InfiniBand network setup

115 views
Skip to first unread message

Erik Bryer

unread,
Oct 15, 2010, 5:22:16 PM10/15/10
to npaci-rocks...@sdsc.edu
Hello,

I have Rocks 5.3 on a cluster with both gE and InfiniBand. gE tests have
so far run OK. I want to get the InfiniBand working. I installed the OFED
roll from QLogic, but I am concerned by some failed tests. Perhaps I
should have installed it using the binary, not the roll, or I've omitted
something. (Or maybe things are better than they look?)

The QLogic documentation indicates to first start the driver, then test in
various ways, firstly with ping. Here is a log of that attempt. I'd
already logged into a compute node and started its ib adapter (10.1.2.2)
in the same way I here start the adapter on the frontend:

# /etc/init.d/openibd restart
Unloading HCA driver: [ OK ]
grep: /sys/class/infiniband/qib*/hca_type: No such file or directory
grep: /sys/class/infiniband/qib*/hca_type: No such file or directory
Loading HCA driver and Access Layer: [ OK ]
Setting up InfiniBand network interfaces:
No configuration found for ib0
No configuration found for ib1
Setting up service network . . . [ done ]

# ifconfig ib0 10.1.2.1 netmask 255.255.255.0

# ping -c 2 -b 10.1.2.255
WARNING: pinging broadcast address
PING 10.1.2.255 (10.1.2.255) 56(84) bytes of data.
--- 10.1.2.255 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

Shutting off iptables on the frontend and compute node didn't permit the
broadcast ping to work. However, I can (regardless of iptables) flood ping
10.1.2.2 from the frontend (10.1.2.1). That yields an average round-trip
time that is ~1/2 of the time found when using the gE address. It seems to
be working. (The ib adapters are 4x in PCI-X slots, btw; the switch is a
silverstorm 9000.)

I am concerned about the other tests it fails, such as "ipath_checkout
<nodefile>":

# ipath_checkout nodefile
Test 1. Pinging each node... OK
Test 2. Attempting to ssh to each node... OK
Test 3. Retrieving system configuration information... OK
Analyzing results...
!!!ERROR!!! 10.1.2.2 does not have an InfiniPath device
[...]

InfiniPath errors also show up when I type:
# ipath_control -iv
ipath_control: No InfiniPath module loaded?

Things look better when I type:
# ibhosts
Ca : 0x00066a0098004a0e ports 2 "compute-0-3 HCA-1"
Ca : 0x00066a0098004a13 ports 2 "compute-0-2 HCA-1"
Ca : 0x00066a0098004a05 ports 2 "compute-0-0 HCA-1"
Ca : 0x00066a00980049c5 ports 2 "compute-0-1 HCA-1"
Ca : 0x00066a00980049f2 ports 2 "typhoon HCA-1" [the frontend]

I wonder if I am missing some files (or what). Here is some more data:
# ibstatus
Infiniband device 'mthca0' port 1 status:
default gid: fe80:0000:0000:0000:0006:6a00:a000:49f2
base lid: 0x6
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 10 Gb/sec (4X)
[...]

# /etc/init.d/openibd status
HCA driver loaded
Configured IPoIB devices:
ib0 ib1
Currently active IPoIB devices:
The following OFED modules are loaded:
rdma_ucm
rdma_cm
ib_addr
ib_ipoib
mlx4_core
mlx4_ib
ib_mthca
ib_uverbs
ib_umad
ib_sa
ib_cm
ib_mad
ib_core
ib_qib
ib_usa

Some of the software referred to in the documentation seems to not exist,
or there is a missing library at times. Should I even be concerned about
these errors, or does my link appear good enough to support "mpirun
<a.out>" later?

Regards,

Erik
ebr...@fsu.edu

Tim Carlson

unread,
Oct 15, 2010, 5:48:44 PM10/15/10
to Discussion of Rocks Clusters
On Fri, 15 Oct 2010, Erik Bryer wrote:

> Hello,
>
> I have Rocks 5.3 on a cluster with both gE and InfiniBand. gE tests have
> so far run OK. I want to get the InfiniBand working. I installed the
> OFED roll from QLogic, but I am concerned by some failed tests. Perhaps
> I should have installed it using the binary, not the roll, or I've
> omitted something. (Or maybe things are better than they look?)

Are you sure you have QLogic gear? Looks to me like you have Mellanox
gear.

The output of ibstatus on my QLogic nodes is

# ibstatus
Infiniband device 'qib0' port 1 status:
default gid: fe80:0000:0000:0000:0011:7500:00ff:da34
base lid: 0xf
sm lid: 0xf


state: 4: ACTIVE
phys state: 5: LinkUp
rate: 10 Gb/sec (4X)

Not the "device" is qib0. Your "device" output indicates you have a
Mellabox card "mthca0". That would be Mellabox Infinihost type cards.
Looks like SDR Mellabox Infinihost type cards.

Tim
--
-------------------------------------------
Tim Carlson, PhD
Senior Research Scientist
Environmental Molecular Sciences Laboratory

Reply all
Reply to author
Forward
0 new messages