[Rocks-Discuss] InfiniBand - is IPoIB necessary?

512 views
Skip to first unread message

Gowtham

unread,
Dec 13, 2011, 4:40:44 PM12/13/11
to NPACI Rocks Discussion List

In one of our clusters running Rocks 5.4.2 that has
InfiniBand backend (MLNX_OFED_LINUX-1.5.3-1.0.0-rhel5.5-x86_64),
we have Intel MPI (v2012.0.032) installed and programs
are compiled with it.

Also, every user has these lines in their .bashrc

export I_MPI_FABRICS=ofa
export I_MPI_DEVICE=rdma
export I_MPI_DEBUG=2
export I_MPI_FALLBACK_DEVICE=enable

While monitoring the calculations on this cluster, I
notice that most calculations have lines such as

[0] MPI startup(): ofa data transfer mode
[1] MPI startup(): ofa data transfer mode
....

indicating that InfiniBand is being used. However, for
some calculations, I do see

[55] MPI startup(): fabric ofa failed: will try use tcp fabric
[55] MPI startup(): tcp data transfer mode

indicating that it's using GigE network.

Checking the mailing list archives, I see from

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2010-November/049739.html

that IPoIB would accomplish forcing all data
transfer to happen through IB rather than GigE.

Do I need to set up IPoIB on this cluster? Are
there other ways (e.g. setting up variables in
.bashrc, SGE scripts, mpirun options, etc.) to
accomplish the same?

I'd be greatly appreciate any insight on this.

Best,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

Lloyd Brown

unread,
Dec 13, 2011, 5:08:18 PM12/13/11
to npaci-rocks...@sdsc.edu
Gowtham,

I can't speak to how Intel MPI does things, which is going to be
significant for this discussion. However, if you do enable IPoIB, you
need to understand the implications.

IPoIB is basically a software-only TCP/IP stack, on top of IB. Using it
will increase your CPU utilization somewhat, since it's all software. I
don't know how significant this will be for you. Also, IPoIB will
probably be faster than 1GbE, but it will not be as fast as a native IB
Verbs implementation.

I do generally recommend setting up IPoIB, even if it's just for
convenience, and internal to the cluster only. But, if you're after all
the raw speed, you'd be better off figuring out why Intel MPI isn't
using its native IB Verbs all the time.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

Gowtham

unread,
Dec 13, 2011, 5:17:17 PM12/13/11
to Discussion of Rocks Clusters

Hello Lloyd,

Thank you for your response. I do now see the point in
setting up IPoIB in this cluster - especially since it
has a 13 TB promise array that's mounted across all
nodes as '/research/'. Hopefully it'll speed up I/O
to some extent.

Do you have any more notes on setting up IPoIB or
the ones in the link below suffice?

Thanks,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

Lloyd Brown

unread,
Dec 13, 2011, 5:41:11 PM12/13/11
to npaci-rocks...@sdsc.edu
As far as Rocks goes, I generally use the rocks command line to do stuff
like this, something roughly like this:

> rocks add network m5ipoib subnet=192.168.170.0 netmask=255.255.254.0 mtu=1500
> rocks add host interface m5-6-14 ib0
> rocks set host interface ip m5-6-14 ib0 192.168.170.143
> rocks set host interface name m5-6-14 ib0 m5-6-14ib
> rocks set host interface subnet m5-6-14 ib0 m5ipoib
> ... (repeat similar for all nodes)
> rocks sync config


As far as the IB setup goes, I know there are other tuning parameters to
make it run more efficiently, but unfortunately, I'm not the one to talk
to about that. If you find a good source of IB technical training info,
though, I'd love to hear about it. I've been looking for that for a few
years now.


Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

Gustavo Correa

unread,
Dec 13, 2011, 6:09:27 PM12/13/11
to Discussion of Rocks Clusters
Hi Gowtham

Have you checked if the Infinband interface/HCA is working on those nodes
that report using GigE?
For instance, you could run ibstat on those nodes to check, or maybe ibchecknet.

I guess Intel MPI may do the same as OpenMPI, namely, try to use the fastest
mechanism [e.g. Infiniband], and if that channel is not working then try something else
[e.g. TCP/IP over GigE].
I don't have Intel MPI, but
OpenMPI won't shy using a hybrid communication pattern, a mix of IB and GigE, or various Ethernet NICs if you have many on a node, etc.
It will try to do what it takes to run the program,
unless you explicitly tell it not to use tcp/Ethernet or another
of the existing transport layers.

Hence, it may be that some nodes have a bad HCA or for some reason it is turned off.

I hope this helps,
Gus correa

Gustavo Correa

unread,
Dec 13, 2011, 6:49:09 PM12/13/11
to Discussion of Rocks Clusters
Hi Lloyd, Gowtham

These two IB references came up in a recent discussion in the Beowulf mailing list.
Some presentations in the second link are tutorials, although in the terse PowerPoint
'bullet' style.

http://members.infinibandta.org/kwspub/Intro_to_IB_for_End_Users.pdf
http://www.hpcadvisorycouncil.com/events/switzerland_workshop/agenda.php

This is Guy Coates' a 'Infinband HOWTO' :
http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html
It is somewhat old and focuses on Debian Linux, though.

There is a bunch of links here, some broken:
http://www.ibswitches.com/index.php?link=resource

I agree with Lloyd that there is not a decent practical reference to Infinband.
Too bad, since it is perhaps the dominant interconnect for HPC, storage, etc.

I hope this helps,
Gus Correa

Lloyd Brown

unread,
Dec 14, 2011, 10:32:14 AM12/14/11
to npaci-rocks...@sdsc.edu
Yep. I've been searching and bugging vendors for a while now. There is
a programmer's workshop/training sponsored by the OFA
(https://www.openfabrics.org/resources/training/training-offerings.html), but
nothing for sysadmins yet.

As far as books and other references go, someone I know once said that
there were exactly 2 books on the subject, and that one was bad, and the
other worse. I decided to buy the "bad" one, just to have a reference,
but honestly it hasn't been much use to me so far.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

Gustavo Correa

unread,
Dec 14, 2011, 1:07:13 PM12/14/11
to Discussion of Rocks Clusters
I second your comments.
Not even the terse man pages of the various 'ib' commands are of much help.

By contrast to the scarce Infiniband documentation and learning materials,
a technology that appeared at about the same time, or perhaps even later than IB,
GPUs [and GPU programming], for instance, has a number of books, tutorials, libraries, many of which are free to download, plus user forums, mailing lists, etc.
Infinband is way behind in this regard, and the Infinband vendor[s] [one is largely dominant]
don't seem to be as proactive or as interested as NVidia and the other GPU vendors
in assisting their users' community in this regard.

Gus Correa

Ian Kaufman

unread,
Dec 14, 2011, 1:24:29 PM12/14/11
to Discussion of Rocks Clusters
I have found the Mellanox and Qlogic OFED user manuals pretty decent.

--
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu

Gustavo Correa

unread,
Dec 14, 2011, 1:50:24 PM12/14/11
to Discussion of Rocks Clusters
Are they free to download?
If free, could you point to an URL, please?

Thank you,
Gus Correa

Ian Kaufman

unread,
Dec 14, 2011, 2:04:35 PM12/14/11
to Discussion of Rocks Clusters

Gowtham

unread,
Dec 15, 2011, 8:52:56 AM12/15/11
to Discussion of Rocks Clusters

Thank you for more information and links to the
documentation. The set up I have so far (sans IPoIB)
seems to be ok [I had followed the ReadMe/Install
notes from Mellanox that was packaged with their ISO]

But this PDF seems pretty exhaustive from the first
look - will use it for setting up IPoIB and post my
success/failure stories soon.

Best,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

Gowtham

unread,
Dec 19, 2011, 10:09:18 AM12/19/11
to Discussion of Rocks Clusters

I followed instructions in the PDF (for Mellanox OFED) and I
believe I got the IPoIB set up properly on the front end as
well as one compute node.

***************************************************************

PART #1: Front End

rocks add network ibnet subnet=10.2.0.0 \
netmask=255.255.0.0

cat /etc/sysconfig/network-script/ifcfg-eth0 | \
grep -v HWADDR | \
grep -v MTU | \
sed -e s/10.1/10.2/ | \
sed -e s/eth0/ib0/ > \
/etc/sysconfig/network-script/ifcfg-ib0

/etc/init.d/network restart

When I run 'ifconfig -a', I do see a relevant entry
for ib0 such as

inet addr:10.2.1.1
Bcast:10.2.255.255
Mask:255.255.0.0

I then updated the /etc/exports with the following line

/research 10.2.1.1(rw,async,no_root_squash) 10.2.0.0/255.255.0.0(rw,async)

***************************************************************

PART #2: compute-0-0

cat /etc/sysconfig/network-script/ifcfg-eth0 | \
grep -v HWADDR | \
grep -v MTU | \
sed -e s/10.1/10.2/ | \
sed -e s/eth0/ib0/ > \
/etc/sysconfig/network-script/ifcfg-ib0

/etc/init.d/network restart

When I run 'ifconfig -a', I do see a relevant entry
for ib0 such as

inet addr:10.2.255.254
Bcast:10.2.255.255
Mask:255.255.0.0

I added the following line to /etc/fstab

10.2.1.1:/research /research nfs defaults 1 2

When I try

mount -a

I get

mount: mount to NFS server '10.2.1.1' failed:
System Error: Connection refused.

***************************************************************

I can ping 10.2.255.254 from front end successfully, but when
I attempt

ssh 10.2.255.254

I get

ssh: connect to host 10.2.255.254 port 22: Connection refused


I understand I'm missing something with regard to firewall. And
I know editing /etc/syconfig/iptables directly wouldn't do much
good. I'm expecting the 'rocks' command to be something like

rocks add host firewall localhost chain=INPUT \
flags="-m state --state NEW --source 10.2.0.0/255.255.0.0" \
protocol=tcp service=ssh action=ACCEPT network=ib0

Can someone please let me know if this is right? Also, any help
on other steps I might have missed would be greatly appreciated
as well.

Best,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

Luca Clementi

unread,
Dec 19, 2011, 12:06:38 PM12/19/11
to Discussion of Rocks Clusters
Hey Gowtham,
did you :
- restart sshd (so he can bind to the new IPoIP address)
- push network config to the nodes (after the rocsk add network, you
should run the rocks sync network )?

Luca Clementi

unread,
Dec 19, 2011, 12:11:04 PM12/19/11
to Discussion of Rocks Clusters
Sorry I hit the send button too early....


Hey Gowtham,
did you :
 - restart sshd (so he can bind to the new IPoIP address)
 - push network config to the nodes (after the rocsk add network, you

should run the rocks sync host network )
- restart the firewall /etc/init.d/iptables restart (and the network)


If it still doesn't work, just to see if it's a iptables or not issue
try to bring down the firewall only for a test with
/etc/init.d/iptables stop?


Sincerely,
Luca

Gowtham

unread,
Dec 19, 2011, 3:06:35 PM12/19/11
to Discussion of Rocks Clusters

I did some more tests and it turned out to be an iptables
related error.

For example, I get no error when I type

rocks add host firewall compute-0-0 chain=INPUT \
protocol=all service=all action=ACCEPT network=ibnet \
iface=ib0

rocks sync host firewall compute-0-0


However, when I log into compute-0-0 and check
/etc/sysconfig/iptables, I don't see an entry like

-A INPUT -i ib0 -j ACCEPT

A similar entry is missing in the front end as well.

How do I make sure these changes are permanent?

Thanks,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

Gowtham

unread,
Dec 19, 2011, 4:09:15 PM12/19/11
to Discussion of Rocks Clusters

It's working as expected. I'll post the details pretty soon.

Best,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

Reply all
Reply to author
Forward
0 new messages