Also, every user has these lines in their .bashrc
export I_MPI_FABRICS=ofa
export I_MPI_DEVICE=rdma
export I_MPI_DEBUG=2
export I_MPI_FALLBACK_DEVICE=enable
While monitoring the calculations on this cluster, I
notice that most calculations have lines such as
[0] MPI startup(): ofa data transfer mode
[1] MPI startup(): ofa data transfer mode
....
indicating that InfiniBand is being used. However, for
some calculations, I do see
[55] MPI startup(): fabric ofa failed: will try use tcp fabric
[55] MPI startup(): tcp data transfer mode
indicating that it's using GigE network.
Checking the mailing list archives, I see from
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2010-November/049739.html
that IPoIB would accomplish forcing all data
transfer to happen through IB rather than GigE.
Do I need to set up IPoIB on this cluster? Are
there other ways (e.g. setting up variables in
.bashrc, SGE scripts, mpirun options, etc.) to
accomplish the same?
I'd be greatly appreciate any insight on this.
Best,
g
--
Gowtham
Information Technology Services
Michigan Technological University
I can't speak to how Intel MPI does things, which is going to be
significant for this discussion. However, if you do enable IPoIB, you
need to understand the implications.
IPoIB is basically a software-only TCP/IP stack, on top of IB. Using it
will increase your CPU utilization somewhat, since it's all software. I
don't know how significant this will be for you. Also, IPoIB will
probably be faster than 1GbE, but it will not be as fast as a native IB
Verbs implementation.
I do generally recommend setting up IPoIB, even if it's just for
convenience, and internal to the cluster only. But, if you're after all
the raw speed, you'd be better off figuring out why Intel MPI isn't
using its native IB Verbs all the time.
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu
Thank you for your response. I do now see the point in
setting up IPoIB in this cluster - especially since it
has a 13 TB promise array that's mounted across all
nodes as '/research/'. Hopefully it'll speed up I/O
to some extent.
Do you have any more notes on setting up IPoIB or
the ones in the link below suffice?
Thanks,
g
--
Gowtham
Information Technology Services
Michigan Technological University
(906) 487/3593
http://www.it.mtu.edu/
> rocks add network m5ipoib subnet=192.168.170.0 netmask=255.255.254.0 mtu=1500
> rocks add host interface m5-6-14 ib0
> rocks set host interface ip m5-6-14 ib0 192.168.170.143
> rocks set host interface name m5-6-14 ib0 m5-6-14ib
> rocks set host interface subnet m5-6-14 ib0 m5ipoib
> ... (repeat similar for all nodes)
> rocks sync config
As far as the IB setup goes, I know there are other tuning parameters to
make it run more efficiently, but unfortunately, I'm not the one to talk
to about that. If you find a good source of IB technical training info,
though, I'd love to hear about it. I've been looking for that for a few
years now.
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu
Have you checked if the Infinband interface/HCA is working on those nodes
that report using GigE?
For instance, you could run ibstat on those nodes to check, or maybe ibchecknet.
I guess Intel MPI may do the same as OpenMPI, namely, try to use the fastest
mechanism [e.g. Infiniband], and if that channel is not working then try something else
[e.g. TCP/IP over GigE].
I don't have Intel MPI, but
OpenMPI won't shy using a hybrid communication pattern, a mix of IB and GigE, or various Ethernet NICs if you have many on a node, etc.
It will try to do what it takes to run the program,
unless you explicitly tell it not to use tcp/Ethernet or another
of the existing transport layers.
Hence, it may be that some nodes have a bad HCA or for some reason it is turned off.
I hope this helps,
Gus correa
These two IB references came up in a recent discussion in the Beowulf mailing list.
Some presentations in the second link are tutorials, although in the terse PowerPoint
'bullet' style.
http://members.infinibandta.org/kwspub/Intro_to_IB_for_End_Users.pdf
http://www.hpcadvisorycouncil.com/events/switzerland_workshop/agenda.php
This is Guy Coates' a 'Infinband HOWTO' :
http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html
It is somewhat old and focuses on Debian Linux, though.
There is a bunch of links here, some broken:
http://www.ibswitches.com/index.php?link=resource
I agree with Lloyd that there is not a decent practical reference to Infinband.
Too bad, since it is perhaps the dominant interconnect for HPC, storage, etc.
I hope this helps,
Gus Correa
As far as books and other references go, someone I know once said that
there were exactly 2 books on the subject, and that one was bad, and the
other worse. I decided to buy the "bad" one, just to have a reference,
but honestly it hasn't been much use to me so far.
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu
By contrast to the scarce Infiniband documentation and learning materials,
a technology that appeared at about the same time, or perhaps even later than IB,
GPUs [and GPU programming], for instance, has a number of books, tutorials, libraries, many of which are free to download, plus user forums, mailing lists, etc.
Infinband is way behind in this regard, and the Infinband vendor[s] [one is largely dominant]
don't seem to be as proactive or as interested as NVidia and the other GPU vendors
in assisting their users' community in this regard.
Gus Correa
--
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
Thank you,
Gus Correa
This is a link to the Mellanox OFED 1.5.3 User Guide
But this PDF seems pretty exhaustive from the first
look - will use it for setting up IPoIB and post my
success/failure stories soon.
Best,
g
--
Gowtham
Information Technology Services
Michigan Technological University
(906) 487/3593
http://www.it.mtu.edu/
***************************************************************
PART #1: Front End
rocks add network ibnet subnet=10.2.0.0 \
netmask=255.255.0.0
cat /etc/sysconfig/network-script/ifcfg-eth0 | \
grep -v HWADDR | \
grep -v MTU | \
sed -e s/10.1/10.2/ | \
sed -e s/eth0/ib0/ > \
/etc/sysconfig/network-script/ifcfg-ib0
/etc/init.d/network restart
When I run 'ifconfig -a', I do see a relevant entry
for ib0 such as
inet addr:10.2.1.1
Bcast:10.2.255.255
Mask:255.255.0.0
I then updated the /etc/exports with the following line
/research 10.2.1.1(rw,async,no_root_squash) 10.2.0.0/255.255.0.0(rw,async)
***************************************************************
PART #2: compute-0-0
cat /etc/sysconfig/network-script/ifcfg-eth0 | \
grep -v HWADDR | \
grep -v MTU | \
sed -e s/10.1/10.2/ | \
sed -e s/eth0/ib0/ > \
/etc/sysconfig/network-script/ifcfg-ib0
/etc/init.d/network restart
When I run 'ifconfig -a', I do see a relevant entry
for ib0 such as
inet addr:10.2.255.254
Bcast:10.2.255.255
Mask:255.255.0.0
I added the following line to /etc/fstab
10.2.1.1:/research /research nfs defaults 1 2
When I try
mount -a
I get
mount: mount to NFS server '10.2.1.1' failed:
System Error: Connection refused.
***************************************************************
I can ping 10.2.255.254 from front end successfully, but when
I attempt
ssh 10.2.255.254
I get
ssh: connect to host 10.2.255.254 port 22: Connection refused
I understand I'm missing something with regard to firewall. And
I know editing /etc/syconfig/iptables directly wouldn't do much
good. I'm expecting the 'rocks' command to be something like
rocks add host firewall localhost chain=INPUT \
flags="-m state --state NEW --source 10.2.0.0/255.255.0.0" \
protocol=tcp service=ssh action=ACCEPT network=ib0
Can someone please let me know if this is right? Also, any help
on other steps I might have missed would be greatly appreciated
as well.
Best,
g
--
Gowtham
Information Technology Services
Michigan Technological University
(906) 487/3593
http://www.it.mtu.edu/
Hey Gowtham,
did you :
- restart sshd (so he can bind to the new IPoIP address)
- push network config to the nodes (after the rocsk add network, you
should run the rocks sync host network )
- restart the firewall /etc/init.d/iptables restart (and the network)
If it still doesn't work, just to see if it's a iptables or not issue
try to bring down the firewall only for a test with
/etc/init.d/iptables stop?
Sincerely,
Luca
For example, I get no error when I type
rocks add host firewall compute-0-0 chain=INPUT \
protocol=all service=all action=ACCEPT network=ibnet \
iface=ib0
rocks sync host firewall compute-0-0
However, when I log into compute-0-0 and check
/etc/sysconfig/iptables, I don't see an entry like
-A INPUT -i ib0 -j ACCEPT
A similar entry is missing in the front end as well.
How do I make sure these changes are permanent?
Thanks,
g
--
Gowtham
Information Technology Services
Michigan Technological University
(906) 487/3593
http://www.it.mtu.edu/
Best,
g
--
Gowtham
Information Technology Services
Michigan Technological University
(906) 487/3593
http://www.it.mtu.edu/