I'd suggest:
1) make sure ko2iblnd has been brought up (please check if there is any
error message when startup ko2iblnd)
2) echo +neterror > /proc/sys/lnet/printk, then try with lctl ping, if
it still can't work please post error messages
Regards
Liang
subbu kl:
> 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=1
> ttl=64 time=0.052 ms
> 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=2
> ttl=64 time=0.024 ms
>
> --- 172.24.198.112 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
> [root@p186 ~]# ping 172.24.198.111
> PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.
> 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=1
> ttl=64 time=2.16 ms
> 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=2
> ttl=64 time=0.296 ms
>
> --- 172.24.198.111 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms
>
> but cant ping the NIDS :
> [root@p186 ~]# lctl ping 172.24.198.112@o2ib
> failed to ping 172.24.198.112@o2ib: Input/output error
> [root@p186 ~]# lctl ping 172.24.198.111@o2ib
> failed to ping 172.24.198.111@o2ib: Input/output error
>
> Any idea why lnet cant ping NIDS ?
>
> some more configurations:
> [root@p186 ~]# ibstat
> CA 'mthca0'
> CA type: MT23108
> Number of ports: 2
> Firmware version: 3.5.0
> Hardware version: a1
> Node GUID: 0x0002c9020021550c
>
> Machines are connected via IB switch.
>
> Looking forward for help.
>
> ~subbu
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
We don't have any tip for setup IPoIB, looks like linux can't find the
ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you
didn't assign any address to ib0 (or failed to assign address to ib0)
before loading o2iblnd in the first try.
I can reproduce exactly same error by:
1. modprobe ib_ipoib
2. ifconfig ib0 up // without assign any address
3. modprobe ko2iblnd
4. lctl network up
Regards
Liang
subbu kl:
> Liang,
> after executing following echo :
> echo +neterror > /proc/sys/lnet/printk
>
> now lctlt ping shows the following error
>
> # lctl ping 172.24.198.112@o2ib
> failed to ping 172.24.198.112@o2ib: Input/output error
>
> Jan 16 10:24:14 p128 kernel: Lustre:
> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112@o2ib:
> ROUTE ERROR -22
> Jan 16 10:24:14 p128 kernel: Lustre:
> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting
> messages for 172.24.198.112@o2ib: connection failed
>
> Looks like some problem with "IB connection manager" !
>
> 1. do we have any help docs to setup IPoIB and Lustre, lustre
> operation manual has very minimal info about this . I think I am
> missing some IPoIB setup part here.
> 2. or is it mannual assignment of IP addresses to "ib0" is creating
> some problem
>
>
> *Some more supporting info :
> *subnet manager of following version is also running : OpenSM 3.1.8
> <mailto:Lustre-...@lists.lustre.org>
# ifconfig ib0 172.24.198.111
Regards
Liang
subbu kl:
> and *"Added LNI" lines *)
> *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI
> 172.24.198.111@o2ib [8/64]
> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started*
> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter
> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000
> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr
> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new
> disk, initializing
> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000
> now serving dev
> (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with
> recovery enabled
> Jan 16 09:47:09 p128 kernel: Lustre:
> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall())
> lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups
> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt:
> set parameter group_upcall=/usr/sbin/l_getgroups
> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000
> on device /dev/loop0 has started
> .
> .
> .
>
>
> ~subbu
>
>
> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen
> <Zhen....@sun.com <mailto:Zhen....@sun.com>
> <mailto:Zhen....@sun.com <mailto:Zhen....@sun.com>>>
> <mailto:Lustre-...@lists.lustre.org
Liang,
please find the info you have asked below.
There are two nodes MDS and OSS1 connected throgh a silverstorme Infiniband switch and MDS running IB subnet manager running.
~subbu