[Lustre-discuss] Lustre module not loading on client mount

2,587 views
Skip to first unread message

Michael Robbert

unread,
Apr 12, 2010, 7:33:05 PM4/12/10
to lustre-discuss@lists.lustre.org discuss
I am trying to configure a Lustre 1.8.2 client on a CentOS 5.4 machine. I have compiled from source into RPMS and all 4 RPMS are installed (lustre, -modules, -tests, and -source). The lustre module will load find manually with "modprobe lustre", but I can not get the filesystem to automatically mount on boot up. I have added the following to /etc/modprobe.conf

options lnet networks=o2ib0(ib0)

and these are the entries in my /etc/fstab

172.16.34.1@o2ib:/home /lustre/home lustre auto,_netdev 1 2
172.16.34.1@o2ib:/scratch /lustre/scratch lustre auto,_netdev 1 2

I have a similar setup with Lustre 1.6.7.2 client running on RHEL 4.5 and it loads fine there.

What am I missing?

Thanks,
Mike Robbert

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Kit Westneat

unread,
Apr 13, 2010, 12:07:43 AM4/13/10
to Michael Robbert, lustre-discuss@lists.lustre.org discuss
Hey Mike,

Are there any messages in dmesg on boot? I've seen it on occasion where
the IB takes a second to actually start. If that's the case, you might
need to add mounts to rc.local, or try to get openibd to start earlier.

- Kit


--
---
Kit Westneat
kwes...@datadirectnet.com
812-484-8485

Michael Robbert

unread,
Apr 14, 2010, 1:42:29 PM4/14/10
to Kit Westneat, lustre-discuss@lists.lustre.org discuss
Kit,
I thought that it may be a timing issue, but I added mount commands to rc.local and it didn't help. The odd thing is that it does seem to work on subsequent reboots. I haven't done extensive testing to see if that works all the time or not. The other odd thing is that if the FSs don't mount on boot a manual mount command does not work without first doing "modprobe lustre" first. This is what I see in that case:

[root@compute-2-1 ~]# mount -a
mount.lustre: mount 172.16.34.1@o2ib:/home at /lustre/home failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf
mount.lustre: mount 172.16.34.1@o2ib:/scratch at /lustre/scratch failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf

Here are some dmesg entries from a boot that does not mount the FSs:

ADDRCONF(NETDEV_UP): eth0: link is not ready
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
ADDRCONF(NETDEV_UP): ib0: link is not ready
ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
Lustre: OBD class driver, http://www.lustre.org/
Lustre: Lustre Version: 1.8.2
Lustre: Build Version: 1.8.2-20100122190848-PRISTINE-2.6.18-164.15.1.el5
ko2iblnd: disagrees about version of symbol ib_fmr_pool_unmap
ko2iblnd: Unknown symbol ib_fmr_pool_unmap
... Lots more ko2iblnd errors here (Is this part of the problem or a red herring? ...
ko2iblnd: disagrees about version of symbol ib_fmr_pool_map_phys
ko2iblnd: Unknown symbol ib_fmr_pool_map_phys
LustreError: 3288:0:(api-ni.c:1043:lnet_startup_lndnis()) Can't load LND o2ib, module ko2iblnd, rc=256
LustreError: 3288:0:(events.c:729:ptlrpc_init_portals()) network initialisation failed
LustreError: 165-2: Nothing registered for client mount! Is the 'lustre' module loaded?
LustreError: 3381:0:(obd_mount.c:2042:lustre_fill_super()) Unable to mount (-19)


Thanks,
Mike

Nathan Dauchy

unread,
Apr 14, 2010, 2:00:58 PM4/14/10
to Michael Robbert, lustre-...@lists.lustre.org
Michael Robbert wrote:
> Kit,
> I thought that it may be a timing issue, but I added mount commands to rc.local and it didn't help.

Robert,

I'm not sure of the root cause of your mount problems, but we were also
hitting a timing problem when mounting file systems over Infiniband at
boot time. To avoid it, since the IB may still not be initialized when
rc.local runs, the solution I used was to add the following to the
"start)" section of /etc/rc.d/init.d/netfs. You could put something
similar in rc.local if you prefer.

# Spin until we find an "Active" IB device
if [ -d /sys/class/infiniband ]; then
tries=1
maxtries=10
delay=5
while [ $tries -le $maxtries ]; do
grep -q ACTIVE /sys/class/infiniband/*/ports/*/state 2>&1 &&
break
logger -s -t netfs "WARNING: No "ACTIVE" Infiniband ports
found: try $tries/$maxtries, sleep $delay"
sleep $delay
(( tries++ ))
[ $tries -gt $maxtries ] && logger -s -t "ERROR: No
"ACTIVE" Infiniband ports found."
done
fi


Hope this helps!

-Nathan

Kit Westneat

unread,
Apr 15, 2010, 12:21:42 AM4/15/10
to Michael Robbert, lustre-discuss@lists.lustre.org discuss
Hey Mike,

That's pretty odd, it looks like the o2ib module has a symbol mismatch
with the ofed driver. I'm surprised it works at all...can you send the
dmesg output after modprobe lustre + mounting, as well as the lctl
list_nids output?

Thanks,
Kit

Michael Robbert

unread,
Apr 15, 2010, 3:56:12 PM4/15/10
to Kit Westneat, lustre-discuss@lists.lustre.org discuss
I think that I've discovered the problem is the OFED Roll that I'm using. When a node is first built it recompiles the OFED modules for the current kernel and I'm still deciphering the actual sequence of events, but I think that I need to add a reboot at the end of the process.

Mike

Reply all
Reply to author
Forward
0 new messages