setting lnet and mounting lustre at boot

1,090 views
Skip to first unread message

Matt

unread,
Mar 10, 2021, 7:41:01 PM3/10/21
to Warewulf
Having a heck of a problem just setting our lnet and mounting lustre at boot for some reason.  Of course everything works manually.

For some reason /etc/modprobe.d/lustre.conf is not being read so trying to do these commands  after boot.

/usr/sbin/lnetctl net add --net o2ib4 --if ib0
mount.lustre 172.40.2.60@o2ib4:172.40.2.61@o2ib4:/lustre4 /lustre4/

Tried putting them in a warewulf script  but still cannot get this to work

 wwsh file list
dynamic_hosts           :  rw-r--r-- 0   root root             7708 /etc/hosts
fstab                   :  r-------- 1   root root             1155 /etc/fstab
ifcfg-ib0               :  r-------- 1   root root              210 /etc/sysconfig/network-scripts/ifcfg-ib0
...
lustre.sh               :  rwxr-xr-x 1   root root              125 /etc/rc3.d/S99lustre
munge.key               :  r-------- 1   munge root            1024 /etc/munge/munge.key
...
[root@cedar-deploy chroots]# wwsh file show lustre.sh
#!/bin/sh
/usr/sbin/lnetctl net add --net o2ib4 --if ib0
mount.lustre 172.40.2.60@o2ib4:172.40.2.61@o2ib4:/lustre4 /lustre4/

Even tried putting this into /etc/rc.local but no glory.

Any suggestions on how to handle this one?

Thanks!

MB




Nathan Crawford

unread,
Mar 10, 2021, 8:24:08 PM3/10/21
to ware...@lbl.gov
Hi Matt,

  Which versions (OS, Lustre) are you using? For us (CentOS 7.8, Lustre 2.12.6), all the lnet configuration is done in /etc/lnet.conf, with the systemd service script basically doing a "modprobe lnet; lnetctl lnet configure; lnetctl import /etc/lnet.conf". We still do the mount in rc.local, but that's an artifact of history.

-Nate

--
You received this message because you are subscribed to the Google Groups "Warewulf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warewulf+u...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/warewulf/bd77f42a-fb07-43d3-97c1-22a4013baa89n%40lbl.gov.


--
Dr. Nathan Crawford              nathan....@uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall                 Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA

MB

unread,
Mar 10, 2021, 8:31:18 PM3/10/21
to ware...@lbl.gov
Ok I noticed that.   lustre is  2.1.5 x86_64,  clients  arm64 which was fun to get that working.  We finally had to go to 2.14.0 due to some  issues with 2.12.6, can't recall details anymore , something with max_frags settings on the client and the servers.   Anyway pushing along , are you putting warewulf environment variables like  in ifcfg-ib0 in lnet.conf?  Specifically the ip of the nids.  Thanks.

Nathan Crawford

unread,
Mar 10, 2021, 9:21:12 PM3/10/21
to ware...@lbl.gov
Yep, lnet.conf gets generated for each node. Here's ours for a mixed QIB/MLX IB fabric (L_ETHDEV is just a custom Wawrewulf variable as the ethernet device changes on some of our nodes):
net:
    - net type: o2ib1
      local NI(s):
        - nid: %{NETDEVS::IB0::IPADDR}@o2ib1
          interfaces:
              0: ib0
          tunables:
              peer_credits: 42
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 256
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
    - net type: tcp
      local NI(s):
        - nid: %{NETDEVS::ETH0::IPADDR}@tcp
          interfaces:
              0: %{L_ETHDEV}
          tunables:
              peer_credits: 16
              peer_buffer_credits: 0
              credits: 256

  Also, remember to enable the lnet service or script in the VNFS as it is disabled as default.

-Nate

MB

unread,
Mar 10, 2021, 9:55:44 PM3/10/21
to ware...@lbl.gov
Ok thanks that was helpful.  Looks like we have some service order issues going on now, now a WW issue I do not think.

Mar 10 20:44:26 cedar-cn01 lnetctl[2952]: add:
Mar 10 20:44:26 cedar-cn01 lnetctl[2952]:    - net:
Mar 10 20:44:26 cedar-cn01 lnetctl[2952]:          errno: -1
Mar 10 20:44:26 cedar-cn01 lnetctl[2952]:          descr: "couldn't query intf ib0"
Mar 10 20:44:26 cedar-cn01 systemd[1]: lnet.service: Main process exited, code=exited, status=255/n/a
Mar 10 20:44:26 cedar-cn01 systemd[1]: lnet.service: Failed with result 'exit-code'.
Mar 10 20:44:26 cedar-cn01 systemd[1]: Failed to start lnet management.
Mar 10 20:45:51 cedar-cn01 kernel: LNetError: 3224:0:(lib-move.c:2076:lnet_handle_find_routed_path()) peer 172.40.1.13@o2ib4 has no available nets

( manually systemctl start lnet - after logging in) 
Mar 10 20:46:25 cedar-cn01 systemd[1]: Starting lnet management...
Mar 10 20:46:26 cedar-cn01 systemd[1]: Started lnet management.

Ryan Novosielski

unread,
Mar 10, 2021, 11:08:02 PM3/10/21
to ware...@lbl.gov
You can do something like After=ib0.device (I forget the specific syntax for device name) in a systemd override file or worst case add a sleep (though not the place to start). 

--
#BlackLivesMatter
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Mar 10, 2021, at 21:55, MB <iam...@gmail.com> wrote:



Matt

unread,
Mar 11, 2021, 9:57:55 AM3/11/21
to Warewulf, Ryan Novosielski
Thanks we got it after diving into this systemd system.  Always changing things!
Reply all
Reply to author
Forward
0 new messages