Lustre: 2637:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked()) Closing conn to 10.168.22.106@o2ib: error 0(waiting)
LustreError: 3898:0:(events.c:66:request_out_callback()) @@@ type 4, status -103 req@ffff81021933d400 x1312117388289524/t0 o4->sparta-OS...@10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1 dl 1251335107 ref 3 fl Rpc:/0/0 rc 0/0
Lustre: 3926:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1312117388289524 sent from sparta-OST0003-osc-ffff8102198f1000 to NID 10.168.22.106@o2ib 0s ago has failed due to network error (limit 7s).
req@ffff81021933d400 x1312117388289524/t0 o4->sparta-OS...@10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1 dl 1251335107 ref 2 fl Rpc:/0/0 rc 0/0
Lustre: sparta-OST0003-osc-ffff8102198f1000: Connection to service sparta-OST0003 via nid 10.168.22.106@o2ib was lost; in progress operations using this service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with 10.168.22.106@o2ib. The ost_connect operation failed with -16
LustreError: Skipped 2 previous similar messages
Lustre: 3926:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1312117388289487 sent from sparta-OST0002-osc-ffff8102198f1000 to NID 10.168.22.106@o2ib 7s ago has timed out (limit 7s).
req@ffff8102192b6400 x1312117388289487/t0 o4->sparta-OS...@10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1 dl 1251335106 ref 2 fl Rpc:/0/0 rc 0/0
Lustre: sparta-OST0002-osc-ffff8102198f1000: Connection to service sparta-OST0002 via nid 10.168.22.106@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Lustre: sparta-OST0002-osc-ffff8102198f1000: Connection restored to service sparta-OST0002 using nid 10.168.22.106@o2ib.
Lustre: 3928:0:(import.c:508:import_select_connection()) sparta-OST0003-osc-ffff8102198f1000: tried all connections, increasing latency to 6s
Lustre: 3928:0:(import.c:508:import_select_connection()) Skipped 2 previous similar messages
Lustre: sparta-OST0003-osc-ffff8102198f1000: Connection restored to service sparta-OST0003 using nid 10.168.22.106@o2ib.
12 == IB_WC_RETRY_EXC_ERR, which usually indicates faulty links in the
network or some other application (like a MPI application) hogging
network resources unfavorably against Lustre. We once observed such
errors at times there was no IO at all - a bad MPI implementation was
resending aggressively upon RNR such that even the tiny bit of
keepalive traffic from Lustre would end up with IB_WC_RETRY_EXC_ERR.
Diagnostics from OFED and the fabric should point you to faulty
hardware, and setting up IB QoS should prevent Lustre from being hurt
badly by someone else.
Meanwhile, there's a potential workaround mentioned here:
https://bugzilla.lustre.org/show_bug.cgi?id=14223#c36
But it's certainly not a good solution in the long run.
Thanks,
Isaac
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss