[Lustre-discuss] Slow acess or hang state of lustre filesystem on client machines

594 views
Skip to first unread message

faheem patel

unread,
Sep 19, 2011, 5:10:27 AM9/19/11
to lustre-...@lists.lustre.org
Hi All,

Thanks in advance.
We  are new to this lustre filesystem.

We have installed some 30TB of lustre filesystem configured on client systems.

We have  2 MDC servers (MDS)  which are in HA mode with IB interface and  1 /mdc filesystem mounted with 500GB of size.

we have 2 OSS servers with HA configured between them.
1 OSS server had 8 OST's filesystem.
And 2nd OSS server had 9 OST's filesystem.
i.e total of 17 OST's distributed between 2 OSS servers which are configured in HA with bond of IB interface on both servers.

All lustre clients and Oss and MDS servers are all having IB (infiniband) Network.

we are getiing the following error messages on my OSS servers and also on my client machine for the past week.



----------------------------------------------------------------------------------------------------------------------------------------

Lustre Server OSS error logs

Sep 19 11:08:57 oss1 kernel: Lustre: 15007:0:(ldlm_lib.c:872:target_handle_connect()) lustre-OST0006: refuse reconnection from 65820f02-c4f0-e79a...@10.148.0.2@o2ib to 0xffff8806320e1800; still busy with 1 active RPCs
Sep 19 11:08:57 oss1 kernel: Lustre: 15007:0:(ldlm_lib.c:872:target_handle_connect()) Skipped 1 previous similar message
Sep 19 11:09:18 oss1 kernel: Lustre: 13143:0:(ldlm_lib.c:572:target_handle_reconnect()) lustre-OST0006: 65820f02-c4f0-e79a-4778-15a9b4653a88 reconnecting
Sep 19 11:09:18 oss1 kernel: Lustre: 13143:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 18 previous similar messages
Sep 19 11:09:26 oss1 kernel: Lustre: 13255:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348987462916 sent from lustre-OST0008 to NID 10.148.0.2@o2ib 7s ago has timed out (7s prior to deadline).
Sep 19 11:09:26 oss1 kernel:   req@ffff880631af7400 x1380348987462916/t0 o104->@NET_0x500000a940002_UUID:15/16 lens 296/384 e 0 to 1 dl 1316410765 ref 2 fl Rpc:N/0/0 rc 0/0
Sep 19 11:09:26 oss1 kernel: LustreError: 138-a: lustre-OST0008: A client on nid 10.148.0.2@o2ib was evicted due to a lock blocking callback to 10.148.0.2@o2ib timed out: rc -107
Sep 19 11:09:26 oss1 kernel: LustreError: 13122:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880312465c00 x1380348987462920/t0 o105->@NET_0x500000a940002_UUID:15/16 lens 344/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0
Sep 19 11:09:26 oss1 kernel: LustreError: 13122:0:(ldlm_lockd.c:595:ldlm_handle_ast_error()) ### client (nid 10.148.0.2@o2ib) returned 0 from completion AST ns: filter-lustre-OST0008_UUID lock: ffff880629826c00/0x7f1137a31caded22 lrc: 3/0,0 mode: PW/PW res: 10165838/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xb4433ce500b30ffb expref: 440 pid: 13255 timeout 0
Sep 19 11:09:38 oss1 kernel: LustreError: 13117:0:(ldlm_lockd.c:1824:ldlm_cancel_handler()) operation 103 from 12345-10.148.0.2@o2ib with bad export cookie 9156160691120629456
Sep 19 11:26:58 oss1 gdm-session-worker[15599]: PAM pam_putenv: NULL pam handle passed
------------------------------------------------------------------------------------------------------
Lustre Client error log messages..

Sep 19 10:38:43 service0 kernel: [ 6094.583298]   req@ffff880b0f7f0800 x1380348451643117/t0 o8->lustre-OS...@10.148.0.107@o2ib:28/4 lens 368/584 e 0 to 1 dl 1316408923 ref 2 fl Rpc:N/0/0 rc 0/0
Sep 19 10:38:43 service0 kernel: [ 6094.583305] Lustre: 8565:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Sep 19 10:38:44 service0 kernel: [ 6095.582711] Lustre: 8566:0:(import.c:517:import_select_connection()) lustre-OST0001-osc-ffff8806234ee000: tried all connections, increasing latency to 8s
Sep 19 10:38:46 service0 kernel: [ 6097.355378] LustreError: 11-0: an error occurred while communicating with 10.148.0.107@o2ib. The ost_connect operation failed with -16
Sep 19 10:38:46 service0 kernel: [ 6097.355381] LustreError: Skipped 20 previous similar messages
Sep 19 10:38:58 service0 kernel: [ 6109.582174] Lustre: lustre-OST0001-osc-ffff8806234ee000: Connection restored to service lustre-OST0001 using nid 10.148.0.107@o2ib.
Sep 19 10:39:55 service0 kernel: [ 6166.617376] Lustre: 14902:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348451645667 sent from lustre-OST0000-osc-ffff8806234ee000 to NID 10.148.0.106@o2ib 14s ago has timed out (14s prior to deadline).
Sep 19 10:39:55 service0 kernel: [ 6166.617381]   req@ffff8805f69bc800 x1380348451645667/t0 o101->lustre-OS...@10.148.0.106@o2ib:28/4 lens 296/544 e 0 to 1 dl 1316408995 ref 2 fl Rpc:/0/0 rc 0/0
Sep 19 10:39:55 service0 kernel: [ 6166.617390] Lustre: 14902:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Sep 19 10:39:55 service0 kernel: [ 6166.617402] Lustre: lustre-OST0000-osc-ffff8806234ee000: Connection to service lustre-OST0000 via nid 10.148.0.106@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Sep 19 10:39:55 service0 kernel: [ 6166.617406] Lustre: Skipped 8 previous similar messages
Sep 19 10:39:56 service0 sshd[14904]: Accepted keyboard-interactive/pam for root from 192.9.70.32 port 33623 ssh2
Sep 19 10:40:08 service0 kernel: [ 6179.616393] Lustre: 8565:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348451645687 sent from lustre-OST0000-osc-ffff8806234ee000 to NID 10.148.0.106@o2ib 13s ago has timed out (13s prior to deadline).
Sep 19 10:40:08 service0 kernel: [ 6179.616396]   req@ffff880af48e5000 x1380348451645687/t0 o8->lustre-OS...@10.148.0.106@o2ib:28/4 lens 368/584 e 0 to 1 dl 1316409008 ref 2 fl Rpc:N/0/0 rc 0/0
Sep 19 10:40:09 service0 kernel: [ 6180.616338] Lustre: 8566:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff8806234ee000: tried all connections, increasing latency to 9s
Sep 19 10:40:09 service0 kernel: [ 6180.616344] Lustre: 8566:0:(import.c:517:import_select_connection()) Skipped 8 previous similar messages
Sep 19 10:40:16 service0 kernel: [ 6187.219814] Lustre: 8564:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348451645670 sent from lustre-OST0000-osc-ffff8806234ee000 to NID 10.148.0.106@o2ib 31s ago has timed out (31s prior to deadline).
Sep 19 10:40:16 service0 kernel: [ 6187.219818]   req@ffff880b27740800 x1380348451645670/t0 o400->lustre-OS...@10.148.0.106@o2ib:28/4 lens 192/384 e 0 to 1 dl 1316409016 ref 1 fl Rpc:N/0/0 rc 0/0
Sep 19 10:40:16 service0 kernel: [ 6187.219843] Lustre: lustre-OST0003-osc-ffff8806234ee000: Connection to service lustre-OST0003 via nid 10.148.0.106@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Sep 19 10:40:21 service0 kernel: [ 6192.219448] Lustre: lustre-OST0010-osc-ffff8806234ee000: Connection to service lustre-OST0010 via nid 10.148.0.106@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Sep 19 10:40:21 service0 kernel: [ 6192.219453] Lustre: Skipped 2 previous similar messages
Sep 19 10:40:24 service0 kernel: [ 6195.615180] Lustre: 8566:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff8806234ee000: tried all connections, increasing latency to 10s
Sep 19 10:40:25 service0 kernel: [ 6196.029170] LustreError: 11-0: an error occurred while communicating with 10.148.0.106@o2ib. The ost_connect operation failed with -16
Sep 19 10:40:25 service0 kernel: [ 6196.029174] LustreError: Skipped 8 previous similar messages
Sep 19 10:40:25 service0 kernel: [ 6196.029345] Lustre: lustre-OST0000-osc-ffff8806234ee000: Connection restored to service lustre-OST0000 using nid 10.148.0.106@o2ib.
Sep 19 10:40:25 service0 kernel: [ 6196.029349] Lustre: Skipped 8 previous similar messages




Thanks and Regards,

Faheem PAtel
 
Reply all
Reply to author
Forward
0 new messages