[Lustre-discuss] I/O error on clients

285 views
Skip to first unread message

Peter Kitchener

unread,
Jul 6, 2010, 2:19:10 AM7/6/10
to lustre-...@lists.lustre.org
Hi all, 

I have been troubleshooting a strange problem that is occurring with our Lustre setup. Under high loads our developers are complaining that various processes they run will error out with I/O error. 

Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid setup.  All of the machines are connected via 10Gig Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2.  The Clients run an un-patched vanilla kernel from CentOS and Lustre 1.8.3 

So far I've not been able to pin point where i should begin to look. I have been trawling through log files that quite frankly don't make much sense to me.

Here is the messages output from the OSS

##############################

Jul  6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven't heard from client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125@tcp) in 227 seconds. I think it's dead, and I am evicting it.
Jul  6 15:08:26 helium kernel: Lustre: 6539:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
Jul  6 15:08:26 helium kernel: Lustre: 6539:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0006: refuse reconnection from 593eb160-edce-8ceb...@10.0.0.54@tcp to 0xffff81026241b800; still busy with 5 active RPCs
Jul  6 15:08:26 helium kernel: LustreError: 6539:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810282fe1000 x1340041377953748/t0 o8->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 368/264 e 0 to 0 dl 1278393006 ref 1 fl Interpret:/0/0 rc -16/0
Jul  6 15:08:26 helium kernel: LustreError: 6660:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET  req@ffff810357883c00 x1340041377934618/t0 o4->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:08:26 helium kernel: LustreError: 6660:0:(ost_handler.c:1061:ost_brw_write()) Skipped 1 previous similar message
Jul  6 15:08:26 helium kernel: LustreError: 6704:0:(ost_handler.c:1061:ost_brw_write()) @@@ Reconnect on bulk GET  req@ffff81082af44050 x1340041377934964/t0 o4->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 448/416 e 0 to 0 dl 1278392947 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:08:27 helium kernel: Lustre: 7062:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
Jul  6 15:08:27 helium kernel: Lustre: 7062:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse reconnection from 593eb160-edce-8ceb...@10.0.0.54@tcp to 0xffff81025a535e00; still busy with 3 active RPCs
Jul  6 15:08:27 helium kernel: LustreError: 7062:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810698233850 x1340041377955630/t0 o8->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 368/264 e 0 to 0 dl 1278393007 ref 1 fl Interpret:/0/0 rc -16/0
Jul  6 15:08:27 helium kernel: Lustre: 6692:0:(ost_handler.c:1219:ost_brw_write()) AC3-OST0006: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID id 12345-10.0.0.54@tcp - client will retry
Jul  6 15:08:27 helium kernel: Lustre: 6692:0:(ost_handler.c:1219:ost_brw_write()) Skipped 6 previous similar messages
Jul  6 15:08:27 helium kernel: LustreError: 6720:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT  req@ffff8103a72aec00 x1340041377933615/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278392946 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:08:27 helium kernel: LustreError: 6720:0:(ost_handler.c:829:ost_brw_read()) Skipped 1 previous similar message
Jul  6 15:08:29 helium kernel: Lustre: 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID id 12345-10.0.0.54@tcp - client will retry
Jul  6 15:08:29 helium kernel: Lustre: 6720:0:(ost_handler.c:886:ost_brw_read()) Skipped 1 previous similar message
Jul  6 15:08:37 helium kernel: Lustre: 7058:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0006: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
Jul  6 15:08:39 helium kernel: Lustre: 6522:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: 593eb160-edce-8ceb-3f93-6f743cacd1a1 reconnecting
Jul  6 15:10:09 helium kernel: Lustre: 7064:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: e6ce7565-0b86-5925-82fc-015770c5143c reconnecting
Jul  6 15:10:09 helium kernel: Lustre: 7064:0:(ldlm_lib.c:837:target_handle_connect()) AC3-OST0008: refuse reconnection from e6ce7565-0b86-5925...@10.0.0.53@tcp to 0xffff810267bbc400; still busy with 1 active RPCs
Jul  6 15:10:09 helium kernel: LustreError: 7064:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-16)  req@ffff81069588b800 x1340036504045855/t0 o8->e6ce7565-0b86-5925-82fc-015770c5143c@NET_0x200000a000035_UUID:0/0 lens 368/264 e 0 to 0 dl 1278393109 ref 1 fl Interpret:/0/0 rc -16/0
Jul  6 15:10:09 helium kernel: LustreError: 6649:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT  req@ffff81053a96f400 x1340036504013451/t0 o3->e6ce7565-0b86-5925-82fc-015770c5143c@NET_0x200000a000035_UUID:0/0 lens 448/400 e 0 to 0 dl 1278393053 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:10:09 helium kernel: LustreError: 6649:0:(ost_handler.c:829:ost_brw_read()) Skipped 2 previous similar messages
Jul  6 15:10:13 helium kernel: Lustre: 6649:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk IO comm error with e6ce7565-0b86-5925-82fc-015770c5143c@NET_0x200000a000035_UUID id 12345-10.0.0.53@tcp - client will retry
Jul  6 15:10:13 helium kernel: Lustre: 6649:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar messages
Jul  6 15:10:17 helium kernel: Lustre: 7016:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0008: e6ce7565-0b86-5925-82fc-015770c5143c reconnecting
Jul  6 15:10:17 helium kernel: LustreError: 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.0.54@tcp  ns: filter-AC3-OST0006_UUID lock: ffff8103a644e200/0xded0540147d4c8c7 lrc: 3/0,0 mode: PR/PR res: 8432287/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x6c318a80ee850f9f expref: 878 pid: 7043 timeout 4896983243
Jul  6 15:10:17 helium kernel: LustreError: 6708:0:(ldlm_lockd.c:305:waiting_locks_callback()) Skipped 1 previous similar message
Jul  6 15:10:17 helium kernel: LustreError: 6715:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)  req@ffff81027fcc5400 x1340041378188643/t0 o4-><?>@<?>:0/0 lens 448/0 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc -107/0
Jul  6 15:10:19 helium kernel: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.0.54@tcp  ns: filter-AC3-OST0008_UUID lock: ffff810489298400/0xded0540147d4c8b9 lrc: 3/0,0 mode: PR/PR res: 8127131/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x6c318a80ee850fbb expref: 34 pid: 6592 timeout 4896985868
Jul  6 15:10:19 helium kernel: LustreError: 6730:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  req@ffff810355aeec00 x1340041378188706/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:10:19 helium kernel: LustreError: 6665:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  req@ffff81069e442400 x1340041378188736/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278393069 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:10:19 helium kernel: LustreError: 6665:0:(ost_handler.c:825:ost_brw_read()) Skipped 5 previous similar messages
Jul  6 15:10:20 helium kernel: Lustre: 6730:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0008: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID id 12345-10.0.0.54@tcp - client will retry
Jul  6 15:10:20 helium kernel: LustreError: 6714:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)  req@ffff8106965e0800 x1340041378196268/t0 o3-><?>@<?>:0/0 lens 448/0 e 0 to 0 dl 1278393072 ref 1 fl Interpret:/0/0 rc -107/0
Jul  6 15:11:10 helium kernel: Lustre: 7119:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: d9a4606b-2d46-7e5b-c67d-c05610d8af95 reconnecting
Jul  6 15:18:00 helium kernel: Lustre: 6979:0:(ldlm_lib.c:540:target_handle_reconnect()) AC3-OST0009: e6ce7565-0b86-5925-82fc-015770c5143c reconnecting
Jul  6 15:30:40 helium kernel: Lustre: 7119:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850745218 sent from AC3-OST0006 to NID 10.0.0.54@tcp 7s ago has timed out (7s prior to deadline).
Jul  6 15:30:40 helium kernel:   req@ffff8104fd018000 x1339860850745218/t0 o104->@NET_0x200000a000036_UUID:15/16 lens 296/384 e 0 to 1 dl 1278394240 ref 2 fl Rpc:N/0/0 rc 0/0
Jul  6 15:30:40 helium kernel: LustreError: 138-a: AC3-OST0006: A client on nid 10.0.0.54@tcp was evicted due to a lock blocking callback to 10.0.0.54@tcp timed out: rc -107
Jul  6 15:30:40 helium kernel: LustreError: 7119:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export ffff8102688f7400 ns: filter-AC3-OST0006_UUID lock: ffff81012bdcb200/0xded0540147d6bb48 lrc: 3/0,0 mode: --/PW res: 8432457/0 rrc: 2 type: EXT [0->376831] (req 0->376831) flags: 0x0 remote: 0x6c318a80f18346f1 expref: 357 pid: 7119 timeout 0
Jul  6 15:30:40 helium kernel: LustreError: 7119:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)  req@ffff8103870e8400 x1340041383514587/t0 o101->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 296/352 e 0 to 0 dl 1278394284 ref 1 fl Interpret:/0/0 rc -107/0
Jul  6 15:30:41 helium kernel: LustreError: 6679:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  req@ffff81082a335c50 x1340041383502349/t0 o3->593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID:0/0 lens 448/400 e 0 to 0 dl 1278394307 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:30:41 helium kernel: Lustre: 6679:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0006: ignoring bulk IO comm error with 593eb160-edce-8ceb-3f93-6f743cacd1a1@NET_0x200000a000036_UUID id 12345-10.0.0.54@tcp - client will retry
Jul  6 15:30:41 helium kernel: Lustre: 6679:0:(ost_handler.c:886:ost_brw_read()) Skipped 6 previous similar messages
Jul  6 15:30:41 helium kernel: LustreError: 8166:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from 12345-10.0.0.54@tcp with bad export cookie 16055425036052657553
Jul  6 15:32:08 helium kernel: Lustre: 7134:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850745435 sent from AC3-OST0005 to NID 10.0.0.56@tcp 7s ago has timed out (7s prior to deadline).
Jul  6 15:32:08 helium kernel:   req@ffff810695874000 x1339860850745435/t0 o104->@:15/16 lens 296/384 e 0 to 1 dl 1278394328 ref 2 fl Rpc:N/0/0 rc 0/0
Jul  6 15:32:08 helium kernel: LustreError: 138-a: AC3-OST0005: A client on nid 10.0.0.56@tcp was evicted due to a lock blocking callback to 10.0.0.56@tcp timed out: rc -107
Jul  6 15:32:08 helium kernel: LustreError: 7134:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export ffff8102552f5400 ns: filter-AC3-OST0005_UUID lock: ffff810485fade00/0xded0540147d6ead5 lrc: 3/0,0 mode: --/PW res: 8108669/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xd1f8ae5995b5c83b expref: 620 pid: 7134 timeout 0
Jul  6 15:32:08 helium kernel: LustreError: 7134:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)  req@ffff81009ddd6800 x1340491441560517/t0 o101->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 296/352 e 0 to 0 dl 1278394372 ref 1 fl Interpret:/0/0 rc -107/0
Jul  6 15:32:08 helium kernel: LustreError: 6610:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  req@ffff810357887c00 x1340491441559786/t0 o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl 1278394396 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:32:08 helium kernel: LustreError: 6610:0:(ost_handler.c:825:ost_brw_read()) Skipped 3 previous similar messages
Jul  6 15:32:08 helium kernel: Lustre: 6610:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 12345-10.0.0.56@tcp - client will retry
Jul  6 15:32:08 helium kernel: Lustre: 6610:0:(ost_handler.c:886:ost_brw_read()) Skipped 3 previous similar messages
Jul  6 15:32:08 helium kernel: LustreError: 6720:0:(ost_handler.c:825:ost_brw_read()) @@@ Eviction on bulk PUT  req@ffff81040de33000 x1340491441560035/t0 o3->8a1862dd-8de6-9414-f71c-0c85925e1e20@:0/0 lens 448/400 e 0 to 0 dl 1278394396 ref 1 fl Interpret:/0/0 rc 0/0
Jul  6 15:32:16 helium kernel: LustreError: 12001:0:(ldlm_lockd.c:1821:ldlm_cancel_handler()) operation 103 from 12345-10.0.0.56@tcp with bad export cookie 16055425036052176730
Jul  6 15:32:19 helium kernel: Lustre: 6720:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 12345-10.0.0.56@tcp - client will retry
Jul  6 15:32:21 helium kernel: Lustre: 6712:0:(ost_handler.c:886:ost_brw_read()) AC3-OST0005: ignoring bulk IO comm error with 8a1862dd-8de6-9414-f71c-0c85925e1e20@ id 12345-10.0.0.56@tcp - client will retry
Jul  6 15:32:21 helium kernel: Lustre: 6712:0:(ost_handler.c:886:ost_brw_read()) Skipped 2 previous similar messages
Jul  6 15:36:56 helium kernel: Lustre: 6970:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746067 sent from AC3-OST0006 to NID 10.0.0.57@tcp 11s ago has timed out (11s prior to deadline).
Jul  6 15:36:56 helium kernel:   req@ffff810086c83000 x1339860850746067/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394616 ref 2 fl Rpc:/0/0 rc 0/0
Jul  6 15:37:01 helium diskmond: 168:Polling all 48 slots for drive fault 
Jul  6 15:37:10 helium diskmond: sata4/5 device(/dev/sdal) is running bad 
Jul  6 15:37:10 helium diskmond: please back up and replace the disk soon. 
Jul  6 15:38:05 helium kernel: Lustre: 6537:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746176 sent from AC3-OST0005 to NID 10.0.0.57@tcp 8s ago has timed out (8s prior to deadline).
Jul  6 15:38:05 helium kernel:   req@ffff810341fdc800 x1339860850746176/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394685 ref 2 fl Rpc:/0/0 rc 0/0
Jul  6 15:38:18 helium kernel: Lustre: 6524:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746195 sent from AC3-OST0008 to NID 10.0.0.57@tcp 7s ago has timed out (7s prior to deadline).
Jul  6 15:38:18 helium kernel:   req@ffff810282961400 x1339860850746195/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394698 ref 2 fl Rpc:/0/0 rc 0/0
Jul  6 15:39:07 helium kernel: Lustre: 6793:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746269 sent from AC3-OST0006 to NID 10.0.0.57@tcp 11s ago has timed out (11s prior to deadline).
Jul  6 15:39:07 helium kernel:   req@ffff81027ce5b400 x1339860850746269/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394747 ref 2 fl Rpc:/0/0 rc 0/0
Jul  6 15:40:19 helium kernel: Lustre: 6851:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746437 sent from AC3-OST0005 to NID 10.0.0.57@tcp 8s ago has timed out (8s prior to deadline).
Jul  6 15:40:19 helium kernel:   req@ffff8104135bb800 x1339860850746437/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1278394819 ref 2 fl Rpc:/0/0 rc 0/0
Jul  6 15:42:10 helium kernel: Lustre: 6935:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746615 sent from AC3-OST0005 to NID 10.0.0.53@tcp 7s ago has timed out (7s prior to deadline).
Jul  6 15:42:10 helium kernel:   req@ffff81025b227c00 x1339860850746615/t0 o106->@NET_0x200000a000035_UUID:15/16 lens 296/424 e 0 to 1 dl 1278394930 ref 2 fl Rpc:/0/0 rc 0/0
Jul  6 15:46:01 helium kernel: Lustre: 6957:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1339860850746990 sent from AC3-OST0007 to NID 172.16.16.124@tcp 7s ago has timed out (7s prior to deadline).
Jul  6 15:46:01 helium kernel:   req@ffff8103fabd9800 x1339860850746990/t0 o106->@NET_0x20000ac10107c_UUID:15/16 lens 296/424 e 0 to 1 dl 1278395161 ref 2 fl Rpc:/0/0 rc 0/0


########################

Here is the output from the client at the same time

Jul  6 15:10:17 compute-0-3 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ost_write operation failed with -107
Jul  6 15:10:17 compute-0-3 kernel: LustreError: Skipped 1 previous similar message
Jul  6 15:10:17 compute-0-3 kernel: LustreError: 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail.
Jul  6 15:10:17 compute-0-3 kernel: LustreError: Skipped 4 previous similar messages
Jul  6 15:10:17 compute-0-3 kernel: LustreError: 3095:0:(namei.c:1176:ll_objects_destroy()) obd destroy objid 0x18542c4@0x0 error -5
Jul  6 15:10:17 compute-0-3 kernel: LustreError: 6419:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
Jul  6 15:10:18 compute-0-3 kernel: LustreError: 6779:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0006-osc-ffff81043e754c00 resource r
efcount nonzero (1) after lock cleanup; forcing cleanup.
Jul  6 15:10:18 compute-0-3 kernel: LustreError: 6779:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff810117997500 (8432287/0/0/0) (rc: 1)
Jul  6 15:10:18 compute-0-3 kernel: LustreError: 6687:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
Jul  6 15:10:18 compute-0-3 kernel: LustreError: 6782:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ost_read operation failed with -107
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 167-0: This client was evicted by AC3-OST0008; in progress operations using this service will fail.
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 20660:0:(rw.c:122:ll_brw()) error from obd_brw: rc = -4
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource r
efcount nonzero (1) after lock cleanup; forcing cleanup.
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff81041604b9c0 (5521743/0/0/0) (rc: 1)
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 6686:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061a800 x1340041378196315/t0 
o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 512/624 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0Jul  6 15:10:20 compute-0-3 kernel: LustreError: 3571:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 78 previous similar messages
Jul  6 15:10:20 compute-0-3 kernel: LustreError: 6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO

Kind Regards,

Peter Kitchener
Systems Administrator
Capital Markets CRC Limited (CMCRC)
Telephone: +61 2 8088 4223
Fax: +61 2 8088 4201
www.cmcrc.com 

 



Capital Markets CRC Ltd - Confidential Communication
The information contained in this e-mail is confidential.  It is intended for the addressee only.  If you receive this e-mail by mistake please promptly inform us by reply e-mail and then delete the e-mail and destroy any printed copy. You must not disclose or use in any way the information in the e-mail. There is no warranty that this e-mail is error or virus free. It may be a private communication, and if so, does not represent the views of the CMCRC and its associates.  If it is a private communication, care should be taken in opening it to ensure that undue offence is not given.



Gabriele Paciucci

unread,
Jul 6, 2010, 5:23:30 AM7/6/10
to lustre-...@lists.lustre.org
Hi Peter,
which 10GbE Card do you have? I've solved similar problem with a Netxen
Card (HP Blade Mezzanine Card) using the nx_nic proprietary driver
instead of the "open source" driver. In every case the problem is that
your users fill the network between client and ost !!!

> www.cmcrc.com <http://www.cmcrc.com/>


>
>
>
>
>
> Capital Markets CRC Ltd - Confidential Communication
> The information contained in this e-mail is confidential. It is
> intended for the addressee only. If you receive this e-mail by
> mistake please promptly inform us by reply e-mail and then delete the
> e-mail and destroy any printed copy. You must not disclose or use in
> any way the information in the e-mail. There is no warranty that this
> e-mail is error or virus free. It may be a private communication, and
> if so, does not represent the views of the CMCRC and its associates.
> If it is a private communication, care should be taken in opening it
> to ensure that undue offence is not given.
>
>
>
>

> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


--
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Wojciech Turek

unread,
Jul 6, 2010, 6:11:45 AM7/6/10
to paci...@gmail.com, lustre-...@lists.lustre.org
The source of the I/O error is the eviction of the client by the OSS server that was not able to reclaim a lock from that client in specified timeout window (100s)

OSS:
======

Jul  6 15:10:17 helium kernel: LustreError: 6708:0:(ldlm_lockd.c:305:
waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.0.0.54@tcp  ns: filter-AC3-OST0006_UUID lock: ffff8103a644e200/0xded0540147d4c8c7 lrc: 3/0,0 mode: PR/PR res: 8432287/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x6c318a80ee850f9f expref: 878 pid: 7043 timeout 4896983243

Client:
=====

Jul  6 15:10:17 compute-0-3 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ost_write operation failed with -107
Jul  6 15:10:17 compute-0-3 kernel: LustreError: Skipped 1 previous similar message
Jul  6 15:10:17 compute-0-3 kernel: LustreError: 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail.

The questions is why the lock callback was expiered? It is possible that your network was unhealthy which caused the eviciton. There is another possibility, I have been seeing simillar evicition on our large cluster (~600 nodes) with running RHEL5.4 Lustre-1.8.3 where I know that network is healthy.  I have searched lustre bugzilla for similar problems and found several simmilar cases reported for example BUGs 22964, 23190, 22175 and others.
It looks like there might be a Lustre bug which casuses a lock deadlock which then causes OSS to evict the client.

Best regards,

Wojciech
--
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wj...@cam.ac.uk
Tel: (+)44 1223 763517

Peter Kitchener

unread,
Jul 6, 2010, 7:20:01 PM7/6/10
to lustre-...@lists.lustre.org
Hi All, 

The NICs we're using are Broadcom Corporation NetXtreme II BCM57711 10-Gigabit PCIe They're a Dual Port SFP+ card. Connected to a Dell PowerConnect 8024F.

Currently we're using the open source driver bnx2x as provided in the kernel, there doesn't appear to be any dropped packets on the clients or servers.

Could it be that the Lock callback timer threshold is too low? Im uncertain if there is a way of altering this threshold.

Could there be some sort of kernel buffer or limit set too low on the clients/servers?

Regards,

Peter Kitchener
Systems Administrator
Capital Markets CRC Limited (CMCRC)
Telephone: +61 2 8088 4223
Fax: +61 2 8088 4201
www.cmcrc.com 

 

Christopher J. Morrone

unread,
Jul 6, 2010, 8:12:44 PM7/6/10
to lustre-...@lists.lustre.org
On 07/05/2010 11:19 PM, Peter Kitchener wrote:
> Hi all,
>
> I have been troubleshooting a strange problem that is occurring with our Lustre setup. Under high loads our developers are complaining that various processes they run will error out with I/O error.
>
> Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid setup. All of the machines are connected via 10Gig Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2. The Clients run an un-patched vanilla kernel from CentOS and Lustre 1.8.3
>
> So far I've not been able to pin point where i should begin to look. I have been trawling through log files that quite frankly don't make much sense to me.
>
> Here is the messages output from the OSS
>
> ##############################
>
> Jul 6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven't heard from client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125@tcp) in 227 seconds. I think it's dead, and I am evicting it.

There is a bug in lustre 1.8.2 and 1.8.3 that makes the ptlrpcd get
stuck for long periods of time (around 10 minutes was the longest that I
saw) on lustre clients under certain work loads. If the ptlrpcd is
dead, the client may stop sending all RPCs to the servers, and the
servers evict the client because they haven't heard from it in a while.

See bug 22897 for a description of the bug. But the fix is a simple
one-liner in bug 22786, attachment 29866. The fix will first appear in
lustre 1.8.4. I would highly recommend to anyone using 1.8.2 or 1.8.3
that they add that patch.

I don't know if that is the cause of your particular evictions, because
there can be many causes of evictions. But the "haven't hear from
client ... in 227 seconds" was one of the symptoms, and the client
failing with -107 (ENOTCONN) with multiple OSTs (and/or MDS, MGS...) at
the same time was another symptom.

Chris

Peter Kitchener

unread,
Jul 6, 2010, 10:02:11 PM7/6/10
to lustre-...@lists.lustre.org
Hi Chris, 

See bug 22897 for a description of the bug.  But the fix is a simple 
one-liner in bug 22786, attachment 29866.  The fix will first appear in 
lustre 1.8.4.  I would highly recommend to anyone using 1.8.2 or 1.8.3 
that they add that patch.

How would I safely do that when i've installed lustre from the RPMs?

Do i obtain the Source tarball or patched kernel source rpm and then run the patch?

Not sure.

Im being pressured by my dev team to fix this ASAP.

Gabriele Paciucci

unread,
Jul 7, 2010, 4:04:04 AM7/7/10
to lustre-...@lists.lustre.org
Hi,
the ptlrcp bug is a problem, but i don't find in the Peter's logs any
refer to an eviction caused by the ptlrpc but instead by a timeout
during the comunication between a ost and the client. But Peter could
make a downgrade to 1.8.1.1 that not suffer by the problem.


My action plan could be:

1. First of all Peter use the same lustre between client and server 1.8.3.

2. Second check the /proc/sys/lustre/ldlm_timeout : 6 sec for MDS, 20
sec for OSS!!

3. Third: do you have enough memory on the servers for all the clients
locks? Please refer to:
http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50417791_pgfId-1290875

Normally the server could suffer for more than 500k locks.

bye

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.

_______________________________________________

Peter Kitchener

unread,
Jul 7, 2010, 8:48:21 PM7/7/10
to lustre-...@lists.lustre.org
Here is a dump of a log of some kind from the client at the time of the eviction.

Does this shed any further light on the problem?

00000100:00000400:7:1278392906.021560:0:6677:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340041377934503 sent from AC3-OST0006-osc-ffff81043e754c00 to NID 172.16.16.2@tcp 11s ago has timed o
ut (11s prior to deadline).
  req@ffff81043f31dc00 x1340041377934503/t0 o101->AC3-OST0...@172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:/0/0 rc 0/0
00000100:02000400:7:1278392906.021598:0:6677:0:(import.c:159:ptlrpc_set_import_discon()) AC3-OST0006-osc-ffff81043e754c00: Connection to service AC3-OST0006 via nid 172.16.16.2@tcp was lost; in progress operations using this service will wait for recovery to complete.
00000100:00080000:7:1278392906.021611:0:6677:0:(import.c:167:ptlrpc_set_import_discon()) ffff81042018c000 AC3-OST0006_UUID: changing import state from FULL to DISCONN
00000100:00080000:7:1278392906.021615:0:6677:0:(import.c:403:ptlrpc_fail_import()) AC3-OST0006_UUID: waking up pinger00000100:00080000:4:1278392906.021634:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0006_UUID: starting recovery
00000100:00080000:4:1278392906.021639:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:4:1278392906.021645:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 4379111012
00000100:00080000:4:1278392906.021645:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 4379111012
00000100:00080000:4:1278392906.021651:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2@tcp/172.16.16.2@tcp
00000100:00000400:7:1278392906.030554:0:6687:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340041377934535 sent from AC3-OST0006-osc-ffff81043e754c00 to NID 172.16.16.2@tcp 11s ago has timed o
ut (11s prior to deadline).
  req@ffff81031686b000 x1340041377934535/t0 o101->AC3-OST0...@172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:/0/0 rc 0/0
00000100:00080000:7:1278392906.030567:0:6687:0:(import.c:182:ptlrpc_set_import_discon()) osc: import ffff81042018c000 already not connected (conn 177, was 176): CONNECTING
00000100:00000400:5:1278392907.030414:0:6686:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340041377934536 sent from AC3-OST0008-osc-ffff81043e754c00 to NID 172.16.16.2@tcp 12s ago has timed o
ut (12s prior to deadline).
  req@ffff8102b439bc00 x1340041377934536/t0 o101->AC3-OST0...@172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278392907 ref 1 fl Rpc:/0/0 rc 0/0
00000100:02000400:5:1278392907.030434:0:6686:0:(import.c:159:ptlrpc_set_import_discon()) AC3-OST0008-osc-ffff81043e754c00: Connection to service AC3-OST0008 via nid 172.16.16.2@tcp was lost; in progress ope
rations using this service will wait for recovery to complete.
00000100:00080000:5:1278392907.030449:0:6686:0:(import.c:167:ptlrpc_set_import_discon()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from FULL to DISCONN
00000100:00080000:5:1278392907.030452:0:6686:0:(import.c:403:ptlrpc_fail_import()) AC3-OST0008_UUID: waking up pinger
00000100:00080000:4:1278392907.030465:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0008_UUID: starting recovery00000100:00080000:4:1278392907.030470:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:4:1278392907.030475:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 437911101200000100:00080000:4:1278392907.030480:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: import ffff810420d1a000 using connection 172.16.16.2@tcp/172.16.16.2@tcp
00000400:00000100:3:1278392907.213541:0:3560:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 4 match 1340041377934536 offset 192 length 352: 200000400:00000100:3:1278392907.213550:0:3560:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 4 match 1340041377934503 offset 192 length 352: 2
00000400:00000100:3:1278392907.213554:0:3560:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 4 match 1340041377934535 offset 192 length 352: 200000100:02020000:0:1278392907.213815:0:3572:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ost_connect operation failed with -16
00000100:00080000:0:1278392907.213828:0:3572:0:(import.c:1091:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to DISCONN00000100:00080000:0:1278392907.213831:0:3572:0:(import.c:1137:ptlrpc_connect_interpret()) recovery of AC3-OST0006_UUID on 172.16.16.2@tcp failed (-16)
00000100:02020000:0:1278392907.213839:0:3572:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ost_connect operation failed with -1600000100:00080000:0:1278392907.213841:0:3572:0:(import.c:1091:ptlrpc_connect_interpret()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from CONNECTING to DISCONN
00000100:00080000:0:1278392907.213843:0:3572:0:(import.c:1137:ptlrpc_connect_interpret()) recovery of AC3-OST0008_UUID on 172.16.16.2@tcp failed (-16)
00000100:00080000:6:1278392907.226136:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN)  req@ffff8101b4588c00 x1340041377955935/t0 o3->AC3-OST0008_UUI
D...@172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
00000100:00080000:6:1278392907.226192:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN)  req@ffff81042061b800 x1340041377955949/t0 o4->AC3-OST0008_UUI
D...@172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
00000100:00080000:6:1278392907.226202:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN)  req@ffff81042061a400 x1340041377955951/t0 o4->AC3-OST0008_UUI
D...@172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
00000100:00080000:6:1278392907.226207:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN)  req@ffff8104201e5000 x1340041377955955/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
00000100:00080000:6:1278392907.226211:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN)  req@ffff81042030f800 x1340041377955956/t0 o3->AC3-OST0008_UUI
D...@172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/000000100:00080000:6:1278392907.226219:0:3571:0:(client.c:1145:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN)  req@ffff81042061a000 x1340041377955959/t0 o4->AC3-OST0008_UUI
D...@172.16.16.2@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
00000100:00080000:4:1278392917.022748:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0006_UUID: starting recovery00000100:00080000:4:1278392917.022753:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:4:1278392917.022759:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 4724612378
00000100:00080000:4:1278392917.022766:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2@tcp/172.16.16.2@tcp00000100:00080000:0:1278392917.023419:0:3572:0:(import.c:868:ptlrpc_connect_interpret()) reconnected to AC3-OST0...@172.16.16.2@tcp after partition
00000100:00080000:0:1278392917.023425:0:3572:0:(import.c:891:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to RECOVER
1 dl 1278392951 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392917.023440:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933846 new x1340041377973369
 1 dl 1278392951 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392917.023452:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377934227 new x1340041377973370
 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392917.023459:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377934618 new x1340041377973371
 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392917.023466:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377934964 new x1340041377973372
 1 dl 1278392955 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392917.023472:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377940508 new x1340041377973373
00000100:00080000:0:1278392917.023475:0:3572:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff81042018c000 AC3-OST0006_UUID: changing import state from RECOVER to FULL
00000100:00080000:6:1278392917.023480:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973369 new x1340041377973374
00000002:00080000:0:1278392917.023492:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:02000000:0:1278392917.023497:0:3572:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0006-osc-ffff81043e754c00: Connection restored to service AC3-OST0006 using nid 172.16.16.2@tcp.
00000100:00080000:6:1278392917.023513:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973370 new x1340041377973375
ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:X/0/0 rc 0/0
 lens 296/544 e 0 to 1 dl 1278392906 ref 1 fl Rpc:X/0/0 rc 0/0
00000100:00080000:6:1278392917.023543:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973371 new x1340041377973376
00000002:00080000:0:1278392917.023544:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:00080000:6:1278392917.023572:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973372 new x1340041377973377
00000100:00080000:6:1278392917.023596:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377973373 new x1340041377973378
00000100:00080000:4:1278392919.644883:0:3573:0:(recover.c:67:ptlrpc_initiate_recovery()) AC3-OST0008_UUID: starting recovery
00000100:00080000:4:1278392919.644889:0:3573:0:(import.c:608:ptlrpc_connect_import()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:4:1278392919.644894:0:3573:0:(import.c:470:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 4724613386
00000100:00080000:4:1278392919.644900:0:3573:0:(import.c:544:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: import ffff810420d1a000 using connection 172.16.16.2@tcp/172.16.16.2@tcp
00000100:00080000:0:1278392919.645477:0:3572:0:(import.c:868:ptlrpc_connect_interpret()) reconnected to AC3-OST0...@172.16.16.2@tcp after partition
00000100:00080000:0:1278392919.645482:0:3572:0:(import.c:891:ptlrpc_connect_interpret()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from CONNECTING to RECOVER
00000100:00080000:0:1278392919.645486:0:3572:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0...@172.16.16.2@tcp
 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392919.645495:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933615 new x1340041377977349
 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392919.645505:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933659 new x1340041377977350
 1 dl 1278392952 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278392919.645511:0:3572:0:(client.c:1969:ptlrpc_resend_req()) resend bulk old x1340041377933703 new x1340041377977351
00000100:00080000:0:1278392919.645514:0:3572:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from RECOVER to FULL
00000100:00080000:6:1278392919.645518:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377977350 new x1340041377977353
00000002:00080000:0:1278392919.645531:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:02000000:0:1278392919.645536:0:3572:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0008-osc-ffff81043e754c00: Connection restored to service AC3-OST0008 using nid 172.16.16.2@tcp.
 lens 296/544 e 0 to 1 dl 1278392907 ref 1 fl Rpc:X/0/0 rc 0/0
ns 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
ns 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
ns 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
ns 448/592 e 0 to 1 dl 1278392952 ref 2 fl Rpc:S/0/0 rc -11/0
00000100:00080000:6:1278392919.645591:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377977351 new x1340041377977354
00000002:00080000:0:1278392919.645593:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:00080000:6:1278392919.645729:0:3571:0:(client.c:1332:ptlrpc_check_set()) resend bulk old x1340041377977349 new x1340041377977355
00000100:02020000:2:1278393017.800317:0:3571:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ost_write operation failed with -107
00000100:00080000:2:1278393017.800332:0:3571:0:(recover.c:196:ptlrpc_request_handle_notconn()) import AC3-OST0006-osc-ffff81043e754c00 of AC3-OST0...@172.16.16.2@tcp abruptly disconnected: reconnecting
rations using this service will wait for recovery to complete.
00000100:00080000:2:1278393017.800343:0:3571:0:(import.c:167:ptlrpc_set_import_discon()) ffff81042018c000 AC3-OST0006_UUID: changing import state from FULL to DISCONN
00000100:00080000:2:1278393017.800350:0:3571:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:2:1278393017.800357:0:3571:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 4724623378
00000100:00080000:2:1278393017.800366:0:3571:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2@tcp/172.16.16.2@tcpt 0 o8->AC3-OST0...@172.16.16.2@tcp:28/4 lens 368/584 e 0 to 1 dl 1278393031 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:00080000:0:1278393017.801096:0:3572:0:(import.c:911:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to EVICTED
00000100:02020000:0:1278393017.801099:0:3572:0:(import.c:1249:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail.
00000100:00080000:0:1278393017.801106:0:3572:0:(import.c:1253:ptlrpc_import_recovery_state_machine()) evicted from AC3-OST0...@172.16.16.2@tcp; invalidating
00000002:00080000:0:1278393017.801150:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:00080000:0:1278393017.801188:0:6779:0:(import.c:1215:ptlrpc_invalidate_import_thread()) thread invalidate import AC3-OST0006-osc-ffff81043e754c00 to AC3-OST0...@172.16.16.2@tcp
00000100:00080000:0:1278393017.801190:0:6779:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0006_UUID INVALID
00020000:01000000:0:1278393017.801201:0:6779:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID inactive
00000002:00080000:0:1278393017.801206:0:6779:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00020000:00080000:7:1278393017.810055:0:6430:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00020000:00080000:5:1278393017.827252:0:3095:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00020000:00080000:5:1278393017.827298:0:3095:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00020000:00080000:5:1278393017.827325:0:3095:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00020000:00080000:7:1278393017.833456:0:3095:0:(lov_request.c:1099:lov_prep_destroy_set()) lov idx 6 inactive
00000080:00020000:7:1278393017.833461:0:3095:0:(namei.c:1176:ll_objects_destroy()) obd destroy objid 0x18542c4@0x0 error -5
00020000:00080000:4:1278393017.912169:0:6430:0:(lov_obd.c:2372:lov_join_lru()) lov idx 6 inactive
00020000:00080000:4:1278393017.913654:0:6419:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00000080:00020000:4:1278393017.913660:0:6419:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
00020000:00080000:4:1278393017.922360:0:6419:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00010000:00020000:1:1278393018.325707:0:6779:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0006-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
00010000:00020000:1:1278393018.325727:0:6779:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff810117997500 (8432287/0/0/0) (rc: 1)
00020000:00080000:7:1278393018.325761:0:6687:0:(lov_obd.c:1727:lov_brw_check()) lov idx 6 inactive
00000080:00020000:7:1278393018.325781:0:6687:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
00020000:00080000:7:1278393018.327514:0:6687:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00020000:00080000:2:1278393018.332887:0:6782:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00000080:00020000:2:1278393018.332895:0:6782:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
00000100:00080000:1:1278393018.333028:0:6779:0:(import.c:1224:ptlrpc_invalidate_import_thread()) ffff81042018c000 AC3-OST0006_UUID: changing import state from EVICTED to RECOVER
00000100:00080000:1:1278393018.333032:0:6779:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0...@172.16.16.2@tcp
00000100:00080000:1:1278393018.333035:0:6779:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff81042018c000 AC3-OST0006_UUID: changing import state from RECOVER to FULL
00020000:01000000:1:1278393018.333042:0:6779:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID active
00000002:00080000:1:1278393018.333051:0:6779:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:02000000:1:1278393018.333056:0:6779:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0006-osc-ffff81043e754c00: Connection restored to service AC3-OST0006 using nid 172.16.16.2@tcp.
00000100:02020000:4:1278393020.758895:0:20660:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ost_read operation failed with -107
g
erations using this service will wait for recovery to complete.
00000100:00080000:4:1278393020.758919:0:20660:0:(import.c:167:ptlrpc_set_import_discon()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from FULL to DISCONN
00000100:00080000:4:1278393020.758925:0:20660:0:(import.c:608:ptlrpc_connect_import()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:4:1278393020.758928:0:20660:0:(import.c:470:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 4724626000
00000100:00080000:4:1278393020.758935:0:20660:0:(import.c:544:import_select_connection()) AC3-OST0008-osc-ffff81043e754c00: import ffff810420d1a000 using connection 172.16.16.2@tcp/172.16.16.2@tcp
16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 1278393085 ref 1 fl Rpc:RS/0/0 rc 0/-107
UU...@172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:W/0/0 rc 0/0
t0 o8->AC3-OST0...@172.16.16.2@tcp:28/4 lens 368/584 e 0 to 1 dl 1278393038 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:00080000:0:1278393020.770920:0:3572:0:(import.c:911:ptlrpc_connect_interpret()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from CONNECTING to EVICTED
00000100:02020000:0:1278393020.770923:0:3572:0:(import.c:1249:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by AC3-OST0008; in progress operations using this service will fail.
00000100:00080000:0:1278393020.770930:0:3572:0:(import.c:1253:ptlrpc_import_recovery_state_machine()) evicted from AC3-OST0...@172.16.16.2@tcp; invalidating
00000002:00080000:0:1278393020.771032:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:00080000:0:1278393020.771066:0:6784:0:(import.c:1215:ptlrpc_invalidate_import_thread()) thread invalidate import AC3-OST0008-osc-ffff81043e754c00 to AC3-OST0...@172.16.16.2@tcp
00000100:00080000:0:1278393020.771068:0:6784:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0008_UUID INVALID
2@tcp:6/4 lens 448/592 e 0 to 1 dl 1278393085 ref 1 fl Rpc:RES/0/0 rc -4/-107
00020000:01000000:0:1278393020.771084:0:6784:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0008_UUID inactive
00000002:00080000:0:1278393020.771087:0:6784:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000080:00020000:4:1278393020.771103:0:20660:0:(rw.c:122:ll_brw()) error from obd_brw: rc = -4
00010000:00020000:0:1278393020.777506:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
00010000:00020000:0:1278393020.777518:0:6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff81041604b9c0 (5521743/0/0/0) (rc: 1)
00010000:00020000:0:1278393020.777531:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
00010000:00020000:0:1278393020.777533:0:6784:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource: ffff810266786e00 (8127131/0/0/0) (rc: 1)
00020000:00080000:5:1278393020.777583:0:6686:0:(lov_obd.c:1727:lov_brw_check()) lov idx 8 inactive
00000080:00020000:5:1278393020.777624:0:6686:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
00000100:00020000:2:1278393020.777678:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061a800 x1340041378196315/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 512/624 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/01068:0:6784:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0008_UUID INVALID
00000100:00020000:2:1278393020.778942:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061a000 x1340041378196323/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/01084:0:6784:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0008_UUID inactive
00020000:00080000:5:1278393020.779394:0:6686:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 8 inactive
00000100:00020000:2:1278393020.780168:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8101f525e000 x1340041378196326/t0 o3->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/592 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/07506:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
00000100:00020000:2:1278393020.781246:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8101f189fc00 x1340041378196328/t0 o3->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/592 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/07531:0:6784:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace AC3-OST0008-osc-ffff81043e754c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
00000100:00020000:2:1278393020.782563:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061b400 x1340041378196329/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 496/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/07583:0:6686:0:(lov_obd.c:1727:lov_brw_check()) lov idx 8 inactive
00000100:00020000:2:1278393020.783344:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8104212fc400 x1340041378196334/t0 o3->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/592 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00020000:2:1278393020.783559:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061ac00 x1340041378196337/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/09394:0:6686:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 8 inactive
00000100:00020000:2:1278393020.783727:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061c800 x1340041378196341/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 496/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00020000:00080000:7:1278393020.783778:0:6785:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 8 inactive
00000080:00020000:7:1278393020.783782:0:6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
00000100:00020000:2:1278393020.784052:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061bc00 x1340041378196347/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00020000:2:1278393020.784197:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061cc00 x1340041378196354/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 480/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/03782:0:6785:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
00000100:00020000:2:1278393020.784325:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061b000 x1340041378196360/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 496/616 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00020000:2:1278393020.784436:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81042061b800 x1340041378196369/t0 o4->AC3-OST0...@172.16.16.2@tcp:6/4 lens 464/608 e 0 o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1278393020.792064:0:6784:0:(import.c:1224:ptlrpc_invalidate_import_thread()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from EVICTED to RECOVER
00000100:00080000:0:1278393020.792068:0:6784:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0...@172.16.16.2@tcp
00000100:00080000:0:1278393020.792071:0:6784:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff810420d1a000 AC3-OST0008_UUID: changing import state from RECOVER to FULL
00020000:01000000:0:1278393020.792078:0:6784:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0008_UUID active
00000002:00080000:0:1278393020.792087:0:6784:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:02000000:0:1278393020.792092:0:6784:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0008-osc-ffff81043e754c00: Connection restored to service AC3-OST0008 using nid 172.16.16.2@tcp.
00000400:00000100:4:1278393020.806533:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 8 match 1340041378188715 offset 0 length 1048576: 2
00000400:00000100:4:1278393020.814069:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 8 match 1340041378188716 offset 0 length 1048576: 2
00000400:00000100:4:1278393020.828140:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 8 match 1340041378188729 offset 0 length 1048576: 2
00000400:00000100:4:1278393020.834691:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 8 match 1340041378188730 offset 0 length 1048576: 2
00000400:00000100:4:1278393020.841916:0:3561:0:(lib-move.c:1818:lnet_parse_put()) Dropping PUT from 12345-172.16.16.2@tcp portal 8 match 1340041378188736 offset 0 length 1048576: 2
00000100:02020000:2:1278394241.028730:0:7780:0:(client.c:975:ptlrpc_check_status()) 11-0: an error occurred while communicating with 172.16.16.2@tcp. The ldlm_enqueue operation failed with -107
00000100:00080000:2:1278394241.028748:0:7780:0:(recover.c:196:ptlrpc_request_handle_notconn()) import AC3-OST0006-osc-ffff81043e754c00 of AC3-OST0...@172.16.16.2@tcp abruptly disconnected: reconnecting
rations using this service will wait for recovery to complete.
00000100:00080000:2:1278394241.028763:0:7780:0:(import.c:167:ptlrpc_set_import_discon()) ffff81042018c000 AC3-OST0006_UUID: changing import state from FULL to DISCONN
00000100:00080000:2:1278394241.028769:0:7780:0:(import.c:608:ptlrpc_connect_import()) ffff81042018c000 AC3-OST0006_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:2:1278394241.028772:0:7780:0:(import.c:470:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: connect to NID 172.16.16.2@tcp last attempt 4724724153
00000100:00080000:2:1278394241.028783:0:7780:0:(import.c:544:import_select_connection()) AC3-OST0006-osc-ffff81043e754c00: import ffff81042018c000 using connection 172.16.16.2@tcp/172.16.16.2@tcp
6.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278394306 ref 1 fl Rpc:RS/0/0 rc 0/-107
g
00000100:00080000:2:1278394241.029211:0:22741:0:(import.c:182:ptlrpc_set_import_discon()) osc: import ffff81042018c000 already not connected (conn 180, was 179): CONNECTING
00010000:00020000:2:1278394241.029215:0:22741:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
00010000:00020000:2:1278394241.029225:0:22741:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
00000100:00080000:2:1278394241.029446:0:3572:0:(import.c:908:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4)  req@ffff8102e5404400 x1340041383549552/t0 o8->AC3-OST0...@172.16.16.2@tcp:28/4 lens 368/584 e 0 to 1 dl 1278394268 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:00080000:2:1278394241.029451:0:3572:0:(import.c:911:ptlrpc_connect_interpret()) ffff81042018c000 AC3-OST0006_UUID: changing import state from CONNECTING to EVICTED
00000100:02020000:2:1278394241.029454:0:3572:0:(import.c:1249:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by AC3-OST0006; in progress operations using this service will fail.
00000100:00080000:2:1278394241.029460:0:3572:0:(import.c:1253:ptlrpc_import_recovery_state_machine()) evicted from AC3-OST0...@172.16.16.2@tcp; invalidating
00000002:00080000:2:1278394241.029546:0:3572:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:00080000:2:1278394241.029564:0:7781:0:(import.c:1215:ptlrpc_invalidate_import_thread()) thread invalidate import AC3-OST0006-osc-ffff81043e754c00 to AC3-OST0...@172.16.16.2@tcp
00000100:00080000:2:1278394241.029567:0:7781:0:(import.c:194:ptlrpc_deactivate_and_unlock_import()) setting import AC3-OST0006_UUID INVALID
00000100:00080000:2:1278394241.029594:0:7780:0:(client.c:2098:ptlrpc_queue_wait()) @@@ "mv" awake: (EVICTED == FULL or 1/0 == 1)  req@ffff81012b338800 x1340041383514587/t0 o101->AC3-OST0...@172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 1278394306 ref 1 fl Rpc:RES/0/0 rc -4/-107
00000100:00020000:2:1278394241.029668:0:7780:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81012b338800 x1340041383549557/t0 o101->AC3-OST0...@172.16.16.2@tcp:28/4 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
00020000:00020000:2:1278394241.029685:0:7780:0:(lov_request.c:209:lov_update_enqueue_set()) enqueue objid 0x18ea266 subobj 0x80ab49 on OST idx 6: rc -108
00020000:01000000:2:1278394241.030074:0:7781:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID inactive
00000002:00080000:2:1278394241.030079:0:7781:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:00020000:7:1278394241.052192:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81032a347800 x1340041383549562/t0 o3->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00020000:7:1278394241.052453:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff810436b9ac00 x1340041383549566/t0 o3->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00020000:7:1278394241.052749:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8101767fe000 x1340041383549569/t0 o3->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00000100:00020000:7:1278394241.053042:0:3571:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81012a81ac00 x1340041383549573/t0 o3->AC3-OST0...@172.16.16.2@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
o 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
00020000:00080000:7:1278394241.221422:0:7301:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00000080:00020000:7:1278394241.221429:0:7301:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
00020000:00080000:6:1278394241.232801:0:7640:0:(lov_request.c:325:lov_prep_enqueue_set()) lov idx 6 inactive
00000100:00080000:2:1278394241.435957:0:7781:0:(import.c:1224:ptlrpc_invalidate_import_thread()) ffff81042018c000 AC3-OST0006_UUID: changing import state from EVICTED to RECOVER
00000100:00080000:2:1278394241.435965:0:7781:0:(import.c:1314:ptlrpc_import_recovery_state_machine()) reconnected to AC3-OST0...@172.16.16.2@tcp
00000100:00080000:2:1278394241.435968:0:7781:0:(import.c:1319:ptlrpc_import_recovery_state_machine()) ffff81042018c000 AC3-OST0006_UUID: changing import state from RECOVER to FULL
00020000:01000000:2:1278394241.435975:0:7781:0:(lov_obd.c:567:lov_set_osc_active()) Marking OSC AC3-OST0006_UUID active
00000002:00080000:2:1278394241.435983:0:7781:0:(mdc_request.c:1465:mdc_init_ea_size()) updating max_mdsize/max_cookiesize: 288/320
00000100:02000000:2:1278394241.435989:0:7781:0:(import.c:1327:ptlrpc_import_recovery_state_machine()) AC3-OST0006-osc-ffff81043e754c00: Connection restored to service AC3-OST0006 using nid 172.16.16.2@tcp.
ut (7s prior to deadline).
  req@ffff810346c34c00 x1340041411931387/t0 o101->AC3-OST0...@172.16.16.4@tcp:28/4 lens 296/544 e 0 to 1 dl 1278419851 ref 1 fl Rpc:/0/0 rc 0/0
erations using this service will wait for recovery to complete.

Christopher J. Morrone

unread,
Jul 20, 2010, 3:06:10 PM7/20/10
to lustre-...@lists.lustre.org
On 07/07/2010 01:04 AM, Gabriele Paciucci wrote:
> Hi,
> the ptlrcp bug is a problem, but i don't find in the Peter's logs any
> refer to an eviction caused by the ptlrpc but instead by a timeout
> during the comunication between a ost and the client. But Peter could
> make a downgrade to 1.8.1.1 that not suffer by the problem.

The bug that I describe does not have any messages about the ptlrpcd
performing evictions. The server's "I think it's dead, and I am
evicting it", and other messages about the server timing out on the
client are the only messages that you will see with the bug that I
described.

But like I said, there are many possible reasons for timeouts, so it
could easily be something else.

Bernd Schubert

unread,
Jul 20, 2010, 3:59:18 PM7/20/10
to lustre-...@lists.lustre.org
On Tuesday, July 20, 2010, Christopher J. Morrone wrote:
> On 07/07/2010 01:04 AM, Gabriele Paciucci wrote:
> > Hi,
> > the ptlrcp bug is a problem, but i don't find in the Peter's logs any
> > refer to an eviction caused by the ptlrpc but instead by a timeout
> > during the comunication between a ost and the client. But Peter could
> > make a downgrade to 1.8.1.1 that not suffer by the problem.
>
> The bug that I describe does not have any messages about the ptlrpcd
> performing evictions. The server's "I think it's dead, and I am
> evicting it", and other messages about the server timing out on the
> client are the only messages that you will see with the bug that I
> described.
>
> But like I said, there are many possible reasons for timeouts, so it
> could easily be something else.


For the record, while stress testing Lustre I can easily reproduce evictions
with any Lustre version. However, it is terribly difficult to debug it without
additional tools. I have opened a bugzilla for that, but I don't think I will
have time for those tools any time soon.

https://bugzilla.lustre.org/show_bug.cgi?id=23190


--
Bernd Schubert
DataDirect Networks

Reply all
Reply to author
Forward
0 new messages