[Lustre-community] Problems with spontaneous evictions of clients on lustre 2.1

55 views
Skip to first unread message

Mike

unread,
Dec 8, 2011, 12:35:50 PM12/8/11
to lustre-c...@lists.lustre.org
Hi everybody,

We have struggled over time with weird problem - since version 1.8.4,
then migration to 1.8.5, then migration to 2.1 we are still experiencing
sporadic evictions by central server.

Why it is happening is unknown. We have 15 clients connected to Lustre
server, and in logs on server we see several times a day something like
this:.
------
Lustre: 6452:0:(filter_io.c:869:filter_preprw_write()) storage-OST0000:
slow start_page_write 31s
Lustre: 6434:0:(filter.c:3455:filter_setattr()) storage-OST0000: slow
i_alloc_sem 31s
Lustre: 6456:0:(filter_io.c:869:filter_preprw_write()) storage-OST0000:
slow start_page_write 36s
Lustre: 6437:0:(filter.c:3455:filter_setattr()) storage-OST0000: slow
i_alloc_sem 36s
Lustre: 6454:0:(filter_io.c:869:filter_preprw_write()) storage-OST0001:
slow start_page_write 45s
Lustre: 6465:0:(filter_io.c:869:filter_preprw_write()) storage-OST0001:
slow start_page_write 47s
Lustre: 6470:0:(filter_io.c:869:filter_preprw_write()) storage-OST0001:
slow start_page_write 31s
Lustre: 6433:0:(filter.c:3455:filter_setattr()) storage-OST0001: slow
i_alloc_sem 31s
Lustre: 5243:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection
from 16eb8188-0ece-5499...@192.168.0.39@tcp t0 exp
0000000000000000 cur 1323314332 last 0
Lustre: 5243:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
MGS->NET_0x20000c0a80027_UUID netid 20000: select flavor null
Lustre: 5082:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection
from f40ee2b0-3fa4-4b90...@192.168.0.37@tcp t0 exp
0000000000000000 cur 1323316215 last 0
Lustre: 5082:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 3
previous similar messages
Lustre: 5082:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
MGS->NET_0x20000c0a80025_UUID netid 20000: select flavor null
Lustre: 5082:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 3
previous similar messages
LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock
callback timer expired after 100s: evicting client at 192.168.0.22@tcp
ns: mdt-ffff810636526000 lock: ffff81053233f900/0x246ade46d6c85c1c lrc:
3/0,0 mode: PR/PR res: 8589938338/130238 bits 0x3 rrc: 2 type: IBT
flags: 0x4000020 remote: 0x6578889ffc95160a expref: 12728 pid: 24880
timeout: 5707546384
LustreError: 24882:0:(client.c:1045:ptlrpc_import_delay_req()) @@@
IMP_CLOSED req@ffff810190b43c00 x1386133921685632/t0(0) o-1->
R6 T_0x20000c0a80016_UUID:15/16 lens 344/192 e 0 to 0 dl 0 ref 1 fl
Rpc:N/ffffffff/ffffffff rc 0/-1
LustreError: 24882:0:(ldlm_lockd.c:638:ldlm_handle_ast_error()) ###
client (nid 192.168.0.22@tcp) returned 0 from completion AST ns:
mdt-ffff810636526000 lock: ffff810587e6b6c0/0x246ade46d6c85ca1 lrc:
1/0,0 mode: --/PR res: 12312741/3540065265 bits 0x2 rrc: 24 type: IBT
flags: 0x200000a0 remote: 0x6578889ffc951618 expref: 6939 pid: 24880
timeout: 5707647019
Lustre: 29331:0:(ldlm_lib.c:877:target_handle_connect())
storage-MDT0000: connection from
10db495f-d837-a550...@192.168.0.22@tcp t322820289329 exp
0000000000000000 cur 1323332617 last 0
Lustre: 29331:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 3
previous similar messages
Lustre: 29331:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
storage-MDT0000->NET_0x20000c0a80016_UUID netid 20000: select flavor null
Lustre: 29331:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 3
previous similar messages
LustreError: 5650:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@
processing error (-95) req@ffff8104d429d000 x1387597624433738/t0(0)
o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983@NET_0x20000c0a80025_UUID:0/0
lens 448/328 e 0 to 0 dl 1323355499 ref 1 fl
Interpret:/ffffffff/ffffffff rc -95/-1
LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@
processing error (-95) req@ffff810627f44800 x1387597624456482/t0(0)
o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983@NET_0x20000c0a80025_UUID:0/0
lens 448/328 e 0 to 0 dl 1323355657 ref 1 fl
Interpret:/ffffffff/ffffffff rc -95/-1
LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 5
previous similar messages
LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@
processing error (-95) req@ffff810050b8ec00 x1387597624467844/t0(0)
o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983@NET_0x20000c0a80025_UUID:0/0
lens 448/328 e 0 to 0 dl 1323355868 ref 1 fl
Interpret:/ffffffff/ffffffff rc -95/-1
LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 1
previous similar message
LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@
processing error (-95) req@ffff810111ff5800 x1387597624486695/t0(0)
o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983@NET_0x20000c0a80025_UUID:0/0
lens 448/328 e 0 to 0 dl 1323356108 ref 1 fl
Interpret:/ffffffff/ffffffff rc -95/-1
LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 5
previous similar messages
LustreError: 21870:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@
processing error (-95) req@ffff8103262e1850 x1387597624493722/t0(0)
o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983@NET_0x20000c0a80025_UUID:0/0
lens 448/328 e 0 to 0 dl 1323356204 ref 1 fl
Interpret:/ffffffff/ffffffff rc -95/-1
LustreError: 21870:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 2
previous similar messages
LustreError: 21849:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@
processing error (-95) req@ffff810325c04050 x1387597624503062/t0(0)
o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983@NET_0x20000c0a80025_UUID:0/0
lens 448/328 e 0 to 0 dl 1323356388 ref 1 fl
Interpret:/ffffffff/ffffffff rc -95/-1
LustreError: 21849:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 6
previous similar messages
LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock
callback timer expired after 100s: evicting client at 192.168.0.26@tcp
ns: mdt-ffff810636526000 lock: ffff8105a51ba240/0x246ade46f6d980a8 lrc:
3/0,0 mode: PR/PR res: 8589938377/22137 bits 0x3 rrc: 2 type: IBT flags:
0x4000020 remote: 0x221c008cb5aab1 expref: 11285 pid: 21871 timeout:
5740267399
LustreError: 21865:0:(mdt_handler.c:2785:mdt_recovery()) operation 101
on unconnected MDS from 12345-192.168.0.26@tcp
LustreError: 21865:0:(mdt_handler.c:2785:mdt_recovery()) Skipped 4
previous similar messages
LustreError: 21865:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@
processing error (-107) req@ffff8100439cac00 x1387008508758491/t0(0)
o-1-><?>@<?>:0/0 lens 544/0 e 0 to 0 dl 1323365244 ref 1 fl
Interpret:/ffffffff/ffffffff rc -107/-1
LustreError: 21865:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 4
previous similar messages
Lustre: 21872:0:(ldlm_lib.c:877:target_handle_connect())
storage-MDT0000: connection from
e22abfde-d76c-9a91...@192.168.0.26@tcp t322669877067 exp
0000000000000000 cur 1323365138 last 0
Lustre: 21872:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import
storage-MDT0000->NET_0x20000c0a8001a_UUID netid 20000: select flavor null

------------------------------------------

At the same time on clients we see something like this:
---
Lustre: storage-MDT0000-mdc-ffff880afb2a5c00: Connection to service
storage-MDT0000 via nid 192.168.0.42@tcp was lost; in progress
operations using this service will wait for recovery to complete.
LustreError: 25826:0:(ldlm_request.c:110:ldlm_expired_completion_wait())
### lock timed out (enqueued at 1323332321, 296s ago), entering recovery
for storage-MD...@192.168.0.42@tcp ns:
storage-MDT0000-mdc-ffff880afb2a5c00 lock:
ffff880211a50240/0x6578889ffc951618 lrc: 4/1,0 mode: --/PR res:
12312741/3540065265 bits 0x2 rrc: 2 type: IBT flags: 0x0 remote:
0x246ade46d6c85ca1 expref: -99 pid: 25826 timeout: 0
LustreError: 167-0: This client was evicted by storage-MDT0000; in
progress operations using this service will fail.
LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) @@@
IMP_INVALID req@ffff88059cf56400 x1386136653334645/t0(0)
o-1->storage-MD...@192.168.0.42@tcp:23/10 lens 360/824 e 0 to 0
dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped
847 previous similar messages
LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) inode
144115249614728204 mdc close failed: rc = -108
LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) Skipped 14
previous similar messages
LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) @@@
IMP_INVALID req@ffff88064358c400 x1386136653334718/t0(0)
o-1->storage-MD...@192.168.0.42@tcp:23/10 lens 360/824 e 0 to 0
dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped
72 previous similar messages
LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) inode
11950252 mdc close failed: rc = -108
LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) Skipped 72
previous similar messages
LustreError: 25826:0:(mdc_locks.c:722:mdc_enqueue()) ldlm_cli_enqueue: -5
LustreError: 25826:0:(mdc_locks.c:722:mdc_enqueue()) Skipped 460
previous similar messages
LustreError: 25826:0:(dir.c:424:ll_get_dir_page()) lock enqueue:
[0xbbe0a5:0xd3011bf1:0x0] at 7331450859914670877: rc -5
LustreError: 25826:0:(dir.c:424:ll_get_dir_page()) Skipped 1 previous
similar message
LustreError: 15606:0:(lmv_obd.c:1036:lmv_fid_alloc()) Can't alloc new
fid, rc -19
LustreError: 26517:0:(file.c:2186:ll_inode_revalidate_fini()) failure
-108 inode 9685121
LustreError: 26517:0:(file.c:2186:ll_inode_revalidate_fini()) Skipped
354 previous similar messages
LustreError: 26517:0:(dir.c:424:ll_get_dir_page()) lock enqueue:
[0xbbe0a5:0xd3011bf1:0x0] at 8925214612138510438: rc -108
LustreError: 26517:0:(dir.c:649:ll_readdir()) error reading dir
[0xbbe0a5:0xd3011bf1:0x0] at 8925214612138510438: rc -108
LustreError: 19962:0:(mdc_locks.c:722:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 19962:0:(mdc_locks.c:722:mdc_enqueue()) Skipped 254
previous similar messages
LustreError: 27542:0:(file.c:2186:ll_inode_revalidate_fini()) failure
-108 inode 9685121
LustreError: 27542:0:(file.c:2186:ll_inode_revalidate_fini()) Skipped
174 previous similar messages
LustreError: 28938:0:(ldlm_resource.c:748:ldlm_resource_complain())
Namespace storage-MDT0000-mdc-ffff880afb2a5c00 resource refcount nonzero
(1) after lock cleanup; forcing cleanup.
LustreError: 28938:0:(ldlm_resource.c:754:ldlm_resource_complain())
Resource: ffff8807bffa9800 (11888416/2271786101/0/0) (rc: 0)
Lustre: storage-MDT0000-mdc-ffff880afb2a5c00: Connection restored to
service storage-MDT0000 using nid 192.168.0.42@tcp.
---------------------------------------------------

Things like this happen usually with 1 to 7-8 clients simultaneously,
for unknown reason.
We have tried to increase timeout for lctl conf_param
storage.sys.timeout=1000 as it was recommended in several lists we
found on google, and it didn't help anything.

Any ideas what can be causing this, and how to proceed to solve this issue?
From what it looks like, it suppose to be fixed in reported bug 23352
already in late version 1.8.x, so we assumed 2.1 has it already?
Can someone confirm this?

Thanks in advance,
Mike.

_______________________________________________
Lustre-community mailing list
Lustre-c...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-community

Reply all
Reply to author
Forward
0 new messages