[Lustre-discuss] Timeouts and Dumps

183 views
Skip to first unread message

Denise Hummel

unread,
Dec 19, 2008, 10:42:16 AM12/19/08
to lustre-...@lists.lustre.org
Hi;

I have started getting numerous dump logs, timeouts and client
evictions. Our environment:
200 compute nodes
The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD
procs
The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs
One OST raid 6 ~ 8TB (recently shrunk from 9TB)
Lustre 1.6.4.2

The timeouts are causing our login node to hang long enough to be
noticeable to the users before recovery. I have attached a couple of
the logs and the syslog from this morning. All of the logs and messages
are pretty similar.

If anyone has any suggestions, I would appreciate it.

Thanks,
Denise

lustre-log.1229661071.27314
lustre-log.1229685498.27300.gz
syslog

Isaac Huang

unread,
Dec 19, 2008, 3:44:34 PM12/19/08
to Denise Hummel, lustre-...@lists.lustre.org
On Fri, Dec 19, 2008 at 08:42:16AM -0700, Denise Hummel wrote:
> Hi;
>
> I have started getting numerous dump logs, timeouts and client
> evictions. Our environment:
> ......
> Dec 19 04:17:28 oss1 kernel: Lustre: 27065:0:(router.c:167:lnet_notify()) Ignoring prediction from 172.16.100.41@tcp of 172.16.2.47@tcp down 11396654252 seconds in the future

Please look in bug 14300 for a fix of this.

Isaac
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Denise Hummel

unread,
Dec 22, 2008, 10:26:35 AM12/22/08
to He.H...@sun.com, lustre-...@lists.lustre.org
HI Isaac;

" > Dec 19 04:17:28 oss1 kernel: Lustre:
27065:0:(router.c:167:lnet_notify()) Ignoring prediction from
172.16.100.41@tcp of 172.16.2.47@tcp down 11396654252 seconds in the
future

Please look in bug 14300 for a fix of this.

Isaac "

I did not see where this bug would fix all the dumps I am getting -
about 50 a day now:


Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb())
Watchdog triggered for pid 27277: it was inactive for 100s
Dec 19 04:18:18 oss1 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb())
Skipped 1 previous similar message
Dec 19 04:18:18 oss1 kernel: Lustre:
0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for
process 27277
Dec 19 04:18:18 oss1 kernel: ll_ost_io_67 S 0000010218f55af8 0
27300 1 27301 27299 (L-TLB)
Dec 19 04:18:18 oss1 kernel: 0000010218f55948 0000000000000046
0000010140002030 ffffffff00000073
Dec 19 04:18:18 oss1 kernel: 0000000000000001 00000000a02ba9f2
0000010140067a80 0000000300000004
Dec 19 04:18:18 oss1 kernel: 000001021c220800 00000000000000a2
Dec 19 04:18:18 oss1 kernel: Call Trace:<4>Lustre:
0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) Skipped 2 previous
similar messages
Dec 19 04:18:18 oss1 kernel: ll_ost_io_44 S 0000010198895af8 0
27277 1 27278 27276 (L-TLB)
Dec 19 04:18:18 oss1 kernel: 0000010198895948 0000000000000046
0000010037ef1030 ffffffff00000073
Dec 19 04:18:18 oss1 kernel: 0000000000000001 00000000a02ba9f2
000001014005fa80 0000000100000004
Dec 19 04:18:18 oss1 kernel: 00000102380b8030 000000000000010b
Dec 19 04:18:18 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer
+293} <ffffffff8013f448>{__mod_timer+293}
<ffffffff80321053>{schedule_timeout+367}
Dec 19 04:18:18 oss1 kernel: <ffffffff80321053>{schedule_timeout
+367}
Dec 19 04:18:18 oss1 kernel: <ffffffff8013fe78>{process_timeout
+0} <ffffffffa0576998>{:ost:ost_brw_write+7448}
Dec 19 04:18:18 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read
+8528} <ffffffff8013fe78>{process_timeout+0}
<ffffffffa0576998>{:ost:ost_brw_write+7448}
Dec 19 04:18:18 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read
+8528}<ffffffff80133566>{default_wake_function+0}
Dec 19 04:18:18 oss1 kernel:
<ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69}
Dec 19 04:18:18 oss1 kernel:
<ffffffffa0571590>{:ost:ost_bulk_timeout+0}
<ffffffffa057c4ab>{:ost:ost_handle+12187}
Dec 19 04:18:18 oss1 kernel: <ffffffff80131923>{recalc_task_prio
+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920}
Dec 19 04:18:18 oss1 kernel:
<ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830}
Dec 19 04:18:18 oss1 kernel:
<ffffffffa028c02e>{:libcfs:lcw_update_time+30}
<ffffffff8013f448>{__mod_timer+293}
Dec 19 04:18:18 oss1 kernel:
<ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456}
<ffffffff80133566>{default_wake_function+0}
Dec 19 04:18:18 oss1 kernel:
<ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0}
Dec 19 04:18:18 oss1 kernel: <ffffffff80110de3>{child_rip+8}
<ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0}
Dec 19 04:18:18 oss1 kernel: <ffffffff80110ddb>{child_rip+0}
Dec 19 04:18:18 oss1 kernel: LustreError: dumping log
to /tmp/lustre-log.1229685498.27300
Dec 19 04:18:19 oss1 kernel: <ffffffff80133566>{default_wake_function
+0}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa0571590>{:ost:ost_bulk_timeout+0}
<ffffffffa057c4ab>{:ost:ost_handle+12187}
Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio
+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa028c02e>{:libcfs:lcw_update_time+30}
<ffffffff8013f448>{__mod_timer+293}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456}
<ffffffff80133566>{default_wake_function+0}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0}
Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8}
<ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0}
Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0}
Dec 19 04:18:19 oss1 kernel: ll_ost_io_93 S 000001016dafbaf8 0
27326 1 27327 27325 (L-TLB)
Dec 19 04:18:19 oss1 kernel: 000001016dafb948 0000000000000046
00000102380b8030 ffffffff00000073
Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2
000001014005fa80 0000000100000004
Dec 19 04:18:19 oss1 kernel: 00000101d1dd4800 00000000000000c8
Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer
+293} <ffffffff80321053>{schedule_timeout+367}
Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout
+0} <ffffffffa0576998>{:ost:ost_brw_write+7448}
Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read
+8528} <ffffffff80133566>{default_wake_function+0}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa0571590>{:ost:ost_bulk_timeout+0}
<ffffffffa057c4ab>{:ost:ost_handle+12187}
Dec 19 04:18:19 oss1 kernel: <ffffffff80131923>{recalc_task_prio
+337} <ffffffffa02beb58>{:lnet:lnet_match_blocked_msg+920}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa03a699e>{:ptlrpc:ptlrpc_server_handle_request+2830}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa028c02e>{:libcfs:lcw_update_time+30}
<ffffffff8013f448>{__mod_timer+293}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa03a8fa8>{:ptlrpc:ptlrpc_main+2456}
<ffffffff80133566>{default_wake_function+0}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0}
<ffffffffa03a7660>{:ptlrpc:ptlrpc_retry_rqbds+0}
Dec 19 04:18:19 oss1 kernel: <ffffffff80110de3>{child_rip+8}
<ffffffffa03a8610>{:ptlrpc:ptlrpc_main+0}
Dec 19 04:18:19 oss1 kernel: <ffffffff80110ddb>{child_rip+0}
Dec 19 04:18:19 oss1 kernel: ll_ost_io_89 S 00000101bbe01af8 0
27322 1 27323 27321 (L-TLB)
Dec 19 04:18:19 oss1 kernel: 00000101bbe01948 0000000000000046
0000010037ef1030 ffffffff00000073
Dec 19 04:18:19 oss1 kernel: 0000000000000001 00000000a02ba9f2
0000010140067a80 0000000300000004
Dec 19 04:18:19 oss1 kernel: 0000010239280800 0000000000000092
Dec 19 04:18:19 oss1 kernel: Call Trace:0000000000000001
00000000a02ba9f2 000001014005fa80 0000000100000004
Dec 19 04:18:19 oss1 kernel: 0000010227575800 00000000000000d5
Dec 19 04:18:19 oss1 kernel: Call Trace:<ffffffff8013f448>{__mod_timer
+293} <ffffffff8013f448>{__mod_timer+293}
<ffffffff80321053>{schedule_timeout+367}
Dec 19 04:18:19 oss1 kernel: <ffffffff80321053>{schedule_timeout
+367}
Dec 19 04:18:19 oss1 kernel: <ffffffff8013fe78>{process_timeout
+0} <ffffffffa0576998>{:ost:ost_brw_write+7448}
Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read
+8528} <ffffffff8013fe78>{process_timeout+0}
<ffffffffa0576998>{:ost:ost_brw_write+7448}
Dec 19 04:18:19 oss1 kernel: <ffffffffa0574ac0>{:ost:ost_brw_read
+8528} <ffffffff80133566>{default_wake_function+0}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa039e125>{:ptlrpc:lustre_msg_check_version+69}
Dec 19 04:18:19 oss1 kernel:
<ffffffffa0571590>{:ost:ost_bulk_timeout+0}
<ffffffffa057c4ab>{:ost:ost_handle
+12187}<ffffffff80133566>{default_wake_function+0}

... and much more of the same - it is in the log I sent with the
original email. With these dumps I end up with timeouts and evictions
and noticeable hangs until recovery.
Does the fix for bug 14300 fix this issue also?

Thanks,
Denise

Denise Hummel

unread,
Dec 22, 2008, 3:22:31 PM12/22/08
to lustre-...@lists.lustre.org
Hi;

A continuation of the timeouts and dumps. The login node has hung three
times today - no one can login and anyone logged in cannot work. The
hangs are lasting for 5 or more minutes.
The messages on the OSS are:

Dec 22 13:00:44 oss1 kernel: LustreError: 138-a: lustre-OST0000: A
client on nid 172.16.100.1@tcp was evicted due to a lock blocking
callback to 172.16.100.1@tcp timed out: rc -107
Dec 22 13:00:44 oss1 kernel: LustreError:
27250:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
req@00000100bff5c800 x91545/t0
o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82@NET_0x20000ac106401_UUID:-1
lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0
Dec 22 13:00:44 oss1 kernel: LustreError:
27166:0:(ldlm_lockd.c:962:ldlm_handle_enqueue()) ### lock on destroyed
export 0000010087dc1000 ns: filter-lustre-OST0000_UUID lock:
0000010191bd54c0/0x0 lrc: 1/0,0 mode: --/PW res: 14313567/0 rrc: 1 type:
EXT [0->8191] (req 0->8191) flags: 20000080 remote: 0x4c7a2f2234322fa9
expref: 29 pid: 27166
Dec 22 13:00:44 oss1 kernel: LustreError:
27166:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107) req@00000101397be600 x91686/t0
o101->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82@NET_0x20000ac106401_UUID:-1
lens 232/288 ref 0 fl Interpret:/0/0 rc -107/0
Dec 22 13:00:44 oss1 kernel: Lustre:
27250:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
bulk IO comm error with
ab915dd6-4f50-ba8a-5cd4-05edb28fcc82@NET_0x20000ac106401_UUID id
12345-172.16.100.1@tcp - client will retry
Dec 22 13:00:44 oss1 kernel: LustreError:
27314:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
req@0000010065c0ca00 x91546/t0
o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82@NET_0x20000ac106401_UUID:-1
lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0
Dec 22 13:00:45 oss1 kernel: LustreError:
27256:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
req@0000010008df4a00 x91586/t0
o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82@NET_0x20000ac106401_UUID:-1
lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0
Dec 22 13:00:45 oss1 kernel: LustreError:
27256:0:(ost_handler.c:1065:ost_brw_write()) Skipped 1 previous similar
message
Dec 22 13:00:45 oss1 kernel: Lustre:
27285:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
bulk IO comm error with
ab915dd6-4f50-ba8a-5cd4-05edb28fcc82@NET_0x20000ac106401_UUID id
12345-172.16.100.1@tcp - client will retry
Dec 22 13:00:45 oss1 kernel: Lustre:
27285:0:(ost_handler.c:1205:ost_brw_write()) Skipped 1 previous similar
message
Dec 22 13:00:48 oss1 kernel: Lustre:
27245:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
bulk IO comm error with
ab915dd6-4f50-ba8a-5cd4-05edb28fcc82@NET_0x20000ac106401_UUID id
12345-172.16.100.1@tcp - client will retry
Dec 22 13:00:48 oss1 kernel: Lustre:
27245:0:(ost_handler.c:1205:ost_brw_write()) Skipped 3 previous similar
messages
Dec 22 13:00:48 oss1 kernel: LustreError:
29210:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107) req@00000100bff84600 x91689/t0 o13-><?>@<?>:-1 lens 128/0 ref 0
fl Interpret:/0/0 rc -107/0
Dec 22 13:00:48 oss1 kernel: LustreError:
27116:0:(ldlm_lockd.c:1556:ldlm_cancel_handler()) operation 103 from
12345-172.16.100.1@tcp with bad export cookie 13571729019316464878
Dec 22 13:00:48 oss1 kernel: LustreError:
27114:0:(ldlm_lockd.c:1556:ldlm_cancel_handler()) operation 103 from
12345-172.16.100.1@tcp with bad export cookie 13571729019316464878

The messages in the syslog on the login node are:

Dec 22 13:00:45 head kernel: LustreError: 11-0: an error occurred while
communicating with 172.16.100.41@tcp. The ldlm_enqueue operation failed
with -107
Dec 22 13:00:45 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection to service
lustre-OST0000 via nid 172.16.100.41@tcp was lost; in progress
operations using this service will wait for recovery to complete.
Dec 22 13:00:58 head kernel: LustreError: 11-0: an error occurred while
communicating with 172.16.100.41@tcp. The ost_statfs operation failed
with -107
Dec 22 13:00:58 head kernel: LustreError:
3270:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from
cancel RPC: canceling anyway
Dec 22 13:00:58 head kernel: LustreError:
3270:0:(file.c:803:ll_extent_lock_callback()) ldlm_cli_cancel failed:
-107
Dec 22 13:00:58 head kernel: LustreError:
3271:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from
cancel RPC: canceling anyway
Dec 22 13:00:58 head kernel: LustreError:
3271:0:(file.c:803:ll_extent_lock_callback()) ldlm_cli_cancel failed:
-107
Dec 22 13:00:58 head kernel: LustreError: 167-0: This client was evicted
by lustre-OST0000; in progress operations using this service will fail.
Dec 22 13:00:58 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) @@@ redo for
recoverable error req@00000101819c4a00 x91544/t0
o4->lustre-OS...@172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Complete:E/0/0 rc -5/-22
Dec 22 13:00:58 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) Skipped 8 previous
similar messages
Dec 22 13:00:58 head kernel: LustreError:
19930:0:(llite_mmap.c:206:ll_tree_unlock()) couldn't unlock -5
Dec 22 13:00:58 head kernel: LustreError:
19930:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5,
returning -EIO
Dec 22 13:00:58 head kernel: LustreError:
19930:0:(file.c:1052:ll_glimpse_size()) Skipped 543 previous similar
messages
Dec 22 13:00:58 head kernel: LustreError:
19998:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5,
returning -EIO
Dec 22 13:00:58 head kernel: LustreError:
19998:0:(file.c:1052:ll_glimpse_size()) Skipped 195 previous similar
messages
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) @@@ redo for
recoverable error req@000001017aa67c00 x91546/t0
o4->lustre-OS...@172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Complete:E/0/0 rc -5/-22
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) Skipped 1 previous
similar message
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@00000100080d9400 x92325/t0
o4->lustre-OS...@172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Rpc:/0/0 rc 0/0
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@00000100f9fda200 x93266/t0
o4->lustre-OS...@172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Rpc:/0/0 rc 0/0
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 35 previous
similar messages
Dec 22 13:01:00 head kernel: LustreError: Skipped 5 previous similar
messages
Dec 22 13:01:00 head kernel: LustreError:
19947:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@00000102faba0200 x91689/t0
o13->lustre-OS...@172.17.103.18@o2ib:28 lens 128/272 ref 1 fl
Rpc:RS/0/0 rc -107/-107
Dec 22 13:01:00 head kernel: LustreError:
19947:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 5 previous
similar messages
Dec 22 13:01:00 head kernel: LustreError:
19947:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc =
-5
Dec 22 13:01:00 head kernel: LustreError:
20029:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc =
-5
Dec 22 13:01:00 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection restored to service
lustre-OST0000 using nid 172.16.100.41@tcp.
Dec 22 13:09:10 head kernel: LustreError: 11-0: an error occurred while
communicating with 172.16.100.41@tcp. The ldlm_enqueue operation failed
with -107
Dec 22 13:09:10 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection to service
lustre-OST0000 via nid 172.16.100.41@tcp was lost; in progress
operations using this service will wait for recovery to complete.
Dec 22 13:09:24 head kernel: LustreError:
3268:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from
cancel RPC: canceling anyway
Dec 22 13:09:24 head kernel: LustreError: 167-0: This client was evicted
by lustre-OST0000; in progress operations using this service will fail.
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(llite_mmap.c:206:ll_tree_unlock()) couldn't unlock -5
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5,
returning -EIO
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(file.c:1052:ll_glimpse_size()) Skipped 131 previous similar
messages
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(namei.c:1132:ll_objects_destroy()) obd destroy objid 0xa485b6
error -5
Dec 22 13:09:24 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection restored to service
lustre-OST0000 using nid 172.16.100.41@tcp.

I also have 56 lustre dump logs in about 2 minutes. I have attached
lustre.logs.tar.gz with all of the dumps.

If anyone has any suggestions, I would appreciate it.

Denise

lustre.logs.tar.gz

Andreas Dilger

unread,
Dec 22, 2008, 10:49:53 PM12/22/08
to Denise Hummel, lustre-...@lists.lustre.org
On Dec 22, 2008 13:22 -0700, Denise Hummel wrote:
> Dec 22 13:00:44 oss1 kernel: LustreError: 138-a: lustre-OST0000: A
> client on nid 172.16.100.1@tcp was evicted due to a lock blocking
> callback to 172.16.100.1@tcp timed out: rc -107
> Dec 22 13:00:44 oss1 kernel: LustreError:
> 27250:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
> req@00000100bff5c800 x91545/t0
> 27250:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
> bulk IO comm error with

These messages could relate to network problems on the oss1 node. That
said, this is most interesting if only oss1 is showing these messages.
In particular "eviction on bulk GET" indicates the network stopped working
in the middle of a data transfer.


> The messages in the syslog on the login node are:

> lustre-OST0000-osc-000001018197f800: Connection to service
> lustre-OST0000 via nid 172.16.100.41@tcp was lost; in progress
> operations using this service will wait for recovery to complete.

This is just the client's version of the same issue.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Denise Hummel

unread,
Dec 23, 2008, 8:45:09 AM12/23/08
to Andreas Dilger, lustre-...@lists.lustre.org
Hi;

Thanks. I have suspected the network, however have not been able to
pinpoint the problem. I have looked at the ethernet and infiniband
switches - found a few with IGMP turned on and some multicast issues.
Those have been fixed. I am checking the network stats on the oss, mdt
and nodes and find a few dropped packets, however the system stats do
not indicate a heavy load during that time. If anyone has any
suggestions on anything else I can look at please let me know.

Thanks for all of your help,
Denise

Isaac Huang

unread,
Dec 23, 2008, 8:04:17 PM12/23/08
to Denise Hummel, Andreas Dilger, lustre-...@lists.lustre.org
On Tue, Dec 23, 2008 at 06:45:09AM -0700, Denise Hummel wrote:
> Hi;
>
> Thanks. I have suspected the network, however have not been able to
> pinpoint the problem. I have looked at the ethernet and infiniband
> switches - found a few with IGMP turned on and some multicast issues.
> Those have been fixed. I am checking the network stats on the oss, mdt
> and nodes and find a few dropped packets, however the system stats do
> not indicate a heavy load during that time. If anyone has any
> suggestions on anything else I can look at please let me know.

Turning on console logging of low-level network errors may reveal
something useful:

echo +neterror > /proc/sys/lnet/printk

Isaac

Denise Hummel

unread,
Dec 29, 2008, 10:32:50 AM12/29/08
to Isaac Huang, Andreas Dilger, lustre-...@lists.lustre.org
Hi;

Thanks for all of the help and suggestions. I was able to narrow the
problem down to two racks of intel nodes. Checking the HP Procurve
switches in each rack showed that LACP was turned on all of the ports
causing ports randomly blocked by LACP with off-line and on-line
messages. Lustre was being accurate, the nodes were timing out
(repeatedly) and I think it caused a cascade effect to the rest of the
nodes I turned off LACP Dec. 24 and have not had a Lustre timeout
since.
Again, thanks for all of your help,
Denise

Reply all
Reply to author
Forward
0 new messages