Random Node Reboots

301 views
Skip to first unread message

Schuler, Laurence (GSFC-606.4)[ADNET SYSTEMS INC]

unread,
Jun 15, 2021, 11:36:00 AM6/15/21
to isilon-u...@googlegroups.com
Hello all,
We have an old Isilon cluster of 14 X400 nodes (one is an X410). Recently we have been experiencing random node reboots. I think daily about 10 nodes will reboot. I ran an IntegrityScan (while holding all other jobs from running) and it took about 8 days to run but it eventually completed ok. I think it helped for a bit, but we are seeing more reboots once again. The cluster is running 8.1.2.0.

The most recent node reboot had this message in var.log.messages right before the restart, is culprit?
> /boot/kernel.amd64/kernel: sonewconn: pcb 0xfffff80287a869a0: Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences)

Has anyone seen this behavior? Is this a sign of OneFS corruption or something else? All nodes are ok except for a few bad drives (to be replaced) and one boot SSD (to be replaced)-we have purchasing issues atm.

Suggestions?

Thanks,
--
Laurence Schuler (Larry) Laurence...@nasa.gov
Systems Support ADNET Systems, Inc
Scientific Visualization Studio https://svs.gsfc.nasa.gov
NASA/Goddard Space Flight Center, Code 606.4 phone: 1-301-286-3557
Greenbelt, MD 20771 cell: 1-410-739-0893



Erik Weiman

unread,
Jun 15, 2021, 11:46:13 AM6/15/21
to isilon-u...@googlegroups.com
Listen queue overflows won’t cause a reboot.

--
Erik Weiman
Sent from my iPhone 7

On Jun 15, 2021, at 10:35 AM, 'Schuler, Laurence (GSFC-606.4)[ADNET
SYSTEMS INC]' via Isilon Technical User Group
<isilon-u...@googlegroups.com> wrote:

Hello all,
--
You received this message because you are subscribed to the Google
Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to isilon-user-gr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/isilon-user-group/3B72E55A-BF81-42D1-96B2-E6E51D6FCCA6%40nasa.gov.

Anurag Chandra

unread,
Jun 15, 2021, 11:46:40 AM6/15/21
to isilon-u...@googlegroups.com
Hi , 

There should be a “Stack”’ in /var/log/messages which will give you additional information.

grep for that and you will have some info. 

Besides this if the cluster is under contract , support can help analyse the memory dump generated on a reboot and figure out what happened.

Thanks
Anurag 


Schuler, Laurence (GSFC-606.4)[ADNET SYSTEMS INC]

unread,
Jun 15, 2021, 12:44:53 PM6/15/21
to isilon-u...@googlegroups.com

This is what I see on two of the nodes.

 

Panic occurred in module kernel loaded at 0xffffffff80200000:

Stack: --------------------------------------------------
kernel:rbm_buf_timelock_panic_all_cb+0x10c
kernel:isi_buf_timelock_panic+0x2d2
kernel:getblk_locked+0x429
kernel:bam_get_buf+0x81
kernel:bam_read_mirrored_block+0x435
kernel:bam_read_block+0x35d
kernel:bam_read_range+0x117
kernel:bam_read+0x78f
kernel:bam_read_mbuf+0x97
kernel:bam_coal_read_wantlock+0x2e0
kernel:ifs_vnop_wrapunlocked_read_mbuf+0x264
kernel:VOP_UNLOCKED_READ_MBUF_APV+0xaa
isi_lwext.ko:lwextsvc_read+0x286
kernel:amd64_syscall+0x396
--------------------------------------------------

mandar kolhe

unread,
Jun 15, 2021, 1:30:17 PM6/15/21
to isilon-u...@googlegroups.com
This might have rebooted due to buf timeout you might have to reach support to confirm and they may tweak a buf timeout.

Coolgoose

unread,
Jul 22, 2021, 2:40:31 PM7/22/21
to Isilon Technical User Group
how did this go, we recently had similar issue and worked with emc support.

is your issue resolved ? let me know if there is anything i can help you

thank you

kenn...@gmail.com

unread,
Nov 21, 2022, 11:05:04 PM11/21/22
to Isilon Technical User Group
Did anyone get the solution?
I'm having exactly the same error.


panic @ time 1640130093.395, thread 0xfffff803c6913000: BUF_TIMELOCK: Waited more than 300 seconds for lock on 0xfffffe0bed99f0b8 (lock access type: 0x208900; wmesg: getblk) -- lockinfo: lock state: EXCL (recursed 0), held by: 0xfffffffffffffff0; buf_track: getblk_locked extrainfos: (0: td: 0x0; flags: 209b00; time: 4180604523; 1: td: 0x0; flags: 209b00; time: 4156139190; 2: td: 0x0; flags: 209900; time: 1414165694); ext_fields: (b_ext = 0xfffffe0bed99f468; b_trans_item: 0x0; b_shadow_item: 0x0; b_ifs_type: 1; b_source_baddr: 08ee4c690007000f);
cpuid = 7
Panic occurred in module kernel loaded at 0xffffffff80200000:

Stack: --------------------------------------------------
kernel:rbm_buf_timelock_panic_all_cb+0x10c
kernel:isi_buf_timelock_panic+0x2d2
kernel:getblk_locked+0x429
kernel:bam_get_buf+0x81
kernel:bam_read_mirrored_block+0x435
kernel:bam_read_block+0x35d
kernel:bam_read_range+0x117
kernel:bam_read+0x78f
kernel:bam_read_mbuf+0x97
kernel:bam_coal_read_wantlock+0x2e0
kernel:ifs_vnop_wrapunlocked_read_mbuf+0x264
kernel:VOP_UNLOCKED_READ_MBUF_APV+0xaa
isi_lwext.ko:lwextsvc_read+0x286
kernel:amd64_syscall+0x396
--------------------------------------------------
Disabling swatchdog
Dumping stacks (40960 bytes)

Jon Lasser

unread,
Nov 21, 2022, 11:14:43 PM11/21/22
to isilon-u...@googlegroups.com
My Isilon knowledge is a few years out of date but that stack is familiar.

This is a panic to avoid a potential deadlock. Either there’s a deadlock bug in your version of OneFS, or the system performance is being crushed on that device (or a drive within that device). Without performance data or the write plan, it’s hard to know which. (The write plan is a dotty diagram—if there’s a loop, there’s a deadlock; no loop, generally a performance issue ranging from a dying hard drive to data imbalance to overall load.)

Jon
-- 
Web: twoideas.org / Twitter: @disappearinjon / Mailing list: bit.ly/difficultsf

On Nov 21, 2022, at 8:05 PM, kenn...@gmail.com <kenn...@gmail.com> wrote:

Did anyone get the solution?
Reply all
Reply to author
Forward
0 new messages