Random Node Reboots

Schuler, Laurence (GSFC-606.4)[ADNET SYSTEMS INC]

unread,

Jun 15, 2021, 11:36:00 AM6/15/21

to isilon-u...@googlegroups.com

Hello all,
We have an old Isilon cluster of 14 X400 nodes (one is an X410). Recently we have been experiencing random node reboots. I think daily about 10 nodes will reboot. I ran an IntegrityScan (while holding all other jobs from running) and it took about 8 days to run but it eventually completed ok. I think it helped for a bit, but we are seeing more reboots once again. The cluster is running 8.1.2.0.

The most recent node reboot had this message in var.log.messages right before the restart, is culprit?
> /boot/kernel.amd64/kernel: sonewconn: pcb 0xfffff80287a869a0: Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences)

Has anyone seen this behavior? Is this a sign of OneFS corruption or something else? All nodes are ok except for a few bad drives (to be replaced) and one boot SSD (to be replaced)-we have purchasing issues atm.

Suggestions?

Thanks,
--
Laurence Schuler (Larry) Laurence...@nasa.gov
Systems Support ADNET Systems, Inc
Scientific Visualization Studio https://svs.gsfc.nasa.gov
NASA/Goddard Space Flight Center, Code 606.4 phone: 1-301-286-3557
Greenbelt, MD 20771 cell: 1-410-739-0893

Erik Weiman

unread,

Jun 15, 2021, 11:46:13 AM6/15/21

to isilon-u...@googlegroups.com

Listen queue overflows won’t cause a reboot.

--
Erik Weiman
Sent from my iPhone 7

On Jun 15, 2021, at 10:35 AM, 'Schuler, Laurence (GSFC-606.4)[ADNET
SYSTEMS INC]' via Isilon Technical User Group
<isilon-u...@googlegroups.com> wrote:

Hello all,

--
You received this message because you are subscribed to the Google
Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to isilon-user-gr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/isilon-user-group/3B72E55A-BF81-42D1-96B2-E6E51D6FCCA6%40nasa.gov.

Anurag Chandra

unread,

Jun 15, 2021, 11:46:40 AM6/15/21

to isilon-u...@googlegroups.com

Hi ,

There should be a “Stack”’ in /var/log/messages which will give you additional information.

grep for that and you will have some info.

Besides this if the cluster is under contract , support can help analyse the memory dump generated on a reboot and figure out what happened.

Thanks

Anurag

Schuler, Laurence (GSFC-606.4)[ADNET SYSTEMS INC]

unread,

Jun 15, 2021, 12:44:53 PM6/15/21

to isilon-u...@googlegroups.com

This is what I see on two of the nodes.

Panic occurred in module kernel loaded at 0xffffffff80200000:

Stack: --------------------------------------------------
kernel:rbm_buf_timelock_panic_all_cb+0x10c
kernel:isi_buf_timelock_panic+0x2d2
kernel:getblk_locked+0x429
kernel:bam_get_buf+0x81
kernel:bam_read_mirrored_block+0x435
kernel:bam_read_block+0x35d
kernel:bam_read_range+0x117
kernel:bam_read+0x78f
kernel:bam_read_mbuf+0x97
kernel:bam_coal_read_wantlock+0x2e0
kernel:ifs_vnop_wrapunlocked_read_mbuf+0x264
kernel:VOP_UNLOCKED_READ_MBUF_APV+0xaa
isi_lwext.ko:lwextsvc_read+0x286
kernel:amd64_syscall+0x396
--------------------------------------------------

To view this discussion on the web visit https://groups.google.com/d/msgid/isilon-user-group/CANSkps4%3Db6hyhQ%3Dx%3D_uv-%3D5a9wuTJzctbuxkqf-De9_rbLix-g%40mail.gmail.com.

mandar kolhe

unread,

Jun 15, 2021, 1:30:17 PM6/15/21

to isilon-u...@googlegroups.com

This might have rebooted due to buf timeout you might have to reach support to confirm and they may tweak a buf timeout.

To view this discussion on the web visit https://groups.google.com/d/msgid/isilon-user-group/0BCC958F-A2E1-4AB8-A796-65179EEEBC9B%40nasa.gov.

Coolgoose

unread,

Jul 22, 2021, 2:40:31 PM7/22/21

to Isilon Technical User Group

how did this go, we recently had similar issue and worked with emc support.

is your issue resolved ? let me know if there is anything i can help you

thank you

kenn...@gmail.com

unread,

Nov 21, 2022, 11:05:04 PM11/21/22

to Isilon Technical User Group

Did anyone get the solution?

I'm having exactly the same error.

panic @ time 1640130093.395, thread 0xfffff803c6913000: BUF_TIMELOCK: Waited more than 300 seconds for lock on 0xfffffe0bed99f0b8 (lock access type: 0x208900; wmesg: getblk) -- lockinfo: lock state: EXCL (recursed 0), held by: 0xfffffffffffffff0; buf_track: getblk_locked extrainfos: (0: td: 0x0; flags: 209b00; time: 4180604523; 1: td: 0x0; flags: 209b00; time: 4156139190; 2: td: 0x0; flags: 209900; time: 1414165694); ext_fields: (b_ext = 0xfffffe0bed99f468; b_trans_item: 0x0; b_shadow_item: 0x0; b_ifs_type: 1; b_source_baddr: 08ee4c690007000f);

cpuid = 7

Panic occurred in module kernel loaded at 0xffffffff80200000:

Stack: --------------------------------------------------

kernel:rbm_buf_timelock_panic_all_cb+0x10c

kernel:isi_buf_timelock_panic+0x2d2

kernel:getblk_locked+0x429

kernel:bam_get_buf+0x81

kernel:bam_read_mirrored_block+0x435

kernel:bam_read_block+0x35d

kernel:bam_read_range+0x117

kernel:bam_read+0x78f

kernel:bam_read_mbuf+0x97

kernel:bam_coal_read_wantlock+0x2e0

kernel:ifs_vnop_wrapunlocked_read_mbuf+0x264

kernel:VOP_UNLOCKED_READ_MBUF_APV+0xaa

isi_lwext.ko:lwextsvc_read+0x286

kernel:amd64_syscall+0x396

--------------------------------------------------

Disabling swatchdog

Dumping stacks (40960 bytes)

Jon Lasser

unread,

Nov 21, 2022, 11:14:43 PM11/21/22

to isilon-u...@googlegroups.com

My Isilon knowledge is a few years out of date but that stack is familiar.

This is a panic to avoid a potential deadlock. Either there’s a deadlock bug in your version of OneFS, or the system performance is being crushed on that device (or a drive within that device). Without performance data or the write plan, it’s hard to know which. (The write plan is a dotty diagram—if there’s a loop, there’s a deadlock; no loop, generally a performance issue ranging from a dying hard drive to data imbalance to overall load.)

Jon

--

Web: twoideas.org / Twitter: @disappearinjon / Mailing list: bit.ly/difficultsf

On Nov 21, 2022, at 8:05 PM, kenn...@gmail.com <kenn...@gmail.com> wrote:

Did anyone get the solution?

To view this discussion on the web visit https://groups.google.com/d/msgid/isilon-user-group/ca22cf13-7ddc-44d3-9a94-e64afbc92158n%40googlegroups.com.

Reply all

Reply to author

Forward