OneFS 7.1.1.2 / "Journal backup validation failed" after reboot

945 views
Skip to first unread message

jamie

unread,
Aug 14, 2017, 10:40:55 PM8/14/17
to Isilon Technical User Group
Hi guys,

I had an issue today where the cluster became unresponsive and smb was down (also couldn't SSH into many of the nodes, web interface wasn't really responsive). Saw broken pipe errors on some nodes when I issued all cluster commands to retrieve health status so I issued a 'isi config' followed by 'reboot all' to clear the issue. Once the nodes came back online, the majority came back with attention status and "Journal backup validation failed" errors. Looked at the jobs list and there's a FlexProtect job running. 

After the FlexProtect job, will the cluster return back to normal? Obviously the journal  error is a critical integrity error. 

Managed to squeeze out a 'kldload panic_me' on one of the responsive nodes but the logs are inconclusive.

Nodes were up for 280 days.


Jerry Uanino

unread,
Aug 15, 2017, 6:57:06 AM8/15/17
to isilon-u...@googlegroups.com
Sound like you lost quorum. 
Not sure how you made out but I suspect you will need support on that one. I believe they can extract data from the journal in nvram but I've not had a scenario where more than one machine concurrently had this problem.  

Is your cluster read only? 
Post an isi stat if you can. 
Also, if you login to each node is /ifs mounted ? 

Jerry Uanino
--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jamie

unread,
Aug 15, 2017, 10:48:00 AM8/15/17
to isilon-u...@googlegroups.com
Hi Jerry,

Thanks for the response. Cluster seems perfectly fine otherwise - read / write capable and if I clear the failed notification the nodes report healthy. Anything else I can do to check? Suggest another restart?

Cluster Name: ISILON
Cluster Health:     [  OK ]
Cluster Storage:  HDD                 SSD Storage    
Size:             225T (227T Raw)     0 (0 Raw)      
VHS Size:         2.1T                
Used:             167T (74%)          0 (n/a)        
Avail:            58T (26%)           0 (n/a)        

                   Health  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |  In   Out  Total| Used / Size     |Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
  1|192.168.3.150  | OK  |    0|  528|  528|  23T/  32T( 73%)|(No Storage SSDs)
  2|192.168.3.151  | OK  |    0|   24|   24|  24T/  32T( 74%)|(No Storage SSDs)
  3|192.168.3.152  | OK  |    0|   32|   32|  24T/  32T( 74%)|(No Storage SSDs)
  4|192.168.3.153  | OK  | 459K|   48| 459K|  24T/  32T( 74%)|(No Storage SSDs)
  5|192.168.3.154  | OK  |  47M|   16|  47M|  24T/  32T( 74%)|(No Storage SSDs)
  6|192.168.3.155  | OK  |    0|   22|   22|  24T/  32T( 74%)|(No Storage SSDs)
  7|192.168.3.156  | OK  |    0|   12|   12|  24T/  32T( 74%)|(No Storage SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals:          |  47M|  682|  47M| 167T/ 225T( 74%)|(No Storage SSDs)

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only     

Critical Events:


Cluster Job Status:

No running jobs.

Paused and waiting jobs:                                                        
Job                        Impact Pri Policy     Phase Run Time   State        
-------------------------- ------ --- ---------- ----- ---------- -------------
ShadowStoreProtect[2728]   Low    6   LOW        1/1   0:00:00    System Paused 
WormQueue[2726]            Low    6   LOW        1/1   0:00:00    System Paused 
MultiScan[2724]            Low    4   LOW        1/4   0:06:01    System Paused 
        (Actions: Collect, AutoBalance)

No failed jobs.

Recent job results:                                                             
Time            Job                        Event                          
--------------- -------------------------- ------------------------------ 
08/15 03:44:06  MultiScan[2727]            User Cancelled 
08/15 03:17:02  FlexProtect[2725]          System Cancelled 
08/14 20:00:06  ShadowStoreProtect[2723]   Succeeded (LOW) 
08/14 04:00:13  ShadowStoreProtect[2722]   Succeeded (LOW) 
08/14 02:00:24  WormQueue[2721]            Succeeded (LOW) 
08/13 20:00:30  ShadowStoreProtect[2720]   Succeeded (LOW) 
08/13 04:00:10  ShadowStoreProtect[2719]   Succeeded (LOW) 
08/13 02:00:21  WormQueue[2718]            Succeeded (LOW) 


On Tuesday, August 15, 2017, Jerry Uanino <jua...@gmail.com> wrote:
Sound like you lost quorum. 
Not sure how you made out but I suspect you will need support on that one. I believe they can extract data from the journal in nvram but I've not had a scenario where more than one machine concurrently had this problem.  

Is your cluster read only? 
Post an isi stat if you can. 
Also, if you login to each node is /ifs mounted ? 

Jerry Uanino

On Aug 14, 2017, at 10:40 PM, jamie <chocola...@gmail.com> wrote:

Hi guys,

I had an issue today where the cluster became unresponsive and smb was down (also couldn't SSH into many of the enodes, web interface wasn't really responsive). Saw broken pipe errors on some nodes when I issued all cluster commands to retrieve health status so I issued a 'isi config' followed by 'reboot all' to clear the issue. Once the nodes came back online, the majority came back with attention status and "Journal backup validation failed" errors. Looked at the jobs list and there's a FlexProtect job running. 


After the FlexProtect job, will the cluster return back to normal? Obviously the journal  error is a critical integrity error. 

Managed to squeeze out a 'kldload panic_me' on one of the responsive nodes but the logs are inconclusive.

Nodes were up for 280 days.


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

jamie

unread,
Aug 16, 2017, 11:32:11 AM8/16/17
to isilon-u...@googlegroups.com
What do you think Jerry, a non-issue? Or am I in trouble?

Peter Serocka

unread,
Aug 16, 2017, 11:49:11 AM8/16/17
to isilon-u...@googlegroups.com
double-check:
isi readonly list -v
isi_for_array -s isi_checkjournal

Also, no jobs should rest in “System Paused” state,
an no jobs should repeatedly become “System Cancelled”.

What do you find?

— Peter
>> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

jamie

unread,
Aug 16, 2017, 12:40:46 PM8/16/17
to Isilon Technical User Group
Hi Peter,

Many thanks for the response. Here's what I found...

isi readonly show
node  mode          status
----  ------------  ----------------------------------------------
   1  read/write
   2  read/write
   3  read/write
   4  read/write
   5  read/write
   6  read/write
   7  read/write

Then all the nodes reported back the following verbatim:

isi_for_array -s isi_checkjournal

ISILON-1: Battery 1: Good (10) 

ISILON-1: Battery 2: Good (10) 

ISILON-1: Batteries appear good, cleared system-status-not-good flag in read-only mode stateBatteries appear good, cleared system-battery-sled flag in read-only mode stateisi_checkjournal finished successfully



Checked the jobs status, looks like they're paused by system (not good). "Cluster is Degraded", not good. 


isi job status
The job engine may not be fully running.
            Coordinator: 4
              Connected: True
     Disconnected Nodes: -
Down or Read-Only Nodes: True
       Statistics Ready: True
    Cluster Is Degraded: True
 Run Jobs When Degraded: False

Running and queued jobs:
ID   Type               State            Impact  Pri  Phase  Running Time 
--------------------------------------------------------------------------
2724 MultiScan          Paused by system Low     4    1/4    6m 1s        
2726 WormQueue          Paused by system Low     6    1/1    -            
2728 ShadowStoreProtect Paused by system Low     6    1/1    -            
--------------------------------------------------------------------------
Total: 3

Any ideas how I can do a cluster check to get it back on track?


>> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Jerry Uanino

unread,
Aug 16, 2017, 3:00:42 PM8/16/17
to isilon-u...@googlegroups.com
Looks healthy to me.
If you have support I'd file a support ticket for a log collection (isi_gather or whatever they call it) and see if they can analyze why it dumped.
You might have an NVRAM card on it's way out or something weird.

It is wierd you have isi job status reporting cluster degraded and nodes read only, but isi stat is good.
I'd open a support ticket to be sure.


Chris Pepper

unread,
Aug 16, 2017, 3:03:58 PM8/16/17
to isilon-u...@googlegroups.com
During SmartFail, when either FlexProtect or FlexProtectLin is running, the job engine goes into 'degraded' mode and blocks most jobs; but "isi events" doesn't show anything until SmartFail *completes*. This wastes a day or two and bothers me on every SmartFail, but it reduces the number of times people remove a drive while data is being copied off by SmartFail.

Chris

Peter Serocka

unread,
Aug 16, 2017, 3:54:11 PM8/16/17
to isilon-u...@googlegroups.com
Jamie, ideally one would let Isilon support handle a degraded cluster…
but I assume your cluster is not under a valid contract now.

Your bad "isi job status” result is inconsistent with the good “isi status” output,
so maybe we have a false alarm with the job engine here.

Can you access KB 482617?
If not, here is briefly what it suggests as next step:

Check if all nodes have the master control processes (mcp) running:
# isi_for_array -s ps auxw | grep mcp | grep -v grep

For each NODENUMBER which doesn't have mcp running:
# isi_for_array -n NODENUMBER isi_mcp


— Peter
> >> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

jamie

unread,
Aug 17, 2017, 9:31:49 AM8/17/17
to Isilon Technical User Group
Thanks Jerry, I agree with you but no service contract in place.

Thanks Chris for the background. The FlexProtect job after the cluster failure entered system cancelled. 

Peter, thanks for the KB#. Was able to access and determine this isn't the issue. All nodes are master/child.

Looks like there's a drive in a node that's smartfailed that didn't show up earlier (I swear). Says 

Lnum 21   [SMARTFAIL]     Last Known Bay N/A 


I know this sounds obvious, but doesn't give me a bay number so I can't replace it. In the same node after the cluster failure, I replaced a drive in Bay 33 / Lnum 36 that seems to be fine, reporting "Healthy."

I think this might be causing the cluster to report a degraded state.

> >> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Chris Pepper

unread,
Aug 17, 2017, 9:39:05 AM8/17/17
to isilon-u...@googlegroups.com
Jamie,

That [SMARTFAIL] state looks like SmartFail is (or should be) underway, in which case the cluster is degraded but the alert won't show up until it finishes. I think you should have a FlexProtect or FlexProtectLin job.

If you aren't seeing the bay from "isi devices" on the node, try "isi status -n99", assuming the drive is in node 99.

Chris
> > >> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > >> For more options, visit https://groups.google.com/d/optout.
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > > For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

jamie

unread,
Aug 17, 2017, 12:50:23 PM8/17/17
to Isilon Technical User Group
Thanks Chris,

What was strange was the SmartFail state looked like it was hanging for a while, but toggling it off to suspended, and back to SmartFail just completed. Yay. Node is now reporting healthy and the cluster is no longer reporting degraded in isi job status --verbose

Funny the drive bays on that node appear fine. Even went through isi_radish -a | more and not a single drive reported a SMART status exceeded the threshold. The notification I received once SmartFail concluded on the drive was:

Disk Repair Complete: Bay 0, Type HDD, LNUM 21. Replace the drive according to the instructions in the OneFS Help system.

Funny, there isn't a Bay 0 on the system. Only Bay 1 - 36. Something isn't seeing eye to eye.

> > >> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> > >> For more options, visit https://groups.google.com/d/optout.
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> > > For more options, visit https://groups.google.com/d/optout.
> > >
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> > > For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Chris Pepper

unread,
Aug 17, 2017, 1:43:59 PM8/17/17
to isilon-u...@googlegroups.com
We may have seen Bay 0 when the system didn't know where it *really* was. In the GUI it shows up as an extra bay below the 36 real bays in a 4U node.

Chris
> > > >> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > > >> For more options, visit https://groups.google.com/d/optout.
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > > > For more options, visit https://groups.google.com/d/optout.
> > > >
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > > > For more options, visit https://groups.google.com/d/optout.
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > > For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

John Beranek - PA

unread,
Sep 7, 2017, 5:12:31 PM9/7/17
to Isilon Technical User Group
Sorry for the late reply, but isn't this ETA 209918:

  1. ETA 209918: Isilon OneFS: Nodes that have run for more than 248.5 consecutive days may restart without warning which may lead to potential data unavailability
?

John

John Beranek - PA

unread,
Sep 7, 2017, 5:16:45 PM9/7/17
to Isilon Technical User Group
The issue was fixed in OneFS 7.1.1.6.

John
Reply all
Reply to author
Forward
0 new messages