No alert at SmartFail start

323 views
Skip to first unread message

Chris Pepper

unread,
Feb 10, 2015, 11:23:29 AM2/10/15
to isilon-u...@googlegroups.com
We have converged our clusters on OneFS 7.1.0.5 and found a problem. In earlier versions we received an alert (and email) when SmartFail began, and another when it completed. This kept us informed and provided advance notice while SmartFail ran (typically for a day). When SmartFail completed and the drive was ready for replacement, we typically had a new replacement drive from EMC onsite.

Unfortunately EMC had too much trouble with people replacing readable drives during SmartFail, which increases the risk of data loss. So as of 7.1.0.5 they removed the OneFS event when SmartFail begins, leaving the event when SmartFail completes -- the idea being only to notify the user when there is something useful (safe) to do.

There are several problems with the new scheme.

1) We need to know the state of our cluster. By removing the alert that a drive failed, we are ignorant of an important piece of cluster health.

2) The change was not complete. The node still goes into ATTENTION status (even though EMC does not want us to pay attention at this stage) but there is no alert to explain why. We receive a Nagios SNMP alert but the cause is missing from "isi alerts" / "isi events" output. Clearly if the cluster is demanding ATTENTION, "isi alerts" should tell us what the issue is.
The workaround is to find the ATTENTION node with "isi status" and then check its status with "isi status -n NODE". The drive which is not 'OK' is likely in the middle of SmartFail. Alternatively an active FlexProtect/FlexProtectLin job means SmartFail is running.

3) I have asked under multiple cases why we were not notified of SmartFail beginning. We spent a lot of time debugging celog when the issue is apparently that EMC removed an important bit of functionality and didn't notify users (in release notes) or adequately train Support personnel.

4) When SmartFail begins EMC should ship a new drive. Like SmartFail itself, drive delivery takes about a day, so previously the replacement was typically onsite about when the old drive was ready for removal. In the new scheme we and EMC are only notified when SmartFail *completes*, so we have no time to get the new drive in place before it is needed. This wastes time with reduced cluster capacity.

If your cluster starts complaining and the cause is not obvious, look for a stealthy SmartFail in progress.

If you want to be notified of SmartFail initiation, I suggest you notify your sales rep (this is outside Support's control).

Regards,

Chris Pepper

Peter Serocka

unread,
Feb 11, 2015, 12:47:02 AM2/11/15
to isilon-u...@googlegroups.com
Hi Chris

I absolutely agree with you here!

With which release did you still get the "SmartFail init notifications"?

We haven't seen any of these since we got our Isilon,
and it was shipped with 6.5 in 2011.

These specific alert notifications might have been,
and still might be, "soft" configurable...

Cheers

-- Peter
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China
pser...@picb.ac.cn





Chris Pepper

unread,
Feb 11, 2015, 1:16:25 AM2/11/15
to isilon-u...@googlegroups.com
On Feb 11, 2015, at 12:46 AM, Peter Serocka <pser...@picb.ac.cn> wrote:
>
> Hi Chris
>
> I absolutely agree with you here!
>
> With which release did you still get the "SmartFail init notifications"?

Peter,

I don't know. I have asked what was wrong with celog and the event notification in several cases, and only got the real answer last week. Now I know that part of our problem was celog bugs, and the other part was this removed event, so clearing celog was a herring which did not address this issue.

> We haven't seen any of these since we got our Isilon,
> and it was shipped with 6.5 in 2011.
>
> These specific alert notifications might have been,
> and still might be, "soft" configurable...

No, I reviewed the list of available events by type (error/warning/informational and category) in the web page where you can enable/disable email and appearance in "isi events", and the expected event for SmartFail initiation is missing. My Support rep confirmed the removal but was unable to learn when it happened. And this *really* should have been in the release notes.

Chris

Peter Serocka

unread,
Feb 11, 2015, 3:42:02 AM2/11/15
to isilon-u...@googlegroups.com
The notification rules can by modified in the CLI
in much less limited way than in the WebGUI.

But it doesn't help if no event is raised in the first place :(

And that part could be still hidden somewhere in sysctl or isi_gconfig,
so there might be a chance... but I have been out of luck so far.

-- Peter



On 2015 Feb 11. md, at 14:16 st, Chris Pepper wrote:

> On Feb 11, 2015, at 12:46 AM, Peter Serocka <pser...@picb.ac.cn> wrote:
>>
>> Hi Chris
>>
>> I absolutely agree with you here!
>>
>> With which release did you still get the "SmartFail init notifications"?
>
> Peter,
>
> I don't know. I have asked what was wrong with celog and the event notification in several cases, and only got the real answer last week. Now I know that part of our problem was celog bugs, and the other part was this removed event, so clearing celog was a herring which did not address this issue.
>
>> We haven't seen any of these since we got our Isilon,
>> and it was shipped with 6.5 in 2011.
>>
>> These specific alert notifications might have been,
>> and still might be, "soft" configurable...
>
> No, I reviewed the list of available events by type (error/warning/informational and category) in the web page where you can enable/disable email and appearance in "isi events", and the expected event for SmartFail initiation is missing. My Support rep confirmed the removal but was unable to learn when it happened. And this *really* should have been in the release notes.
>
> Chris
>
>

Reply all
Reply to author
Forward
0 new messages