Disk replacement and firmware upgrade procedure

1,835 views
Skip to first unread message

Jean-Baptiste Denis

unread,
Feb 4, 2014, 9:34:37 AM2/4/14
to isilon-u...@googlegroups.com
Hi,

I've posted the question on the community forum, but no answer so far.

https://community.emc.com/thread/188092

I've also asked the support a few weeks ago and just go an answer that
if you're shipped with a disk that does not match the firmware you want,
you have to upgrade it using the standard procedure via
isi_disk_firmware_reboot, hence a full node reboot, just for a disk
replacement.

That sounds crazy to me.

How are you dealing with disk replacement on a day to day basis ? Are
you just adding it without checking for the firmware version ? Or should
the shipped disk already running the last firmware ?

I'm asking this because I had to deal with a firmware upgrade in a
middle of a "crisis", and I don't want to have to do that anymore,
especially as it had nothing to do with the problem but was REQUIRED by
the support before any further investigation.

Jean-Baptiste

Cory Snavely

unread,
Feb 4, 2014, 9:55:52 AM2/4/14
to isilon-u...@googlegroups.com
We do that, and yes, it's ridiculous.

Jerry Uanino

unread,
Feb 4, 2014, 9:57:29 AM2/4/14
to isilon-u...@googlegroups.com
We've had to deal with this. it sucks.
One thing I might recommend, particularly to get out of the crisis is to have drives @ today's firmware on hand.  Having 4 hot spares or some small number at your current firmware level will prevent this mess.  Now, as you replace them you might get ones on newer firmware, but either way when you are in crisis mode, at least you won't have to worry about 1 disk killing you.  We used to keep 4 per cluster handy for this, I can't remember if we paid for them or if we just insisted, but it's not unreasonable cost anyway to keep your sanity. 



Jean-Baptiste

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jerry Uanino

unread,
Feb 4, 2014, 9:58:57 AM2/4/14
to isilon-u...@googlegroups.com
And Andrew Stack originally posted earlier about the "dear santa" list for isilon.  Disk firmware was in there.

Peter Serocka

unread,
Feb 4, 2014, 10:10:11 AM2/4/14
to isilon-u...@googlegroups.com
Fully agree as for the isi_disk_firmware_reboot mechanism,
but out of curiousity, what kind of “crisis” situations
are we talking about here? Full cluster — I’d understand that,
but when else would you say, we NEED to have this
disk replaced first?

Thanks

— Peter

Jean-Baptiste Denis

unread,
Feb 4, 2014, 11:14:14 AM2/4/14
to isilon-u...@googlegroups.com
Jerry Uanino wrote :

> Having 4 hot spares or some small number at your current firmware
> level will prevent this mess.

Thank you for the suggestion !

Peter Serocka wrote:

> but out of curiousity, what kind of "crisis" situations
> are we talking about here? Full cluster -- I'd understand that,
> but when else would you say, we NEED to have this
> disk replaced first?

I'm not sure that I understand you correctly.

I was in the following situation : no disks to replace, but "old"
firmware and multiple - maybe - correlated problems : long jobs were
systematically system cancelled, nfs lock problem... EMC support
required to upgrade disk firmware to continue investigation. It did'nt
help. I'll spare you the details.

My concern here is to never be in this situation again.


Jerry Uanino

unread,
Feb 4, 2014, 11:19:26 AM2/4/14
to isilon-u...@googlegroups.com
The fact that Isilon can't do non disruptive upgrades like their competitor should be enough to motivate them to not cause other unnecessary clusterwide reboots.
Netapp still wins on NDU's.

I can't remember why we ran into this and why it was a big deal.... but it was at the time. (maybe it was a onefs upgrade stalled by a drive we couldn't replace without a clusterwide reboot).
In our environment a clusterwide reboot is painful and developers notice.  Isilon needs to pickup the slack on this.  We'd prefer the entire cluster go down to serving traffic from one non-redundant node than have it stop serving traffic for a few minutes.



On Tue, Feb 4, 2014 at 10:10 AM, Peter Serocka <pser...@picb.ac.cn> wrote:
Fully agree as for the isi_disk_firmware_reboot mechanism,
but out of curiousity, what kind of "crisis" situations
are we talking about here? Full cluster -- I'd understand that,

but when else would you say, we NEED to have this
disk replaced first?

Thanks

-- Peter

Jerry Uanino

unread,
Feb 4, 2014, 11:22:08 AM2/4/14
to isilon-u...@googlegroups.com
Also in my "very strong" opinion....  They should be able to pop a drive in a supermicro server and flash them to whatever firmware we want to get out of those scenarios.  We have had HP do this for us when we go into log jams with some of our servers. Unless the hardware rev of the drive prevents it, I'd rather deal with the firmware bug than deal with a cluster reboot during a time I am working on another problem.  

Peter Serocka

unread,
Feb 4, 2014, 11:23:26 AM2/4/14
to isilon-u...@googlegroups.com
Thanks - that perfectly answered my question.

(My thought was, leave the disk bay empty
until “crisis” is over — but might not
be appropriate in all situations.)

Cory Snavely

unread,
Feb 4, 2014, 11:24:09 AM2/4/14
to isilon-u...@googlegroups.com
FWIW, my process is to have achieved a baseline of consistent disk and
node firmware, and then check firmware after any hardware replacements
and upgrade if necessary.

As for the suggestion to keep a few cold spares, we do that too, but to
get them to the right firmware version, they have to be put in a live
node. If you order them from EMC, they'll be stale as soon as a new
version is released, which is only a matter of time.

So, I very much wish the hardware replacement processes would image the
replacement part to the right firmware when it's added; it's my
understanding from the "Christmas List" discussion that it works that
way with other storage systems; moreover, it's the only reasonable way
for it to work.
> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
> > --
> > You received this message because you are subscribed to the
> Google Groups "Isilon Technical User Group" group.
> > To unsubscribe from this group and stop receiving emails from
> it, send an email to
> isilon-user-gr...@googlegroups.com
> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the
> Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to
> isilon-user-gr...@googlegroups.com
> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.

Erik Weiman

unread,
Feb 4, 2014, 1:55:03 PM2/4/14
to isilon-u...@googlegroups.com
What stops the end user from grabbing the new firmware file from the
drive firmware package and putting the drive in a desktop pc and
flashing the new firmware on the drive then putting it in the cluster?
For any new firmware to really be used the hardware needs to be
reinitialized with the new version so I don't see how you could update
drive firmware in the node without needing to force a reboot.

--
Erik Weiman
Sent from my iPhone 4
Message has been deleted

Ozen

unread,
Feb 5, 2014, 3:28:24 AM2/5/14
to isilon-u...@googlegroups.com, jbd...@pasteur.fr
Hi,

Today a patch released for related issue on v6.5.5.23.

Isilon OneFS Patch-120551
This patch addresses an issue where replacement drives cannot be installed in a node unless that node is shut down and rebooted using the shutdown command.

Hope that helps.

Jean-Baptiste Denis

unread,
Feb 5, 2014, 4:08:07 AM2/5/14
to Ozen, isilon-u...@googlegroups.com
On 02/05/2014 09:15 AM, Ozen wrote:
> Hi,
>
> Today a patch released for this issue.
>
> Isilon OneFS Patch-120551
> <https://download.emc.com/downloads/DL51969_Isilon-OneFS-Patch-120551.tgz>
> This patch addresses an issue where replacement drives cannot be
> installed in a node unless that node is shut down and rebooted using
> the shutdown command.

From the description, I'm not sure that this path address our problem.

Daniel Cornel

unread,
Mar 5, 2014, 9:36:29 AM3/5/14
to isilon-u...@googlegroups.com
Hi erik.  I believe you are entirely correct.  If you can flash the disk firmware with the specific package needed, then yes you should be able to update it to what is current in the system, and then have no update needed within the system.
Also my understanding is that there is no requirement to have disk firmware all to a particular level.  The system is capable of a mixed firmware environemnt. (though, i have gone through great lengths to update all of ours)
>        <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>         > For more options, visit https://groups.google.com/groups/opt_out.
>         >
>         >
>         > --
>         > You received this message because you are subscribed to the
>        Google Groups "Isilon Technical User Group" group.
>         > To unsubscribe from this group and stop receiving emails from
>        it, send an email to
>        <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>         > For more options, visit https://groups.google.com/groups/opt_out.
>
>        --
>        You received this message because you are subscribed to the
>        Google Groups "Isilon Technical User Group" group.
>        To unsubscribe from this group and stop receiving emails from
>        it, send an email to

Daniel Cornel

unread,
Mar 5, 2014, 9:39:37 AM3/5/14
to isilon-u...@googlegroups.com, jbd...@pasteur.fr
I was advised by an isilon specialist that dissimilar disk firmware is 100% acceptable.  The only firmware that needs to be similar across the cluster is the infiniband firmware.  If you were given anything that disagrees with that, please post any reasons they gave you.

Cory Snavely

unread,
Mar 5, 2014, 9:52:42 AM3/5/14
to isilon-u...@googlegroups.com
In troubleshooting problems on our clusters with persistent drive
stalling, EMC support recommended consistent firmware across the
cluster. This was verbal, over a webex, so I can't post exact wording.
The general gist of it was when the engineer saw I had differing
versions of node and drive firmware, the reaction was something like
"whoa, we need to update that right away; that can cause all sorts of
problems". After the upgrade, we ran our first successful Collect in
over a year, and that has continued, so there does empirically appear to
be some truth to it. It seems others have gotten this advice explicitly
or implicitly as well, and I understand the water is muddy on this
issue, but personally I believe EMC's own internal knowledge on this is
unclear and inconsistent. That being the case, and given that I'm
inclined to act on experience rather than advice in general, on this
one, I'm convinced that consistently upgraded firmware is the way to go.
> --
> You received this message because you are subscribed to the Google
> Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isilon-user-gr...@googlegroups.com
> <mailto:isilon-user-gr...@googlegroups.com>.

Daniel Cornel

unread,
Mar 5, 2014, 9:54:55 AM3/5/14
to isilon-u...@googlegroups.com
My experience agree with yours. Perhaps you can sway support into sending disks at a particular patch level...
You received this message because you are subscribed to a topic in the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/isilon-user-group/LdekVFWlxo8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isilon-user-gr...@googlegroups.com.

Cory Snavely

unread,
Mar 5, 2014, 10:10:01 AM3/5/14
to isilon-u...@googlegroups.com, Daniel Cornel
Even if that were possible, the inability to upgrade firmware at
replacement time is just problematic, particularly for anyone who keeps
cold spares, like we do, as they will ultimately be stale.

As far as pursuing an enhancement, I doubt I have any traction on my
own, but I think we've established this as a deficiency in the system
that affects us all. I suppose if a group of us were to "petition" EMC
by signing on to an articulation of the problem and a request for
enhancement, it could carry some weight.
Reply all
Reply to author
Forward
0 new messages