Isilon Node Failure

2,704 views
Skip to first unread message

Justin

unread,
Apr 6, 2013, 10:11:35 PM4/6/13
to isilon-u...@googlegroups.com
Hello, 
We have a 9 node Isilon system and recently we had a failure so we decided to just completely reformat the entire system and start fresh. I talked to support who helped me with the commands that are required to reformat the system and get it set back up. Most of the nodes reformatted properly and allowed me to set up a new cluster, but two nodes just seem to get into a boot loop. They boot up, begin to load the OS, then just reboot...over and over again. 
I captured the output of hyperterminal during boot up. 
Does any of this make sense to anyone, and does anyone know how I can get these nodes online again? 
I appreciate any help! 

Justin
CAPTURE.TXT

Cory Snavely

unread,
Apr 7, 2013, 12:09:05 AM4/7/13
to isilon-u...@googlegroups.com, Justin
That looks sort of similar to a failure mode we experienced. The fault
was eventually traced back to RAM.

If it is the same problem, you will probably be able to get past it by
swapping the first DIMM to a different position. However, I wouldn't
recommend running that way; replacing the RAM is the indicated solution
for the problem.
> --
> You received this message because you are subscribed to the Google
> Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Justin

unread,
Apr 7, 2013, 12:15:16 AM4/7/13
to isilon-u...@googlegroups.com, Justin
I did try replacing all 4GB of RAM in the machine, but it is still doing the same thing. 
I noticed in the Capture, something about /root/ was not dismounted correctly, so is this something with the hard drive itself?
I also tried replacing all of the hard drives with the ones from another node, but I still got the same error....I have no idea anymore..

Cory Snavely

unread,
Apr 7, 2013, 10:10:30 AM4/7/13
to isilon-u...@googlegroups.com, Justin
If you replaced the RAM with known-good RAM, then you're not
experiencing the same problem we had here.

The message about root not being unmounted uncleanly is consequential
from the node booting to the point where root is mounted and then
rebooting. That in and of itself appears to me to be an effect from the
rebooting behavior, not a causal factor.

It's possible you may have confused the node if you didn't put all the
drives back in the proper order.

In general, if you have nodes that won't boot and are covered under a
support agreement, EMC should just replace them. If you want to keep
trying DIY options, though, you could reimage the nodes that won't boot
using USB, OVT them to discover and hardware failures, address those,
and then join them to the cluster.
> > an email to isilon-user-gr...@googlegroups.com <javascript:>.
> > For more options, visit https://groups.google.com/groups/opt_out
> <https://groups.google.com/groups/opt_out>.

Ethan Richard

unread,
Apr 7, 2013, 10:29:45 AM4/7/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, Justin
What node type is it? Have you tried re image from USB stick?

Typed with my thumbs

Luc Simard

unread,
Apr 7, 2013, 2:48:22 PM4/7/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, Justin
You should work through support , you potentially have a bad motherboard or the PCI riser is potentially at fault.

Hard to say while not having specific to your node types, we used active risers in the IQ series, which in their old age can cause this type of behavior.

Luc Simard - 415-793-0989
simard.j...@gmail.com
Messages may contain confidential information.
Sent from my iPhone

Luc Simard

unread,
Apr 7, 2013, 2:54:30 PM4/7/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, Justin
Be advised if you have IQ Series nodes, check with your Isilon  sales team on end of life and end support notifications.

The IQ iSeries nodes do support v6.5.x with some performance considerations, and do not support v7.0.x

The IQ xSeries have 64bits CPU and support 6.5.x and v7 alike. 

It's a good idea to weight the upgrade of node firmware to version 8.2 and disk firmware v1.5

EMC

Your EMC Product Updates

Updates are now available for the following product(s):

Notification Frequency: WEEKLY

Product(s)Content TypeDateTitle
Isilon S-Series, Isilon OneFS, Isilon NL-Series, Isilon X-SeriesProduct Documentation2013-04-04

Current Isilon software releases

Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series, Isilon X-SeriesDownloads2013-04-03

OneFS 6.5.5.19

Isilon OneFS, Isilon NL-Series, Isilon X-SeriesDownloads2013-04-03

OneFS 6.5.5.19 Installation image

Isilon OneFSDownloads2013-04-03

Isilon OneFS Patch-101988

Isilon OneFSDownloads2013-04-03

Isilon OneFS Patch-101172

Isilon OneFSDownloads2013-04-03

Isilon OneFS Patch-101597

Isilon S-Series, Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series, Isilon X-SeriesProduct Documentation2013-04-03

OneFS Version 6.5.5.19 Release Notes

Isilon S-Series, Isilon Switch QDR, Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series,...Product Documentation2013-04-01

Isilon Supportability and Compatibility Guide (PDF)

Isilon S-Series, Isilon Switch QDR, Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series,...Product Documentation2013-04-01

Isilon Product Availability

Isilon S-Series, Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series, Isilon X-SeriesProduct Documentation2013-04-01

InsightIQ Release Notes Version 2.5

Isilon S-Series, Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series, Isilon X-SeriesProduct Documentation2013-04-01

InsightIQ Installation and Setup Guide Version 2.5

Isilon S-Series, Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series, Isilon X-SeriesProduct Documentation2013-04-01

InsightIQ User Guide Version 2.5

Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series, Isilon X-SeriesDownloads2013-03-06

Isilon OneFS 6.5.5.18 installation image

Isilon OneFS, Isilon Backup Accelerator, Isilon NL-Series, Isilon X-SeriesDownloads2013-03-06

Isilon OneFS 6.5.5.18


You are receiving this notification because you have subscribed to product updates through the EMC Support website.

Sincerely, 
EMC Customer Service 

Manage your Subscriptions

Can't find what you're looking for through one of our self-service tools?

Consider using Live Chat for your next Service Request. With Live Chat you have access to a subject area expert for fast assistance with any type of issue or question. To try Live Chat, simply go to Live Chat page.

Do not reply to this e-mail.

To ensure delivery of this email, add Product...@emc.com to your email address book or safe list.

© 2012 EMC Corporation. All rights reserved.



Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone

Justin

unread,
Apr 10, 2013, 12:19:45 AM4/10/13
to isilon-u...@googlegroups.com
Thanks everyone, I was able to get one of the two failed nodes back online by re-imaging from a USB stick. The other one is still presenting the same errors, so i believe it is a hardware issue. I will dive into this more later when I have time as it's not required to be online right at the moment. 

Peter Serocka

unread,
Apr 10, 2013, 12:36:12 AM4/10/13
to Justin, isilon-u...@googlegroups.com
Good the hear.

Would you mind disclosing what was the initial trouble
that made you start all over? 

Peter


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China





Justin

unread,
Apr 10, 2013, 12:54:56 AM4/10/13
to isilon-u...@googlegroups.com, Justin
Yes sure, it started when we noticed one node kept popping a ECC Memory Error. We replaced the RAM in this one, but the error stayed. Then I did some research that said that if an ECC pops more often than once a year (it was happening every 15 minutes), it could be a problem with the CPU. So me, not understanding our protection level), brought down ANOTHER node to take out its CPU and put into the first node having the problem. By doing this, I basically crashed our entire system, corrupting lots and lots of data. Fortunately, most of the data was not mission critical as this cluster was more frequently being used as a dump for large personal files and illegal movie downloads...so no love was lost for having to reformat the whole thing and start over. After bringing it back up, we discovered a few underlying hardware issues, so we are still missing a few nodes, but we have over 21TB of usable space, which is plenty for now. 

Peter Serocka

unread,
Apr 10, 2013, 1:03:45 AM4/10/13
to Justin, isilon-u...@googlegroups.com
Thanks a lot for sharing this lesson!

Peter


On 2013 Apr 10. md, at 12:54 st, Justin wrote:

Yes sure, it started when we noticed one node kept popping a ECC Memory Error. We replaced the RAM in this one, but the error stayed. Then I did some research that said that if an ECC pops more often than once a year (it was happening every 15 minutes), it could be a problem with the CPU. So me, not understanding our protection level), brought down ANOTHER node to take out its CPU and put into the first node having the problem. By doing this, I basically crashed our entire system, corrupting lots and lots of data. Fortunately, most of the data was not mission critical as this cluster was more frequently being used as a dump for large personal files and illegal movie downloads...so no love was lost for having to reformat the whole thing and start over. After bringing it back up, we discovered a few underlying hardware issues, so we are still missing a few nodes, but we have over 21TB of usable space, which is plenty for now. 

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Luc Simard

unread,
Apr 10, 2013, 2:07:17 AM4/10/13
to isilon-u...@googlegroups.com, Justin, isilon-u...@googlegroups.com
Actually the PCI riser card also logs ECC error in dmilog.

Depending on your node type, not specified here , the riser could be the source of your headaches.

This applies to IQ iSeries and xSeries up to IQ12000x, where active pci risers are used.



Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone

Justin Lemme

unread,
Apr 10, 2013, 2:42:16 AM4/10/13
to isilon-u...@googlegroups.com
I'm actually wondering if that is the issue, and Isilon Support did mention that...but I have no spare PCI riser cards laying around to swap it out with. I wouldn't even want to know how much money I would have to spend to replace these...

Luc Simard

unread,
Apr 10, 2013, 3:25:22 AM4/10/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
If your cluster is under support , then it's a non issue, this part of of support license w/Break fix.

If not , then you can probably get one with order though sales for ( educated guess ) 200$.


Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone

Cory Snavely

unread,
Apr 10, 2013, 9:17:17 AM4/10/13
to isilon-u...@googlegroups.com
If you're running w/o support, I'd suggest leaving a few nodes out of
the cluster for spare parts.

On 04/10/2013 03:25 AM, Luc Simard wrote:
> If your cluster is under support , then it's a non issue, this part of
> of support license w/Break fix.
>
> If not , then you can probably get one with order though sales for (
> educated guess ) 200$.
>
> Luc Simard - 415-793-0989
> simard.j...@gmail.com <mailto:simard.j...@gmail.com>
> Messages may contain confidential information.
> Sent from my iPhone
>
> On Apr 9, 2013, at 23:42, Justin Lemme <jdl...@gmail.com
> <mailto:jdl...@gmail.com>> wrote:
>
>> I'm actually wondering if that is the issue, and Isilon Support did
>> mention that...but I have no spare PCI riser cards laying around to
>> swap it out with. I wouldn't even want to know how much money I would
>> have to spend to replace these...
>>
>>
>> On Tue, Apr 9, 2013 at 11:07 PM, Luc Simard
>> <simard.j...@gmail.com <mailto:simard.j...@gmail.com>> wrote:
>>
>> Actually the PCI riser card also logs ECC error in dmilog.
>>
>> Depending on your node type, not specified here , the riser could
>> be the source of your headaches.
>>
>> This applies to IQ iSeries and xSeries up to IQ12000x, where
>> active pci risers are used.
>>
>>
>>
>> Luc Simard - 415-793-0989 <tel:415-793-0989>
>> simard.j...@gmail.com <mailto:simard.j...@gmail.com>
>> Messages may contain confidential information.
>> Sent from my iPhone
>>
>> On Apr 9, 2013, at 22:03, Peter Serocka <pser...@picb.ac.cn
>>>> <mailto:isilon-user-gr...@googlegroups.com>.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>>
>>>
>>> Peter Serocka
>>> CAS-MPG Partner Institute for Computational Biology (PICB)
>>> Shanghai Institutes for Biological Sciences (SIBS)
>>> Chinese Academy of Sciences (CAS)
>>> 320 Yue Yang Rd, Shanghai 200031, China
>>> pser...@picb.ac.cn <mailto:pser...@picb.ac.cn>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the
>>> Google Groups "Isilon Technical User Group" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to isilon-user-gr...@googlegroups.com
>>> <mailto:isilon-user-gr...@googlegroups.com>.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-gr...@googlegroups.com>.

Justin Lemme

unread,
Apr 10, 2013, 11:48:51 AM4/10/13
to isilon-u...@googlegroups.com
Yes, that's what I am going to do with one but it also had the ECC Error, so I can't really fix that.


Sent from my iPhone 5s

Saker Klippsten

unread,
Apr 10, 2013, 11:53:16 AM4/10/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
Justin what type of node ?






Saker Klippsten | CTO | Zoic Studios
Reply all
Reply to author
Forward
0 new messages