Anyone with experience front ending a Isilon cluster with an Avere systems filer?

1,037 views
Skip to first unread message

Jim Long

unread,
Aug 6, 2013, 11:42:43 AM8/6/13
to isilon-u...@googlegroups.com
I posted a message earlier about the problems we have been having with a nine node X200 cluster.  EMC's recommended solution is to add three S nodes with all SSD in them so that we can place the file system map on SSD.   I'm leary of that solution for several reasons.  

  1. Cost is prohibitive.  By buying these three nodes; my cost per terabyte for the whole solution increases by $1,200 per terabyte.
  2. No usuable disk space; this solution is solely to improve metadata i/o.
  3. Does not address the drive problems we have been having.  See attached file node8_drive_stats.jpg.   I have an autobalance job running after we flexprotected drive 8 in node 8.  My smb protocol average increases on node 8 to an average of 15 ms and I can see that my queue length is 200+.  This leads me to believe, that my front end request is waiting behind 200+ autobalance i/os.
I received a very timely email from Avere system today.  Apparently they front end Isilon clusters.  Does anyone have any experience with this?  We are really struggling with sporadic performance problems and trying to investigate all options.  

Thanks
  Jim  
Got_Isilon_Position_Paper.pdf
node8_drive_stats.JPG

Peter Serocka

unread,
Aug 6, 2013, 12:30:30 PM8/6/13
to isilon-u...@googlegroups.com

On Tue 6 Aug '13 md, at 23:42 st, Jim Long <might...@gmail.com> wrote:

> I posted a message earlier about the problems we have been having with a nine node X200 cluster. EMC's recommended solution is to add three S nodes with all SSD in them so that we can place the file system map on SSD. I'm leary of that solution for several reasons.

Are you still on 6.5.5.somesmallnumber?

>
> • Cost is prohibitive. By buying these three nodes; my cost per terabyte for the whole solution increases by $1,200 per terabyte.

Would that be much different with Avere?

> • No usuable disk space; this solution is solely to improve metadata i/o.

Would Avere actually increase usable disk space?

> • Does not address the drive problems we have been having. See attached file node8_drive_stats.jpg. I have an autobalance job running after we flexprotected drive 8 in node 8.

Does it run on HIGH, MEDIUM or LOW policy?

> My smb protocol average increases on node 8 to an average of 15 ms and I can see that my queue length is 200+. This leads me to believe, that my front end request is waiting behind 200+ autobalance i/os.

Even if that hurts, I would first wait until AutoBalance has finished.
Try to balance the pain and the waiting time until finish by choosing the right policy for the autobalance job, maybe MEDIUM during off hours and LOW at daytime.

> I received a very timely email from Avere system today. Apparently they front end Isilon clusters. Does anyone have any experience with this? We are really struggling with sporadic performance problems and trying to investigate all options.

Just wondering, how would SmartQuotas work on an Isilon+Avere combo?
Snapshots? SmartPools? InsightIQ? SyncIQ or NDMP backups? That whole
combined Windows+UNIX permissions/ACL stuff?

-- Peter

>
> Thanks
> Jim
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
Got_Isilon_Position_Paper.pdf
node8_drive_stats.JPG

Blake Golliher

unread,
Aug 6, 2013, 12:41:00 PM8/6/13
to isilon-u...@googlegroups.com
Avere is just an nfs cache, so if the performance problem is 5GB / sec of writes for 30 minutes everyday at noon, that's not gonna help.  Alternately if it's every day at noon you read 20GB of data that hasn't been touched in 60 days, it may not help either (until after the first read is done) unless you reread that data several times over.  Using it to offload your cluster of lots and lots of nfs traffic using their caching system is pretty great.  

-Blake


--

Luc Simard

unread,
Aug 6, 2013, 1:01:33 PM8/6/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
You will get better performances maintaining your Isilon cluster, adding one or 2x A100 accelerator nodes for high concurrent reads, if that is your workflow.

Avere does not make their own storage platform, there a support angle there. They will not lower your costs.

It seems to me your cluster might not have been properly sized up, better knowledge of the data profile ( file size, number of files per directories , files quantity ratios per directories, shallow directories.

High concurrency of small files in shallow directories will increase typically client latency access on NAS. So if you are coming from a NETAPP environment , for costs, you need to compare oranges to oranges, with all Bells and whistles turned on, how much usable capacity are you left with ? You are meeting a challenge with head based system ( sandboxed performances) compared to Isilon will grow with you, scales to 20PB and with a single volume, single mount point, easy mgmt. 

I think there is really value in their proposition.

Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone

Jim Long

unread,
Aug 6, 2013, 2:00:42 PM8/6/13
to isilon-u...@googlegroups.com
Some background:  We have been running this cluster for a year.  We have 50 million directories, 280 million files and 80 TB of primary content.  We replicate to two other clusters.  We have talked about upgrading from 6.5.5.14 to 6.5.5.22.  Isilon has ruled out version 7 until the performance problems are resolved because version 7 would increase performance problems for us due to the way it handles drive pools.  We grow at a rate of 4% per month or 60% annually.  

I have a quote in front of me for 3 Isilon S nodes that is quite pricey.  The original cluster spec was 3 X nodes with 12 gb of ram.  I currently have 8 X200 nodes with 24 gb of ram and one accelerator with 48 gb of ram.  I've exceeded the original spec by over 3x and our performance is still a problem due to the job engine.  Obviously that make me investing further in the solution very concerning.  I believe the S nodes will help jobs like replication finish faster by accelerating the metadata lookups.  Our application does not look at the metadata.  We ask the isilon for a file and nothing else.  If I am pinning a SATA disk during autobalance and my front end traffic is contending with that;  I don't see how metadata acceleration solves that.  So the S nodes are only addressing part of the problem.  We have no idea what Avere costs yet.  I'm not suggesting this is a better path; I'm just trying to see if anyone has run into them.  Our current challenge is that we constantly are watching Isilon.  As a Saas company that serves files; that is a dangerous game to play.  Disk is a primary cost for me;  I need to keep our cost per terabyte down.  

Autobalance is running on Low.  If I push it to medium or high; performance will suffer.  

Thanks for all the input.
  Jim

Rob Peglar

unread,
Aug 6, 2013, 2:23:48 PM8/6/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
There is a fair amount of misunderstanding going on here, as well as lots of other sites that run NAS (not necessarily Isilon)

1) adding SSD to your Isilon cluster is a huge win.  Yes, it may be expensive but it is by far the best way to dig out of a performance hole caused by metadata accesses to HDD dominating the drive queue, which this case almost certainly is.  'Just asking for a file' involves one to several metadata reads and perhaps (depends on the access) one to several writes.  If these are all on HDD, your drive queues can get quite long.  A job engine HDD I/O will queue and wait for a metadata I/O to finish, and vice versa.  Moving the metadata I/O to SSD not only accelerates the metadata ops, as much as 100x faster, but also relieves a lot of pressure on the HDD.  So what your EMC folks told you is exactly right.
2) Autobalance and all other job engine jobs can stress metadata just as much as any other workload, if not more.  Adding SSD will help your job engine immensely.  An 80 microsecond SSD read is way better than an 8 millisecond HDD read.
3) the S nodes will address everything you have stated, as it turns out.  A cluster with 50 M directories and 280 M files is a prime candidate for metadata acceleration.  Below 100 M files or so, you might not notice.  But as you continue to throw files and load at a cluster, and don't add SSD for metadata, you will slowly tip it over.  

I believe Luc has stated the case about the Avere non-solution well.  Also remember if you put Avere in front of Isilon or any other NAS you essentially lose all the higher-level embedded function because you can change the namespace 'out from under' the Avere caching appliance with things like replication and snaps and it won't know what has happened.  

Rob 

Peter Serocka

unread,
Aug 6, 2013, 3:04:06 PM8/6/13
to isilon-u...@googlegroups.com

On Wed 7 Aug '13 md, at 02:00 st, Jim Long <might...@gmail.com> wrote:

> Some background: We have been running this cluster for a year. We have 50 million directories, 280 million files and 80 TB of primary content.

That is a huge number, and makes for pretty tiny files on average. Someone should have warned you, it wouldn't work on purely SATA based nodes. Sad. In contrast, we have experienced how EMC sales/presales engineering gave excellent advice on designing a cluster for a given workload.

A practical (though somewhat desperate) tip for your current situation:
Have you already switched off (or reduced) the access time stamping
on the cluster? That should save some disk IOPS.

-- Peter

Luc Simard

unread,
Aug 6, 2013, 8:55:43 PM8/6/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
To add to Rob and Peter's points, and bear with me as I'm thinking out loud.

I would favor 6.5.5.23 with patch 109309, or if are not in a hurry, I am told the new maintenance release is days away. At least you will have options once the release notes are published .

One strategy is to work with sales team, investigate into an eval with S200-SAS-SSD nodes, if you find it addresses your workflow, the proof is in the pudding. You could also investigate working with the PS team, you might benefit from a performance / workflow analysis and optimize your workflow.




Luc Simard - 415-793-0989
simard.j...@gmail.com
Messages may contain confidential information.
Sent from my iPhone

Saker Klippsten

unread,
Aug 6, 2013, 9:18:32 PM8/6/13
to isilon-u...@googlegroups.com
Avere's are not cheap either. They are comparable in cost to an Isilon node. Everything is negotiable of course.. 
You will want two Avere nodes at min for failover. I have two of them ( FX3500 ) but we use them for a different purpose and that is remote rendering ( our render farm is in LA our Isilon cluster is in BC and we use it to cache read's and writes over our wan connection. You can setup to do write through etc.. there are slew of features I will not go into..  I do know a few people that have Averes with X400's or NL400 ,  high density Isilon nodes and use them for main production storage and have a good workflow down. The isilons continue to function just like they would and the Averes sit in front acting as a cache. They do require you to join an AD domain for CIFS access. There are some permission issues to iron out for sure. If you will no longer use smartconnect since the Averes sit in front and write to the isilon VIA NFS and they recomend you do not use it and do a 1:1 mapping.. 

Also can you talk more about your application?
What is the application doing and what is it written in.. is this code you guys wrote? or off the shelf we all might be familiar with ?

I head up a Visual Effects company.. we have hundreds of applications and insane throughput requirements. I can tell you that not all apps are written the same and most are crap when it comes to accessing data off the network and cause havoc. I have two production clusters a 30 node SSD5000 with 3 NL108's and a 12 node S200 with 3 NL108's. We are all CIFS and have 1000 Proc Render Farm and 400 User workstations. 

If you want to find out more about Avere . Look up Studio Sysadmins and join that group and post there. Avere's are used a lot in the Visual Effects world and there are more people who use it for the type of workflow you are looking at. 



-Saker



Saker Klippsten | CTO | Zoic Studios

sa...@zoicstudios.com

310-838-0770 m

310-202-2063 d

310-350-3854 c

Peter Serocka

unread,
Aug 7, 2013, 2:12:37 AM8/7/13
to isilon-u...@googlegroups.com

On 2013 Aug 7. md, at 08:55 st, Luc Simard wrote:

> To add to Rob and Peter's points, and bear with me as I'm thinking out loud.
>
> I would favor 6.5.5.23 with patch 109309, or if are not in a hurry, I am told the new maintenance release is days away. At least you will have options once the release notes are published .
>
> One strategy is to work with sales team, investigate into an eval with S200-SAS-SSD nodes, if you find it addresses your workflow, the proof is in the pudding. You could also investigate working with the PS team, you might benefit from a performance / workflow analysis and optimize your workflow.

Just got a price list at my hands and remembered Jim
was talking about an all-SSD S series solution.
That's really crazy stuff, it would probably mean buying
right now all the SSD needed for coming(?) billions of files,
rather than buy-as-you-grow.

But there is another interesting point I found:

820-0004 FIELD UPGRADE KIT; X200 2 X 200GB SSD

for example (1 to 6 x 200GB SSD kits all on the list).

If this really means EMC offers "UPGRADES" to SSD
for X nodes, rather than mere repair-replacement kits,
if think it might be an attractive option!

-- Peter
Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China
pser...@picb.ac.cn





Jerry Uanino

unread,
Aug 7, 2013, 4:31:52 AM8/7/13
to isilon-u...@googlegroups.com
This "lots of small files" has plagued many of us.
When we went to 10000x-ssd's this helped a lot.  We were rsyncing millions of small files and we hit criical mass before we had SSDs.  I wish the universe would focus on this more (not just EMC) because it's a big problem.

Have you considered reducing your file count by packing files?  It's often easier to read a whole tar/zip file and pick out your file on the fly than it is to read several small files seperately 10 times. I don't know your app, but "packing" might work if you have great control over your app and its not a third party app.  Of course you have a lot of existing files.

Also, just to note some early analysis when we went to SSD nodes showed that larger ram sizes helped us more than SSDs in many cases for caching reasons (if i recall).
In other cases moving metadata for an app to MySQL or some other "metadata store" helps if you control the design.  We do some metadata heavy things in MySQL that knows where to find things on disk.  This  avoids an "ls" or globbing because you know the file/path.

Luc Simard

unread,
Aug 7, 2013, 7:15:33 AM8/7/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
That's a great suggestion, I will keep it in mind.


Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone

Luc Simard

unread,
Aug 7, 2013, 7:14:19 AM8/7/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
The former applies to situations were larger cluster needs GNA SSD acceleration ( metadata ) , and not all node can support them or be upgraded.

The general rule to support GNA is 1.5% of clusters capacity and 20% of nodes in the cluster must have SSD, in version 7 or better of OneFS, they now have a feature where you can leverage read, read/writes, other options for metadata. Work with your sales Eng on the sizing piece for GNA , great pay off in performance improvements and almost immediate ( need a job to run to upgrade the layout and voila)

The S200, X200 and X400 can be field upgraded which is fantastic, no need to forklift upgrade entire nodes and can be done while in production with minimum impact, we've done several times. Works great.

Luc Simard - 415-793-0989
simard.j...@gmail.com
Messages may contain confidential information.
Sent from my iPhone

Peter Serocka

unread,
Aug 7, 2013, 9:17:14 AM8/7/13
to isilon-u...@googlegroups.com

On Wed 7 Aug '13 md, at 19:14 st, Luc Simard <simard.j...@gmail.com> wrote:

> The S200, X200 and X400 can be field upgraded which is fantastic, no need to forklift upgrade entire nodes and can be done while in production with minimum impact, we've done several times. Works great.

Terrific! Thanks!

-- Peter





> The former applies to situations were larger cluster needs GNA SSD acceleration ( metadata ) , and not all node can support them or be upgraded.
>
> The general rule to support GNA is 1.5% of clusters capacity and 20% of nodes in the cluster must have SSD, in version 7 or better of OneFS, they now have a feature where you can leverage read, read/writes, other options for metadata. Work with your sales Eng on the sizing piece for GNA , great pay off in performance improvements and almost immediate ( need a job to run to upgrade the layout and voila)
>
>

Cory Snavely

unread,
Aug 7, 2013, 9:48:58 AM8/7/13
to isilon-u...@googlegroups.com
FWIW, we did this in order to reduce our file count by two orders of
magnitude when we moved to Isilon storage in 2007 and confirmed through
testing that accessing files from within a zip container was faster on
*subsequent* accesses to the zip. The reason? Presumably metadata, as
discussed here, but also we believe read-ahead is playing a significant
role. We also avoided considerable storage overhead, as most of the
small files were less than 128KB in size.

On 08/07/2013 07:15 AM, Luc Simard wrote:
> That's a great suggestion, I will keep it in mind.
>
> Luc Simard - 415-793-0989
> simard.j...@gmail.com <mailto:simard.j...@gmail.com>
> Messages may contain confidential information.
> Sent from my iPhone
>
> On Aug 7, 2013, at 1:31, Jerry Uanino <jua...@gmail.com
>> > Luc Simard - 415-793-0989 <tel:415-793-0989>
>> > simard.j...@gmail.com <mailto:simard.j...@gmail.com>
>> > Messages may contain confidential information.
>> > Sent from my iPhone
>> >
>> > On Aug 6, 2013, at 12:04, Peter Serocka <pser...@picb.ac.cn
>> <mailto:pser...@picb.ac.cn>> wrote:
>> >
>> >>
>> >> On Wed 7 Aug '13 md, at 02:00 st, Jim Long
>> >>> simard.j...@gmail.com <mailto:simard.j...@gmail.com>
>> >>> Messages may contain confidential information.
>> >>> Sent from my iPhone
>> >>>
>> >>> On Aug 6, 2013, at 9:41, Blake Golliher <blakeg...@gmail.com
>> <mailto:blakeg...@gmail.com>> wrote:
>> >>>
>> >>>> Avere is just an nfs cache, so if the performance problem is
>> 5GB / sec of writes for 30 minutes everyday at noon, that's not
>> gonna help. Alternately if it's every day at noon you read 20GB
>> of data that hasn't been touched in 60 days, it may not help
>> either (until after the first read is done) unless you reread that
>> data several times over. Using it to offload your cluster of lots
>> and lots of nfs traffic using their caching system is pretty great.
>> >>>>
>> >>>> -Blake
>> >>>>
>> >>>>
>> >>>> On Tue, Aug 6, 2013 at 8:42 AM, Jim Long <might...@gmail.com
>> <mailto:might...@gmail.com>> wrote:
>> >>>> I posted a message earlier about the problems we have been
>> having with a nine node X200 cluster. EMC's recommended solution
>> is to add three S nodes with all SSD in them so that we can place
>> the file system map on SSD. I'm leary of that solution for
>> several reasons.
>> >>>>
>> >>>> � Cost is prohibitive. By buying these three nodes; my
>> cost per terabyte for the whole solution increases by $1,200 per
>> terabyte.
>> >>>> � No usuable disk space; this solution is solely to improve
>> metadata i/o.
>> >>>> � Does not address the drive problems we have been having.
>> See attached file node8_drive_stats.jpg. I have an autobalance
>> job running after we flexprotected drive 8 in node 8. My smb
>> protocol average increases on node 8 to an average of 15 ms and I
>> can see that my queue length is 200+. This leads me to believe,
>> that my front end request is waiting behind 200+ autobalance i/os.
>> >>>> I received a very timely email from Avere system today.
>> Apparently they front end Isilon clusters. Does anyone have any
>> experience with this? We are really struggling with sporadic
>> performance problems and trying to investigate all options.
>> >>>>
>> >>>> Thanks
>> >>>> Jim
>> >>>>
>> >>>> --
>> >>>> You received this message because you are subscribed to the
>> Google Groups "Isilon Technical User Group" group.
>> >>>> To unsubscribe from this group and stop receiving emails from
>> it, send an email to isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-gr...@googlegroups.com>.
>> >>>> For more options, visit https://groups.google.com/groups/opt_out.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> You received this message because you are subscribed to the
>> Google Groups "Isilon Technical User Group" group.
>> >>>> To unsubscribe from this group and stop receiving emails from
>> it, send an email to isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-gr...@googlegroups.com>.
>> >>>> For more options, visit https://groups.google.com/groups/opt_out.
>> >>>
>> >>> --
>> >>> You received this message because you are subscribed to the
>> Google Groups "Isilon Technical User Group" group.
>> >>> To unsubscribe from this group and stop receiving emails from
>> it, send an email to
>> isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>> >>> For more options, visit https://groups.google.com/groups/opt_out.
>> >>
>> >> --
>> >> You received this message because you are subscribed to the
>> Google Groups "Isilon Technical User Group" group.
>> >> To unsubscribe from this group and stop receiving emails from
>> it, send an email to
>> isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>> >> For more options, visit https://groups.google.com/groups/opt_out.
>> >>
>> >>
>> >
>> > --
>> > You received this message because you are subscribed to the
>> Google Groups "Isilon Technical User Group" group.
>> > To unsubscribe from this group and stop receiving emails from
>> it, send an email to
>> isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >
>> >
>> >
>>
>> Peter Serocka
>> CAS-MPG Partner Institute for Computational Biology (PICB)
>> Shanghai Institutes for Biological Sciences (SIBS)
>> Chinese Academy of Sciences (CAS)
>> 320 Yue Yang Rd, Shanghai 200031, China
>> pser...@picb.ac.cn <mailto:pser...@picb.ac.cn>
>>
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-gr...@googlegroups.com>.

Jim Long

unread,
Aug 7, 2013, 11:59:29 AM8/7/13
to isilon-u...@googlegroups.com
So some great thoughts / suggestions.  I really appreciate all the time spent in replying to me.  I'll try and respond to all the points made

  • Access time tracking is off already.  
  • As Pete pointed out buying 3 S nodes with all SSD is very expensive.  The alternative is retrofitting 8 nodes with 2 SSD each which would require a smart fail of 2 drives on each node.  A single flex protect takes between 2-3 days for us.  That would require about 40 days.  Afterwards we would need to reboot the cluster to enable GNA.  Our product relies on Isilon;  these are not trivial changes for an environment that exceeds 99.99 availability.
  • I haven't disagreed that SSD will help.  I'm sure it will. I'm certainly frustrated with the original sizing and now having a quote that exceeds $200,000 in my hand to fix this.  This solution was purchased about a year ago and we purchase almost double EMC's original recommendation  (recommend was three X200 with 12 gb;  we purchased 5 X200 with 24gb).  A monthly three percent growth was also conveyed.  I think you can understand why I am so skeptical. 
  • I struggle with the design / behavior of the job engine.  Why does the job engine fire hose a drive?  We know a SATA drives i/o potential  (between 80 -120).  We all know that the point of this product is to serve files.  So how can the job engine not have defensive behavior that prevents front end traffic from being affected by background job.  To me this is a huge flaw.  A huge flaw I am being asked to pay to fix.  Based on the feedback in this thread; the SSD requirement should have been identified.
  • Application is custom code that converts files to video and then reads the video files at a ratio of 1:30.  Front end load when no jobs are running is excellent.  The front end load is also very predictable. 
  • The zip file idea is very creative.  We will investigate that.
Thanks again
   Jim

Cory Snavely

unread,
Aug 7, 2013, 12:17:01 PM8/7/13
to isilon-u...@googlegroups.com, Jim Long
Minor point - you could smartfail all those drives at once, provided you
have sufficient capacity to accommodate their data. We routinely
smartfail beyond redundancy tolerances when we do equipment retirement.

On 08/07/2013 11:59 AM, Jim Long wrote:
> So some great thoughts / suggestions. I really appreciate all the time
> spent in replying to me. I'll try and respond to all the points made
>
> * Access time tracking is off already.
> * As Pete pointed out buying 3 S nodes with all SSD is very expensive.
> The alternative is retrofitting 8 nodes with 2 SSD each which
> would require a smart fail of 2 drives on each node. A single flex
> protect takes between 2-3 days for us. That would require about 40
> days. Afterwards we would need to reboot the cluster to enable GNA.
> Our product relies on Isilon; these are not trivial changes for
> an environment that exceeds 99.99 availability.
> * I haven't disagreed that SSD will help. I'm sure it will. I'm
> certainly frustrated with the original sizing and now having a quote
> that exceeds $200,000 in my hand to fix this. This solution was
> purchased about a year ago and we purchase almost double EMC's
> original recommendation (recommend was three X200 with 12 gb; we
> purchased 5 X200 with 24gb). A monthly three percent growth was
> also conveyed. I think you can understand why I am so skeptical.
> * I struggle with the design / behavior of the job engine. Why does
> the job engine fire hose a drive? We know a SATA drives i/o
> potential (between 80 -120). We all know that the point of this
> product is to serve files. So how can the job engine not have
> defensive behavior that prevents front end traffic from being
> affected by background job. To me this is a huge flaw. A huge flaw
> I am being asked to pay to fix. Based on the feedback in this
> thread; the SSD requirement should have been identified.
> * Application is custom code that converts files to video and then
> reads the video files at a ratio of 1:30. Front end load when no
> jobs are running is excellent. The front end load is also very
> predictable.
> * The zip file idea is very creative. We will investigate that.
>
> Thanks again
> Jim
>
>

Rob Peglar

unread,
Aug 7, 2013, 12:27:03 PM8/7/13
to isilon-u...@googlegroups.com
Several things here.

1) the first smart fail of an HDD for SSD will take some time, yes.  However, as that SSD begins to fill with metadata, the time to run Flexprotect will decrease, perhaps dramatically.  One can hold an awful lot of inodes in a single SSD.
2) you don't need to reboot or enable GNA for metadata on SSD to start on the X nodes.  That is the default SmartPool policy, assuming you haven't changed it.  GNA is only necessary on clusters with mixed nodes, in particular those containing NL, for -all- metadata to be accelerated.  Any SSD in a given tier will accelerate metadata in that tier by default.
3) the capacity growth rate is not nearly as important as the file count growth rate, in terms of sizing.  
4) the job engine, by itself, doesn't 'fire hose' drives any more than other concurrent workload does.  An I/O is an I/O.  Yes the engine will run all the nodes in parallel for maximum efficiency, again like any other well-designed workload would.  You can control the hours in which jobs run to minimize their effect on users.  There is the 'off-hours' canned policy but the best policies I've seen are custom, tailored to the individual business and application/human environment. 
5) The EMC folks are not asking you to 'pay to fix the job engine' - they are asking you to pay for upgrades to your cluster with a third of a billion directories+files which has outgrown the capabilities and performance envelope of a pure SATA environment. This is quite normal, actually.  Workloads and file collections change over time and so should infrastructure.  Bridges that carry more traffic over time need to be reinforced and widened at some point to keep everything running smoothly and safely.   Filesystems are no different.  
The beauty of a scale-out approach is that you don't have to oversize clusters originally, and pay much more up front.  Leverage the time value of money and pay for only what you need to meet the needs of your business.  As business changes, scale it out.  Pretty simple, really.

Rob 

Erik Weiman

unread,
Aug 7, 2013, 12:19:59 PM8/7/13
to isilon-u...@googlegroups.com, Jim Long
As long as you don't physically remove them you should be ok to smart
fail all of those drives at one time.

Protections / redundancy gets involved when you have devices that are
listed as "down" or "dead" in the output from isi_group_info. Or
sysctl efs.gmp.group


--
Erik Weiman
Sent from my iPhone 4

Peter Serocka

unread,
Aug 7, 2013, 12:59:38 PM8/7/13
to isilon-u...@googlegroups.com, Jim Long
This is really cool -- OneFS is not RAID -- but the nice(!) consequences are sometimes counter-intuitive -- better training would be appreciated, I think…

BTW: With that upgrade, can the HDDs be kept are spares or need to be sent back to EMC?


-- Peter

Cory Snavely

unread,
Aug 7, 2013, 1:45:29 PM8/7/13
to isilon-u...@googlegroups.com, Peter Serocka, Jim Long
Good point about not removing them. :) Interestingly, too, you will see
that within redundancy tolerances they will see no IO - the cluster will
handle their smartfails as a parity rebuild - but as soon as you go
beyond redundancy tolerances, they will see read IO. That's the smart in
smartfail.

As far as returning them, I'd make that decision based on whether I was
getting any trade-in allowance for them. Having a handful of cold spares
is very handy, and buying cold spares certainly costs a lot more than
the trade-in value...I'd hang on to a few in a situation like this.

On 08/07/2013 12:59 PM, Peter Serocka wrote:
> This is really cool -- OneFS is not RAID -- but the nice(!) consequences are sometimes counter-intuitive -- better training would be appreciated, I think�

Luc Simard

unread,
Aug 7, 2013, 4:57:53 PM8/7/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, Peter Serocka, Jim Long
Typically, they are considered used, so I would assume they are yours to keep.

When replacing them , you might have follow a different procedure, still using " isi devices"
Messages may contain confidential information.
Sent from my iPhone

On Aug 7, 2013, at 10:45, Cory Snavely <csna...@umich.edu> wrote:

> Good point about not removing them. :) Interestingly, too, you will see that within redundancy tolerances they will see no IO - the cluster will handle their smartfails as a parity rebuild - but as soon as you go beyond redundancy tolerances, they will see read IO. That's the smart in smartfail.
>
> As far as returning them, I'd make that decision based on whether I was getting any trade-in allowance for them. Having a handful of cold spares is very handy, and buying cold spares certainly costs a lot more than the trade-in value...I'd hang on to a few in a situation like this.
>
> On 08/07/2013 12:59 PM, Peter Serocka wrote:
>> This is really cool -- OneFS is not RAID -- but the nice(!) consequences are sometimes counter-intuitive -- better training would be appreciated, I think…

Peter Serocka

unread,
Aug 9, 2013, 6:51:47 AM8/9/13
to Luc Simard, isilon-u...@googlegroups.com, Jim Long
Luc,

any experience how much space per LIN is used on the SSD?

I have one observation that says roughly 2KB per LIN
with SSD in the same pool (X200); but there's also
prelimary indication that with GNA it might get higher,
perhaps 3-4KB per LIN... Any references?


-- Peter

Peter Serocka

unread,
Sep 6, 2013, 12:48:14 AM9/6/13
to isilon-u...@googlegroups.com, Jim Long
Jim,

curious, have you implemented some changes?

-- Peter


On 2013 Aug 8. md, at 04:57 st, Luc Simard wrote:

> Typically, they are considered used, so I would assume they are yours to keep.
>
> When replacing them , you might have follow a different procedure, still using " isi devices"
>
> Luc Simard - 415-793-0989
> simard.j...@gmail.com
> Messages may contain confidential information.
> Sent from my iPhone
>
> On Aug 7, 2013, at 10:45, Cory Snavely <csna...@umich.edu> wrote:
>
>> Good point about not removing them. :) Interestingly, too, you will see that within redundancy tolerances they will see no IO - the cluster will handle their smartfails as a parity rebuild - but as soon as you go beyond redundancy tolerances, they will see read IO. That's the smart in smartfail.
>>
>> As far as returning them, I'd make that decision based on whether I was getting any trade-in allowance for them. Having a handful of cold spares is very handy, and buying cold spares certainly costs a lot more than the trade-in value...I'd hang on to a few in a situation like this.
>>
>> On 08/07/2013 12:59 PM, Peter Serocka wrote:
>>> This is really cool -- OneFS is not RAID -- but the nice(!) consequences are sometimes counter-intuitive -- better training would be appreciated, I think…
>>>
>>> BTW: With that upgrade, can the HDDs be kept are spares or need to be sent back to EMC?
>>>
>>>
>>> -- Peter
>>>
>>> On Thu 8 Aug '13 md, at 00:19 st, Erik Weiman <erik.j...@gmail.com> wrote:
>>>
>>>> As long as you don't physically remove them you should be ok to smart
>>>> fail all of those drives at one time.
>>>>
>>>> Protections / redundancy gets involved when you have devices that are
>>>> listed as "down" or "dead" in the output from isi_group_info. Or
>>>> sysctl efs.gmp.group
>>>>
>>>>
>>>> --
>

Reply all
Reply to author
Forward
0 new messages