Isilon 10Gige connections best practices

3,712 views
Skip to first unread message

Jeff

unread,
Oct 7, 2013, 9:46:23 AM10/7/13
to isilon-u...@googlegroups.com
I'm curious as to what others have done as far as configuring their SmartConnect with 10Gige connections.  We have 6 nodes with two 10Gige ports currently active and connected.  I have all 12 ports in the same pool (we haven't seen a need to create more at this point) and configured with two IP's per connection thusly:
  • Aggregation mode: Link Aggregation Control Protocol (LACP)
  • Connection policy: Round Robin
  • IP allocation method: Dynamic
  • Rebalance policy: Automatic Failback
  • IP failover policy: Round Robin
While this config seems to work okay, can we do better using the 10gige-agg ?  The 10gige ports are currently split between two differnt card in our RX switch, just for some semblance of redundancy, again could this be doen smarter?  What if we decide to add another subnet to make use of the the "Zones" in OneFS 7, that requires static IP's?  

Looking forward to some insight from the group here.  Thanks!

Chris Pepper

unread,
Oct 7, 2013, 10:34:22 AM10/7/13
to isilon-u...@googlegroups.com
Jeff,

Do your switches support MLAG (Multi-Link Aggregation), or would you have to connect each node to a single switch, giving up switch diversity, to aggregate with LACP?

Are you serving NFS, SMB, or both? Our SMB pools do *not* use dynamic failover, but our NFS pools do use dynamic failover.

Chris

Jeff

unread,
Oct 7, 2013, 10:53:26 AM10/7/13
to isilon-u...@googlegroups.com
Hi Chris, 

We are serving NFS only at this point, which lead to my query to start with.  The automounted systems seem to survive changes in the service much better than the hard mounted systems.  I guess that is to be expected, however, our RX has been introducing some errors and retries that has been causing some heartburn as well. 

Andrew Stack

unread,
Oct 7, 2013, 1:17:24 PM10/7/13
to isilon-u...@googlegroups.com
Hi Chris,

If your uplink switches support VPC then I encourage you to deploy that to get active active links on both interfaces.  Also, for NFS only consider client connection count vs. round robin for your connection policy.  However, set the failback to round robin.  

Basically, Isilon needs to mature their load balancing as the current offerings do not consider node load very well.  There are basic options to try and assess this but they add overhead and are not recommended.  Basically, long term Isilon need to borrow from VMWare and get more DRS like in their load distribution.  Until then I think Client Connection count with round robin fail back (I learned the hard way about fail back being set to this) is your best bet.  

Regards,

Andrew S.


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Andrew Stack
Sr. Storage Administrator
Genentech

Cory Snavely

unread,
Oct 7, 2013, 2:15:21 PM10/7/13
to isilon-u...@googlegroups.com, Andrew Stack
Can you say more about the choice of round robin for failback as opposed
to connection count, or one of the other heuristics? (Actually, do you
mean failover or failback? My 6.5.5.24 clusters don't have a failback
setting.) Any use of round robin seems, to me, likely to result in
arbitrarily unbalanced connections overall.
> > � Aggregation mode: Link Aggregation Control
> Protocol (LACP)
> > � Connection policy: Round Robin
> > � IP allocation method: Dynamic
> > � Rebalance policy: Automatic Failback
> > � IP failover policy: Round Robin
> > While this config seems to work okay, can we do better using
> the 10gige-agg ? The 10gige ports are currently split between
> two differnt card in our RX switch, just for some semblance of
> redundancy, again could this be doen smarter? What if we decide
> to add another subnet to make use of the the "Zones" in OneFS 7,
> that requires static IP's?
> >
> > Looking forward to some insight from the group here. Thanks!
>
> --
> You received this message because you are subscribed to the Google
> Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isilon-user-gr...@googlegroups.com
> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.

Chris Pepper

unread,
Oct 7, 2013, 2:24:10 PM10/7/13
to isilon-u...@googlegroups.com, Andrew Stack
Cory,

An optimal load balancing pattern would be based on CPU load or network bandwidth per node, but unfortunately OneFS (certainly through v6.5.5) doesn't track this data sufficiently to make informed balancing choices.

Unfortunately for efficient balancing, NFS connections (especially in HPC clusters) are very long-lived, so we only get to rebalance when a compute node reboots, and OneFS doesn't track historical trends, so it cannot detect that a brand-new connection might be about to impose substantial CPU load, while a long-idle connection is unlikely to suddenly ramp up.

We have a range of 1gbps compute nodes through 20gbps 2*10GE LACP large-memory and head nodes. Connection counting treats those as equivalent.

Round robin is simpler and more robust. If OneFS had more data and intelligence a more sophisticated algorithm could do better, but with the current implementation none of the other options seems to work any better. We saw very poor connection grouping when we tried other options, and Isilon Support recommended Round Robin for all general cases.

Chris
>> > • Aggregation mode: Link Aggregation Control
>> Protocol (LACP)
>> > • Connection policy: Round Robin
>> > • IP allocation method: Dynamic
>> > • Rebalance policy: Automatic Failback
>> > • IP failover policy: Round Robin

Andrew Stack

unread,
Oct 7, 2013, 2:25:56 PM10/7/13
to Cory Snavely, isilon-u...@googlegroups.com, Andrew Stack
So, if you set the failback to client count what I've personally observed more than once is when you take a node down for maintenance everything is fine.  The IP's shift to their respective partners, no NFS disruption.  When it comes back up all IP's save one from each node in the dynamic pool shift to the returning node creating a huge dis-balance.  I opened a case with Isilon about this and they directed me to use round robin on failback.  From the case notes:

In regards to the IP re-balance failing, listed below is the article discussing the configuration needed to avoid the issue. 

SmartConnect doesn't re-balance IP addresses unless round-robin fail-over policy is used
Article Number:000089845

Both of these issues can be resolved by changing The IP Fail Over Policy to Round-Robin. This would provide a redistribution of IP addresses evenly across the cluster when IPs are moved. Any new connections to the cluster would continue to utilize the Connection Count for distribution.


I still think this is kinda lousy but I'm sharing with the group so that folks do not run across the same issue that I did.

Cheers,

Andrew Stack
Sr. Storage Admin
Genentech


         >         • Aggregation mode:        Link Aggregation Control
        Protocol (LACP)

         >         • Connection policy:        Round Robin
         >         • IP allocation method:        Dynamic
         >         • Rebalance policy:        Automatic Failback
         >         • IP failover policy:        Round Robin

         > While this config seems to work okay, can we do better using
        the 10gige-agg ?  The 10gige ports are currently split between
        two differnt card in our RX switch, just for some semblance of
        redundancy, again could this be doen smarter?  What if we decide
        to add another subnet to make use of the the "Zones" in OneFS 7,
        that requires static IP's?
         >
         > Looking forward to some insight from the group here.  Thanks!

    --
    You received this message because you are subscribed to the Google
    Groups "Isilon Technical User Group" group.
    To unsubscribe from this group and stop receiving emails from it,

    For more options, visit https://groups.google.com/groups/opt_out.




--
Andrew Stack
Sr. Storage Administrator
Genentech
Cell - 650.867.5524

--
You received this message because you are subscribed to the Google
Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send

For more options, visit https://groups.google.com/groups/opt_out.

Cory Snavely

unread,
Oct 7, 2013, 2:42:06 PM10/7/13
to Andrew Stack, isilon-u...@googlegroups.com
That's weird. We've never seen that, but it sort of rings a bell as a
bug that was addressed. I looked through the 6.5.5 and 6.0 release notes
but couldn't find anything, FWIW, load balancing on connection count
has worked well for us for quite some time.
> <mailto:zan...@gmail.com>> wrote:
>
> > I'm curious as to what others have done as far as
> configuring
> their SmartConnect with 10Gige connections. We have 6
> nodes
> with two 10Gige ports currently active and connected.
> I have
> all 12 ports in the same pool (we haven't seen a need
> to create
> more at this point) and configured with two IP's per
> connection
> thusly:
> > � Aggregation mode: Link Aggregation
> Control
> Protocol (LACP)
> > � Connection policy: Round Robin
> > � IP allocation method: Dynamic
> > � Rebalance policy: Automatic Failback
> > � IP failover policy: Round Robin
> > While this config seems to work okay, can we do
> better using
> the 10gige-agg ? The 10gige ports are currently split
> between
> two differnt card in our RX switch, just for some
> semblance of
> redundancy, again could this be doen smarter? What if
> we decide
> to add another subnet to make use of the the "Zones" in
> OneFS 7,
> that requires static IP's?
> >
> > Looking forward to some insight from the group here.
> Thanks!
>
> --
> You received this message because you are subscribed to the
> Google
> Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails
> from it,
> send an email to
> isilon-user-group+unsubscribe@__googlegroups.com
> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>
> <mailto:isilon-user-group%__2Buns...@googlegroups.com
> <mailto:isilon-user-group%252Buns...@googlegroups.com>__>.
>
> For more options, visit
> https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
>
> --
> Andrew Stack
> Sr. Storage Administrator
> Genentech
> Cell - 650.867.5524 <tel:650.867.5524>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from
> it, send
> an email to isilon-user-group+unsubscribe@__googlegroups.com
> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
> For more options, visit
> https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.

Chris Pepper

unread,
Oct 7, 2013, 2:42:59 PM10/7/13
to isilon-u...@googlegroups.com, Andrew Stack
In the normal case it works. But there are many broken edge cases.

Chris
>> > • Aggregation mode: Link Aggregation
>> Control
>> Protocol (LACP)
>> > • Connection policy: Round Robin
>> > • IP allocation method: Dynamic
>> > • Rebalance policy: Automatic Failback
>> > • IP failover policy: Round Robin
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Jeff

unread,
Oct 7, 2013, 3:56:08 PM10/7/13
to isilon-u...@googlegroups.com, Andrew Stack
This is a great discussion and I'm finding it very helpful.  What we experienced may support Andrew's statement even more.  Initially we had three nodes added three more but no connections other than the 1gb initially (that's a whole different story).  A decision was made to add the three newest nodes 10gige ports to a different card on our RX.  Problems began almost immediately, randomly dropped NFS mounts and outages, so bad that we had to back the new connections out of the pool.  Initially, it was thought there was something wrong with the new nodes and after 6 months of battling with our network infra and Isilon, we discovered the card the second ports were connected to was dropping packets like mad.

Cory Snavely

unread,
Oct 7, 2013, 4:05:30 PM10/7/13
to isilon-u...@googlegroups.com, Jeff, Andrew Stack
That...is also weird.

:D

Peter Serocka

unread,
Oct 8, 2013, 3:59:17 AM10/8/13
to Andrew Stack, Cory Snavely, isilon-u...@googlegroups.com
Any load-based "balancing" (= at mount time) will
point a large number of mounts to the very single node
with the lowest load at that point in time.

Obviously bad for re-balancing, but also when many fresh mounts
are made with a few seconds. Like from numerous nodes of a 
compute cluster, which often use an automounter rather
that mounting at system start. 

Round-robin is really the safest way. It is suggested to overprovision
the number of IP addresses (have several IPs per physical interface!), 
so you have some (statistical) headroom for manually or semi-automatic
rebalancing the cluster when the load distribution becomes too uneven. 

-- Peter


To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China





Cory Snavely

unread,
Oct 8, 2013, 8:58:11 AM10/8/13
to Peter Serocka, Andrew Stack, isilon-u...@googlegroups.com
Makes sense, assuming the load calculation is not happening frequently
enough - one would think it would be near-real-time, but apparently not
- and so the response to the SmartConnect query is based on stale
information. Interesting.
>> <mailto:zan...@gmail.com>> wrote:
>>
>> > I'm curious as to what others have done as far as
>> configuring
>> their SmartConnect with 10Gige connections. We have 6
>> nodes
>> with two 10Gige ports currently active and connected.
>> I have
>> all 12 ports in the same pool (we haven't seen a need
>> to create
>> more at this point) and configured with two IP's per
>> connection
>> thusly:
>> > � Aggregation mode: Link Aggregation
>> Control
>> Protocol (LACP)
>> > � Connection policy: Round Robin
>> > � IP allocation method: Dynamic
>> > � Rebalance policy: Automatic Failback
>> > � IP failover policy: Round Robin
>> > While this config seems to work okay, can we do
>> better using
>> the 10gige-agg ? The 10gige ports are currently split
>> between
>> two differnt card in our RX switch, just for some
>> semblance of
>> redundancy, again could this be doen smarter? What if
>> we decide
>> to add another subnet to make use of the the "Zones"
>> in OneFS 7,
>> that requires static IP's?
>> >
>> > Looking forward to some insight from the group
>> here. Thanks!
>>
>> --
>> You received this message because you are subscribed to
>> the Google
>> Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails
>> from it,
>> send an email to
>> <mailto:isilon-user-group%252Buns...@googlegroups.com>__>.
>>
>> For more options, visit
>> https://groups.google.com/__groups/opt_out
>> <https://groups.google.com/groups/opt_out>.
>>
>>
>>
>>
>> --
>> Andrew Stack
>> Sr. Storage Administrator
>> Genentech
>> Cell - 650.867.5524 <tel:650.867.5524>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails from
>> it, send
>> an email to isilon-user-group+unsubscribe@__googlegroups.com
>> <mailto:isilon-user-group%2Bunsu...@googlegroups.com>.
>> For more options, visit
>> https://groups.google.com/__groups/opt_out
>> <https://groups.google.com/groups/opt_out>.
>>
>>
>>
>>
>> --
>> Andrew Stack
>> Sr. Storage Administrator
>> Genentech
>> Cell - 650.867.5524
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to isilon-user-gr...@googlegroups.com
>> <mailto:isilon-user-gr...@googlegroups.com>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>
> Peter Serocka
> CAS-MPG Partner Institute for Computational Biology (PICB)
> Shanghai Institutes for Biological Sciences (SIBS)
> Chinese Academy of Sciences (CAS)
> 320 Yue Yang Rd, Shanghai 200031, China
> pser...@picb.ac.cn <mailto:pser...@picb.ac.cn>
>
>
>
>
>

Chris Pepper

unread,
Oct 8, 2013, 9:59:34 AM10/8/13
to isilon-u...@googlegroups.com, Peter Serocka, Andrew Stack
Cory,

I believe it is exactly real-time, but imagine you have 3 nodes balancing on CPU load, with a bunch of clients almost evenly distributed. Now 5 new clients reboot and they each nee 3 mounts; they might all connect to the least-busy node, because none of them has generated any traffic (except the mount calls) yet.

SmartConnect has no foresight -- it cannot realize that each of the new connections **is going to** bring some amount of load in the very near future.

Chris

Cory Snavely

unread,
Oct 8, 2013, 10:06:09 AM10/8/13
to isilon-u...@googlegroups.com, Chris Pepper, Peter Serocka, Andrew Stack
Right, that explains the behavior for that heuristic, but why connection
count would act similarly would seem to result from something like what
I'm suggesting, right? That's why I'd concluded the metering was
apparently not near-real-time, or near-real-time enough.

Saker Klippsten

unread,
Oct 8, 2013, 10:14:55 AM10/8/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, Chris Pepper, Peter Serocka, Andrew Stack
We discoverer some sort of cache limit. In SC. When you have hundreds of nodes mounting at the same time, in our case render nodes mount the cluster when given a job to work on, not Boot up. Basically Smart Connect would not evenly distribute the nodes across the cluster in a RR fashion. We have since added a staggered start of 1-3 seconds feature so round robin works properly . Before we would end up with sometimes 100 + nodes on a single Isilon cluster node.

-saker
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

Peter Serocka

unread,
Oct 8, 2013, 10:35:45 AM10/8/13
to Saker Klippsten, isilon-u...@googlegroups.com, Chris Pepper, Andrew Stack
That's… Have you checked the time-to-live?

$ dig nfs1.isilon.example.com




;; ANSWER SECTION:
nfs1.isilon.example.com. 0 IN A ###.###.###.###

The zero here ------------------^
is the time-to-live (in seconds) of this SmartConnect DNS record;
and therefore this record should never be cached...

But yeah, there are too many instances in the path that might do
some caching against the rules: the organization's DNS server,
the client OS, the app (= the automounter in this case).
They would all need to be properly configured to behave conformingly.

-- Peter

Saker Klippsten

unread,
Oct 8, 2013, 10:38:54 AM10/8/13
to Peter Serocka, isilon-u...@googlegroups.com, Chris Pepper, Andrew Stack
Yeah we had T3 support on it for 2 months. Running tests checking client a no server DNS. Could not get any lower

Saker Klippsten | CTO | Zoic Studios
310-838-0770 o
310-202-2063 d
310-350-3854 c

Jerry Uanino

unread,
Oct 12, 2013, 10:12:47 AM10/12/13
to isilon-u...@googlegroups.com, Peter Serocka, Chris Pepper, Andrew Stack
I had similar problems to this.  We have 900+ machines, so on a mass patch weekend or mass event, we'd end up horrible distributed.  Round Robin ended up working better given all the possible scenarios where wierd stuff could happen.

Luc Simard

unread,
Oct 12, 2013, 10:54:26 AM10/12/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, Peter Serocka, Chris Pepper, Andrew Stack
Use connection counting, but with round robin fail over

Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone

Peter Serocka

unread,
Oct 12, 2013, 11:12:08 AM10/12/13
to Luc Simard, isilon-u...@googlegroups.com, Chris Pepper, Andrew Stack
The problem is the small time gap between the DNS query (which do not increase
the connection count) and the actual mount, when dozens or hundreds of clients try
to connect with a second or a few. Round robin is totally different and
avoids any bias. Worst case can be a "random" distribution rather than
perfect balance, still better than heavy skews.

-- Peter

Gumar K

unread,
Nov 4, 2013, 4:29:06 PM11/4/13
to isilon-u...@googlegroups.com
Along the lines of 10gige best practices, is there any issue in using 10gige NICs of different type nodes in same smart connect zone. Though 10gige cards may be the same model on all nodes there could be chances that performance of 10gige vary between different node type as the Node cache,CPU,Bus speed may be different on X200 and X400 nodes ? I'm not able to find a best practice guide for this situation. Any recommendations from Isilon on this ?

Reply all
Reply to author
Forward
0 new messages