Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

GBE as Cluster Interconnect

6 views
Skip to first unread message

Lars Holmström

unread,
Aug 6, 2003, 4:07:02 PM8/6/03
to
Can any one share some experience about using GigaBit Ethernet/FC as the
cluster interconnect.
In particular which GBE switches to use and which one to avoid. Also some
performance differences between GBE switches that may affect the Cluster.

I would also like to learn something about recoverytime for a cluster with
larger separation (10-50 km) running GBE/FC as ISL. In particular for
SCS-failover to GBE when in a recovery situation.

Last I may replace FDDI in a cluster with 3 identical sites with
Gigaswitches and a fourth Quorum site (there is a particular reason for this
setup). The quorum node has 3 FDDI adapters, one for each of the remote
sites. Does any one has some experience to share, first how to migrate to
GBE while keeping the services up, and second what to consider from a design
perspective.

/Lars


Keith Parris

unread,
Aug 6, 2003, 10:47:23 PM8/6/03
to
"Lars Holmström" <lars.ho...@flysta.net> wrote in message news:<3f315fe3$1...@news.wineasy.se>...

> Can any one share some experience about using GigaBit Ethernet/FC as the
> cluster interconnect.

Lots of customers these days are moving from CI- (HSJ controller) or
SCSI-based (HSZ controller) storage to Fibre Channel storage, with a
SAN as the Storage Interconnect and Gigabit Ethernet (or Fast
Ethernet) as the Cluster Interconnect instead of CI (or DSSI), so
you're not alone.

When I helped build E*Trade's 2nd disaster-tolerant cluster a few
years ago, we put in one rail of FDDI (GIGAswitch/FDDI) and one rail
of GbE (Cisco 6509s) in parallel, as GbE was fairly new at the time.
A quick scan of my (woefully-imcomplete) directory of known
disaster-tolerant VMS cluster sites shows at least 10 VMS DT cluster
sites known to be using GbE as a cluster interconnect.

Note that Fibre Channel doesn't yet carry inter-node cluster (SCS)
traffic; it carries only SCSI-3 protocol for storage at this point.
So it can only be a Storage Interconnect at this point, not a Cluster
Interconnect (and so also can't do both at the same time as did CI and
DSSI).

Also, at this point (7.3-1) all interrupts for LAN adapters go to the
Primary CPU in an SMP system, so if you're using CI now with SMP
systems with Fast_Path enabled, you could risk Primary CPU
interrupt-state saturation if you switched over to using Gigabit
Ethernet or another LAN as the cluster interconnect (or switched to
Memory Channel, for that matter, as it also lacks Fast_Path support).
Fast_Path support for PEDRIVER and LANs is slated for 7.3-2, which is
in Field Test now.

Verell Boaen of VMS Engineering has done tests of different cluster
interconnects and their latency, bandwidth, and host CPU overhead,
with the results presented most recently at HP-ETS in St. Louis
(there's an older version from DFW Days on the http://vmsone.com/
website). It would be worthwhile to get a copy of his presentation.
(I can send you one via e-mail; just remove _NOSPAM from my address
above when you request it.) Verell will be presenting this at the VMS
Advanced Technical Bootcamp in November also, if you can arrange to
attend that.

There are two Gigabit Ethernet host adapters out now, the older,
slower (for lock requests), but proven DEGPA, and the newer, faster
(for lock requests), but less-mature DEGXA.

Do you do MSCP-serving across FDDI today? and have NISCS_MAX_PKTSZ at
4474 to take advantage of large packets? If so, look carefully at
Jumbo Packet support on GbE, and be careful, as some vendors say they
support Jumbo Packets if they support ANYTHING over 1498 bytes, not
necessarily the full size the standard allows.

And if you are doing MSCP-serving now, consider that the addition of
inter-site Fibre Channel links would allow direct access to storage,
relegating MSCP-serving to a backup path in the event of a FC failure.

But if you can only have a limited number of inter-site links due to
cost (like two between a pair of sites), it may be better to have two
dual-purpose SCS links than to have one SCS link and one FC link and
either be unable to continue use of both of those sites (if the sole
SCS link fails) or to have to fall back to MSCP-serving over the SCS
link (if the sole FC link fails).

> In particular which GBE switches to use and which one to avoid. Also some
> performance differences between GBE switches that may affect the Cluster.

My preference is the switches from Digital Network Products Group
(http://dnpg.com/). If you've had good results with your
GIGAswitch/FDDI boxes, consider dnpg's Gigabit Ethernet boxes, built
with the same design philosophy.

I worry about switch OS's that are designed with IP as their primary
purpose in life, rather than bridging -- with IP, dropping small
multicast packets can actually be helpful in resource-starved
conditions (like ARP broadcast storms). PEDRIVER's Hello packets are
small multicast packets, and if a lot of them get dropped, nodes can
leave the cluster with CLUEXIT bugchecks.

Newer Ciscos seem to work fine, although I would avoid the low-end
(e.g. Catalyst 4000 series) boxes (which have collision domains across
multiple ports) if you can and stick to the ones with a separate
collision domain per port (e.g. 6500 series), if you want to avoid
congestion.

One way to measure a switch's performance as a cluster interconnect is
to time actual lock requests through that switch. You can use the
LOCKTIME tool originally written by Roy G. Davis, author of VAXcluster
Principles -- see http://encompasserve.org/~parris/locktime*.* By
enabling and disabling different LAN adapters using SCACP, you can
measure the performance of each interconnect separately. Test your
existing infrastructure first, and don't accept new technology that
can't do better than that, especially under heavy load. I found some
pretty good results for GbE on ES45s recently -- 200 microseconds
latency through a Cisco switch, or 140 microseconds through a
cross-over cable, which is very close to the 120 microseconds I
measured for Memory Channel 2.

> I would also like to learn something about recoverytime for a cluster with
> larger separation (10-50 km) running GBE/FC as ISL. In particular for
> SCS-failover to GBE when in a recovery situation.

As noted above, FC cannot serve as a backup SCS link yet (but that's
in the Roadmap as a future feature).

So I'm confused about the question, and I'm thinking that the only
failover possible today would be if the Fibre Channel link failed, and
the storage traffic failed over to MSCP (which can happen provided you
are running OpenVMS version 7.3-1 and above, which has the ability to
fail over between direct FC and MSCP-served paths).

If you have a single extended LAN between sites, Spanning Tree
reconfiguration times can be an issue that requires running with the
RECNXINTERVAL parameter set to a value higher than the default 20
seconds. With multiple, completely separate LANs, you could use the
default value.

Or by "recovery" did you mean to talk about shadow full-copy or
full-merge times? With Fibre Channel storage, you will get full
merges instead of the mini-merges you may be used to with CI-based
(HSJ) storage (until the successful completion of the host-based
mini-merge project which is underway in VMS Engineering at present).
I did a presentation on shadow copies and merges at HP-EUW in
Amsterdam -- see http://www2.openvms.org/kparris/

> Last I may replace FDDI in a cluster with 3 identical sites with
> Gigaswitches and a fourth Quorum site (there is a particular reason for this
> setup). The quorum node has 3 FDDI adapters, one for each of the remote
> sites. Does any one has some experience to share, first how to migrate to
> GBE while keeping the services up, and second what to consider from a design
> perspective.

This cluster sounds intriguing, and I'd love to learn more about it.

Since PEDRIVER can handle multiple LAN paths, one common approach in
migrating between interconnects is to leave the existing LAN in place,
and install the new interconnect as a second LAN, with LAN adapters of
both types in the systems during the transition. The most risk-free
approach here would be to initially lower the priority of the new
interconnect so the cluster continues to use the old interconnect
while you evaluate the operation and stability of the new
interconnect. Next, turn the new interconnect on in parallel with the
old, at the same priority. Later, lower the priority of the older
interconnect. Finally, remove the old interconnect. Allow sufficient
testing time at each stage before you progress to the next.

If a particular node doesn't have enough PCI slots to double-up on
adapters (say, perhaps your quorum node here), see if you can add one
adapter and transition one link at a time to the new technology,
following the strategy above.

0 new messages