iSCSI and NIC bonding modes?

Thomas Harold

unread,

May 24, 2007, 2:08:05 AM5/24/07

to open-iscsi

(I'm finding it exceedingly difficult to get good information on iSCSI
and bonding via Google. So I'll ask here... and the wiki seems to be
broken.)

When setting up an iSCSI server & client, what are the good/bad of the
various bonding modes? I'm interested in both sides of the equation
(both the target and the initiator) and which modes are preferable.

Here's what I think I know... (please correct if I'm wrong):

mode 0 (balance-rr) a.k.a. round-robin
- requires switch support (where the ports are configured as an
"etherchannel", "trunk group", or some other name)
- can cause issues with TCP resulting in out-of-order packets, which
likely reduces overall throughput
- not sure how well packets would flow back to the destination system

mode 1 (active-backup)
- doesn't really get you any performance, it's mainly for
high-availability (so I'm not really interested in that mode)
- requires switch support (?)

mode 2 (balance-xor)
- similar restrictions to mode 0 (requires switch support)

mode 3 (broadcast)
- packets go out across all interfaces (high-availability, not performance)

mode 4 (802.3ad)
- a standard way of grouping ports and NICs together
- not sure how well it performs compared to the other options

mode 5 (balance-tlb)
- does not require special switch support
- outbound traffic bandwidth scales across all NICs, but inbound traffic
only flows across a single NIC

mode 6 (balance-alb)
- does not require special switch support
- balances both inbound / outbound traffic over all NICs

...

So, for a client & server, both connected to a switch. Both client &
server units have 4 NICs that can be bonded together. Which of these
configurations will result in the best bandwidth on gigabit ethernet?

It seems like the best choice is either going to be 802.3ad or
balance-alb. But I also see a lot of comments that the top performance
for any one task / application / server(?) talking across the wire is
going to be limited to the bandwidth of a single NIC.

When does the "single NIC" limit come into play when dealing with iSCSI?
If we have a mail server that stores its files on an iSCSI target,
will that server only be able to get a single NIC's worth of performance?

Obviously, for redundancy reasons, 2 NICs is probably the minimum that
you'd want to have on a client (bonded together, probably connected to
different switches). But will a client benefit from having 4 NICs? Or
will it only benefit if you're talking to multiple server units?

Björn Metzdorf

unread,

May 24, 2007, 2:51:59 AM5/24/07

to open-...@googlegroups.com

Hello Thomas,

> Here's what I think I know... (please correct if I'm wrong):
>
> mode 0 (balance-rr) a.k.a. round-robin
> - requires switch support (where the ports are configured as an
> "etherchannel", "trunk group", or some other name)

Thats wrong, you don't need special witch support for mode 0.

>
> mode 1 (active-backup)
> - doesn't really get you any performance, it's mainly for
> high-availability (so I'm not really interested in that mode)
> - requires switch support (?)

No switch support required either.

> mode 2 (balance-xor)
> - similar restrictions to mode 0 (requires switch support)

No switch support required either. This is the mode we are using, works
perfectly with open-iscsi.

> mode 4 (802.3ad)
> - a standard way of grouping ports and NICs together
> - not sure how well it performs compared to the other options

Never tried it, but LACP might have its own issues. Needs switch support.

> mode 5 (balance-tlb)
> - does not require special switch support
> - outbound traffic bandwidth scales across all NICs, but inbound traffic
> only flows across a single NIC

No special switch support required, but we have had bad experiences with
it: Many many dropping connections and so on..

> So, for a client & server, both connected to a switch. Both client &
> server units have 4 NICs that can be bonded together. Which of these
> configurations will result in the best bandwidth on gigabit ethernet?

Best bandwidth might be mode 4, but I'd use mode 2 if you have at least
2 or more ethernet destinations you are talking to.

> It seems like the best choice is either going to be 802.3ad or
> balance-alb. But I also see a lot of comments that the top performance
> for any one task / application / server(?) talking across the wire is
> going to be limited to the bandwidth of a single NIC.

Thats right, only mode 0 and 4 will allow more than 1 gbit/s to the same
destination. But, do you really need this?

> When does the "single NIC" limit come into play when dealing with iSCSI?
> If we have a mail server that stores its files on an iSCSI target,
> will that server only be able to get a single NIC's worth of performance?

Well, today's onboard NICs will make 950 mbit/s without problem. Do you
think your mailserver will have a sustained transfer rate of 950 mbit/s
and above? That's nearly 120 mbyte/s. Not realistic with many small
files like mails..

> Obviously, for redundancy reasons, 2 NICs is probably the minimum that
> you'd want to have on a client (bonded together, probably connected to
> different switches). But will a client benefit from having 4 NICs? Or
> will it only benefit if you're talking to multiple server units?

We have clients and servers running both, 2 nics bonded and vlan'ed for
LAN and SAN, and 2 x 2 nics bonded separately for LAN and SAN. 4 nics
are better, but if your SAN bandwidth needs are not very high, you can
use 2 nics and vlans. It does work.

Regards,
Bjoern

Eric Germann

unread,

May 24, 2007, 3:04:53 AM5/24/07

to open-...@googlegroups.com

Trimming here slightly ...

-----Original Message-----
From: open-...@googlegroups.com [mailto:open-...@googlegroups.com] On
Behalf Of Björn Metzdorf
Sent: Thursday, May 24, 2007 2:52 AM
To: open-...@googlegroups.com
Subject: Re: iSCSI and NIC bonding modes?

Hello Thomas,

> mode 4 (802.3ad)
> - a standard way of grouping ports and NICs together
> - not sure how well it performs compared to the other options

Never tried it, but LACP might have its own issues. Needs switch support.

----------------

Not tried iscsi over it, but we've had issues with the following combo in
general:

Centos 4.[45]
Linksys SRW2048 switches and Cisco 3550-12T
Intel Gig on a chip NIC's embedded on the motherboard

They would periodically drop a link or two in the bundle. No amount of
coaxing with replugging the cables would get the full bundle to come back
up. The only way to fix it (for a while) was to reboot the switch, then the
bundle was back. Also, when the bundle went haywire, odd things happened.
Some IP's were reachable, some weren't, even though ARP worked perfectly.

We compiled and installed the latest Intel driver from their website and
it's rock solid now.

Just an FYI if you go the LACP route (no pun intended).

Cheers,

EKG

Victor Rodriguez Cortes

unread,

May 24, 2007, 3:13:41 AM5/24/07

to open-...@googlegroups.com

Hi!

I believe that iSCSI MPIO it's a better choice than leaving the network layer to deal
with the load balancing of the traffic, specially if both initiator and target support multi
connection MPIO (so both sides should be able of distributing incoming and outgoing traffic
among all available paths). I suppose that it would make a round robin based on how many
bits have flown through a given NIC. Anyway for any given traffic flow I think that it will only
use one NIC, to avoid out of order packets. Maybe someone could confirm this ones...

For example, in 802.3ad many switches stablish a relation between SRC address
and DST address, so if you connect all your iSCSI disk to the same iSCSI target ip (portal),
there's a chance that all the traffic will go through the same wire. For example, if you have
an uneven number of iSCSI disks connected to 3 different target IP's to make 802.3ad load
balance traffic, and one disk is under heavy load, one or more will suffer reduced
performance because the would share the same wire.

Anyway, 1Gbps can give you real transfer around the 80MByte/s mark with almost
any actual nic and server. That's enough for many environments. If you're planning
connecting a lot of servers the bottleneck will be in the disk subsystem or even the system
PCI bus that has to deal with all the traffic related to iSCSI target and data located in it's HD.

For a mail server try to optimize IOs as you don't need to carry a lot of data, but just
do many things to many files at the same time.

Regards,
Victor.

Thomas Harold

unread,

May 24, 2007, 9:33:49 AM5/24/07

to open-...@googlegroups.com

Thanks for the response. (follow-up questions in-lined)

Victor Rodriguez Cortes wrote:
>
> Hi!
>
> I believe that iSCSI MPIO it's a better choice than leaving the
> network layer to deal with the load balancing of the traffic,
> specially if both initiator and target support multi connection MPIO
> (so both sides should be able of distributing incoming and outgoing
> traffic among all available paths).

I sort of understand the concept of multi-path, but I need to sit down
and bone up on it. It sounds like you setup the SAN unit with multiple
IP addresses on each NIC and connect each NIC to a different switch.
You then advertise the iSCSI disk as being on a particular set of IP
addresses? (My nomenclature may be a bit off.)

So from a network design point, I should plan on a few dozen IP
addresses for each SAN unit?

And if I go the multiple IP address route for a particular iSCSI target
disk, there's less need to worry about LACP or even bonding?

> For example, in 802.3ad many switches stablish a relation between SRC
> address and DST address, so if you connect all your iSCSI disk to the
> same iSCSI target ip (portal), there's a chance that all the traffic
> will go through the same wire. For example, if you have an uneven
> number of iSCSI disks connected to 3 different target IP's to make
> 802.3ad load balance traffic, and one disk is under heavy load, one
> or more will suffer reduced performance because the would share the
> same wire.

So we have (3) iSCSI disks defined (X, Y and Z) on the same IP address
and being used by the same source machine. In the case of a 2-wire LACP
bundle, you're saying that traffic might flow as?:

X & Y & Z -> wire #1 in the bundle
(nothing) -> wire #2 in the bundle

> Anyway, 1Gbps can give you real transfer around the 80MByte/s mark
> with almost any actual nic and server. That's enough for many
> environments.

That's good to know, I was thinking that it would be more around 20-30
MByte/sec with regular frame sizes. Which would seem a little low for
our needs. With ~80MB/sec over a single cable, I'm less worried about
traffic being spread across multiple cables for a single source/target pair.

How much does iSCSI benefit from jumbo frames (~9000 bytes)? What sort
of throughput will you see with regular frame sizes? What is the exact
frame size that you configure?

> If you're planning connecting a lot of servers the
> bottleneck will be in the disk subsystem or even the system PCI bus
> that has to deal with all the traffic related to iSCSI target and
> data located in it's HD.

Not really particular to iSCSI, but...

Our startup unit will be a 10-disk RAID10 on top of 500GB SATAs (with
two hot-spares attached to the 12-port controller). Fairly inexpensive
to get started, should provide modest performance, and we'll eventually
upgrade to bigger iron once we saturate this. The underlying hardware
would be a newer PCIe motherboard.

My back-of-envelope estimate is that the RAID array could provide burst
bandwidth of around 200 MB/s, but probably only 60-90 MB/s sustained
under heavy loads. So with those numbers I was trying to determine
whether we would saturate 2 NICs or 4 NICs.

Thomas Harold

unread,

May 24, 2007, 9:47:02 AM5/24/07

to open-...@googlegroups.com

> We compiled and installed the latest Intel driver from their website and
> it's rock solid now.
>
> Just an FYI if you go the LACP route (no pun intended).

We're planning on using dual-port Intel PRO/1000 PCIe NICs across the
board. They're not very expensive and I'd expect them to be rock-solid
and have good Linux support.

The quad-port PCIe PRO/1000 NICs might be something we use down the road
if I need additional bandwidth.

I was considering supplementing the dual-port Intel PCIe NIC with the
pair of onboard gigabit ports (Marvell 88E1116 single port Gigabit
Ethernet PHY w/ TOE). Which would give me (4) ports to connect to the
SAN switch(es) instead of just two.

(I'm doing all of this myself rather then using pre-build units because
it's a learning experience.)

Thomas Harold

unread,

May 24, 2007, 10:11:27 AM5/24/07

to open-...@googlegroups.com, b...@turtle-entertainment.de

Björn Metzdorf wrote:
> Hello Thomas,
>
>> Here's what I think I know... (please correct if I'm wrong):
>>
>> mode 0 (balance-rr) a.k.a. round-robin
>> - requires switch support (where the ports are configured as an
>> "etherchannel", "trunk group", or some other name)
>

> Thats wrong, you don't need special switch support for mode 0.

I was looking at the bonding.txt file (section 5) to come up with that.
I guess the word "generally" in that section means "sometimes but not
in all cases"?

>> mode 5 (balance-tlb)
>> - does not require special switch support
>> - outbound traffic bandwidth scales across all NICs, but inbound traffic
>> only flows across a single NIC
>
> No special switch support required, but we have had bad experiences with
> it: Many many dropping connections and so on..

Anyone else have issues with balance-tlb (or balance-alb)? Is this a
common issue with iSCSI connections and those two modes?

>
>> So, for a client & server, both connected to a switch. Both client &
>> server units have 4 NICs that can be bonded together. Which of these
>> configurations will result in the best bandwidth on gigabit ethernet?
>
> Best bandwidth might be mode 4, but I'd use mode 2 if you have at least
> 2 or more ethernet destinations you are talking to.
>
>> It seems like the best choice is either going to be 802.3ad or
>> balance-alb. But I also see a lot of comments that the top performance
>> for any one task / application / server(?) talking across the wire is
>> going to be limited to the bandwidth of a single NIC.
>
> Thats right, only mode 0 and 4 will allow more than 1 gbit/s to the same
> destination. But, do you really need this?

That depends on how much data you can actually push across iSCSI over a
single gigabit link.

> Well, today's onboard NICs will make 950 mbit/s without problem. Do you
> think your mailserver will have a sustained transfer rate of 950 mbit/s
> and above? That's nearly 120 mbyte/s. Not realistic with many small
> files like mails..

That's a lot higher then I expected. I'm assuming that requires jumbo
frame support to get close to the theoretical maximum? Or does iSCSI
not really care about smaller frame sizes?

>> Obviously, for redundancy reasons, 2 NICs is probably the minimum that
>> you'd want to have on a client (bonded together, probably connected to
>> different switches). But will a client benefit from having 4 NICs? Or
>> will it only benefit if you're talking to multiple server units?
>
> We have clients and servers running both, 2 nics bonded and vlan'ed for
> LAN and SAN, and 2 x 2 nics bonded separately for LAN and SAN. 4 nics
> are better, but if your SAN bandwidth needs are not very high, you can
> use 2 nics and vlans. It does work.

My plan is to completely separate the SAN traffic onto its own pair of
switches. Our startup system won't be very large and a pair of 48-port
switches will be plenty (with lots of room for growth).

So, like you, our servers would have a minimum of (4) NICs in them, 2
for the SAN connection 2 for the LAN connection. Would more disk-heavy
servers like a database box needing 3 or 4 for the SAN?

Eric Germann

unread,

May 24, 2007, 11:23:40 AM5/24/07

to open-...@googlegroups.com

We were using both onboard and the quad ports PCIe's with similar issues.
As you note though, they have good support, so it's just a matter of making
sure you build and install the latest NIC drivers from Intel whenever you
build a new kernel. That's been our experience.

EKG

-----Original Message-----
From: open-...@googlegroups.com [mailto:open-...@googlegroups.com] On
Behalf Of Thomas Harold
Sent: Thursday, May 24, 2007 9:47 AM
To: open-...@googlegroups.com
Subject: Re: iSCSI and NIC bonding modes?

Björn Metzdorf

unread,

May 24, 2007, 5:12:42 PM5/24/07

to Thomas Harold, open-...@googlegroups.com

Hello Thomas,

>>> mode 5 (balance-tlb)
>>> - does not require special switch support
>>> - outbound traffic bandwidth scales across all NICs, but inbound
>>> traffic only flows across a single NIC
>>
>> No special switch support required, but we have had bad experiences
>> with it: Many many dropping connections and so on..
>
> Anyone else have issues with balance-tlb (or balance-alb)? Is this a
> common issue with iSCSI connections and those two modes?

This issue did not come up with iscsi only, in general all connections
were dropped frequently.

>> Well, today's onboard NICs will make 950 mbit/s without problem. Do
>> you think your mailserver will have a sustained transfer rate of 950
>> mbit/s and above? That's nearly 120 mbyte/s. Not realistic with many
>> small files like mails..
>
> That's a lot higher then I expected. I'm assuming that requires jumbo
> frame support to get close to the theoretical maximum? Or does iSCSI
> not really care about smaller frame sizes?

We use EqualLogic arrays which come with jumbo frames enabled by
default. I assume jumbo frames will help you.

> My plan is to completely separate the SAN traffic onto its own pair of
> switches. Our startup system won't be very large and a pair of 48-port
> switches will be plenty (with lots of room for growth).
>
> So, like you, our servers would have a minimum of (4) NICs in them, 2
> for the SAN connection 2 for the LAN connection. Would more disk-heavy
> servers like a database box needing 3 or 4 for the SAN?

No, you will be fine with 2 NICs for the SAN. When your SAN-traffic to a
single host goes over 2 gbit/s then you could check out 3-4 NICs or even
better a 10 gbit/s NIC.

Regards,
Bjoern

Thomas Harold

unread,

May 24, 2007, 10:12:17 PM5/24/07

to open-...@googlegroups.com, b...@turtle-entertainment.de

Thanks for the replies.

Björn Metzdorf wrote:
>>>> mode 5 (balance-tlb)
>>>> - does not require special switch support
>>>> - outbound traffic bandwidth scales across all NICs, but inbound
>>>> traffic only flows across a single NIC
>>> No special switch support required, but we have had bad experiences
>>> with it: Many many dropping connections and so on..
>> Anyone else have issues with balance-tlb (or balance-alb)? Is this a
>> common issue with iSCSI connections and those two modes?
>
> This issue did not come up with iscsi only, in general all connections
> were dropped frequently.

So when we start implementation testing, it should be pretty obvious if
we run into that particular issue. I'm still not sure which mode we'll use.

> We use EqualLogic arrays which come with jumbo frames enabled by
> default. I assume jumbo frames will help you.

Is there a standard jumbo frame size? Or at least a commonly used and
supported frame size? It seems like the magic number is 9000 bytes for
the payload. On my WinXP box the NVIDIA nForce driver shows a list of
"1500, 2500, 4500, 9000" for the "Jumbo Frame Payload Size". I'm seeing
"9000" bandied about quite a bit and I'm assuming that everyone is
specifying the payload size and not the total frame size (including
headers).

>> My plan is to completely separate the SAN traffic onto its own pair of
>> switches. Our startup system won't be very large and a pair of 48-port
>> switches will be plenty (with lots of room for growth).
>>
>> So, like you, our servers would have a minimum of (4) NICs in them, 2
>> for the SAN connection 2 for the LAN connection. Would more disk-heavy
>> servers like a database box needing 3 or 4 for the SAN?
>
> No, you will be fine with 2 NICs for the SAN. When your SAN-traffic to a
> single host goes over 2 gbit/s then you could check out 3-4 NICs or even
> better a 10 gbit/s NIC.

One advantage of iSCSI SANs (grin) the old equipment can be re-purposed
and used for less critical areas in the network when we upgrade the SAN
to 10 gigabit in a few years. AFAIK, 10 gigabit NICs and switches are
still $500+ per port? So they're not within our price range yet.

We may end up VLAN'ing on an existing switch at the start as well, until
we can find a good pair of 24/48 port switches for the SAN.

...

(Misc information - posted more for future readers - somewhat on-topic
for this discussion but mostly background information)

I found an OLD article from 1999 called "Gigabit Ethernet Jumbo Frames"
http://sd.wareonearth.com/~phil/jumbo.html

For the small IT shop: Building a SAN on the cheap (Sep 2005)
http://articles.techrepublic.com.com/5100-9592_11-5854600-1.html

Alleviate network congestion on iSCSI SANs via switch settings (Mar 2006)
http://searchwincomputing.techtarget.com/tip/0,289483,sid68_gci1171462,00.html

TechFest Ethernet Technical Summary (1999)
http://www.techfest.com/networking/lan/ethernet2.htm

Björn Metzdorf

unread,

May 25, 2007, 2:20:36 AM5/25/07

to Thomas Harold, open-...@googlegroups.com

Hello Thomas,

>> We use EqualLogic arrays which come with jumbo frames enabled by
>> default. I assume jumbo frames will help you.
>
> Is there a standard jumbo frame size? Or at least a commonly used and
> supported frame size? It seems like the magic number is 9000 bytes for
> the payload. On my WinXP box the NVIDIA nForce driver shows a list of
> "1500, 2500, 4500, 9000" for the "Jumbo Frame Payload Size". I'm seeing
> "9000" bandied about quite a bit and I'm assuming that everyone is
> specifying the payload size and not the total frame size (including
> headers).

EqualLogic uses MTU 9000, so you have to make sure that your card and
driver supports MTU 9000 on linux. There are onboard broadcom NICs on
server boards (tyan) which DON'T support jumbo frames, so better check
before buying.

Regards,
Bjoern

rishi pathak

unread,

May 25, 2007, 2:29:03 AM5/25/07

to open-...@googlegroups.com, Thomas Harold

You will also check for Jumbo frames support on the interconnecting switch.
And enable it.

Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra

Victor Rodriguez Cortes

unread,

May 25, 2007, 3:30:27 AM5/25/07

to open-...@googlegroups.com

Hi again!

> I sort of understand the concept of multi-path, but I need to sit down
> and bone up on it. It sounds like you setup the SAN unit with
> multiple IP addresses on each NIC and connect each NIC to a different
> switch. You then advertise the iSCSI disk as being on a particular set
> of IP addresses? (My nomenclature may be a bit off.)

MPIO is designed to give you both high availability and increased performance,
allowing the iSCSI traffic to flow by more than one NIC/wire. Keep in mind that your target
and initiator must support multiconnection MPIO. In the target you would create two "portal"
addresses, probably in different ranges, and a volume V that would be published through
both portal addreses, so you would be able to login to that target using both portals. Now, you
should set up the initiator to make two connections to the target, each one to a different
portal IP. Traffic would spread across both links and will failback to 1 wire if the other fail.
Teoretically it's possible to get MPIO with more than two paths.

> So from a network design point, I should plan on a few dozen IP
> addresses for each SAN unit?

For example, one class C net for each NIC port that you would like to have. Put them
con different VLANs and for the servers use IP addreses in that ranges. Try to use these
NIC's for iSCSI only.

> And if I go the multiple IP address route for a particular iSCSI
> target disk, there's less need to worry about LACP or even bonding?

Yes, with MPIO you can forget on using LACP or bonding. But I don't know if linux
software target/initiators support MPIO.

> So we have (3) iSCSI disks defined (X, Y and Z) on the same IP address
> and being used by the same source machine. In the case of a 2-wire
> LACP bundle, you're saying that traffic might flow as?:
>
> X & Y & Z -> wire #1 in the bundle
> (nothing) -> wire #2 in the bundle

X & Z -> wire #1
Y -> wire #2

But that would depend on the switch you use...

> That's good to know, I was thinking that it would be more around 20-30
> MByte/sec with regular frame sizes. Which would seem a little low for
> our needs. With ~80MB/sec over a single cable, I'm less worried about
> traffic being spread across multiple cables for a single source/target
> pair.

Think of increased performance as a subproduct of the increased reliability of using
two nic's/wires ;-)

> How much does iSCSI benefit from jumbo frames (~9000 bytes)? What
> sort of throughput will you see with regular frame sizes? What is the
> exact frame size that you configure?

Jumbo frames will allow more useful data to be put on the network per interrupt, so
the CPU will be doing useful things more time instead of being waiting for that interrupt to
end. Also, less headers are sent and the bandwith increases a bit. But Jumbo frames must
be supported end to end by all devices in the chain, so check carefully!

My future target does not support jumbo frames but adaptative interrupt moderation
does a good job, at least using Win servers with good intel nics as initiators.

> Not really particular to iSCSI, but...
>
> Our startup unit will be a 10-disk RAID10 on top of 500GB SATAs (with
> two hot-spares attached to the 12-port controller). Fairly
> inexpensive to get started, should provide modest performance, and
> we'll eventually upgrade to bigger iron once we saturate this. The
> underlying hardware would be a newer PCIe motherboard.

Go to PCI-e for both the controller and the target NIC's, as both traffic will flow by the
internal bus of the mobo and you don't want that bus to be shared among the raid card and
nic's. The RAID card must be best of breed in orther to give you the highest IOs, which is
critical in shared storage as many clients will be accesing data at the same time. The bigger
the cache the better. Also, I would test creating 2 RAID5 arrays with 1 or 2 hotspares, and
then spread the volumes among them with care: don't put that heavy loaded mail server on
the same spindles used by that big MySQL server! The IO/s for one RAID5 will be lower that
your 10 disk RAID10, but the combined IO/s of both raid's may be higher as all disk heads
won't be active when any servers needs data.

> My back-of-envelope estimate is that the RAID array could provide
> burst bandwidth of around 200 MB/s, but probably only 60-90 MB/s
> sustained under heavy loads. So with those numbers I was trying to
> determine whether we would saturate 2 NICs or 4 NICs.

If data is cached you may get 100 MB/s per NIC configured. That would happend few
times... the rest your bottleneck is going to be the disk and bus subsystems (here is where
"commercial" targets usually shine), but I feel that any good hardware would be enough to
get 60 as a bare minimum.

Regards,
Victor.

Bjoern Metzdorf

unread,

May 25, 2007, 3:40:44 AM5/25/07

to open-...@googlegroups.com

Hello,

> Yes, with MPIO you can forget on using LACP or bonding. But I don't know if linux
> software target/initiators support MPIO.

There is the multipath-tools project on
http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=Home which
provides generic multipath using the device-mapper on linux.

Regards,
Bjoern

Thomas Harold

unread,

May 25, 2007, 5:33:49 PM5/25/07

to open-...@googlegroups.com

Excellent reply :)

Victor Rodriguez Cortes wrote:
> MPIO is designed to give you both high availability and increased
> performance, allowing the iSCSI traffic to flow by more than one
> NIC/wire. Keep in mind that your target and initiator must support
> multiconnection MPIO.

If the target supports MPIO but the initiator doesn't, I'm assuming that
things still work but it wouldn't be HA?

> Yes, with MPIO you can forget on using LACP or bonding. But I don't
> know if linux software target/initiators support MPIO.

Does it make sense to use MPIO with LACP or bonding? Seems like it
wouldn't matter and would help with HA for initiators that don't support
MPIO?

...

We're only moderately worried about performance and i/o load on the
array right now. Once we have the base SAN up and running, we can
organically grow it out to address performance issues. We'll also have
much better questions to ask a SAN vendor when we move up to a
commercial SAN target.

RAID5 vs RAID10 - I'm fairly biased against RAID5. I have yet to see a
good RAID5 setup (in the 3-5 spindle range) where I'm happy with
performance. Now, that could be an ID10T error, not properly tuned
block sizes, the hardware used (various, including Dell PERCs), the
small number of spindles, etc. On the flip side, with RAID10, I'm
finding that performance scales as you add spindle pairs. And
performance on a RAID10 doesn't seem to require a lot of tuning (of
either block sizes or the file systems).

I'd be more likely to consider RAID6 because I feel RAID5 is too fragile
during the array rebuild window. RAID10 has another advantage here
because rebuild time scales with the size of an individual disk in the
array, rather then the overall size of the array. It's a simple rebuild
of the mirror pair that failed. RAID10 also offers the chance that a
2nd drive failure won't take the array down (depends on which 2nd disk
fails).

Still, we'll be able to do some testing. Useable capacity for the
12-disk RAID10 (10 active 2 spare, using 500GB SATA) will be around
2.3TB while RAID6 (pair of 5-disk RAID6 + 2 spares) would be around
2.9TB. So moving from one setup to the other won't be too taxing until
we fill the enclosure past 2TB. We can simply backup the LVMs, tear the
drives down to reconfigure, and re-do the LVMs.

...

Another general question - how do you deal with failure of the SAN
enclosure itself?

Don Williams

unread,

May 25, 2007, 8:22:43 PM5/25/07

to open-...@googlegroups.com, Thomas Harold

The Equallogic arrays default to an MTU of 9000. However, it will negotiate
down to standard frame length on a per connection basis. As long as your
switch and NIC support it, you can use it. However, FLOW-CONTROL is WAY
more important than Jumbo Frames. Make sure that flowcontrol is enabled on
your switch.

Regards,

Don

-----Original Message-----
From: open-...@googlegroups.com [mailto:open-...@googlegroups.com] On
Behalf Of Björn Metzdorf
Sent: Friday, May 25, 2007 2:21 AM
To: Thomas Harold
Cc: open-...@googlegroups.com
Subject: Re: iSCSI and NIC bonding modes?

Tomka Gergely

unread,

May 31, 2007, 11:05:21 AM5/31/07

to open-...@googlegroups.com

Bjoern Metzdorf írta:

I trying to use this, on linux 2.6.21. I have two pairs of intel e1000
cards, connected with crosslink cables. My problem is the performance.
Using one card, the throughput is around 80-90 mbyte/sec, which is not
bad. But if i try to use any type of dual link (dm multipath, or two
targets, two mountpoints) the performance cant be better than 100
mbyte/sec.

The device is a 8 disk 3ware sata, with swraid0, and can do more than
300 mbyte/sec.

The configurations:

Target iqn.2001-04.com.example:storage.teszt0
Lun 0 Path=/dev/storage/teszt0,Type=blockio
Alias Test0
Wthreads 6
InitialR2T No
ImmediateData No
MaxRecvDataSegmentLength 16384
MaxXmitDataSegmentLength 16384
MaxOutstandingR2T 4
QueuedCommands 128

Target iqn.2001-04.com.example:storage.teszt1
Lun 0 Path=/dev/storage/teszt1,Type=blockio
Alias Test1
Wthreads 3
InitialR2T Yes
ImmediateData Yes
MaxRecvDataSegmentLength 4096
MaxXmitDataSegmentLength 4096
MaxOutstandingR2T 1
QueuedCommands 8

The first can do ~80 MBps with fileio mode and ~90 with blockio, the
second 100 with both.

I have things to test (using hwraid, and using two separated ietd), but
if anyone has some hints, experience, please share with me.

tg

Victor Rodriguez Cortes

unread,

Jun 1, 2007, 5:09:25 AM6/1/07

to open-...@googlegroups.com

> Excellent reply :)

Glad to help :)

> If the target supports MPIO but the initiator doesn't, I'm assuming
> that things still work but it wouldn't be HA?

I believe that you'll get only one connection, but should work...

> > Yes, with MPIO you can forget on using LACP or bonding. But I don't
> > know if linux software target/initiators support MPIO.
>
> Does it make sense to use MPIO with LACP or bonding? Seems like it
> wouldn't matter and would help with HA for initiators that don't
> support MPIO?

I like to think about MPIO as a special kind of LACP specialiced in iSCSI storage, so
it gives you all the benefits of trunking plus the "layer 7" knowledge to keep everything under
control. Using just LACP I see that some packets get lost during the failover and I don't know
how may this affect to a filesystem or the app accessing that filesystem.

Ok, the failover is usually <1 second :) but haven't tested myself what happens under
those circumstances. So test them in your environment, with your apps, and get your results.
LACP it's better than nothing.

> ...
>
> We're only moderately worried about performance and i/o load on the
> array right now. Once we have the base SAN up and running, we can
> organically grow it out to address performance issues. We'll also
> have much better questions to ask a SAN vendor when we move up to a
> commercial SAN target.

There's a point where comodity hardware just isn't enough and you need specialiced
products to get the performance and/or capacity.

> RAID5 vs RAID10 - I'm fairly biased against RAID5. I have yet to see
> a good RAID5 setup (in the 3-5 spindle range) where I'm happy with
> performance. Now, that could be an ID10T error, not properly tuned
> block sizes, the hardware used (various, including Dell PERCs), the
> small number of spindles, etc. On the flip side, with RAID10, I'm
> finding that performance scales as you add spindle pairs. And
> performance on a RAID10 doesn't seem to require a lot of tuning (of
> either block sizes or the file systems).

RAID5 is usually limited by the controller's procesor but RAID10 it's usually limited by
disk performance, as there's a lot less calculations involved. Also, in a RAID5 a data block is
only on one disk but RAID1has it at least on 2, so read performance may be better. Given a
good RAID5 controller, the higher the randomness and concurrency over the disks, the
higher the performance gain with RAID5 as disk heads can move freely instead of being tied
with the heads of the other disks. But everything should be tested with your apps,
environment...

Big, battery backed, caches are crucial to get the utmost performance with RAID 5 /
RAID 6.

> I'd be more likely to consider RAID6 because I feel RAID5 is too
> fragile during the array rebuild window. RAID10 has another advantage
> here because rebuild time scales with the size of an individual disk
> in the array, rather then the overall size of the array. It's a
> simple rebuild of the mirror pair that failed. RAID10 also offers the
> chance that a 2nd drive failure won't take the array down (depends on
> which 2nd disk fails).

RAID6 -> RAID5 + pre-rebuilt hot spare. You'll need an even better controller to get
nice performance. And be sure to check that it's an standard implementation as some
vendors have propietary "RAID 6" formats. I like RAID6 over RAID5, but only if there would
be at least 6 disks in the raidset.

> Still, we'll be able to do some testing. Useable capacity for the
> 12-disk RAID10 (10 active 2 spare, using 500GB SATA) will be around
> 2.3TB while RAID6 (pair of 5-disk RAID6 + 2 spares) would be around
> 2.9TB. So moving from one setup to the other won't be too taxing
> until we fill the enclosure past 2TB. We can simply backup the LVMs,
> tear the drives down to reconfigure, and re-do the LVMs.

IMHO it would be an error to split the RAID6 array in two 5 disks sets: twice the
workload on the controller's processor and the probably less than half the performance. Also
2 hotspares with raid6 seem too much redundancy. I sometimes leave the hotspare outside
the enclosure to use all the available spindles and capacity, but on critical systems there's
one hotspare inside and another one liying over the cabinet ;)

>
> Another general question - how do you deal with failure of the SAN
> enclosure itself?

Having two enclosures and replicating everything among them in realtime :) Linux
softraid 1 over two enclosures with hardraid 6 and dual controllers, with one hotspare inside
each cabinet, for example. Or take the hardware path with one controller capable of RAID1
and two enclosures with hardraid6. Anyway then the SPOF will be the target machine and the
controller... Having a iSCSI SAN without SPOF is expensive and requires specific hardware...
or a lot of experiments with a combination of LVS + 2 target servers + 2 enclosures + DRBD
to replicate among servers + LVM.

Regards!

Victor Rodriguez Cortes

unread,

Jun 1, 2007, 5:22:27 AM6/1/07

to open-...@googlegroups.com

Hello,

How are you benchmarking this? Maybe MPIO is configured in failover mode only?

Don't know about this exact implementation (I haven't used it), but I believe that
MPIO won't share the load for just one volume among all paths in order to avoid the load
imposed by TCP packets re-ordering, so if you're testing only against one volume it's logical
to max out the bandwidth of one path, because the other one would be almost idle.

Regards,

> Bjoern Metzdorf írta:

Tomka Gergely

unread,

Jun 1, 2007, 7:26:50 AM6/1/07

to open-...@googlegroups.com

Victor Rodriguez Cortes írta:

> Hello,
>
> How are you benchmarking this? Maybe MPIO is configured in failover mode only?

dd, iozone. "This" in the story means the multipath module of dm, and i
use the multipath -p multibus. If this is failover, hten of course only
one datapath used in one time.

> Don't know about this exact implementation (I haven't used it), but I believe that
> MPIO won't share the load for just one volume among all paths in order to avoid the load
> imposed by TCP packets re-ordering, so if you're testing only against one volume it's logical
> to max out the bandwidth of one path, because the other one would be almost idle.

Strange. Because this distributes the load, the two nic has half the
load. And of course i tried with two volumes, and the trouble is
present. I tried with two volumes, on two 3ware cards, accessed over two
different datapath also.

I have a growing suspicion: the bottleneck is the target (linux
enterprise target), or the initiator (open-iscsi), or, possibly the
initiatir system, but the two latter is the same. On the initiator i see
sometimes 100% cpu load, ~60% sys, ~40% user.