Tuning iscsi read performance with multipath Redhat 5.3 / SLES 10 SP2 / Oracle Linux / Equallogic

jnantel

unread,

Apr 24, 2009, 11:07:24 AM4/24/09

to open-iscsi

If you recall my thread on tuning performance for writes. Now I am
attempting to squeeze as much read performance as I can from my
current setup. I've read a lot of the previous threads, and there has
been mention of "miracle" settings that resolved slow reads vs
writes. Unfortunately, most posts reference the effects and not the
changes. If I were tuning for read performance in the 4k to 128k
block range what would the best way to go about it?

Observed behavior:
- Read performance seems to be capped out at 110meg/sec
- Write performance I get upwards of 190meg/sec

Tuning options I'll be trying:
block alignment (stride)
Receiving buffers
multipath min io changes
iscsi cmd depth

Hardware:
2 x Cisco 3750 with 32gig interconnect
2 x Dell R900 with 128gig ram and 1 broadcom Quad (5709) and 2 dual
port intels (pro 1000/MT)
2 x Dell Equallogic PS5000XV with 15 x SAS in raid 10 config

multipath.conf:

device {
vendor "EQLOGIC"
product "100E-00"
path_grouping_policy multibus
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
features "1 queue_if_no_path"
path_checker readsector0
failback immediate
path_selector "round-robin 0"
rr_min_io 128
rr_weight priorities
}

iscsi settings:

node.tpgt = 1
node.startup = automatic
iface.hwaddress = default
iface.iscsi_ifacename = ieth10
iface.net_ifacename = eth10
iface.transport_name = tcp
node.discovery_address = 10.1.253.10
node.discovery_port = 3260
node.discovery_type = send_targets
node.session.initial_cmdsn = 0
node.session.initial_login_retry_max = 4
node.session.cmds_max = 1024
node.session.queue_depth = 128
node.session.auth.authmethod = None
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.session.iscsi.FastAbort = Yes
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2
node.session.iscsi.MaxConnections = 1
node.session.iscsi.MaxOutstandingR2T = 1
node.session.iscsi.ERL = 0
node.conn[0].address = 10.1.253.10
node.conn[0].port = 3260
node.conn[0].startup = manual
node.conn[0].tcp.window_size = 524288
node.conn[0].tcp.type_of_service = 0
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 10
node.conn[0].timeo.noop_out_timeout = 30
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.HeaderDigest = None,CRC32C
node.conn[0].iscsi.DataDigest = None
node.conn[0].iscsi.IFMarker = No
node.conn[0].iscsi.OFMarker = No

/etc/sysctl.conf

net.core.rmem_default= 65536
net.core.rmem_max=2097152
net.core.wmem_default = 65536
net.core.wmem_max = 262144
net.ipv4.tcp_mem= 98304 131072 196608
net.ipv4.tcp_window_scaling=1

#
# Additional options for Oracle database server
#ORACLE
kernel.panic = 2
kernel.panic_on_oops = 1
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=524288
net.core.wmem_max=524288
fs.aio-max-nr=524288

jnantel

unread,

Apr 24, 2009, 12:23:12 PM4/24/09

to open-iscsi

As an update:

new observed behavior:
- RAW disk read performance is phenomenal (200meg/sec)
- Ext3 performance is 100meg/sec and tps in iostat aren't going about
800 (50k with raw disk).

Some added info:
- This system has an oracle database on it and it's tuned for huge
pages..etc (see sysctl posted above)

Donald Williams

unread,

Apr 24, 2009, 2:14:43 PM4/24/09

to open-...@googlegroups.com

Have you tried increasing the disk readahead value?

#blockdev --setra X /dev/<multipath device>

The default is 256. Use --getra to see current setting.

Setting it too high will probably hurt your database performance. Since databases tend to be random, not sequential.

Don

Christopher Chen

unread,

Apr 24, 2009, 2:44:59 PM4/24/09

to open-...@googlegroups.com

I've found that trying to reduce wire latency brings big wins in read
performance. Don't get too hung up on sequential read throughput
though--focus on parallel performance and IOPS, particularly for your
database workload.

You can fiddle with the read-ahead settings if you want, but Don's
right--it'll probably hurt your random performance.

Some things to look at:

Interrupt coalencing (try and minimize this for better latency)
Ethernet flow control (especially for gigabit ethernet) (try turning
it off to allow TCP flow control to work)
Ring buffers for tx and rx on interface (try and maximize them to
reduce overflow)

Let us know how it works for you!

Are you running dm-multipath? Try testing it with failover rather than
round-robin if you are.

--
Chris Chen <muff...@gmail.com>
"I want the kind of six pack you can't drink."
-- Micah

Konrad Rzeszutek

unread,

Apr 24, 2009, 4:06:44 PM4/24/09

to open-...@googlegroups.com

On Fri, Apr 24, 2009 at 02:14:43PM -0400, Donald Williams wrote:
> Have you tried increasing the disk readahead value?
> #blockdev --setra X /dev/<multipath device>
>
> The default is 256. Use --getra to see current setting.
>
> Setting it too high will probably hurt your database performance. Since
> databases tend to be random, not sequential.

I would think that the databases would open the disks with O_DIRECT
bypassing the block cache (And hence the disk readahead value isn't used
at all).

jnantel

unread,

Apr 24, 2009, 4:21:09 PM4/24/09

to open-iscsi

We are running with Async IO right now, it's yielding better results
with multipath.

jnantel

unread,

Apr 24, 2009, 5:02:06 PM4/24/09

to open-iscsi

Donald, thanks for the reply. This issue has me baffled. I can goof
with the read ahead all I want but it has no effect on the performance
with a filesystem. I must be missing a key buffer section that is
starving my filesystem reads.

Here is the output from iostat -k 5 during artificially generated read
(dd if=/fs/disk0/testfile of=/dev/null bs=32k -c=1000000)

This is reading a file residing on the ext3 filesystem on my raid6
volume. Keep in mind I am using multipath:

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdc 115.17 52090.22 0.00 260972 0
sdd 0.00 0.00 0.00 0 0
sde 109.78 49694.21 0.00 248968 0
dm-0 249.30 101784.43 0.00 509940 0

Same volume reading from the device itself( dd if=/dev/mapper/raid6
of=/dev/null bs=32k -c=1000000):

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdc 418.80 106905.60 0.00 534528 0
sdd 0.00 0.00 0.00 0 0
sde 901.80 106950.40 0.00 534752 0
dm-0 53452.80 213811.20 0.00 1069056 0

More detailed on ext3 performance
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.40 0.00 0.40 0.00 4.00
20.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 12.40 0.20 110.80 0.40 50215.20 2.40
903.19 1.00 8.97 4.81 53.52
sdd 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sde 11.20 0.00 104.00 0.00 47205.60 0.00
907.80 0.91 8.77 4.65 48.32
dm-0 0.00 0.00 238.40 0.60 97375.20 2.40
814.88 2.08 8.70 4.18 100.00

ByteEnable

unread,

Apr 27, 2009, 11:19:45 PM4/27/09

to open-iscsi

I'm not sure if you have seen this, but there is a guide from Dell on
this subject:

http://www.support.dell.com/support/edocs/software/appora10/lin_x86_64/multlang/EELinux_storage_4_1.pdf

Also I would suggest the following changes in multipath.conf for RHEL5

device {
vendor "EQLOGIC"
product "100E-00"

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

hardware_handler "0"
path_selector "round-robin 0"
path_grouping_policy multibus
failback immediate
features "1 queue_if_no_path"
path_checker tur
rr_min_io 10
rr_weight uniform
}

Byte

Ulrich Windl

unread,

Apr 28, 2009, 8:12:56 AM4/28/09

to open-...@googlegroups.com

Hi,

first a silly question: Shouldn't the read-ahead on the server be as least as
hight as the setting on the client to provide any benefit?

And two interesting numbers: On one of our busy databases the the read:write ratio
is about 10:1 and the tables are severely fragmented
On our Linux servers using iSCSI the read:write ratio is about 1:10 because the
machines have several GIGs of RAM and the disk caching is very efficient. So the
machine just has to send out the writes...

Regards,
Ulrich

jnantel

unread,

Apr 29, 2009, 10:13:07 AM4/29/09

to open-iscsi

Just an update to this issue, it's still persisting. 210meg/sec raw
reads with 100-110meg/sec throughput on filesystem reads. I'm a bit
perplexed. Setting stride and aligning sectors in my mind is to help
with write performance not read performance. So I'm left with a
couple of questions. What does ext3 add to the equation for reads
that might cause this behavior.

Block device
File table
request merging (I'd like to tune this, but I've not seen any tunable
parameters here)

I'm missing a lot more obviously.

Reply all

Reply to author

Forward