On the initiator side the ping timeout feature will sometimes
mis-diagnose the connection as bad when it is really just slow. SUSE did
some fixes, but I am not sure what version of SLES. If your not using
multipath then you can just turn them off by setting those noop values to 0.
On the target side, Equalogic has some target side load balancing where
if it determines that it is best to use another path it will send that
target requests logout request to use, then we will logout, kill the
tcp/ip conneciton, then relogin to the new port. You can disable this on
the target. I am not sure the command. You will have to ask Equallogic.
Yes, but set them lower than your scsi command timeout
(/sys//block/sdX/device/timeout).
Now the question is, is that because we aren't getting disconnected
because we are no longer getting the really high IO rates we were
seeing?
I'm going to let disk tests run for a few hours and see if we get any
errors...
Just so you know , the SUSE developers are good at keeping SLES up to
date and even have fixes that are not yet upstream, so it is sometimes
best to just get their newest SLES kernels.
> problem, however, throughput numbers have dropped by about half. We
> were getting nearly 5 Gbit /sec when testing across 5 LUNs, now we are
> getting more like 2.6 Gbit/sec, but we've had no disconnect errors.
>
> Now the question is, is that because we aren't getting disconnected
> because we are no longer getting the really high IO rates we were
> seeing?
Normally we would expect if the disconnect errors go away then you get
higher throughput numbers. However, when the Equalogic load balancing
comes into play then if the disconnect errors also come with that target
request logout message then it could be that the EQL target knows a
better path and you wanted it to load balance.
Is there a difference between SLES and opensuse's cpu speed handling
(not sure what the feature is called in suse) (for iscsi I normally turn
this off to get the best results) or iptables (for this turning it off
can improve performance) or io scheduler (switching between noop and cfq
sometimes gives differrent results.
To us this looks like an issue with worse network performance with later
kernels. The kernel has become very bloated over time and we notice
we get much worse per transaction network latencies in more recent
kernel versions, as opposed to say the kernel version in use for SLES
10.2, which is 2.6.16. SLES 11 got slower, and OpenSUSE seems even
slower then SLES11.
I will do some iperf testing to compare a SLES 11 and an OpenSUSE 11
server to see how the network layer performs.
So far stability is great, and performance is good, just going to see
if we can get performance to be great as well.
Login into the array as grpadmin; at the CLI> prompt, type this
command:
CLI> grpparams conn-balancing disable
This command can be done w/o a reboot or restart of the arrays and
becomes effective immediately on all members in the group. Should you
decide to re-enable connection load balancing in the future, type this
command at the Group CLI prompt:
CLI> grpparams conn-balancing enable