active/active open-iscsi & heartbeat & drbd

Jarrod

unread,

Aug 9, 2007, 7:40:33 PM8/9/07

to open-iscsi

I have set up a pair of highly available iscsi targets using Linux,
Enterprise iSCSI Target, Heartbeat, and DRBD, with the following
characteristics:

* There are two DRBD devices
* ietd is always running on each server
* There are two heartbeat resources. Each resource has an IP, a
drbd device, and an iscsi target.
* During a clean failover the server going standby will drop all
connections on a target, then delete the target, but will still have
ietd running. I wrote a script to do this which seems to work just
fine.

I ran into a problem where on a clean failover sometimes an initiator
would attempt to reconnect very quickly - before the IP address had
switched to the takeover machine. The connect would initially
succeed, because ietd is running, but no session was ever created.
Fairly quickly login would fail with code 0200 and the initiator would
stop trying to reconnect. The next message would be that reconnect
timed out. When no iscsi device is mounted this just results in the
need to use iscsiadm to tell iscsid to reconnect to the targets. If a
device is mounted and an operation is underway the kernel will
actually hang. I can't even shutdown the machine cleanly.

In researching the configuration it looked like
node.session.iscsi.DefaultTime2Wait would specify a wait time before
beginning to reconnect, which would solve my problem. I could just
set this to a time reasonably above the time for the clean failover to
occur, then reconnection would happen correctly with the takeover
server. I tried this and it didn't work.

When I looked into the code it looks like DefaultTime2Wait is not
actually used anywhere. It ends up on session->def_time2wait, but
while def_time2wait is set and printed out in a couple of places in
the code, it isn't ever actually used. I made a small change to usr/
initiator.c so that on the first reconnect attempt if def_time2wait is
nonzero reconnection will wait def_time2wait seconds before starting.
Below is the diff from version 2.0-865.4.

This solution works great for me. I have been able to fail over many
times in the middle of a disk operation without failure, loss of data,
or kernel hang using a DefaultTime2Wait of 5 seconds.

Is this how DefaultTime2Wait is supposed to work? If so, can we get
an official fix? If not, is there a better suggestion on how to solve
this problem?

Thanks,

Jarrod Ribble

************************
initiator.c diff (for patch):
************************

--- ../../../open-iscsi-2.0-865.4/usr/initiator.c 2007-06-21
10:16:22.000000000 -0600
+++ initiator.c 2007-08-09 15:32:11.000000000 -0600
@@ -851,6 +851,12 @@ __session_conn_reopen(iscsi_conn_t *conn
}
conn->session->t->template->ep_disconnect(conn);

+ if (session->reopen_cnt == 1 && session->time2wait == 0) {
+ log_debug(4, "waiting default %d secs for first
recovery attempt",
+ session->def_time2wait);
+ session->time2wait = session->def_time2wait;
+ }
+
if (session->time2wait)
goto queue_reopen;

********************************************
heartbeat iscsi_target script:
********************************************

#!/bin/bash
#
# This script is inteded to be used as resource script by heartbeat
#
# Jul 17 2007 by Jarrod Ribble.
#
###

#set -vx

IETADM="/usr/sbin/ietadm"

usage() {
echo "usage: $0 <target id> <target lun> <target name> <lun
parameters> <user> <password> {start|stop|status}"
exit 1
}

if [ "$#" -eq 7 ]; then
TID="$1"
TLUN="$2"
TNAME="$3"
TPARAMS="$4"
TUSER="$5"
TPASS="$6"
CMD="$7"
else
TID=
TLUN=
TNAME=
TPARAMS=
TUSER=
TPASS=
CMD=
usage
fi

status() {
grep -q "tid:$TID name:$TNAME" < /proc/net/iet/volume
RUNSTS=$?
return $RUNSTS
}

delete_target() {
echo "Stopping target tid:$TID name:$TNAME"
# delete each session in target until no more sessions
TSID=`awk "/tid:$TID name:$TNAME/,/sid:/ || (/tid:/ && ! /tid:$TID
name:$TNAME/)" /proc/net/iet/session | awk '/sid:/ {tsid =
substr($1,5); print tsid; }'`
while [ -n "$TSID" ]; do
delete_session $TSID
RET=$?
if [ "$RET" -eq 0 ]; then
echo "Session $TSID deleted"
else
echo "Error deleting tid:$TID name:$TNAME sid:$TSID"
return $RET
fi
TSID=`awk "/tid:$TID name:$TNAME/,/sid:/ || (/tid:/ && ! /tid:$TID
name:$TNAME/)" /proc/net/iet/session | awk '/sid:/ {tsid =
substr($1,5); print tsid; }'`
done
# then delete the target
$IETADM --op delete --tid=$TID
RET=$?
return $RET

}

delete_session() {
RET=0
CURSID=$1
echo "Stopping session tid:$TID name:$TNAME sid:$CURSID"
# delete each connection in session until no more connections
# the session then closes on its own
SCID=`awk "/sid:$CURSID/,/cid:/ || (/sid:/ && ! /sid:$CURSID/)" /
proc/net/iet/session | awk '/cid:/ {scid = substr($1,5); print
scid; }'`
while [ -n "$SCID" ]; do
delete_connection $CURSID $SCID
RET=$?
if [ "$RET" -eq 0 ]; then
echo "connection $SCID deleted"
else
echo "Error deleting tid:$TID name:$TNAME sid:$CURSID cid:$SCID"
return $RET
fi
SCID=`awk "/sid:$CURSID/,/cid:/ || (/sid:/ && ! /sid:$CURSID/)" /
proc/net/iet/session | awk '/cid:/ {scid = substr($1,5); print
scid; }'`
done
return $RET
}

delete_connection() {
CURSID=$1
CURCID=$2
# call ietadm to close connection
$IETADM --op delete --tid=$TID --sid=$CURSID --cid=$CURCID
RET=$?
return $RET
}

case "$CMD" in
start)
status
STS=$?
if [ "$STS" -eq 0 ]; then
echo "tid:$TID name:$TNAME is running"
exit 0
fi
$IETADM --op new --tid=$TID --params Name=$TNAME
RET=$?
if [ "$RET" -ne 0 ]; then
echo "Error starting tid:$TID name:$TNAME"
exit $RET
fi
$IETADM --op new --tid=$TID --lun=$TLUN --params $TPARAMS
RET=$?
if [ "$RET" -ne 0 ]; then
echo "Error starting lun $TLUN on tid:$TID name:$TNAME"
exit $RET
fi
$IETADM --op new --tid=$TID --user --params IncomingUser=
$TUSER,Password=$TPASS
RET=$?
if [ "$RET" -ne 0 ]; then
echo "Error setting credentials on tid:$TID name:$TNAME"
exit $RET
else
echo "tid:$TID name:$TNAME started"
fi
exit $RET
;;
stop)
status
STS=$?
if [ "$STS" -eq 1 ]; then
echo "tid:$TID name:$TNAME is not running"
exit 0
fi
#$IETADM --op delete --tid=$TID --lun=$TLUN
#$IETADM --op delete --tid=$TID
delete_target
RET=$?
if [ "$RET" -eq 0 ]; then
echo "tid:$TID name:$TNAME stopped"
else
echo "Error stopping tid:$TID name:$TNAME"
fi
exit $RET
;;
status)
status
RUNSTS=$?
if [ "$RUNSTS" -eq 1 ]; then
echo "tid:$TID name:$TNAME is not running."
else
echo "tid:$TID name:$TNAME is running."
fi
exit $RUNSTS
;;
*)
usage
;;
esac

exit 0

****************************
haresources
****************************
san1 192.168.0.1 drbddisk::cluster1 iscsi_target::1::0::iqn.
2007-08.com.mydomain:iscsitarget.1::Path=/dev/
drbd0,Type=blockio,ScsiId=target1::target1user::target1secret
san2 192.168.0.2 drbddisk::cluster2 iscsi_target::2::0::iqn.
2007-08.com.mydomain:iscsitarget.2::Path=/dev/
drbd1,Type=blockio,ScsiId=target2::target2user::target2secret

Mike Christie

unread,

Aug 8, 2007, 4:59:29 PM8/8/07

to open-...@googlegroups.com

Jarrod wrote:
> I have set up a pair of highly available iscsi targets using Linux,
> Enterprise iSCSI Target, Heartbeat, and DRBD, with the following
> characteristics:
>
> * There are two DRBD devices
> * ietd is always running on each server
> * There are two heartbeat resources. Each resource has an IP, a
> drbd device, and an iscsi target.
> * During a clean failover the server going standby will drop all
> connections on a target, then delete the target, but will still have
> ietd running. I wrote a script to do this which seems to work just
> fine.
>
> I ran into a problem where on a clean failover sometimes an initiator
> would attempt to reconnect very quickly - before the IP address had
> switched to the takeover machine. The connect would initially
> succeed, because ietd is running, but no session was ever created.
> Fairly quickly login would fail with code 0200 and the initiator would
> stop trying to reconnect. The next message would be that reconnect
> timed out. When no iscsi device is mounted this just results in the
> need to use iscsiadm to tell iscsid to reconnect to the targets. If a
> device is mounted and an operation is underway the kernel will
> actually hang. I can't even shutdown the machine cleanly.

I might have found the reason for that hang.

>
> In researching the configuration it looked like
> node.session.iscsi.DefaultTime2Wait would specify a wait time before
> beginning to reconnect, which would solve my problem. I could just
> set this to a time reasonably above the time for the clean failover to
> occur, then reconnection would happen correctly with the takeover
> server. I tried this and it didn't work.
>
> When I looked into the code it looks like DefaultTime2Wait is not
> actually used anywhere. It ends up on session->def_time2wait, but
> while def_time2wait is set and printed out in a couple of places in
> the code, it isn't ever actually used. I made a small change to usr/
> initiator.c so that on the first reconnect attempt if def_time2wait is
> nonzero reconnection will wait def_time2wait seconds before starting.
> Below is the diff from version 2.0-865.4.
>

Nice catch. We were setting the wrong internal value after negotiating
for the Time2Wait.

> --- ../../../open-iscsi-2.0-865.4/usr/initiator.c 2007-06-21
> 10:16:22.000000000 -0600
> +++ initiator.c 2007-08-09 15:32:11.000000000 -0600
> @@ -851,6 +851,12 @@ __session_conn_reopen(iscsi_conn_t *conn
> }
> conn->session->t->template->ep_disconnect(conn);
>
> + if (session->reopen_cnt == 1 && session->time2wait == 0) {
> + log_debug(4, "waiting default %d secs for first
> recovery attempt",
> + session->def_time2wait);
> + session->time2wait = session->def_time2wait;
> + }
> +
> if (session->time2wait)

A couple lines above this session->def_time2wait gets reset to whatever
you set in iscsid.conf. So we probably want something like the attached
where we will use whatever we negotiated for. This patch also might fix
your lock up. There is a potential NULL ptr dereference in that scenario
you described. The patch was made over open-iscsi-2.865.9 on
open-iscsi.org. Please try it out and review it since it is 3 in the
morning here and I am about to go to sleep :) I did some quick tests on
it to make sure the defaultTime2Wait is being used now and that nothing
segfaults.

Thanks for the detailed bug report and patch!

fix-DefaultTime2Wait.patch

infernix

unread,

Aug 10, 2007, 12:52:58 PM8/10/07

to open-...@googlegroups.com

Jarrod wrote:
> I have set up a pair of highly available iscsi targets using Linux,
> Enterprise iSCSI Target, Heartbeat, and DRBD, with the following
> characteristics:
>
> * There are two DRBD devices
> * ietd is always running on each server
> * There are two heartbeat resources. Each resource has an IP, a
> drbd device, and an iscsi target.

Could you tell us some more about this setup?

I'm guessing that this is two boxes with a disk/raidset, both running
DRBD to make sure that the data stays in sync; then the DRBD device is
exported over iSCSI, or perhaps there's LVM on top of the DRBD and
you're exporting LVM disks as targets? All the while heartbeat is used
to control hardware failures. Add a bonding network setup and your
open-iscsi initiators have two paths to a redundant iSCSI target cluster.

I'd be interested to hear about how you built this.

Thanks! :)

Jarrod Ribble

unread,

Aug 10, 2007, 1:39:31 PM8/10/07

to open-...@googlegroups.com

It sounds like you have some ideas on how to make my setup better. I'm
not using LVM. I haven't looked very closely into LVM, don't know
enough about it, probably should.

I have two openSuSE 10.2 boxes with 8 disks each. 2 disks on each
machine are RAID 1 and hold the OS. The other 6 disks are put into two
RAID 5 arrays of 3 disks each, resulting in two 1.4 TB devices in the OS
(hardware raid). I have these two devices mirrored using DRBD between
the two servers, so I have two 1.4TB devices made of 6 disks. I can
lose any 3 disks and not lose my data.

Each of these devices is exposed as an iscsi target using a unique IP
address, target name, & scsiid. Either server can serve as the target,
but not both at the same time.

Heartbeat (v1) manages which server is exposing which target. Most of
the time each server exposes one of the targets. I included the
iscsi_target resource script and the resource definition in my previous
email. During failover the initiators block and appear to hang until
reconnect, at which point they keep going happily. So far we haven't
lost any data during failover. The iscsi targets use blockio and drbd
uses protocol C to ensure data is written to disk before a write is
reported.

That's as far as it goes on the target side.

On the initiator side I have four openSuSE 10.2 servers that are set up
with heartbeat (v2), Xen, and open-iscsi. I took my iscsi targets and
partitioned them to the maximum allowable 15 partitions of about 100 GB
each. (LVM might help me be a lot more flexible here - if you have any
ideas please share). I then create xen virtual machines using one or
more of the iscsi devices partitions. I sync the vm configurations
across the 4 servers using rsync and put the vms under heartbeat's
control. During a clean failover the vm live migrates from one of the
servers to another without losing a beat. On a server failure one of
the other cluster servers boots the vm as soon as the cluster decides
the original server is down.

At least, this is all how it is supposed to work. I almost have it all
put together. I was having the problem with the iscsi targets hanging
after failover when actually being used that is now resolved. My other
problem is that in the course of getting heartbeat for the vms working
they stopped live migrating. They were live migrating before, but now
they don't. I'm not sure why yet. Otherwise this setup seems to be
working well. We are subjecting each part to stress and failure testing
to see how well it works.

Right now I have one single point of failure, which is my network
switch. I also don't have a real stonith device for heartbeat. It
would also be nice to have more flexibility on the partition side.
Maybe cluster LVM would help me with this.

I was meaning to write a good howto once I finished - I ran into a
number of gotchas - but I haven't been taking the best notes.

BTW the redundant iSCSI SAN (the target servers) cost me about $13,000
($6,500 per server). They are 3U quad core Intel servers that hold up
to 16 SATA drives. I have 1 quad core CPU in each, but they will take
two, 8 drives in each, and a hardware raid controller that allows hot
swap. Overall a pretty good value for a fully redundant 2.8 TB SAN that
is cheaply expandable to several more TBs. Then again, there is the
time it takes to figure this out for the first time.

Jarrod Ribble

unread,

Aug 13, 2007, 2:22:31 PM8/13/07

to open-...@googlegroups.com

I ran this through a few tests and it works well for my particular
issue. Thanks for the quick turnaround.

-----Original Message-----
From: open-...@googlegroups.com [mailto:open-...@googlegroups.com]
On Behalf Of Mike Christie
Sent: Wednesday, August 08, 2007 2:59 PM
To: open-...@googlegroups.com
Subject: Re: active/active open-iscsi & heartbeat & drbd

Mike Christie

unread,

Aug 13, 2007, 2:20:24 PM8/13/07

to open-...@googlegroups.com

Jarrod Ribble wrote:
> I ran this through a few tests and it works well for my particular
> issue. Thanks for the quick turnaround.
>

Thanks for the update. I am going to finish up my testing on the patch
then will put it in open-iscsi-2.0-865.10 and release that on
open-iscsi.org later today or tomorrow.

Reply all

Reply to author

Forward