Bug#740701: multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed"

Ritesh Raj Sarraf

unread,

Jun 17, 2014, 1:40:03 AM6/17/14

to

CCing: netapp-linux-community

On 06/17/2014 04:56 AM, Hans van Kranenburg wrote:
> I ran into this same issue today.
>
> iSCSI target: NetApp FAS2240-2 NetApp Release 8.1.2 7-Mode
> iSCSI initiator: Debian GNU/Linux (kernel 3.2.57-3, amd64)
>
> It seems that as soon as an unmap iscsi command is fired to the NetApp,
> the result is:
>
> [4230017.814546] sd 12:0:0:0: [sdc] Unhandled sense code
> [4230017.814610] sd 12:0:0:0: [sdc] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [4230017.814682] sd 12:0:0:0: [sdc] Sense Key : Medium Error [current]
> [4230017.814748] sd 12:0:0:0: [sdc] Add. Sense: Incompatible medium
> installed
> [4230017.814818] sd 12:0:0:0: [sdc] CDB: Unmap/Read sub-channel: 42 00
> 00 00 00 00 00 00 18 00
> [4230017.814986] end_request: I/O error, dev sdc, sector 4372576896
>
> What does this message exactly mean (besides telling the fact that
> there's no cd in the cdrom drive, which is a clearly not so practical
> re-use of message numbers here :) ).
>
> I guess multipath tries to fire the same request to the next (and next)
> connection, because of the I/O error, which results in a situation where
> are paths are failing.
>
> The reason why testing with squeeze and 2.6.32 succeeds, and
> wheezy/jessie fails is that in the squeeze case, mkfs does not issue
> discard requests by default, because device mapper did not support it
> back then.
>
> I don't really know why this is happening. What I know is that it takes
> down the entire multipath/iscsi connection, because all paths start
> failing. In my case, the debian machine is a Xen dom0, which runs a
> number of virtual machines. All of them experienced 100% disk iowait
> right away. I think the "You've seen mkfs fail because all the paths
> were faulty." should be "You've seen all paths go faulty because you did
> an mkfs, which tried to discard the new empty space." :-)
>
> Additional interesting info: When doing the same on an iSCSI target
> which is a NetApp FAS2040, either running 8.1RC2 or 8.1, I can use iSCSI
> UNMAP. Well, at least when using Debian kernel 3.2.46-1+deb7u1 and
> 3.2.51-1, which were on the iSCSI initiators last time I used this.
>
> (Well, actually, it seems that NetApp equipment can respond quite badly
> to UNMAP (high load/latency spikes, even without using snapshots), so
> that's why we only use discard/unmap when removing old lvm logical
> volumes by slow-discarding them. Anyway, using it on the newer NetApp
> system has a clearly different result.)
>
> I just started researching this situation and found this bug report. I'd
> appreciate to hear if the poster of this bug has made progress since Mar
> 5 2014 on this topic.
>
> Attached is a slightly munged syslog file showing what happened this
> afternoon when trying to use fstrim on a mounted ext4 filesystem, which
> was on lvm on iscsi on this netapp.
>
> Although I do not have a dedicated test setup containing a spare NetApp
> FAS2240, I know this issue results in impact on the iSCSCI initiator
> running linux, instead on any impact on the storage array itself, so I'd
> be happy to help debug this issue to find out what exactly is causing
> this, and how we could improve on it. Is it really a multipath issue, or
> a kernel issue, is the NetApp software to blame? (I cannot find anything
> related in release notes since 8.1.2) Should iSCSI and/or multipath
> handle this response in another way?

Okay!! Let me check in the lab. I will try to reproduce it. In case you
do not hear back from me, please feel free to ping back.

Meanwhile, can you run sg_inq on the SCSI device ??

--
Ritesh Raj Sarraf | Linux Engineering | NetApp Inc.

signature.asc

Hans van Kranenburg

unread,

Jun 17, 2014, 2:00:02 PM6/17/14

to

Hi,

On 06/17/2014 07:24 AM, Ritesh Raj Sarraf wrote:
>
> Okay!! Let me check in the lab. I will try to reproduce it. In case you
> do not hear back from me, please feel free to ping back.
>
> Meanwhile, can you run sg_inq on the SCSI device ??

Sure:

# sg_inq /dev/sg6
standard INQUIRY:
PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3]
[AERC=0] [TrmTsk=0] NormACA=1 HiSUP=1 Resp_data_format=2
SCCS=0 ACC=0 TPGS=0 3PC=1 Protect=0 BQue=0
EncServ=0 MultiP=1 (VS=0) [MChngr=0] [ACKREQQ=0] Addr16=0
[RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1
[SPI: Clocking=0x0 QAS=0 IUS=0]
length=117 (0x75) Peripheral device type: disk
Vendor identification: NETAPP
Product identification: LUN
Product revision level: 811a
Unit serial number: BWr95?E2aJc9

Besides this, I'm trying to isolate a test case that is as small as
possible to reproduce this behaviour, using an emptied server and only
giving this server access to a small test lun.

Test case 1:

No multipath or whatever, directly operate on the lun (well, on 1 of the
four paths). It's a small 10GB lun.

/mnt 0-# mkfs.ext4 -E nodiscard /dev/sdf
[...]
/mnt 0-# mkdir discard
/mnt 0-# mount /dev/sdf discard/
/mnt 0-# cd discard/
/mnt/discard 0-# fstrim -v -o 0 -l 128MB ./
./: 0 bytes were trimmed

Ok, that went fine, there's no data yet, so this was expected. Also, no
errors in dmesg. Let's create some random data on the lun, remove the
file and fstrim again:

/mnt/discard 0-# dd if=/dev/urandom of=bla bs=1028476 count=128
128+0 records in
128+0 records out
131644928 bytes (132 MB) copied, 14.7496 s, 8.9 MB/s

/mnt/discard 0-# fstrim -v -o 0 -l 256MB ./
./: 117063680 bytes were trimmed

So far so good.

Next test cases will work towards the situation which is identical to in
which the issue occured yesterday, having a striped lvm logical volume
on top of encryption and multipath... I hope somewhere in between it
will break in a way that will result in a clear pointer where the issue
show yesterday originates.

The way we use netapp with linux, might sound a bit unusual, but it
works great in practice:

xvda in a domU
|
xen
|
lv (striped, -i 2)
|
lvm volume group
/ \
/ \
lvm pv lvm pv
dm-crypt dm-crypt
mpath1 mpath2
|||| ||||
a,b c,d e,f g,h
|| || || ||
--xxxx---||--------------xxxx---||--- switch
| | || | | ||
--|--|--xxxx-------------|--|--xxxx----- switch
| | | | | | | |
| | | | | | | |
| | | | | | | |
a b c d e f g h
NetApp controller 1 NetApp controller 2

Because we don't care about snapshots and other fancy NetApp
functionality (sorry 'bout that :) ), we create two multipath devices
(each to a lun on a different one of the two disk controllers in a
netapp device), put encryption on them, create a lvm pv out of them, add
them together in a volume group, and take striped logical volumes out of
them. It's even usable on multiple attached servers, as long as you get
your locking on metadata operations done right.

But I have to leave now, will continue later.

--
Hans van Kranenburg - System / Network Engineer
+31 (0)10 2760434 | hans.van....@mendix.com | www.mendix.com

--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Hans van Kranenburg

unread,

Jun 17, 2014, 7:10:02 PM6/17/14

to

On 06/17/2014 07:43 PM, Hans van Kranenburg wrote:
>
> But I have to leave now, will continue later.

Btw, netapp-linux-community, I kept the Cc in my last update, which is
now in a moderation queue of the mailing list. I joined the list, I
didn't even know it existed before... Read up at
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=740701 when interested.

This evening I tried to reproduce the problem in a test setup, issuing
unmap (discard) requests, using fstrim and mkfs in the following situations:

1. Just /dev/sdf, single path to single lun
2. Use multipath to single lun, /dev/mapper/mpatha
3. Use encryption on top of multipath, /dev/mapper/mpatha_luks
4. Use lvm on top of the encryption, /dev/vg_discard/lv_discard
5. Start using a second equally sized lun on the other netapp
controller, multipath to it, put encryption on it, pvcreate it, create a
new volume group containing both pvs, create a striped lv.

The sad part of the story is that I could not manage to get my iSCSI
connections toasted in any of the test cases yet.

For reference, this is what step 5 looks like:

# multipath -l
mpathb (360a9800042576c32412b4532614a6750) dm-2 NETAPP,LUN
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 36:0:0:1 sdc 8:32 active undef running
|- 38:0:0:1 sde 8:64 active undef running
|- 35:0:0:1 sdb 8:16 active undef running
`- 37:0:0:1 sdd 8:48 active undef running
mpatha (360a9800042577239353f4532614a6339) dm-1 NETAPP,LUN
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 40:0:0:1 sdf 8:80 active undef running
|- 42:0:0:1 sdi 8:128 active undef running
|- 39:0:0:1 sdg 8:96 active undef running
`- 41:0:0:1 sdh 8:112 active undef running

# cryptsetup --verbose --verify-passphrase luksFormat /dev/mapper/mpatha
# cryptsetup --verbose --verify-passphrase luksFormat /dev/mapper/mpathb

# cryptsetup status /dev/mapper/mpatha_luks
/dev/mapper/mpatha_luks is active.
type: LUKS1
cipher: aes-cbc-essiv:sha256
keysize: 256 bits
device: /dev/mapper/mpatha
offset: 4096 sectors
size: 20967424 sectors
mode: read/write

# pvcreate /dev/mapper/mpatha_luks
# pvcreate /dev/mapper/mpathb_luks

# pvs
PV VG Fmt Attr PSize PFree
/dev/mapper/mpatha_luks vg_discard lvm2 a-- 10.00g 9.50g
/dev/mapper/mpathb_luks vg_discard lvm2 a-- 10.00g 9.50g

# vgcreate vg_discard /dev/mapper/mpatha_luks /dev/mapper/mpathb_luks
# lvcreate -i 2 -L 10G -n lv_discard --addtag $(hostname) vg_discard
# mkfs.ext4 /dev/vg_discard/lv_discard

or, do something like:

# dd if=/dev/zero of=sparse bs=1048576 seek=1024 count=0
# shred -n 1 -v sparse
# sync
# rm sparse
# sync
# fstrim -v -o 0MB -l 512MB ./

So, conclusions for now:
- This is not very easily reproducible, it's not just like "you need
to have multipath or this and that and then do mkfs or fstrim and then
it fails". But it's there, and in the production setup I've seen it
happen more than once now, yesterday being the case in which we could
connect the dots and pinpoint where the actual problem is. (1st time:
what the .. just happened, collect logs, 2nd time: different
situation, same cause, compare, etc, blam! it's the unmap iscsi)
- I can find very few search hits on this on the web, it does not seem
like a known issue, besides the OP of this bug and me reporting it.
Google for "CDB: Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00".
- There must be something different in the production setup, which is
now running separately from this one physical server and test luns, on
the exact same type of hardware, using identical software and identical
configuration, but fails all I/O after any UNMAP request. Differences
are that the production luns are accessed concurrently from multiple
physical servers, that there's a lot more I/O going on at any moment,
that there's a lot more of logical volumes and data written to the luns
etc etc...

Any ideas?

--
Hans van Kranenburg - System / Network Engineer

T +31 (0)10 2760434 | hans.van....@mendix.com | www.mendix.com

Sarraf, Ritesh

unread,

Jun 19, 2014, 12:40:03 PM6/19/14

to

Hans,

SCSI UNMAP functionality got complete very recently in the Linux kernel. From what I see so far, you seem to be running on 3.2 kernel. The Debian kernel team's policy for stable is to only backport important fixes, and some device driver refreshes. I highly doubt they'd have backported SCSI enhancements.

If UNMAP is important to you as a feature, you may want to try evaluate a more recent kernel.

I just checked on b.d.o, it currently has Linux 3.12. That should be a good start to verify against.

https://packages.debian.org/search?keywords=linux-image&searchon=names&section=all&suite=wheezy-backports

Veeraraghavan, Kugesh

unread,

Jun 19, 2014, 1:10:02 PM6/19/14

to

+ Martin who can provide assistance going forward.

_______________________________________________
Netapp-Linux-Community mailing list
Netapp-Linu...@linux.netapp.com
http://linux-mail.netapp.com/cgi-bin/mailman/listinfo/netapp-linux-community

Hans van Kranenburg

unread,

Jun 19, 2014, 7:10:01 PM6/19/14

to

Hi,

On 06/19/2014 06:49 PM, Veeraraghavan, Kugesh wrote:
> + Martin who can provide assistance going forward.
>
> -----Original Message-----
> From: Netapp-Linux-Community [mailto:netapp-linux-co...@linux.netapp.com] On Behalf Of Sarraf, Ritesh
> Sent: Thursday, June 19, 2014 10:08 PM
> To: Hans van Kranenburg; 740...@bugs.debian.org
> Cc: Sarraf, Ritesh; Bill MacAllister; netapp-linu...@linux.netapp.com
> Subject: Re: [Netapp-Linux-Community] Bug#740701: multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed"
>
> Hans,
>
> SCSI UNMAP functionality got complete very recently in the Linux kernel. From what I see so far, you seem to be running on 3.2 kernel. The Debian kernel team's policy for stable is to only backport important fixes, and some device driver refreshes. I highly doubt they'd have backported SCSI enhancements.
>
> If UNMAP is important to you as a feature, you may want to try evaluate a more recent kernel.
>
> I just checked on b.d.o, it currently has Linux 3.12. That should be a good start to verify against.
>
> https://packages.debian.org/search?keywords=linux-image&searchon=names&section=all&suite=wheezy-backports

As long as I have not succeeded to construct a proper isolated
reproducible test case which fails consistently, there's not much reason
to start changing or trying anything I guess.

So, right now I have built a copy of the production environment setup
where this occured (see prev messages in debian bug report [1]), and I
cannot trigger the same bug there yet, so something important must still
be different. Same linux kernel, same ONTAP version, same layering of
iscsi, multipath, dm_crypt and lvm...

In the meanwhile, I've been browsing around a bit in the kernel git
history, also comparing latest (3.16 now) and stable linux-3.2.y.

When taking the error messages as starting point...

sd 12:0:0:0: [sdc] Unhandled sense code

sd 12:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

sd 12:0:0:0: [sdc] Sense Key : Medium Error [current]

sd 12:0:0:0: [sdc] Add. Sense: Incompatible medium installed

sd 12:0:0:0: [sdc] CDB: Unmap/Read sub-channel: 42 00 00 00 00 00 00 00
18 00

end_request: I/O error, dev sdc, sector 4372576896

...I see that the only place in the linux kernel where the message
"Unhandled sense code" occurs is in drivers/scsi/scsi_lib.c, around line
900, where a case statement is executed on the sense key value.

SENSE KEYS are defined in drivers/include/scsi/scsi.h. One of them is
MEDIUM_ERROR, 0x03, which seems to be the one we got here. MEDIUM_ERROR
is not handled in the case statement, so the fall-through is executed:

897 default:
898 description = "Unhandled sense code";
899 action = ACTION_FAIL;
900 break;

This is the case in 3.2 and in 3.16 as well. There are changes related
to UNMAP or WRITE_SAME in this area, but only in case of
ILLEGAL_REQUEST. One of those changes (66a651a) is related to preventing
complete path failures on failed operations. But, that codepath is not
chosen in this case anyway, so not relevant.

I really want to understand what this message means and who generated
it. If I assume it's the NetApp filer that returns this data when
issuing an UNMAP command, I wonder why. Could any of you NetApp folks
shed some light on when and how ontap 8.1.2 7-Mode might ever send this
error back? This could help finding a test case which will trigger this
error, and to find out why it does not occur in a different situation
that seems to be identical.

The whole error still does not make a huge amount of sense to me. Why
would my NetApp system return an error which I only would expect to see
when using a CD-ROM drive or some USB removable device? It's not like
all disks suddenly vanished from my FAS and disk shelves, like ejecting
a CD drive. :-)

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=740701

--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van....@mendix.com | www.mendix.com

Sarraf, Ritesh

unread,

Jun 21, 2014, 5:20:06 PM6/21/14

to

Hello Hans,

Comments below....

[rrs] Thanks. Please keep us posted once you have concluded on a persistent "steps to reproduce".

Ritesh Raj Sarraf

unread,

Jun 21, 2014, 6:20:02 PM6/21/14

to

On 06/22/2014 02:38 AM, Sarraf, Ritesh wrote:

As long as I have not succeeded to construct a proper isolated reproducible test case which fails consistently, there's not much reason to start changing or trying anything I guess.


[rrs] Thanks. Please keep us posted once you have concluded on a persistent "steps to reproduce".

Bill,

You reported this issue originally on a non NetApp box. From what we suspect, this has more to do with the UNMAP implementation in the Linux kernel, for which proper support was very recent.
Would you be in a position to verify this against a newer kernel ??

-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System

signature.asc

Ritesh Raj Sarraf

unread,

Jun 21, 2014, 6:30:01 PM6/21/14

to

On 06/22/2014 03:45 AM, Ritesh Raj Sarraf wrote:

On 06/22/2014 02:38 AM, Sarraf, Ritesh wrote:
As long as I have not succeeded to construct a proper isolated reproducible test case which fails consistently, there's not much reason to start changing or trying anything I guess.


[rrs] Thanks. Please keep us posted once you have concluded on a persistent "steps to reproduce".
Bill,

You reported this issue originally on a non NetApp box. From what we suspect, this has more to do with the UNMAP implementation in the Linux kernel, for which proper support was very recent.
Would you be in a position to verify this against a newer kernel ??

Nope. Actually you too are on a NetApp box (your initial logs) and that is why you see the UNMAP attribute. Our target implements it.

I guess I should get some sleep now. :-)

signature.asc

Martin George

unread,

Jun 22, 2014, 4:40:02 AM6/22/14

to

On 6/20/2014 4:36 AM, Hans van Kranenburg wrote:
>
> I really want to understand what this message means and who generated
> it. If I assume it's the NetApp filer that returns this data when
> issuing an UNMAP command, I wonder why. Could any of you NetApp folks
> shed some light on when and how ontap 8.1.2 7-Mode might ever send this
> error back? This could help finding a test case which will trigger this
> error, and to find out why it does not occur in a different situation
> that seems to be identical.
>
> The whole error still does not make a huge amount of sense to me. Why
> would my NetApp system return an error which I only would expect to see
> when using a CD-ROM drive or some USB removable device? It's not like
> all disks suddenly vanished from my FAS and disk shelves, like ejecting
> a CD drive. :-)
>
> [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=740701
>

So firstly, the question arises why your kernel marked all paths as
failed when you hit this error. This actually resembles the old Linux
behavior where for a device error such as a MEDIUM ERROR, it gets
retried on all paths available to the LUN, all which result in the same
error, and hence all paths get marked as failed. This was addressed with
the upstream patch at
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=63583cca745f440167bf27877182dc13e19d4bcf,
where more fine-grained error handling is now available. With this,
device errors such as MEDIUM ERROR are no longer retried since it treats
such errors as permanent errors. That makes me suspect your kernel is
already missing some of the key patches from the upstream kernel in
context with this error handling. And given that UNMAP has also been a
relatively new feature which underwent several upstream revisions to get
to the current stable state, it would be prudent for you to check if
your kernel is up-to-date with its SCSI & UNMAP handling.

That said, it is indeed strange that you hit a MEDIUM ERROR in the first
place, when using UNMAP. As described above, that's a device error. So
does this fail even for other commands such as a regular write (you
could try this with dd) or even a simple TUR command (like say using
sg_turs -v /dev/mpathX)?

-Martin

Hans van Kranenburg

unread,

Jun 22, 2014, 7:40:01 PM6/22/14

to

Hi,

On 06/22/2014 10:19 AM, Martin George wrote:
>
> So firstly, the question arises why your kernel marked all paths as
> failed when you hit this error. This actually resembles the old Linux
> behavior where for a device error such as a MEDIUM ERROR, it gets
> retried on all paths available to the LUN, all which result in the same
> error, and hence all paths get marked as failed. This was addressed with
> the upstream patch at
> http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=63583cca745f440167bf27877182dc13e19d4bcf,
> where more fine-grained error handling is now available.

Yes, it retries on all paths. The kernel version (3.2.57) which is used
in my case already includes the changes mentioned above.

> With this,
> device errors such as MEDIUM ERROR are no longer retried since it treats
> such errors as permanent errors. That makes me suspect your kernel is
> already missing some of the key patches from the upstream kernel in
> context with this error handling. And given that UNMAP has also been a
> relatively new feature which underwent several upstream revisions to get
> to the current stable state, it would be prudent for you to check if
> your kernel is up-to-date with its SCSI & UNMAP handling.

Currently I'm not able to reproduce the error (getting this iSCSI
response) I see in production after re-creating a very similar test
setup using same hardware and software that is failing on me, which is a
bit confusing. :||

So, even worse, I'm not convinced that the actual problem is a linux
kernel problem yet. Why is my NetApp filer sending a MEDIUM ERROR
"Incompatible medium installed" to me anyway in the other case?

The latest kernel code only prevents (afaics) the retry in a small
subset of cases, which does not include an asc of 0x30 INCOMPATIBLE
MEDIUM INSTALLED.

case MEDIUM_ERROR:
if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
sshdr.asc == 0x13 || /* AMNF DATA FIELD */
sshdr.asc == 0x14) { /* RECORD NOT FOUND */
set_host_byte(scmd,DID_MEDIUM_ERROR);
return SUCCESS;
}
return NEEDS_RETRY;

> That said, it is indeed strange that you hit a MEDIUM ERROR in the first
> place, when using UNMAP. As described above, that's a device error. So
> does this fail even for other commands such as a regular write (you
> could try this with dd) or even a simple TUR command (like say using
> sg_turs -v /dev/mpathX)?

# sg_turs -v /dev/mapper/mpath_scylla0
test unit ready cdb: 00 00 00 00 00 00

The UNMAP is the only command that causes the failure. As long as I do
not cause an UNMAP to be sent, by doing mkfs.ext4 without -E nodiscard,
doing a mkfs.btrfs without preventing discard or issuing an fstrim
command, this multipathed lvm on iscsi handles millions of iscsi write
and read ops every day in production just fine. If an UNMAP is sent, it
makes all iSCSI storage on a physical server hang, as seen before.

Today I played around a bit in my test environment (where the failure
does not occur yet), also tcpdumping the iSCSI traffic, viewing it
afterwards using wireshark, and reading about the SCSI specs. That's a
very interesting way to learn more about what I'm talking about here. :-)

If there's no obvious way to be found to trigger the same error in the
test environment, I think I'm going to propose to trigger the same again
while having the test physical server attached to the production luns.
From the past occurance, I know that if the only thing that breaks is
the storage connection on the physical server that executes the UNMAP.
It's still not the most reassuring choice, but a kind of a calculated risk.

If that's possible I can do a couple of tcpdumps on the iscsi and
blktrace dumps to capture what's going on and post them here. Doing so
will prove whether the SCSI error was actually being sent by the NetApp
device or not.

--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van....@mendix.com | www.mendix.com

Hans van Kranenburg

unread,

Jun 23, 2014, 6:50:02 PM6/23/14

to

On 06/23/2014 06:31 PM, Hans van Kranenburg wrote:

> On 06/23/2014 01:30 AM, Hans van Kranenburg wrote:
>>
>> If there's no obvious way to be found to trigger the same error in the
>> test environment, I think I'm going to propose to trigger the same again
>> while having the test physical server attached to the production luns.
>> From the past occurance, I know that if the only thing that breaks is
>> the storage connection on the physical server that executes the UNMAP.
>> It's still not the most reassuring choice, but a kind of a calculated
>> risk.
>>
>> If that's possible I can do a couple of tcpdumps on the iscsi and
>> blktrace dumps to capture what's going on and post them here. Doing so
>> will prove whether the SCSI error was actually being sent by the NetApp
>> device or not.
>

> And that's what I just did, together with a colleague of mine. On one
> lun, the NetApp box accepts unmap, on another lun it throws up with
> Incompatible Medium Installed. All other iSCSI connections from other
> physical servers to the same production lun are not impacted, only the
> connection to this server.
>
> [...] dsfsdfsdfsdfdsf

For netapp-linux-community folks, previous mail is still in moderation
queue, you can also read it in the debian bug report, including
interesting tcpdump attachments with iscsi traffic while the errors
occur: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=740701#87

Ok, for sake of completeness, I installed the 3.14 kernel from
wheezy-backports (linux-image-3.14-0.bpo.1-amd64 3.14.7-1~bpo70+1) and
reran the test, which provides the exact same results. Doing unmap on
the test lun succeeds, doing unmap on the other lun results in the same
behaviour and same errors, in slightly different formatting then when
using the 3.2 kernel:

[...]
Jun 23 23:29:51 jolteon kernel: [ 678.219033] sd 9:0:0:0: [sdl]
Unhandled sense code
Jun 23 23:29:51 jolteon kernel: [ 678.219142] sd 9:0:0:0: [sdl]
Jun 23 23:29:51 jolteon kernel: [ 678.219234] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
Jun 23 23:29:51 jolteon kernel: [ 678.219331] sd 9:0:0:0: [sdl]
Jun 23 23:29:51 jolteon kernel: [ 678.219423] Sense Key : Medium Error
[current]
Jun 23 23:29:51 jolteon kernel: [ 678.219653] sd 9:0:0:0: [sdl]
Jun 23 23:29:51 jolteon kernel: [ 678.219753] Add. Sense: Incompatible
medium installed
Jun 23 23:29:51 jolteon kernel: [ 678.219926] sd 9:0:0:0: [sdl] CDB:
Jun 23 23:29:51 jolteon kernel: [ 678.220019] Unmap/Read sub-channel:
42 00 00 00 00 00 00 00 18 00
Jun 23 23:29:51 jolteon kernel: [ 678.220946] device-mapper: multipath:
Failing path 8:176.
[...]

By the way, also, the first message on this debian bug report, from Bill
MacAllister already listed the output of a very recent linux kernel when
using the test case 'mkfs on jessie'.

That concludes the discussion about older or newer linux kernels. The
real problem here is NetApp, returning the SCSI errors while issuing
UNMAP commands to it.

Questions left:
- Is it wanted to have the linux kernel multipathing fail an iop
instead of retry on receiving the combination of a medium error and
additional code incompatible medium installed?
- Now I'm left with my broken NetApp, and I'd like to start using
UNMAP on it... Any comments from netapp people reading this? There must
be some reason why this is happening, and only on this specific lun, and
not on the test lun, or on several of the other NetApp filer we use.

Martin George

unread,

Jun 24, 2014, 1:30:02 AM6/24/14

to

On 6/24/2014 4:11 AM, Hans van Kranenburg wrote:
> On 06/23/2014 06:31 PM, Hans van Kranenburg wrote:
> Questions left:
> - Is it wanted to have the linux kernel multipathing fail an iop
> instead of retry on receiving the combination of a medium error and
> additional code incompatible medium installed?

Well, it would have been ideal if the Linux kernel had given up for this
MEDIUM ERROR - INCOMPATIBLE MEDIUM INSTALLED ASC as well, instead of
retrying on all available paths (since retrying would only end up
hitting the same error again, given that this is a device error).

> - Now I'm left with my broken NetApp, and I'd like to start using
> UNMAP on it... Any comments from netapp people reading this? There must
> be some reason why this is happening, and only on this specific lun, and
> not on the test lun, or on several of the other NetApp filer we use.
>

Yes, the NetApp controllers are returning this MEDIUM ERROR check
condition for some reason. I'd suggest you open a NetApp support ticket
for tracking this.

-Martin

Hans van Kranenburg

unread,

Jul 31, 2014, 4:20:01 PM7/31/14

to

Hello again,

On 06/24/2014 07:24 AM, Martin George wrote:
> On 6/24/2014 4:11 AM, Hans van Kranenburg wrote:
>> On 06/23/2014 06:31 PM, Hans van Kranenburg wrote:
>> Questions left:
>> - Is it wanted to have the linux kernel multipathing fail an iop
>> instead of retry on receiving the combination of a medium error and
>> additional code incompatible medium installed?
>
> Well, it would have been ideal if the Linux kernel had given up for this
> MEDIUM ERROR - INCOMPATIBLE MEDIUM INSTALLED ASC as well, instead of
> retrying on all available paths (since retrying would only end up
> hitting the same error again, given that this is a device error).

Yeah, that's true. In that case, even if the actual medium error would
be a bug, which is the case here, it would fail a single path, then fail
the iop and then after the direct-io checker would come along, the path
would be enabled again.

I'll leave this as an exercise for myself to create a kernel patch that
would do this. Seems doable, the only problem is that I need a
development environment that simulates exactly this behaviour. I can
obviously not use our production netapp system for this. :)

>> - Now I'm left with my broken NetApp, and I'd like to start using
>> UNMAP on it... Any comments from netapp people reading this? There must
>> be some reason why this is happening, and only on this specific lun, and
>> not on the test lun, or on several of the other NetApp filer we use.
>>
>
> Yes, the NetApp controllers are returning this MEDIUM ERROR check
> condition for some reason. I'd suggest you open a NetApp support ticket
> for tracking this.

And so I did. The answer from NetApp support is:

"Since the UNMAP command is an host OS related command, it is not
something that we implement in explicitly in Data ONTAP. The
functionality to support the command is something we do have to test
however, and any changes or additions to support it are done in new
releases. This is why you are able to get some functionality out of the
command in 8.1.2. However, it's not fully certified in Data ONTAP until
8.1.3."

So, using UNMAP might work in ONTAP versions before 8.1.3, but it does
not have an official approval stamp by NetApp.

Also see:
- https://kb.netapp.com/support/index?page=content&id=3013806
- https://kb.netapp.com/support/index?page=content&id=3013991

So, although this functionality is reported as 'working' in many cases,
it seems that NetApp itself has fixed some bugs in 8.1.3 and only then
officially started supporting using it.

Right now, our idea is to do some upgrades from random 8.1 versions we
run to the latest maintenance version of ONTAP 8.1 (8.1.4P1), which is
not a bad idea in any case.

Anyway, I'm still typing this in a debian bug report. I know this is
actually not a debian bug at all anymore, but since I guess that whoever
encounters the same issue and starts to look for a solution on the
internet ends up reading this page, I'd like to use this bug report to
document my experiences.

If Bill (OP) agrees, I'll close it as soon as I found a solution for the
problem, documenting it, whatever that is.

--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van....@mendix.com | www.mendix.com

Ritesh Raj Sarraf

unread,

Aug 1, 2014, 2:20:04 AM8/1/14

to

Control: severity -1 normal

On 08/01/2014 01:43 AM, Hans van Kranenburg wrote:

Anyway, I'm still typing this in a debian bug report. I know this is actually not a debian bug at all anymore, but since I guess that whoever encounters the same issue and starts to look for a solution on the internet ends up reading this page, I'd like to use this bug report to document my experiences.

If Bill (OP) agrees, I'll close it as soon as I found a solution for the problem, documenting it, whatever that is.

Thanks for the update Hans. I am thus downgrading the severity to normal.

signature.asc

Bill MacAllister

unread,

Aug 1, 2014, 1:30:02 PM8/1/14

to

--On Thursday, July 31, 2014 10:13:15 PM +0200 Hans van Kranenburg <hans.van....@mendix.com> wrote:

> If Bill (OP) agrees, I'll close it as soon as I found a solution for
> the problem, documenting it, whatever that is.

That is fine by me. Thanks for your work on this.

Bill

--

Bill MacAllister
System Programmer, Stanford University

Hans van Kranenburg

unread,

Sep 1, 2014, 4:40:02 PM9/1/14

to

Well, actually it was...

It turns out that, since the upgrade to ONTAP 8.1.4P1, we have this
issue with UNMAP on all our NetApp filers, instead of only one.

We first did the upgrade to 8.1.4P1 on a system that is hosting backups
and some office infrastructure, before doing the production systems.
Looking back, it had been better if we already started testing UNMAP
again on that one after the first upgrade, because then we would have
known that upgrading would have actually introduced the same issue there.

Sadly, I was unwittingly assuming that the upgrade would hopefully fix
the issue, instead of just causing it, so we didn't really extensively
test on the system that was previously working fine. (argh!)

The only upside of this situation is that I can consistently reproduce
the error now in a dedicated test setup with some separate physical
servers and a dedicated test-lun on the backups/office system, where I
can start playing around, also with multipath/iscsi configuration in
linux, without having access to any production data.

And I'm trying to get the support ticket at NetApp escalated to the next
level...

To be continued...