[RFC][PATCH] fix for async scsi scan sysfs problem (resend)

Josef Bacik

unread,

Apr 19, 2007, 9:24:55 AM4/19/07

to linux...@vger.kernel.org

Hello,

Resending this to a wider audience (thanks Andrew). I'm having a problem on
the newest version of linus's git tree with my qla2xxx card. This is on a UP
box, the problem doesn't happen on my similarly configured SMP box. When I
unload and then try to load the qla2xxx driver again I get this message

kobject_add failed for 3:0:0:0 with -EEXIST, don't try to register
things with the same name in the same directory.
[<c0405ea6>] show_trace_log_lvl+0x1a/0x2f
[<c0406456>] show_trace+0x12/0x14
[<c04064d1>] dump_stack+0x16/0x18
[<c04e6e86>] kobject_shadow_add+0xcd/0x1df
[<c04e6fa2>] kobject_add+0xa/0xc
[<c0557ae1>] device_add+0xab/0x62e
[<d0873a0f>] scsi_sysfs_add_sdev+0x2d/0x1eb [scsi_mod]
[<d0871db8>] scsi_probe_and_add_lun+0x974/0xaa5 [scsi_mod]
[<d087240a>] __scsi_scan_target+0xc0/0x5f1 [scsi_mod]
[<d0872ec5>] scsi_scan_target+0x97/0xa6 [scsi_mod]
[<d08b1c34>] fc_scsi_scan_rport+0x5a/0x76 [scsi_transport_fc]
[<c0435f33>] run_workqueue+0x89/0x14e
[<c0436949>] worker_thread+0xf8/0x124
[<c043911b>] kthread+0xb3/0xdc
[<c0405b4f>] kernel_thread_helper+0x7/0x10
=======================

I traced this down to the async scanning doing a kobject_add for that object,
the backtrace below shows the path we took to add it.

[<c0405ea6>] show_trace_log_lvl+0x1a/0x2f
[<c0406456>] show_trace+0x12/0x14
[<c04064d1>] dump_stack+0x16/0x18
[<c04e6e86>] kobject_shadow_add+0xcd/0x1df
[<c04e6fa2>] kobject_add+0xa/0xc
[<c055a45c>] class_device_add+0x9e/0x3ad
[<d0873a3c>] scsi_sysfs_add_sdev+0x5a/0x1eb [scsi_mod]
[<d0872cd4>] do_scan_async+0x62/0xf8 [scsi_mod]
[<c043911b>] kthread+0xb3/0xdc
[<c0405b4f>] kernel_thread_helper+0x7/0x10
=======================

Looking through everything I came to the conclusion that we don't really need
the scsi_sysfs_add_devices in scsi_finish_async_scan, which gets run everytime
we do a do_scan_async. In doing the scanning, if we come upon anything we will
already be registering the device with sysfs so the scsi_sysfs_add_devices step
is kind of useless. I tested this and it worked fine on my UP box (where the
problem was happening) and my SMP box (where the problem wasn't happening). Now
I'm not entirely sure if this is correct, but I'm attaching the patch that I
used to fix it for me, please point out if I've done something wrong or if there
is a different way this needs to be fixed. Thank you,

Josef

PS. I'm not on linux-scsi so please CC me, thanks for your time.

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 0949145..2c8527b 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1661,15 +1661,6 @@ int scsi_scan_host_selected(struct Scsi_
return 0;
}

-static void scsi_sysfs_add_devices(struct Scsi_Host *shost)
-{
- struct scsi_device *sdev;
- shost_for_each_device(sdev, shost) {
- if (scsi_sysfs_add_sdev(sdev) != 0)
- scsi_destroy_sdev(sdev);
- }
-}
-
/**
* scsi_prep_async_scan - prepare for an async scan
* @shost: the host which will be scanned
@@ -1741,8 +1732,6 @@ static void scsi_finish_async_scan(struc

wait_for_completion(&data->prev_finished);

- scsi_sysfs_add_devices(shost);
-
spin_lock(&async_scan_lock);
shost->async_scan = 0;
list_del(&data->list);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

James Bottomley

unread,

Apr 19, 2007, 10:03:33 AM4/19/07

to Josef Bacik

On Thu, 2007-04-19 at 09:25 -0400, Josef Bacik wrote:
> Looking through everything I came to the conclusion that we don't really need
> the scsi_sysfs_add_devices in scsi_finish_async_scan, which gets run everytime
> we do a do_scan_async. In doing the scanning, if we come upon anything we will
> already be registering the device with sysfs so the scsi_sysfs_add_devices step
> is kind of useless.

Unfortunately, it isn't. The registration step while scanning is at the
end of scsi_add_lun():

if (!async && scsi_sysfs_add_sdev(sdev) != 0)
return SCSI_SCAN_NO_RESPONSE;

return SCSI_SCAN_LUN_PRESENT;

The !async should mean that the addition *only* occurs for the non async
scan case ... if you remove the post async scan add, we'll lose devices.

> I tested this and it worked fine on my UP box (where the
> problem was happening) and my SMP box (where the problem wasn't happening). Now
> I'm not entirely sure if this is correct, but I'm attaching the patch that I
> used to fix it for me, please point out if I've done something wrong or if there
> is a different way this needs to be fixed. Thank you,

Could you add some debugging first to see if we're actually adding the
device twice (and also, if we are, what the value of the async is).

James

Josef Bacik

unread,

Apr 19, 2007, 11:06:30 AM4/19/07

to James Bottomley

On Thu, Apr 19, 2007 at 10:02:36AM -0400, James Bottomley wrote:
> On Thu, 2007-04-19 at 09:25 -0400, Josef Bacik wrote:
> > Looking through everything I came to the conclusion that we don't really need
> > the scsi_sysfs_add_devices in scsi_finish_async_scan, which gets run everytime
> > we do a do_scan_async. In doing the scanning, if we come upon anything we will
> > already be registering the device with sysfs so the scsi_sysfs_add_devices step
> > is kind of useless.
>
> Unfortunately, it isn't. The registration step while scanning is at the
> end of scsi_add_lun():
>
>
> if (!async && scsi_sysfs_add_sdev(sdev) != 0)
> return SCSI_SCAN_NO_RESPONSE;
>
> return SCSI_SCAN_LUN_PRESENT;
>
> The !async should mean that the addition *only* occurs for the non async
> scan case ... if you remove the post async scan add, we'll lose devices.
>
> > I tested this and it worked fine on my UP box (where the
> > problem was happening) and my SMP box (where the problem wasn't happening). Now
> > I'm not entirely sure if this is correct, but I'm attaching the patch that I
> > used to fix it for me, please point out if I've done something wrong or if there
> > is a different way this needs to be fixed. Thank you,
>
> Could you add some debugging first to see if we're actually adding the
> device twice (and also, if we are, what the value of the async is).
>

Sorry I should have put that in the original post, I added debugging to
kobject_add to check to see if we were adding something twice, thats how I
figured out who was doing it

kobject rport-3:0-0: registering. parent: host3, set: devices
kobject rport-3:0-0: registering. parent: fc_remote_ports, set: class_obj
kobject target3:0:0: registering. parent: rport-3:0-0, set: devices
kobject rport-3:0-1: registering. parent: host3, set: devices
kobject rport-3:0-1: registering. parent: fc_remote_ports, set: class_obj
kobject target3:0:0: registering. parent: fc_transport, set: class_obj
kobject rport-3:0-2: registering. parent: host3, set: devices
kobject rport-3:0-2: registering. parent: fc_remote_ports, set: class_obj
kobject rport-3:0-3: registering. parent: host3, set: devices
kobject rport-3:0-3: registering. parent: fc_remote_ports, set: class_obj
kobject rport-3:0-4: registering. parent: host3, set: devices
kobject rport-3:0-4: registering. parent: fc_remote_ports, set: class_obj
kobject rport-3:0-5: registering. parent: host3, set: devices
kobject rport-3:0-5: registering. parent: fc_remote_ports, set: class_obj
kobject rport-3:0-6: registering. parent: host3, set: devices
kobject rport-3:0-6: registering. parent: fc_remote_ports, set: class_obj
kobject rport-3:0-7: registering. parent: host3, set: devices
kobject rport-3:0-7: registering. parent: fc_remote_ports, set: class_obj
>> kobject 3:0:0:0: registering. parent: target3:0:0, set: devices
kobject 3:0:0:0: registering. parent: scsi_device, set: class_obj
scsi 3:0:0:0: Direct-Access IBM 1742-900 0520 PQ: 0 ANSI: 3
>> kobject 3:0:0:0: registering. parent: target3:0:0, set: devices

kobject_add failed for 3:0:0:0 with -EEXIST, don't try to register things with
the same name in the same directory.

Async in the first case is set and in the second case it isn't set. Thank you,

Josef

Andrew Morton

unread,

Apr 21, 2007, 3:24:30 AM4/21/07

to Josef Bacik

So.... do we now know what is causing this failure?

Josef Bacik

unread,

Apr 21, 2007, 9:59:38 AM4/21/07

to Andrew Morton

Yes, the qla init stuff kicks off the async scanning, and then continues to
initialize, and registers itself with the scsi fc stuff, which makes a workqueue
to register the fc port. The problem with this is the async scanning and the fc
port registration stuff runs back to back on my UP computer, so the scsi async
stuff does the sysfs registration and then right after that the scsi fc stuff
goes through and does its own scsi scan without async set, which does the sysfs
registration as well, and then kobject complains about adding the same thing
twice. Taking out the async check in scsi_add_lun would probably work, but this
really isn't my area of expertise and I have a feeling thats kind of defeating
the purpose of the async scsi scanning.

Josef

Josef Bacik

unread,

Apr 23, 2007, 2:12:41 PM4/23/07

to Josef Bacik

On Sat, Apr 21, 2007 at 09:59:56AM -0400, Josef Bacik wrote:
> On Sat, Apr 21, 2007 at 12:23:45AM -0700, Andrew Morton wrote:
> > On Thu, 19 Apr 2007 11:06:56 -0400 Josef Bacik <jwh...@redhat.com> wrote:
> >
> > > On Thu, Apr 19, 2007 at 10:02:36AM -0400, James Bottomley wrote:
> > > > On Thu, 2007-04-19 at 09:25 -0400, Josef Bacik wrote:
> > > > > Looking through everything I came to the conclusion that we don't really need
> > > > > the scsi_sysfs_add_devices in scsi_finish_async_scan, which gets run everytime
> > > > > we do a do_scan_async. In doing the scanning, if we come upon anything we will
> > > > > already be registering the device with sysfs so the scsi_sysfs_add_devices step
> > > > > is kind of useless.
> > > >
> > > > Unfortunately, it isn't. The registration step while scanning is at the
> > > > end of scsi_add_lun():
> > > >
> > > >
> > > > if (!async && scsi_sysfs_add_sdev(sdev) != 0)
> > > > return SCSI_SCAN_NO_RESPONSE;
> > > >
> > > > return SCSI_SCAN_LUN_PRESENT;
> > > >
> > > > The !async should mean that the addition *only* occurs for the non async
> > > > scan case ... if you remove the post async scan add, we'll lose devices.
> > > >

Ok I have a new patch that I've built and tested on both my UP and SMP machine
and it appears to work fine. I took the async check out of scsi_add_lun, I
don't really see the point in waiting to do the sysfs registration stuff (if
theres a reason I haven't been able to find it in the original submission of
this functionality). Please let me know if this is incorrect. Thank you,

Josef

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 0949145..8af1e16 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -712,7 +712,7 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result,
* SCSI_SCAN_LUN_PRESENT: a new scsi_device was allocated and initialized
**/
static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
- int *bflags, int async)
+ int *bflags)
{
/*
* XXX do not save the inquiry, since it can change underneath us,
@@ -912,7 +912,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
* register it and tell the rest of the kernel
* about it.
*/
- if (!async && scsi_sysfs_add_sdev(sdev) != 0)
+ if (scsi_sysfs_add_sdev(sdev) != 0)
return SCSI_SCAN_NO_RESPONSE;

return SCSI_SCAN_LUN_PRESENT;
@@ -1081,7 +1081,7 @@ static int scsi_probe_and_add_lun(struct scsi_target *starget,
goto out_free_result;
}

- res = scsi_add_lun(sdev, result, &bflags, shost->async_scan);
+ res = scsi_add_lun(sdev, result, &bflags);
if (res == SCSI_SCAN_LUN_PRESENT) {
if (bflags & BLIST_KEY) {
sdev->lockable = 0;
@@ -1661,15 +1661,6 @@ int scsi_scan_host_selected(struct Scsi_Host *shost, unsigned int channel,

return 0;
}

-static void scsi_sysfs_add_devices(struct Scsi_Host *shost)
-{
- struct scsi_device *sdev;
- shost_for_each_device(sdev, shost) {
- if (scsi_sysfs_add_sdev(sdev) != 0)
- scsi_destroy_sdev(sdev);
- }
-}
-
/**
* scsi_prep_async_scan - prepare for an async scan
* @shost: the host which will be scanned

@@ -1741,8 +1732,6 @@ static void scsi_finish_async_scan(struct async_scan_data *data)

wait_for_completion(&data->prev_finished);

- scsi_sysfs_add_devices(shost);
-
spin_lock(&async_scan_lock);
shost->async_scan = 0;
list_del(&data->list);

James Bottomley

unread,

Apr 23, 2007, 2:27:19 PM4/23/07

to Josef Bacik

On Mon, 2007-04-23 at 14:13 -0400, Josef Bacik wrote:
> Ok I have a new patch that I've built and tested on both my UP and SMP machine
> and it appears to work fine. I took the async check out of scsi_add_lun, I
> don't really see the point in waiting to do the sysfs registration stuff (if
> theres a reason I haven't been able to find it in the original submission of
> this functionality). Please let me know if this is incorrect. Thank you,

Yes, it's incorrect ... if you do this, the devices will come up in a
random order for multiple SCSI cards. One of the original design goals
was not to require udev, so the final ordering should be the same as for
the sync case.

I think the root cause of the problem is somewhere in the fc transport
rport addition code.

James

James Smart

unread,

May 3, 2007, 4:03:56 PM5/3/07

to James Bottomley

I doubt it's in the fc transport - it's doing what it always did, which has
nothing to do with coherency of the sdev's.

We're seeing like problems, and it looks like it's related to the scan_mutex
being held when some of the entry points are being called via the recent
async scan code (which also still has a bunch of issues around rmmod).
We should be sending some patches shortly.

-- james s

James Bottomley wrote:
> On Mon, 2007-04-23 at 14:13 -0400, Josef Bacik wrote:
>> Ok I have a new patch that I've built and tested on both my UP and SMP machine
>> and it appears to work fine. I took the async check out of scsi_add_lun, I
>> don't really see the point in waiting to do the sysfs registration stuff (if
>> theres a reason I haven't been able to find it in the original submission of
>> this functionality). Please let me know if this is incorrect. Thank you,
>
> Yes, it's incorrect ... if you do this, the devices will come up in a
> random order for multiple SCSI cards. One of the original design goals
> was not to require udev, so the final ordering should be the same as for
> the sync case.
>
> I think the root cause of the problem is somewhere in the fc transport
> rport addition code.
>
> James
>
>
> -

> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in

Jurij Smakov

unread,

Aug 11, 2007, 11:09:40 AM8/11/07

to James Smart, James Bottomley, Josef Bacik, Andrew Morton, linux...@vger.kernel.org, linux-...@vger.kernel.org, Matthew Wilcox

[Please keep me on CC, as I'm not on LKML.]

On Thu, May 03, 2007 at 04:00:57PM -0400, James Smart wrote:
> I doubt it's in the fc transport - it's doing what it always did, which has
> nothing to do with coherency of the sdev's.
>
> We're seeing like problems, and it looks like it's related to the
> scan_mutex
> being held when some of the entry points are being called via the recent
> async scan code (which also still has a bunch of issues around rmmod).
> We should be sending some patches shortly.

Hi James,

I've recently got a Sun Blade 1000 box with a QLA2200 controller, and
I'm bumping into exact same problem with 2.6.22:

scsi 0:0:0:0: Attached scsi generic sg1 type -1
scsi 0:0:0:0: Direct-Access HITACHI DKR1C-J072FC D7V5 PQ: 0
ANSI: 3
kobject_add failed for 0:0:0:0 with -EEXIST, don't try to register

things with the same name in the same directory.

Call Trace:
[000000001000ac78] scsi_sysfs_add_sdev+0x2c/0x228 [scsi_mod]
[0000000010008a68] scsi_probe_and_add_lun+0x97c/0xab8 [scsi_mod]
[00000000100090d8] __scsi_scan_target+0x90/0x660 [scsi_mod]
[0000000010009ce8] scsi_scan_target+0x94/0xa4 [scsi_mod]
[00000000100668bc] fc_scsi_scan_rport+0x68/0x8c [scsi_transport_fc]
[000000000046de88] run_workqueue+0xac/0x138
[000000000046e414] worker_thread+0xc4/0xd4
[0000000000471f24] kthread+0x4c/0x78
[00000000004277f8] kernel_thread+0x38/0x48
[0000000000471d84] kthreadd+0xbc/0x160
error 1

After that the device fails to initialize. On rare occasions the
error does not trigger, and then machine boots fine. The complete boot
log can be found at http://www.wooyd.org/misc/dmesg-blade1000-2.6.22.log

I'm willing to test any patches you might have, as well as provide any
additional debugging information.

Best regards,
--
Jurij Smakov ju...@wooyd.org
Key: http://www.wooyd.org/pgpkey/ KeyID: C99E03CC

Matthew Wilcox

unread,

Aug 12, 2007, 8:27:43 PM8/12/07

to Jurij Smakov, James Smart, James Bottomley, Josef Bacik, Andrew Morton, linux...@vger.kernel.org, linux-...@vger.kernel.org

On Sat, Aug 11, 2007 at 04:04:54PM +0100, Jurij Smakov wrote:
> [Please keep me on CC, as I'm not on LKML.]

> I've recently got a Sun Blade 1000 box with a QLA2200 controller, and
> I'm bumping into exact same problem with 2.6.22:

Please try
http://marc.info/?l=linux-scsi&m=118289275414202
which fixes a number of problems with the async scanning code.

--
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."