Hi Folks,
I am looking at a kernel panic due to a hung task and could use some help understanding whether this is a known issue. Kernel version is 4.14.63.
Here is an complete stack trace of the hung kworker task.
crash> bt 106700
PID: 106700 TASK: ffff885eb22ebe80 CPU: 8 COMMAND: "kworker/u32:0"
#0 [ffffc900550ebab8] __schedule at ffffffff815f0b78
#1 [ffffc900550ebb50] schedule at ffffffff815f1248
#2 [ffffc900550ebb58] schedule_timeout at ffffffff815f4fe6
#3 [ffffc900550ebbf8] wait_for_completion at ffffffff815f1cf0
#4 [ffffc900550ebc48] flush_workqueue at ffffffff8108ec66
#5 [ffffc900550ebce8] drain_workqueue at ffffffff8108ef84
#6 [ffffc900550ebd10] destroy_workqueue at ffffffff81091ce5
#7 [ffffc900550ebd30] scsi_host_dev_release at ffffffffa0095ced [scsi_mod]
#8 [ffffc900550ebd48] device_release at ffffffff81453c90
#9 [ffffc900550ebd68] kobject_put at ffffffff815d8130
#10 [ffffc900550ebd88] iscsi_session_release at ffffffffa0aebf88 [scsi_transport_iscsi]
#11 [ffffc900550ebda8] device_release at ffffffff81453c90
#12 [ffffc900550ebdc8] kobject_put at ffffffff815d8130
#13 [ffffc900550ebde8] device_release at ffffffff81453c90
#14 [ffffc900550ebe08] kobject_put at ffffffff815d8130
#15 [ffffc900550ebe28] scsi_remove_target at ffffffffa00a3e92 [scsi_mod]
#16 [ffffc900550ebe70] __iscsi_unbind_session at ffffffffa0aecd8d [scsi_transport_iscsi]
#17 [ffffc900550ebe98] process_one_work at ffffffff8108f62a
#18 [ffffc900550ebed8] worker_thread at ffffffff8108f84b
#19 [ffffc900550ebf10] kthread at ffffffff8109536a
#20 [ffffc900550ebf50] ret_from_fork at ffffffff816001ef
After poking around in the kdump, I've discovered that the worker thread that called __iscsi_unbind_session did so for a work item that came from the same workqueue that is being destroyed at the top of the stack. My understanding of work queues is that this isn't allowed and will result in a hung task.
Here we can see where the __iscsi_unbind_session work is queued to a SCSI work queue
static int
iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
{
.
.
.
case ISCSI_UEVENT_UNBIND_SESSION:
session = iscsi_session_lookup(ev->u.d_session.sid);
if (session)
scsi_queue_work(iscsi_session_to_shost(session), <--- unbind work queued to scsi work queue
&session->unbind_work);
else
err = -EINVAL;
break;
Here we can see that this puts the work item onto Scsi_Host->work_q
int scsi_queue_work(struct Scsi_Host *shost, struct work_struct *work)
{
if (unlikely(!shost->work_q)) {
shost_printk(KERN_ERR, shost,
"ERROR: Scsi host '%s' attempted to queue scsi-work, "
"when no workqueue created.\n", shost->hostt->name);
dump_stack();
return -EINVAL;
}
return queue_work(shost->work_q, work); <--- Work item goes into Scsi_Host->work_q
}
Here we can see the scsi_host_dev_release routine destroying the Scsi_Host->work_q
static void scsi_host_dev_release(struct device *dev)
{
struct Scsi_Host *shost = dev_to_shost(dev);
struct device *parent = dev->parent;
scsi_proc_hostdir_rm(shost->hostt);
/* Wait for functions invoked through call_rcu(&shost->rcu, ...) */
rcu_barrier();
if (shost->tmf_work_q)
destroy_workqueue(shost->tmf_work_q);
if (shost->ehandler)
kthread_stop(shost->ehandler);
if (shost->work_q)
destroy_workqueue(shost->work_q); <--- Destroying Scsi_Host->work_q
I did some searching and couldn't locate a similar stack trace. Does anyone know if this a known issue?
If not a known issue, any ideas as to what would normally keep the Scsi_Host device from being removed inline in this call stack? This happened on two hosts with mniutes of each other after starting to disconnect from 2 targets. I believe the unbind session was kicked off from an iscsiadm command to terminate the session but other than that nothing out of the ordinary was going on.
Thanks in advance,
Adam