[Bug 212841] getting panic during mps reinitialization.

bugzilla...@freebsd.org

unread,

Sep 20, 2016, 4:39:46 AM9/20/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

Mark Linimon <lin...@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC|freebs...@FreeBSD.org |freebs...@FreeBSD.org
Keywords| |patch

--
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
freebs...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "freebsd-scsi...@freebsd.org"

bugzilla...@freebsd.org

unread,

Sep 20, 2016, 10:08:32 AM9/20/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

Sean Bruno <sbr...@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |sbr...@FreeBSD.org

--- Comment #3 from Sean Bruno <sbr...@FreeBSD.org> ---
I'm not a big fan of using msleep() to work around hardware, but can you try
and put an msleep in your loop so that it pauses a bit before banging again?
Thank you for your diagnosis and patch!

bugzilla...@freebsd.org

unread,

Sep 20, 2016, 12:10:00 PM9/20/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

Ngie Cooper <ng...@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |ng...@FreeBSD.org

--- Comment #4 from Ngie Cooper <ng...@FreeBSD.org> ---
Are you running the latest firmware? My group was running into similar gnarly
problems on ^/stable/11 with the old firmware and a new driver, but after we
upgraded the firmware, everything was groovy.

bugzilla...@freebsd.org

unread,

Sep 26, 2016, 4:53:58 AM9/26/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

--- Comment #5 from prateek sethi <prateek...@gmail.com> ---
Created attachment 175173
--> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=175173&action=edit
Adding some delay in the previous fix.

Added a delay of 1 millisecond before retry for next time in case of
mps_request_sync failure.

I have tested this delay thing by changing return value of mps_request_sync to
failure so get go into the retry path.

bugzilla...@freebsd.org

unread,

Sep 26, 2016, 5:02:30 AM9/26/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

--- Comment #6 from prateek sethi <prateek...@gmail.com> ---
(In reply to Sean Bruno from comment #3)

Hi Sean,
I have put an DELAY of 1 millisecond in the loop. I am not very sure about the
delay timing. Can you tell that 1 ms is fine or not?

(In reply to Ngie Cooper from comment #4)

Hi Ngie,
Yes it can be a HBA firmware issue but I think it is good only if we have fix
for those situations also.(In reply to Sean Bruno from comment #3)

bugzilla...@freebsd.org

unread,

Sep 26, 2016, 10:33:55 AM9/26/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

--- Comment #7 from Sean Bruno <sbr...@FreeBSD.org> ---
(In reply to prateek sethi from comment #6)
Yeah, this looks like it will work. I don't have a problem with an error
handler here. I'll wait a day or two for "strong objections" if there are any.

bugzilla...@freebsd.org

unread,

Sep 26, 2016, 4:25:47 PM9/26/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

Stephen McConnell <s...@freebsd.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |s...@freebsd.org

--- Comment #8 from Stephen McConnell <s...@freebsd.org> ---
Hi Prateek, I'm not sure this is really the right thing to do to fix this
because it just seems that we might be covering up a larger problem where the
driver is doing something wrong. Can you attach the message file that shows all
of the output from the driver prior to the panic? I can take a look and see if
it gives me a clue as to what's going on. There are certain timing restrictions
that need to be followed when resetting the controller - maybe those
restrictions aren't being followed. Some cards are more sensitive to these
restrictions than others and, if I remember correctly, the 2308 is one of them.
In any case, I think we should dig a little deeper first before committing this
change.

bugzilla...@freebsd.org

unread,

Sep 26, 2016, 4:52:18 PM9/26/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

--- Comment #9 from Stephen McConnell <s...@freebsd.org> ---
Also, it might help to have the debug_level set to 0x07 before your test. If
that's too much, and you can't reproduce, then you can try 0x05, and if that's
too much then the default of 0x04 will have to do. Looking at the code again,
the reset timing is probably OK, but I'd still like to see exactly where this
is failing. Thanks.

bugzilla...@freebsd.org

unread,

Sep 27, 2016, 5:20:05 AM9/27/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

--- Comment #10 from prateek sethi <prateek...@gmail.com> ---
(In reply to Stephen McConnell from comment #9)

Hi Stephen,

I could not reproduce this issue now. system got panic after the mps reinit got
triggered. The following are the only logs I could find (before and after the
crash).

Apr 27 13:34:51 Node1 kernel: mps0: Reinitializing controller
Apr 27 14:52:21 Node1 syslogd: kernel boot file is /boot/kernel/kernel
Apr 27 14:52:21 Node1 savecore: reboot after panic: mps_iocfacts_allocate
failed to get IOC Facts with error 6

I got the message "mps0: Doorbell failed to activate" from the core analysis,
which will help to find exact point of the issue. Hope these will be helpful
for the further debugging.

Would you please mind throwing some light on the timing restrictions, that you
mentioned, to be followed when resetting the controller?

bugzilla...@freebsd.org

unread,

Sep 27, 2016, 3:18:41 PM9/27/16

to freebs...@freebsd.org

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212841

--- Comment #11 from Stephen McConnell <s...@freebsd.org> ---
The reset timing in the driver looks fine to me. There is a requirement that
the host wait a certain amount of time when it first accesses the controller
during a reset, and then a certain time to wait on checking registers, etc.
But, it looks fine.

What doesn't make sense is that you're waiting some arbitrary amount of time
after the initial failure and then it works. This time that your waiting is
after the reset completes and then after some calls to other functions. After
all of that, some access to the DOORBELL fails. Then, waiting 2 mSecs fixes it.
That's strange.

There are two ways that this will fail in Step 4 of mps_request_sync(). The
first is when reading the Interrupt Status REG. If this Register does not show
an interrupt within 5 seconds, it fails (that's a really long time). The second
is when reading the DOORBELL REG. If the DOORBELL_USED bit is not set, it
fails. I can't tell which one of these fails. But, because it fails your fix
will just wait 2 mSecs and then retry, then it's successful (at least within 10
mSecs - 5 retries).

What I'm wondering is, does it really matter that you have a delay between
mps_request_sync() calls? To me, it looks like something is messed up in FW and
just doing a retry fixes it.

Now, with all of that said, I'm not sure there really is a better fix except
that the delay may not need to be there. Having the delay there would make
someone think that we're just not waiting long enough, which really is not the
case and looks a little scary, meaning someone could think the driver timing
for this is very fragile, when it's really not.

Sean, let me know what you think about removing the delay. If you want the
delay, I would at least say to add a comment that explains the delay and retry,
since none of this is really supposed to happen and I think it's some FW or HW
workaround.