Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] usb: hub: Fix error loop seen after hub communication errors

656 views
Skip to first unread message

Guenter Roeck

unread,
Mar 8, 2017, 6:20:05 PM3/8/17
to
While stress testing a usb controller using a bind/unbind looop, the
following error loop was observed.

usb 7-1.2: new low-speed USB device number 3 using xhci-hcd
usb 7-1.2: hub failed to enable device, error -108
usb 7-1-port2: cannot disable (err = -22)
usb 7-1-port2: couldn't allocate usb_device
usb 7-1-port2: cannot disable (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
** 57 printk messages dropped ** hub 7-1:1.0: activate --> -22
** 82 printk messages dropped ** hub 7-1:1.0: hub_ext_port_status failed (err = -22)

This continues forever. After adding tracebacks into the code,
the call sequence leading to this is found to be as follows.

[<ffffffc0007fc8e0>] hub_activate+0x368/0x7b8
[<ffffffc0007fceb4>] hub_resume+0x2c/0x3c
[<ffffffc00080b3b8>] usb_resume_interface.isra.6+0x128/0x158
[<ffffffc00080b5d0>] usb_suspend_both+0x1e8/0x288
[<ffffffc00080c9c4>] usb_runtime_suspend+0x3c/0x98
[<ffffffc0007820a0>] __rpm_callback+0x48/0x7c
[<ffffffc00078217c>] rpm_callback+0xa8/0xd4
[<ffffffc000786234>] rpm_suspend+0x84/0x758
[<ffffffc000786ca4>] rpm_idle+0x2c8/0x498
[<ffffffc000786ed4>] __pm_runtime_idle+0x60/0xac
[<ffffffc00080eba8>] usb_autopm_put_interface+0x6c/0x7c
[<ffffffc000803798>] hub_event+0x10ac/0x12ac
[<ffffffc000249bb8>] process_one_work+0x390/0x6b8
[<ffffffc00024abcc>] worker_thread+0x480/0x610
[<ffffffc000251a80>] kthread+0x164/0x178
[<ffffffc0002045d0>] ret_from_fork+0x10/0x40

kick_hub_wq() is called from hub_activate() even after failures to
communicate with the hub. This results in an endless sequence of
hub event -> hub activate -> wq trigger -> hub event -> ...
To break the loop, only trigger the hub event queue if communication
with the hub is successful.

Signed-off-by: Guenter Roeck <li...@roeck-us.net>
---
drivers/usb/core/hub.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 5286bf67869a..f22ab428b310 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -1197,7 +1197,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
&hub->leds, LED_CYCLE_PERIOD);

/* Scan all ports that need attention */
- kick_hub_wq(hub);
+ if (!status)
+ kick_hub_wq(hub);

if (type == HUB_INIT2 || type == HUB_INIT3) {
/* Allow autosuspend if it was suppressed */
--
2.7.4

Alan Stern

unread,
Mar 10, 2017, 4:20:05 PM3/10/17
to
On Wed, 8 Mar 2017, Guenter Roeck wrote:

> While stress testing a usb controller using a bind/unbind looop, the
> following error loop was observed.
>
> usb 7-1.2: new low-speed USB device number 3 using xhci-hcd
> usb 7-1.2: hub failed to enable device, error -108
> usb 7-1-port2: cannot disable (err = -22)
> usb 7-1-port2: couldn't allocate usb_device
> usb 7-1-port2: cannot disable (err = -22)
> hub 7-1:1.0: hub_ext_port_status failed (err = -22)
> hub 7-1:1.0: hub_ext_port_status failed (err = -22)
> hub 7-1:1.0: activate --> -22
> hub 7-1:1.0: hub_ext_port_status failed (err = -22)
> hub 7-1:1.0: hub_ext_port_status failed (err = -22)
> hub 7-1:1.0: activate --> -22
...
I'm not at all sure this is the best solution. status == 0 means that
an URB was successfully _submitted_; it does not mean that future
communication with the hub will be successful.

The problem arises here because the system is unable to runtime-suspend
the hub (you didn't say why, but probably because of the same
communication problem). When a suspend fails, drivers are informed
that the device is back at full power, and of course the hub driver
tries to determine the hub's current state at that time. When this
fails, there's no reason to keep the hub at full power any longer, so
the system tries to do another runtime suspend. This suspend fails
just like the first one, and there's your loop.

This same kind of thing could happen with any driver. In your case,
the problem could be solved by making sure that when the hub fails to
suspend, it is still at least minimally functional. We don't do this
now; when usb_port_suspend fails there is no checking to see whether
the device is still working.

(It would be interesting to know exactly at which point
usb_port_suspend fails during your test. And why didn't your system
mark the hub as disconnected when the host controller was unbound?)

An even worse problem is what to do when the device is working okay but
runtime suspend still fails (for example, if remote wakeup cannot be
enabled). How do we prevent the system from going into a similar loop
then?

Alan Stern

Guenter Roeck

unread,
Mar 10, 2017, 6:40:04 PM3/10/17
to
Hi Alan,
Yes, exactly.

> that the device is back at full power, and of course the hub driver
> tries to determine the hub's current state at that time. When this
> fails, there's no reason to keep the hub at full power any longer, so
> the system tries to do another runtime suspend. This suspend fails
> just like the first one, and there's your loop.
>

If we don't cut the loop in hub_activate(), where else would be
a good place to cut it ?

> This same kind of thing could happen with any driver. In your case,
> the problem could be solved by making sure that when the hub fails to
> suspend, it is still at least minimally functional. We don't do this
> now; when usb_port_suspend fails there is no checking to see whether
> the device is still working.
>
> (It would be interesting to know exactly at which point
> usb_port_suspend fails during your test. And why didn't your system
> mark the hub as disconnected when the host controller was unbound?)
>
It does, but that happens a bit later. If I disable console logging
(for example by using dev_err_ratelimited() instead of dev_err() for the
observed errors), the hub is marked as disabled after a few hundred
or a few thousand loops. Only that code never gets to run runs if console
logging is enabled, and to restrict logging didn't seem to be a good
solution for the problem.

The test script I am running is:

while true
do
for i in /sys/bus/platform/drivers/rockchip-dwc3/usb*; do
basename $i > $(dirname $i)/unbind
basename $i > $(dirname $i)/bind
done
sleep $(awk "BEGIN {printf \"%.2f\", $((RANDOM % 150)) / 100 + 0.5}")
done

where rockchip-dwc3 is a slightly more complex version of dwc3-of-simple.c,
with pretty much the same functionality, only that it ties into extcon
to be able to handle cable states. The hub disconnect is a result of a call
to of_platform_depopulate(), similar to the same call in dwc3-of-simple.c.

The sleep in the above script is essential; it results in all kinds of
race conditions. The one I am trying to fix here is just one of them. The
system under test has two Type-C ports; one is connected to an Ethernet
interface and the other to an Apple Type-C adapter (it probably doesn't
matter much what is connected, as long as there is a hub).

> An even worse problem is what to do when the device is working okay but
> runtime suspend still fails (for example, if remote wakeup cannot be
> enabled). How do we prevent the system from going into a similar loop
> then?
>

Good question. I am open to suggestions.

Thanks,
Guenter

Guenter Roeck

unread,
Mar 15, 2017, 2:30:06 PM3/15/17
to
** 57 printk messages dropped ** hub 7-1:1.0: activate --> -22
** 82 printk messages dropped ** hub 7-1:1.0: hub_ext_port_status failed (err = -22)

This continues forever. After adding tracebacks into the code,
the call sequence leading to this is found to be as follows.

[<ffffffc0007fc8e0>] hub_activate+0x368/0x7b8
[<ffffffc0007fceb4>] hub_resume+0x2c/0x3c
[<ffffffc00080b3b8>] usb_resume_interface.isra.6+0x128/0x158
[<ffffffc00080b5d0>] usb_suspend_both+0x1e8/0x288
[<ffffffc00080c9c4>] usb_runtime_suspend+0x3c/0x98
[<ffffffc0007820a0>] __rpm_callback+0x48/0x7c
[<ffffffc00078217c>] rpm_callback+0xa8/0xd4
[<ffffffc000786234>] rpm_suspend+0x84/0x758
[<ffffffc000786ca4>] rpm_idle+0x2c8/0x498
[<ffffffc000786ed4>] __pm_runtime_idle+0x60/0xac
[<ffffffc00080eba8>] usb_autopm_put_interface+0x6c/0x7c
[<ffffffc000803798>] hub_event+0x10ac/0x12ac
[<ffffffc000249bb8>] process_one_work+0x390/0x6b8
[<ffffffc00024abcc>] worker_thread+0x480/0x610
[<ffffffc000251a80>] kthread+0x164/0x178
[<ffffffc0002045d0>] ret_from_fork+0x10/0x40

kick_hub_wq() is called from hub_activate() even after failures to
communicate with the hub. This results in an endless sequence of
hub event -> hub activate -> wq trigger -> hub event -> ...

Provide two solutions for the problem.

- Only trigger the hub event queue if communication with the hub
is successful.
- After a suspend failure, only resume already suspended interfaces
if the communication with the device is still possible.

Each of the changes fixes the observed problem.

Signed-off-by: Guenter Roeck <li...@roeck-us.net>
---
v2: Instead of not triggering the hub wq after an error to submit an urb,
implement a more complex error detection and handling. Do it in two
places. Marked as RFC to determine if one (or both) of those solutions
are viable.

drivers/usb/core/driver.c | 14 ++++++++++++++
drivers/usb/core/hub.c | 15 ++++++++++++++-
2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/core/driver.c b/drivers/usb/core/driver.c
index cdee5130638b..40a0ab395d82 100644
--- a/drivers/usb/core/driver.c
+++ b/drivers/usb/core/driver.c
@@ -1331,6 +1331,20 @@ static int usb_suspend_both(struct usb_device *udev, pm_message_t msg)
*/
if (udev->parent && !PMSG_IS_AUTO(msg))
status = 0;
+
+ /*
+ * If the device is inaccessible, don't try to resume
+ * suspended interfaces and just return the error.
+ */
+ if (status) {
+ int err;
+ u16 devstat;
+
+ err = usb_get_status(udev, USB_RECIP_DEVICE, 0,
+ &devstat);
+ if (err)
+ goto done;
+ }
}

/* If the suspend failed, resume interfaces that did get suspended */
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 5a420657f9f7..8a74f483cfef 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -1066,6 +1066,19 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)

portstatus = portchange = 0;
status = hub_port_status(hub, port1, &portstatus, &portchange);
+ if (status) {
+ u16 hubstatus;
+
+ ret = usb_get_status(hdev, USB_RECIP_DEVICE, 0,
+ &hubstatus);
+ if (ret < 0) {
+ dev_err(&hdev->dev,
+ "Failed to read hub status (%d)\n",
+ ret);
+ goto abort;
+ }
+ }
+
if (udev || (portstatus & USB_PORT_STAT_CONNECTION))
dev_dbg(&port_dev->dev, "status %04x change %04x\n",
portstatus, portchange);
@@ -1198,7 +1211,7 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)

/* Scan all ports that need attention */
kick_hub_wq(hub);
-
+abort:
if (type == HUB_INIT2 || type == HUB_INIT3) {
/* Allow autosuspend if it was suppressed */
disconnected:
--
2.7.4

Alan Stern

unread,
Mar 15, 2017, 4:00:07 PM3/15/17
to
This seems like a good approach, and adding both checks should provide
a valuable extra layer of security.

> drivers/usb/core/driver.c | 14 ++++++++++++++
> drivers/usb/core/hub.c | 15 ++++++++++++++-
> 2 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/usb/core/driver.c b/drivers/usb/core/driver.c
> index cdee5130638b..40a0ab395d82 100644
> --- a/drivers/usb/core/driver.c
> +++ b/drivers/usb/core/driver.c
> @@ -1331,6 +1331,20 @@ static int usb_suspend_both(struct usb_device *udev, pm_message_t msg)
> */
> if (udev->parent && !PMSG_IS_AUTO(msg))
> status = 0;
> +
> + /*
> + * If the device is inaccessible, don't try to resume
> + * suspended interfaces and just return the error.
> + */
> + if (status) {
> + int err;
> + u16 devstat;
> +
> + err = usb_get_status(udev, USB_RECIP_DEVICE, 0,
> + &devstat);
> + if (err)

Putting a log message here would be a good idea.

For example, suppose a device spontaneously stops working. The driver,
not receiving any data from the device, might decide to suspend it.
The only clue we would have that something has gone wrong would be
right here.

> + goto done;
> + }
> }
>
> /* If the suspend failed, resume interfaces that did get suspended */
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 5a420657f9f7..8a74f483cfef 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -1066,6 +1066,19 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
>
> portstatus = portchange = 0;
> status = hub_port_status(hub, port1, &portstatus, &portchange);
> + if (status) {
> + u16 hubstatus;
> +
> + ret = usb_get_status(hdev, USB_RECIP_DEVICE, 0,
> + &hubstatus);

In this case, since this code is in the middle of a loop that reads all
the port statuses, I wouldn't bother trying to read the hub status.
Just log the error message and abort.

> + if (ret < 0) {
> + dev_err(&hdev->dev,
> + "Failed to read hub status (%d)\n",
> + ret);
> + goto abort;
> + }
> + }
> +
> if (udev || (portstatus & USB_PORT_STAT_CONNECTION))
> dev_dbg(&port_dev->dev, "status %04x change %04x\n",
> portstatus, portchange);
> @@ -1198,7 +1211,7 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
>
> /* Scan all ports that need attention */
> kick_hub_wq(hub);
> -
> +abort:
> if (type == HUB_INIT2 || type == HUB_INIT3) {
> /* Allow autosuspend if it was suppressed */
> disconnected:

On the whole, this looks very good.

Alan Stern

Guenter Roeck

unread,
Mar 16, 2017, 3:30:06 PM3/16/17
to
was successful.
- After a suspend failure, only resume already suspended interfaces
if the communication with the device is still possible.

Each of the changes fixes the observed problem. Use both to improve
robustness.

Signed-off-by: Guenter Roeck <li...@roeck-us.net>
---
v3: In hub.c, abort immediately if hub_port_status() returns an error.
Since hub_port_status() already logs the error, don't log it again.
In device,c, log the error return value from usb_suspend_device()
if usb_get_status() failed as well.
Don't check if the hub is still accessible if the error returned
from hub_port_status() is -EBUSY.
v2: Instead of not triggering the hub wq after an error to submit an urb,
implement a more complex error detection and handling. Do it in two
places. Marked as RFC to determine if one (or both) of those solutions
are viable.

drivers/usb/core/driver.c | 18 ++++++++++++++++++
drivers/usb/core/hub.c | 5 ++++-
2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/core/driver.c b/drivers/usb/core/driver.c
index cdee5130638b..7ebdf2a4e8fe 100644
--- a/drivers/usb/core/driver.c
+++ b/drivers/usb/core/driver.c
@@ -1331,6 +1331,24 @@ static int usb_suspend_both(struct usb_device *udev, pm_message_t msg)
*/
if (udev->parent && !PMSG_IS_AUTO(msg))
status = 0;
+
+ /*
+ * If the device is inaccessible, don't try to resume
+ * suspended interfaces and just return the error.
+ */
+ if (status && status != -EBUSY) {
+ int err;
+ u16 devstat;
+
+ err = usb_get_status(udev, USB_RECIP_DEVICE, 0,
+ &devstat);
+ if (err) {
+ dev_err(&udev->dev,
+ "Failed to suspend device, error %d\n",
+ status);
+ goto done;
+ }
+ }
}

/* If the suspend failed, resume interfaces that did get suspended */
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 5a420657f9f7..5ab7cd09e1a4 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -1066,6 +1066,9 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)

portstatus = portchange = 0;
status = hub_port_status(hub, port1, &portstatus, &portchange);
+ if (status)
+ goto abort;
+
if (udev || (portstatus & USB_PORT_STAT_CONNECTION))
dev_dbg(&port_dev->dev, "status %04x change %04x\n",
portstatus, portchange);
@@ -1198,7 +1201,7 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)

/* Scan all ports that need attention */
kick_hub_wq(hub);
-
+abort:
if (type == HUB_INIT2 || type == HUB_INIT3) {
/* Allow autosuspend if it was suppressed */
disconnected:
--
2.7.4

Alan Stern

unread,
Mar 17, 2017, 12:40:07 PM3/17/17
to
Acked-by: Alan Stern <st...@rowland.harvard.edu>

Guenter Roeck

unread,
Mar 20, 2017, 2:20:04 PM3/20/17
to
is successful.
- After a suspend failure, only resume already suspended interfaces
if the communication with the device is still possible.

Each of the changes fixes the observed problem. Use both to improve
robustness.

Acked-by: Alan Stern <st...@rowland.harvard.edu>
Signed-off-by: Guenter Roeck <li...@roeck-us.net>
---
v4: Other code uses a space before labels in hub_activate(). Do the same
for consistency.
v3: In hub.c, abort immediately if hub_port_status() returns an error.
Since hub_port_status() already logs the error, don't log it again.
In device,c, log the error return value from usb_suspend_device()
if usb_get_status() failed as well.
Don't check if the hub is still accessible if the error returned
from hub_port_status() is -EBUSY.
v2: Instead of not triggering the hub wq after an error to submit an urb,
implement a more complex error detection and handling. Do it in two
places. Marked as RFC to determine if one (or both) of those solutions
are viable.

drivers/usb/core/driver.c | 18 ++++++++++++++++++
drivers/usb/core/hub.c | 5 ++++-
2 files changed, 22 insertions(+), 1 deletion(-)

index 5286bf67869a..2e047f982af3 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -1066,6 +1066,9 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)

portstatus = portchange = 0;
status = hub_port_status(hub, port1, &portstatus, &portchange);
+ if (status)
+ goto abort;
+
if (udev || (portstatus & USB_PORT_STAT_CONNECTION))
dev_dbg(&port_dev->dev, "status %04x change %04x\n",
portstatus, portchange);
@@ -1198,7 +1201,7 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)

/* Scan all ports that need attention */
kick_hub_wq(hub);
-
+ abort:
if (type == HUB_INIT2 || type == HUB_INIT3) {
/* Allow autosuspend if it was suppressed */
disconnected:
--
2.7.4

Doug Anderson

unread,
Mar 20, 2017, 2:20:05 PM3/20/17
to
Hi,

On Thu, Mar 16, 2017 at 12:24 PM, Guenter Roeck <li...@roeck-us.net> wrote:
> @@ -1198,7 +1201,7 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
>
> /* Scan all ports that need attention */
> kick_hub_wq(hub);
> -
> +abort:

One tiny nit that could be done when applying this patch is to add a
space before "abort". Other goto labels in this function are preceded
by a space and it's sane to try to match the existing coding
convention in the function rather than trying to mix and match.

Other than that this patch seems sane to me, but I am by no means an
expert on this code. ;)


-Doug
0 new messages