[PATCH 0/8] USB fixes: xHCI error handling

This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
https://bugzilla.redhat.com/show_bug.cgi?id=2244305
Signed-off-by: Hector Martin marcan@marcan.st --- Hector Martin (8): usb: xhci: Guard all calls to xhci_wait_for_event usb: xhci: Better error handling in abort_td() usb: xhci: Allow context state errors when halting an endpoint usb: xhci: Recover from halted non-control endpoints usb: xhci: Fail on attempt to queue TRBs to a halted endpoint usb: xhci: Do not panic on event timeouts usb: xhci: Fix DMA address calculation in queue_trb usb: xhci: Add more debugging
drivers/usb/host/xhci-ring.c | 100 +++++++++++++++++++++++++++++++++++-------- drivers/usb/host/xhci.c | 9 ++++ include/usb/xhci.h | 2 + 3 files changed, 92 insertions(+), 19 deletions(-) --- base-commit: fb428b61819444b9337075f49c72f326f5d12085 change-id: 20231027-usb-fixes-1-83bfc7013012
Best regards,

xhci_wait_for_event returns NULL on timeout, so the caller always has to check for that. This addresses the immediate explosions in this part of the code, but not the original cause.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 15 ++++++++++++++- drivers/usb/host/xhci.c | 9 +++++++++ 2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index c8260cbdf94b..aaf128ff9317 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -511,7 +511,8 @@ static void reset_ep(struct usb_device *udev, int ep_index) printf("Resetting EP %d...\n", ep_index); xhci_queue_command(ctrl, 0, udev->slot_id, ep_index, TRB_RESET_EP); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); - field = le32_to_cpu(event->trans_event.flags); + if (!event) + return; BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl);
@@ -519,6 +520,9 @@ static void reset_ep(struct usb_device *udev, int ep_index) (void *)((uintptr_t)ring->enqueue | ring->cycle_state)); xhci_queue_command(ctrl, addr, udev->slot_id, ep_index, TRB_SET_DEQ); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); @@ -544,6 +548,9 @@ static void abort_td(struct usb_device *udev, int ep_index) xhci_queue_command(ctrl, 0, udev->slot_id, ep_index, TRB_STOP_RING);
event = xhci_wait_for_event(ctrl, TRB_TRANSFER); + if (!event) + return; + field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); BUG_ON(TRB_TO_EP_INDEX(field) != ep_index); @@ -552,6 +559,9 @@ static void abort_td(struct usb_device *udev, int ep_index) xhci_acknowledge_event(ctrl);
event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); @@ -561,6 +571,9 @@ static void abort_td(struct usb_device *udev, int ep_index) (void *)((uintptr_t)ring->enqueue | ring->cycle_state)); xhci_queue_command(ctrl, addr, udev->slot_id, ep_index, TRB_SET_DEQ); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 5cacf0769ec7..d13cbff9b372 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -451,6 +451,9 @@ static int xhci_configure_endpoints(struct usb_device *udev, bool ctx_change) xhci_queue_command(ctrl, in_ctx->dma, udev->slot_id, 0, ctx_change ? TRB_EVAL_CONTEXT : TRB_CONFIG_EP); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return -ETIMEDOUT; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id);
@@ -647,6 +650,9 @@ static int xhci_address_device(struct usb_device *udev, int root_portnr) xhci_queue_command(ctrl, virt_dev->in_ctx->dma, slot_id, 0, TRB_ADDR_DEV); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return -ETIMEDOUT; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != slot_id);
switch (GET_COMP_CODE(le32_to_cpu(event->event_cmd.status))) { @@ -722,6 +728,9 @@ static int _xhci_alloc_device(struct usb_device *udev)
xhci_queue_command(ctrl, 0, 0, 0, TRB_ENABLE_SLOT); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return -ETIMEDOUT; + BUG_ON(GET_COMP_CODE(le32_to_cpu(event->event_cmd.status)) != COMP_SUCCESS);

Gah, I should've paid more attention to that rebase. Here's a dumb fixup for this patch. I'll squash it into a v2 if needed alongside any other changes, or if it looks good feel free to apply/squash it in directly.
--- drivers/usb/host/xhci-ring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index e2bd2e999a8e..7f2507be0cf0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -532,6 +532,7 @@ static void reset_ep(struct usb_device *udev, int ep_index) event = xhci_wait_for_event(ctrl, TRB_COMPLETION); if (!event) return; + field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl);

On 10/27/23 01:26, Hector Martin wrote:
Gah, I should've paid more attention to that rebase. Here's a dumb fixup for this patch. I'll squash it into a v2 if needed alongside any other changes, or if it looks good feel free to apply/squash it in directly.
drivers/usb/host/xhci-ring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index e2bd2e999a8e..7f2507be0cf0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -532,6 +532,7 @@ static void reset_ep(struct usb_device *udev, int ep_index) event = xhci_wait_for_event(ctrl, TRB_COMPLETION); if (!event) return;
- field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl);
Please squash, and add
Reviewed-by: Marek Vasut marex@denx.de
Also, +CC Minda,
there has been a similar fix to this one but with much more information about the failure, see:
[PATCH v1] usb: xhci: Check return value of wait for TRB_TRANSFER event
I think it would be useful to somehow include that information, so it wouldn't be lost.

On 27/10/2023 09.36, Marek Vasut wrote:
On 10/27/23 01:26, Hector Martin wrote:
Gah, I should've paid more attention to that rebase. Here's a dumb fixup for this patch. I'll squash it into a v2 if needed alongside any other changes, or if it looks good feel free to apply/squash it in directly.
drivers/usb/host/xhci-ring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index e2bd2e999a8e..7f2507be0cf0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -532,6 +532,7 @@ static void reset_ep(struct usb_device *udev, int ep_index) event = xhci_wait_for_event(ctrl, TRB_COMPLETION); if (!event) return;
- field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl);
Please squash, and add
Reviewed-by: Marek Vasut marex@denx.de
Also, +CC Minda,
there has been a similar fix to this one but with much more information about the failure, see:
[PATCH v1] usb: xhci: Check return value of wait for TRB_TRANSFER event
I think it would be useful to somehow include that information, so it wouldn't be lost.
The root cause for *that* failure is what I fix in patch #5. From that thread:
scanning bus xhci_pci for devices... WARN halted endpoint, queueing
URB anyway.
Without #5 and without #1, that situation then leads to fireworks.
A bunch of things are broken, which is why this is a series and not a single patch. I have more fixes to timeout handling, mass storage, etc. coming up after this. I fixed most of this in one long day of trying random USB devices, so it's not so much subtle super specific problems I could document the crashes for as just wide-ranging problems in the u-boot USB stack. None of this is particularly hard to repro if you just try a bunch of normal consumer USB devices, including things like USB HDDs which take time to spin-up leading to timeouts etc. The crash dumps aren't particularly useful, it's the subtleties of the xHCI interactions that are, but for that you need to add and enable a lot more debugging (patch #8).
I think the main reason all this stuff is broken and we're finding out now is that u-boot has traditionally been used in embedded devices with fairly narrow use cases for USB, and now we're running it on workstation-grade laptops and desktops and people are plugging in all kinds of normal USB devices and expecting their bootloader not to freak out with them (even though most of the time it doesn't need to talk to them). It's really easy to run into this stuff when that's your use case.
- Hector

On 10/27/23 08:03, Hector Martin wrote:
On 27/10/2023 09.36, Marek Vasut wrote:
On 10/27/23 01:26, Hector Martin wrote:
Gah, I should've paid more attention to that rebase. Here's a dumb fixup for this patch. I'll squash it into a v2 if needed alongside any other changes, or if it looks good feel free to apply/squash it in directly.
drivers/usb/host/xhci-ring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index e2bd2e999a8e..7f2507be0cf0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -532,6 +532,7 @@ static void reset_ep(struct usb_device *udev, int ep_index) event = xhci_wait_for_event(ctrl, TRB_COMPLETION); if (!event) return;
- field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl);
Please squash, and add
Reviewed-by: Marek Vasut marex@denx.de
Also, +CC Minda,
there has been a similar fix to this one but with much more information about the failure, see:
[PATCH v1] usb: xhci: Check return value of wait for TRB_TRANSFER event
I think it would be useful to somehow include that information, so it wouldn't be lost.
The root cause for *that* failure is what I fix in patch #5. From that thread:
scanning bus xhci_pci for devices... WARN halted endpoint, queueing
URB anyway.
Without #5 and without #1, that situation then leads to fireworks.
A bunch of things are broken, which is why this is a series and not a single patch. I have more fixes to timeout handling, mass storage, etc. coming up after this. I fixed most of this in one long day of trying random USB devices, so it's not so much subtle super specific problems I could document the crashes for as just wide-ranging problems in the u-boot USB stack. None of this is particularly hard to repro if you just try a bunch of normal consumer USB devices, including things like USB HDDs which take time to spin-up leading to timeouts etc.
I think majority of users I can think of use USB mass storage devices, like USB pen drives, not so much HDDs as far as I can tell.
The crash dumps aren't particularly useful, it's the subtleties of the xHCI interactions that are, but for that you need to add and enable a lot more debugging (patch #8).
The crash dumps are more for posterity, when someone searches for a problem they see. Search engines tend to pick those up and it might point those people in the right direction.
Also, it is good to include information about what triggered the crash, e.g. which USB device on which hardware, in case this is needed in the future.
I think the main reason all this stuff is broken and we're finding out now is that u-boot has traditionally been used in embedded devices with fairly narrow use cases for USB
Yes
, and now we're running it on workstation-grade laptops and desktops and people are plugging in all kinds of normal USB devices and expecting their bootloader not to freak out with them (even though most of the time it doesn't need to talk to them). It's really easy to run into this stuff when that's your use case.
It is really appreciated to see people fix things like this stuff, thanks !

On Sun, Oct 29, 2023 at 9:25 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 08:03, Hector Martin wrote:
On 27/10/2023 09.36, Marek Vasut wrote:
On 10/27/23 01:26, Hector Martin wrote:
Gah, I should've paid more attention to that rebase. Here's a dumb fixup for this patch. I'll squash it into a v2 if needed alongside any other changes, or if it looks good feel free to apply/squash it in directly.
drivers/usb/host/xhci-ring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index e2bd2e999a8e..7f2507be0cf0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -532,6 +532,7 @@ static void reset_ep(struct usb_device *udev, int ep_index) event = xhci_wait_for_event(ctrl, TRB_COMPLETION); if (!event) return;
- field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl);
Please squash, and add
Reviewed-by: Marek Vasut marex@denx.de
Also, +CC Minda,
there has been a similar fix to this one but with much more information about the failure, see:
[PATCH v1] usb: xhci: Check return value of wait for TRB_TRANSFER event
I think it would be useful to somehow include that information, so it wouldn't be lost.
The root cause for *that* failure is what I fix in patch #5. From that thread:
scanning bus xhci_pci for devices... WARN halted endpoint, queueing
URB anyway.
Without #5 and without #1, that situation then leads to fireworks.
A bunch of things are broken, which is why this is a series and not a single patch. I have more fixes to timeout handling, mass storage, etc. coming up after this. I fixed most of this in one long day of trying random USB devices, so it's not so much subtle super specific problems I could document the crashes for as just wide-ranging problems in the u-boot USB stack. None of this is particularly hard to repro if you just try a bunch of normal consumer USB devices, including things like USB HDDs which take time to spin-up leading to timeouts etc.
I think majority of users I can think of use USB mass storage devices, like USB pen drives, not so much HDDs as far as I can tell.
We see a bunch of spinning rust users in Fedora, these sorts of devices are used by a bunch of people that want to run up cheap home NAS devices so from experience I'd say while not usual to be USB sticks and some form of solid state storage, spinning disk isn't unusual.
The crash dumps aren't particularly useful, it's the subtleties of the xHCI interactions that are, but for that you need to add and enable a lot more debugging (patch #8).
The crash dumps are more for posterity, when someone searches for a problem they see. Search engines tend to pick those up and it might point those people in the right direction.
Also, it is good to include information about what triggered the crash, e.g. which USB device on which hardware, in case this is needed in the future.
I think the main reason all this stuff is broken and we're finding out now is that u-boot has traditionally been used in embedded devices with fairly narrow use cases for USB
Yes
, and now we're running it on workstation-grade laptops and desktops and people are plugging in all kinds of normal USB devices and expecting their bootloader not to freak out with them (even though most of the time it doesn't need to talk to them). It's really easy to run into this stuff when that's your use case.
It is really appreciated to see people fix things like this stuff, thanks !

On Sun, Oct 29, 2023 at 2:33 PM Peter Robinson pbrobinson@gmail.com wrote:
On Sun, Oct 29, 2023 at 9:25 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 08:03, Hector Martin wrote:
On 27/10/2023 09.36, Marek Vasut wrote:
On 10/27/23 01:26, Hector Martin wrote:
Gah, I should've paid more attention to that rebase. Here's a dumb fixup for this patch. I'll squash it into a v2 if needed alongside any other changes, or if it looks good feel free to apply/squash it in directly.
drivers/usb/host/xhci-ring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index e2bd2e999a8e..7f2507be0cf0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -532,6 +532,7 @@ static void reset_ep(struct usb_device *udev, int ep_index) event = xhci_wait_for_event(ctrl, TRB_COMPLETION); if (!event) return;
- field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl);
Please squash, and add
Reviewed-by: Marek Vasut marex@denx.de
Also, +CC Minda,
there has been a similar fix to this one but with much more information about the failure, see:
[PATCH v1] usb: xhci: Check return value of wait for TRB_TRANSFER event
I think it would be useful to somehow include that information, so it wouldn't be lost.
The root cause for *that* failure is what I fix in patch #5. From that thread:
scanning bus xhci_pci for devices... WARN halted endpoint, queueing
URB anyway.
Without #5 and without #1, that situation then leads to fireworks.
A bunch of things are broken, which is why this is a series and not a single patch. I have more fixes to timeout handling, mass storage, etc. coming up after this. I fixed most of this in one long day of trying random USB devices, so it's not so much subtle super specific problems I could document the crashes for as just wide-ranging problems in the u-boot USB stack. None of this is particularly hard to repro if you just try a bunch of normal consumer USB devices, including things like USB HDDs which take time to spin-up leading to timeouts etc.
I think majority of users I can think of use USB mass storage devices, like USB pen drives, not so much HDDs as far as I can tell.
We see a bunch of spinning rust users in Fedora, these sorts of devices are used by a bunch of people that want to run up cheap home NAS devices so from experience I'd say while not usual to be USB sticks and some form of solid state storage, spinning disk isn't unusual.
What Peter said. A very common use case, even more so than USB flash drives, is for the consumers to plug in a USB HDD to their routers or home NAS, as a cheap and quick solution. I've seen this type of timeout failure happen quite often with large >= 3TB HDDs in USB enclosure.
All the best, Tony
The crash dumps aren't particularly useful, it's the subtleties of the xHCI interactions that are, but for that you need to add and enable a lot more debugging (patch #8).
The crash dumps are more for posterity, when someone searches for a problem they see. Search engines tend to pick those up and it might point those people in the right direction.
Also, it is good to include information about what triggered the crash, e.g. which USB device on which hardware, in case this is needed in the future.
I think the main reason all this stuff is broken and we're finding out now is that u-boot has traditionally been used in embedded devices with fairly narrow use cases for USB
Yes
, and now we're running it on workstation-grade laptops and desktops and people are plugging in all kinds of normal USB devices and expecting their bootloader not to freak out with them (even though most of the time it doesn't need to talk to them). It's really easy to run into this stuff when that's your use case.
It is really appreciated to see people fix things like this stuff, thanks !

If the xHC has a problem with our STOP ENDPOINT command, it is likely to return a completion directly instead of first a transfer event for the in-progress transfer. Handle that more gracefully.
Right now we still BUG() on the error code, but at least we don't end up timing out on the event and ending up with unexpected event errors.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 34 ++++++++++++++++++++++------------ include/usb/xhci.h | 2 ++ 2 files changed, 24 insertions(+), 12 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index aaf128ff9317..d08bb8e2bfba 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -466,7 +466,8 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected) continue;
type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags)); - if (type == expected) + if (type == expected || + (expected == TRB_NONE && type != TRB_PORT_STATUS)) return event;
if (type == TRB_PORT_STATUS) @@ -542,27 +543,36 @@ static void abort_td(struct usb_device *udev, int ep_index) struct xhci_ctrl *ctrl = xhci_get_ctrl(udev); struct xhci_ring *ring = ctrl->devs[udev->slot_id]->eps[ep_index].ring; union xhci_trb *event; + trb_type type; u64 addr; u32 field;
xhci_queue_command(ctrl, 0, udev->slot_id, ep_index, TRB_STOP_RING);
- event = xhci_wait_for_event(ctrl, TRB_TRANSFER); + event = xhci_wait_for_event(ctrl, TRB_NONE); if (!event) return;
- field = le32_to_cpu(event->trans_event.flags); - BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); - BUG_ON(TRB_TO_EP_INDEX(field) != ep_index); - BUG_ON(GET_COMP_CODE(le32_to_cpu(event->trans_event.transfer_len - != COMP_STOP))); - xhci_acknowledge_event(ctrl); + type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags)); + if (type == TRB_TRANSFER) { + field = le32_to_cpu(event->trans_event.flags); + BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); + BUG_ON(TRB_TO_EP_INDEX(field) != ep_index); + BUG_ON(GET_COMP_CODE(le32_to_cpu(event->trans_event.transfer_len + != COMP_STOP))); + xhci_acknowledge_event(ctrl);
- event = xhci_wait_for_event(ctrl, TRB_COMPLETION); - if (!event) - return; + event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags));
- BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) + } else { + printf("abort_td: Expected a TRB_TRANSFER TRB first\n"); + } + + BUG_ON(type != TRB_COMPLETION || + TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); xhci_acknowledge_event(ctrl); diff --git a/include/usb/xhci.h b/include/usb/xhci.h index 4a4ac10229ac..04d16a256bbd 100644 --- a/include/usb/xhci.h +++ b/include/usb/xhci.h @@ -901,6 +901,8 @@ union xhci_trb {
/* TRB type IDs */ typedef enum { + /* reserved, used as a software sentinel */ + TRB_NONE = 0, /* bulk, interrupt, isoc scatter/gather, and control data stage */ TRB_NORMAL = 1, /* setup stage for control transfers */

On 10/27/23 01:16, Hector Martin wrote:
If the xHC has a problem with our STOP ENDPOINT command, it is likely to return a completion directly instead of first a transfer event for the in-progress transfer. Handle that more gracefully.
Right now we still BUG() on the error code, but at least we don't end up timing out on the event and ending up with unexpected event errors.
Signed-off-by: Hector Martin marcan@marcan.st
I'll defer the rest of the review to Bin , he just knows better when it comes to xHCI . Please ping me if things are stuck for too long however.

There is a race where an endpoint may halt by itself while we are trying to halt it, which results in a context state error. See xHCI 4.6.9 which mentions this case.
This also avoids BUGging when we attempt to stop an endpoint which was already stopped to begin with, which is probably a bug elsewhere but not a good reason to crash.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index d08bb8e2bfba..5c2d16b58589 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -543,6 +543,7 @@ static void abort_td(struct usb_device *udev, int ep_index) struct xhci_ctrl *ctrl = xhci_get_ctrl(udev); struct xhci_ring *ring = ctrl->devs[udev->slot_id]->eps[ep_index].ring; union xhci_trb *event; + xhci_comp_code comp; trb_type type; u64 addr; u32 field; @@ -571,10 +572,11 @@ static void abort_td(struct usb_device *udev, int ep_index) printf("abort_td: Expected a TRB_TRANSFER TRB first\n"); }
+ comp = GET_COMP_CODE(le32_to_cpu(event->event_cmd.status)); BUG_ON(type != TRB_COMPLETION || TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) - != udev->slot_id || GET_COMP_CODE(le32_to_cpu( - event->event_cmd.status)) != COMP_SUCCESS); + != udev->slot_id || (comp != COMP_SUCCESS && comp + != COMP_CTX_STATE)); xhci_acknowledge_event(ctrl);
addr = xhci_trb_virt_to_dma(ring->enq_seg,

There is currently no codepath to recover from this case. In principle we could require that the upper layer do this explicitly, but let's just do it in xHCI when the next bulk transfer is started, since that reasonably implies whatever caused the problem has been dealt with.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 5c2d16b58589..60f2cf72dffa 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -669,6 +669,14 @@ int xhci_bulk_tx(struct usb_device *udev, unsigned long pipe,
ep_ctx = xhci_get_ep_ctx(ctrl, virt_dev->out_ctx, ep_index);
+ /* + * If the endpoint was halted due to a prior error, resume it before + * the next transfer. It is the responsibility of the upper layer to + * have dealt with whatever caused the error. + */ + if ((le32_to_cpu(ep_ctx->ep_info) & EP_STATE_MASK) == EP_STATE_HALTED) + reset_ep(udev, ep_index); + ring = virt_dev->eps[ep_index].ring; /* * How much data is (potentially) left before the 64KB boundary?

This isn't going to work, don't pretend it will and then end up timing out.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 60f2cf72dffa..3ef8c2f14ccc 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -243,7 +243,8 @@ static int prepare_ring(struct xhci_ctrl *ctrl, struct xhci_ring *ep_ring, puts("WARN waiting for error on ep to be cleared\n"); return -EINVAL; case EP_STATE_HALTED: - puts("WARN halted endpoint, queueing URB anyway.\n"); + puts("WARN endpoint is halted\n"); + return -EINVAL; case EP_STATE_STOPPED: case EP_STATE_RUNNING: debug("EP STATE RUNNING.\n");

Now that we always check the return value, just return NULL on timeouts. We can still log the error since this is a problem, but it's not reason to panic.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 3ef8c2f14ccc..b95f20642943 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -494,8 +494,9 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected) if (expected == TRB_TRANSFER) return NULL;
- printf("XHCI timeout on event type %d... cannot recover.\n", expected); - BUG(); + printf("XHCI timeout on event type %d...\n", expected); + + return NULL; }
/*

We need to get the DMA address before incrementing the pointer, as that might move us onto another segment.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index b95f20642943..b9518e4a6743 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -202,18 +202,22 @@ static dma_addr_t queue_trb(struct xhci_ctrl *ctrl, struct xhci_ring *ring, bool more_trbs_coming, unsigned int *trb_fields) { struct xhci_generic_trb *trb; + dma_addr_t addr; int i;
trb = &ring->enqueue->generic;
+ for (i = 0; i < 4; i++) trb->field[i] = cpu_to_le32(trb_fields[i]);
xhci_flush_cache((uintptr_t)trb, sizeof(struct xhci_generic_trb));
+ addr = xhci_trb_virt_to_dma(ring->enq_seg, (union xhci_trb *)trb); + inc_enq(ctrl, ring, more_trbs_coming);
- return xhci_trb_virt_to_dma(ring->enq_seg, (union xhci_trb *)trb); + return addr; }
/**

A bunch of miscellaneous debug messages to aid in working out USB issues.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index b9518e4a6743..e2bd2e999a8e 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -215,6 +215,9 @@ static dma_addr_t queue_trb(struct xhci_ctrl *ctrl, struct xhci_ring *ring,
addr = xhci_trb_virt_to_dma(ring->enq_seg, (union xhci_trb *)trb);
+ debug("trb @ %llx: %08x %08x %08x %08x\n", addr, + trb_fields[0], trb_fields[1], trb_fields[2], trb_fields[3]); + inc_enq(ctrl, ring, more_trbs_coming);
return addr; @@ -297,6 +300,8 @@ void xhci_queue_command(struct xhci_ctrl *ctrl, dma_addr_t addr, u32 slot_id, { u32 fields[4];
+ debug("CMD: %llx 0x%x 0x%x %d\n", addr, slot_id, ep_index, cmd); + BUG_ON(prepare_ring(ctrl, ctrl->cmd_ring, EP_STATE_RUNNING));
fields[0] = lower_32_bits(addr); @@ -472,8 +477,14 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected)
type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags)); if (type == expected || - (expected == TRB_NONE && type != TRB_PORT_STATUS)) + (expected == TRB_NONE && type != TRB_PORT_STATUS)) { + debug("Event: %08x %08x %08x %08x\n", + le32_to_cpu(event->generic.field[0]), + le32_to_cpu(event->generic.field[1]), + le32_to_cpu(event->generic.field[2]), + le32_to_cpu(event->generic.field[3])); return event; + }
if (type == TRB_PORT_STATUS) /* TODO: remove this once enumeration has been reworked */ @@ -485,8 +496,9 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected) le32_to_cpu(event->generic.field[2])) != COMP_SUCCESS); else - printf("Unexpected XHCI event TRB, skipping... " + printf("Unexpected XHCI event TRB, expected %d... " "(%08x %08x %08x %08x)\n", + expected, le32_to_cpu(event->generic.field[0]), le32_to_cpu(event->generic.field[1]), le32_to_cpu(event->generic.field[2]), @@ -601,10 +613,13 @@ static void abort_td(struct usb_device *udev, int ep_index) static void record_transfer_result(struct usb_device *udev, union xhci_trb *event, int length) { + xhci_comp_code code = GET_COMP_CODE( + le32_to_cpu(event->trans_event.transfer_len)); + udev->act_len = min(length, length - (int)EVENT_TRB_LEN(le32_to_cpu(event->trans_event.transfer_len)));
- switch (GET_COMP_CODE(le32_to_cpu(event->trans_event.transfer_len))) { + switch (code) { case COMP_SUCCESS: BUG_ON(udev->act_len != length); /* fallthrough */ @@ -612,16 +627,23 @@ static void record_transfer_result(struct usb_device *udev, udev->status = 0; break; case COMP_STALL: + debug("Xfer STALL\n"); udev->status = USB_ST_STALLED; break; case COMP_DB_ERR: + debug("Xfer DB_ERR\n"); + udev->status = USB_ST_BUF_ERR; + break; case COMP_TRB_ERR: + debug("Xfer TRB_ERR\n"); udev->status = USB_ST_BUF_ERR; break; case COMP_BABBLE: + debug("Xfer BABBLE\n"); udev->status = USB_ST_BABBLE_DET; break; default: + debug("Xfer error: %d\n", code); udev->status = 0x80; /* USB_ST_TOO_LAZY_TO_MAKE_A_NEW_MACRO */ } } @@ -1015,6 +1037,7 @@ int xhci_ctrl_tx(struct usb_device *udev, unsigned long pipe, record_transfer_result(udev, event, length); xhci_acknowledge_event(ctrl); if (udev->status == USB_ST_STALLED) { + debug("EP %d stalled\n", ep_index); reset_ep(udev, ep_index); return -EPIPE; }

On Thu, Oct 26, 2023 at 7:26 PM Hector Martin marcan@marcan.st wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
https://bugzilla.redhat.com/show_bug.cgi?id=2244305
Signed-off-by: Hector Martin marcan@marcan.st
Hector Martin (8): usb: xhci: Guard all calls to xhci_wait_for_event usb: xhci: Better error handling in abort_td() usb: xhci: Allow context state errors when halting an endpoint usb: xhci: Recover from halted non-control endpoints usb: xhci: Fail on attempt to queue TRBs to a halted endpoint usb: xhci: Do not panic on event timeouts usb: xhci: Fix DMA address calculation in queue_trb usb: xhci: Add more debugging
drivers/usb/host/xhci-ring.c | 100 +++++++++++++++++++++++++++++++++++-------- drivers/usb/host/xhci.c | 9 ++++ include/usb/xhci.h | 2 + 3 files changed, 92 insertions(+), 19 deletions(-)
base-commit: fb428b61819444b9337075f49c72f326f5d12085 change-id: 20231027-usb-fixes-1-83bfc7013012
The series looks reasonable and has worked quite well in Fedora Asahi.
Reviewed-by: Neal Gompa neal@gompa.dev

On Fri, Oct 27, 2023 at 2:58 PM Neal Gompa neal@gompa.dev wrote:
On Thu, Oct 26, 2023 at 7:26 PM Hector Martin marcan@marcan.st wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
https://bugzilla.redhat.com/show_bug.cgi?id=2244305
Signed-off-by: Hector Martin marcan@marcan.st
Hector Martin (8): usb: xhci: Guard all calls to xhci_wait_for_event usb: xhci: Better error handling in abort_td() usb: xhci: Allow context state errors when halting an endpoint usb: xhci: Recover from halted non-control endpoints usb: xhci: Fail on attempt to queue TRBs to a halted endpoint usb: xhci: Do not panic on event timeouts usb: xhci: Fix DMA address calculation in queue_trb usb: xhci: Add more debugging
drivers/usb/host/xhci-ring.c | 100 +++++++++++++++++++++++++++++++++++-------- drivers/usb/host/xhci.c | 9 ++++ include/usb/xhci.h | 2 + 3 files changed, 92 insertions(+), 19 deletions(-)
base-commit: fb428b61819444b9337075f49c72f326f5d12085 change-id: 20231027-usb-fixes-1-83bfc7013012
The series looks reasonable and has worked quite well in Fedora Asahi.
Reviewed-by: Neal Gompa neal@gompa.dev
Resending now that I'm subscribed to the U-Boot mailing list...
Reviewed-by: Neal Gompa ngompa13@gmail.com

On 10/27/23 01:16, Hector Martin wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
Was there ever a V2 of these patches I might've missed ?

On Sun, Nov 19, 2023 at 8:08 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 01:16, Hector Martin wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
Was there ever a V2 of these patches I might've missed ?
Is it this one? https://patchwork.ozlabs.org/project/uboot/list/?series=379807

On 11/20/23 00:17, Shantur Rathore wrote:
On Sun, Nov 19, 2023 at 8:08 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 01:16, Hector Martin wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
Was there ever a V2 of these patches I might've missed ?
Is it this one? https://patchwork.ozlabs.org/project/uboot/list/?series=379807
I think so, thanks.
And uh, my question therefore it, is there a V3 which addresses the 3/8 and 8/8 comment ?

On 2023/11/20 11:09, Marek Vasut wrote:
On 11/20/23 00:17, Shantur Rathore wrote:
On Sun, Nov 19, 2023 at 8:08 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 01:16, Hector Martin wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
Was there ever a V2 of these patches I might've missed ?
Is it this one? https://patchwork.ozlabs.org/project/uboot/list/?series=379807
I think so, thanks.
And uh, my question therefore it, is there a V3 which addresses the 3/8 and 8/8 comment ?
Not yet, no. Sorry, I probably won't have time to work on this in a while, currently busy with other stuff.
- Hector

On 11/20/23 11:45, Hector Martin wrote:
On 2023/11/20 11:09, Marek Vasut wrote:
On 11/20/23 00:17, Shantur Rathore wrote:
On Sun, Nov 19, 2023 at 8:08 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 01:16, Hector Martin wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
Was there ever a V2 of these patches I might've missed ?
Is it this one? https://patchwork.ozlabs.org/project/uboot/list/?series=379807
I think so, thanks.
And uh, my question therefore it, is there a V3 which addresses the 3/8 and 8/8 comment ?
Not yet, no. Sorry, I probably won't have time to work on this in a while, currently busy with other stuff.
I can probably fix the patches up myself if that is fine with you, I'd really like to get these fixes into the release soon. Would that be OK with you ?

On 2023/11/20 21:15, Marek Vasut wrote:
On 11/20/23 11:45, Hector Martin wrote:
On 2023/11/20 11:09, Marek Vasut wrote:
On 11/20/23 00:17, Shantur Rathore wrote:
On Sun, Nov 19, 2023 at 8:08 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 01:16, Hector Martin wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Most importantly, this related set of patches makes xHCI error/stall recovery more robust (or work at all in some cases). There are also a couple patches fixing other xHCI bugs and adding better debug logs.
I believe this should fix this Fedora bug too:
Was there ever a V2 of these patches I might've missed ?
Is it this one? https://patchwork.ozlabs.org/project/uboot/list/?series=379807
I think so, thanks.
And uh, my question therefore it, is there a V3 which addresses the 3/8 and 8/8 comment ?
Not yet, no. Sorry, I probably won't have time to work on this in a while, currently busy with other stuff.
I can probably fix the patches up myself if that is fine with you, I'd really like to get these fixes into the release soon. Would that be OK with you ?
Of course, I would appreciate that :)
- Hector

On 11/21/23 07:46, Hector Martin wrote:
On 2023/11/20 21:15, Marek Vasut wrote:
On 11/20/23 11:45, Hector Martin wrote:
On 2023/11/20 11:09, Marek Vasut wrote:
On 11/20/23 00:17, Shantur Rathore wrote:
On Sun, Nov 19, 2023 at 8:08 PM Marek Vasut marex@denx.de wrote:
On 10/27/23 01:16, Hector Martin wrote: > This series is the first of a few bundles of USB fixes we have been > carrying downstream on the Asahi U-Boot branch for a few months. > > Most importantly, this related set of patches makes xHCI error/stall > recovery more robust (or work at all in some cases). There are also a > couple patches fixing other xHCI bugs and adding better debug logs. > > I believe this should fix this Fedora bug too: > > https://bugzilla.redhat.com/show_bug.cgi?id=2244305
Was there ever a V2 of these patches I might've missed ?
Is it this one? https://patchwork.ozlabs.org/project/uboot/list/?series=379807
I think so, thanks.
And uh, my question therefore it, is there a V3 which addresses the 3/8 and 8/8 comment ?
Not yet, no. Sorry, I probably won't have time to work on this in a while, currently busy with other stuff.
I can probably fix the patches up myself if that is fine with you, I'd really like to get these fixes into the release soon. Would that be OK with you ?
Of course, I would appreciate that :)
I picked a subset to usb/master and fixed a subset, for the rest I really do need your input on those patches.
participants (7)
-
Hector Martin
-
Marek Vasut
-
Neal Gompa
-
Neal Gompa
-
Peter Robinson
-
Shantur Rathore
-
Tony Dinh