[PATCH v2 0/8] USB fixes: xHCI error handling

This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Patches #1-#6 fix a series of related robustness issues. Certain conditions related to endpoint stalls revealed a chain of bugs throughout the stack that caused U-Boot to completely fail when encountering some errors (and errors are a fact of life with USB).
Example scenario: - The device stalls a bulk endpoint due to an error - The upper layer driver tries to use the endpoint again - There is no endpoint stall clear wired up in the API, so for starters this is doomed to fail (fix: #4) - xHCI knows the endpoint is halted, but tries to queue the TRB anyway, which can't work (fix: #5) - Since the endpoint is halted nothing happens, so the transfer times out. - The timeout handling tries to abort the transfer - abort_td() tries to stop the endpoint, but since it is already halted, this results in a context state error. As the transfer never started, there is no completion event, so xhci_wait_for_event() is waiting for the wrong event type, and logs an error and returns NULL. (fix: #2) - abort_td() crashes due to failing to guard against the NULL event (fix: #1) - Even after fixing all that, abort_td() would not handle the context state error properly and BUG() (fix: #3). This also fixes a race condition documented in the xHCI spec that could occur even in the absence of all the other bugs.
Patch #6 addresses a related robustness issue where xhci_wait_for_event() panics on event timeouts other than for transfers. While this is clearly an unexpected condition and indicates a bug elsewhere, it's no reason to outright crash. USB is notoriously impossible to get 100% right, and we can't afford to be breaking users' systems at any sign of trouble. Error conditions should always be dealt with as gracefully as possible (even if that results in a completely broken USB controller, that is much preferable to aborting the boot process entirely, especially on devices with non-USB storage and keyboards where USB support is effectively optional for most users). Since after patch #1 we now guard all callers to xhci_wait_for_event() with at least trivial NULL checks, it's okay to return and continue in case of timeouts.
Patch #7 addresses an unrelated bug I ran into while debugging all this, and patch #8 adds extra debug logs to make finding future issues less painful.
I believe this should fix this Fedora bug too, which is either an instance of the above sequence of events, or (more likely, given the difficulty reproducing) the race condition documented in xHCI 4.6.9:
https://bugzilla.redhat.com/show_bug.cgi?id=2244305
Signed-off-by: Hector Martin marcan@marcan.st --- Changes in v2: - Squashed in a trivial fix for patch #1 - Removed spurious blank line - Added a longer description to the cover letter - Link to v1: https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd928@marcan.st
--- Hector Martin (8): usb: xhci: Guard all calls to xhci_wait_for_event usb: xhci: Better error handling in abort_td() usb: xhci: Allow context state errors when halting an endpoint usb: xhci: Recover from halted bulk endpoints usb: xhci: Fail on attempt to queue TRBs to a halted endpoint usb: xhci: Do not panic on event timeouts usb: xhci: Fix DMA address calculation in queue_trb usb: xhci: Add more debugging
drivers/usb/host/xhci-ring.c | 99 ++++++++++++++++++++++++++++++++++++-------- drivers/usb/host/xhci.c | 9 ++++ include/usb/xhci.h | 2 + 3 files changed, 92 insertions(+), 18 deletions(-) --- base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529 change-id: 20231027-usb-fixes-1-83bfc7013012
Best regards,

xhci_wait_for_event returns NULL on timeout, so the caller always has to check for that. This addresses immediate explosions in this part of the code when timeouts happen, but not the root cause for the timeout.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 15 +++++++++++++++ drivers/usb/host/xhci.c | 9 +++++++++ 2 files changed, 24 insertions(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index c8260cbdf94b..d0960812a47b 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -511,6 +511,9 @@ static void reset_ep(struct usb_device *udev, int ep_index) printf("Resetting EP %d...\n", ep_index); xhci_queue_command(ctrl, 0, udev->slot_id, ep_index, TRB_RESET_EP); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); xhci_acknowledge_event(ctrl); @@ -519,6 +522,9 @@ static void reset_ep(struct usb_device *udev, int ep_index) (void *)((uintptr_t)ring->enqueue | ring->cycle_state)); xhci_queue_command(ctrl, addr, udev->slot_id, ep_index, TRB_SET_DEQ); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); @@ -544,6 +550,9 @@ static void abort_td(struct usb_device *udev, int ep_index) xhci_queue_command(ctrl, 0, udev->slot_id, ep_index, TRB_STOP_RING);
event = xhci_wait_for_event(ctrl, TRB_TRANSFER); + if (!event) + return; + field = le32_to_cpu(event->trans_event.flags); BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); BUG_ON(TRB_TO_EP_INDEX(field) != ep_index); @@ -552,6 +561,9 @@ static void abort_td(struct usb_device *udev, int ep_index) xhci_acknowledge_event(ctrl);
event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); @@ -561,6 +573,9 @@ static void abort_td(struct usb_device *udev, int ep_index) (void *)((uintptr_t)ring->enqueue | ring->cycle_state)); xhci_queue_command(ctrl, addr, udev->slot_id, ep_index, TRB_SET_DEQ); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 5cacf0769ec7..d13cbff9b372 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -451,6 +451,9 @@ static int xhci_configure_endpoints(struct usb_device *udev, bool ctx_change) xhci_queue_command(ctrl, in_ctx->dma, udev->slot_id, 0, ctx_change ? TRB_EVAL_CONTEXT : TRB_CONFIG_EP); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return -ETIMEDOUT; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id);
@@ -647,6 +650,9 @@ static int xhci_address_device(struct usb_device *udev, int root_portnr) xhci_queue_command(ctrl, virt_dev->in_ctx->dma, slot_id, 0, TRB_ADDR_DEV); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return -ETIMEDOUT; + BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != slot_id);
switch (GET_COMP_CODE(le32_to_cpu(event->event_cmd.status))) { @@ -722,6 +728,9 @@ static int _xhci_alloc_device(struct usb_device *udev)
xhci_queue_command(ctrl, 0, 0, 0, TRB_ENABLE_SLOT); event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return -ETIMEDOUT; + BUG_ON(GET_COMP_CODE(le32_to_cpu(event->event_cmd.status)) != COMP_SUCCESS);

On 10/29/23 07:37, Hector Martin wrote:
xhci_wait_for_event returns NULL on timeout, so the caller always has to check for that. This addresses immediate explosions in this part of the code when timeouts happen, but not the root cause for the timeout.
Signed-off-by: Hector Martin marcan@marcan.st
Reviewed-by: Marek Vasut marex@denx.de

If the xHC has a problem with our STOP ENDPOINT command, it is likely to return a completion directly instead of first a transfer event for the in-progress transfer. Handle that more gracefully.
We still BUG() on the error code, but at least we don't end up timing out on the event and ending up with unexpected event errors.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 34 ++++++++++++++++++++++------------ include/usb/xhci.h | 2 ++ 2 files changed, 24 insertions(+), 12 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index d0960812a47b..d21e76e0bdb6 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -466,7 +466,8 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected) continue;
type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags)); - if (type == expected) + if (type == expected || + (expected == TRB_NONE && type != TRB_PORT_STATUS)) return event;
if (type == TRB_PORT_STATUS) @@ -544,27 +545,36 @@ static void abort_td(struct usb_device *udev, int ep_index) struct xhci_ctrl *ctrl = xhci_get_ctrl(udev); struct xhci_ring *ring = ctrl->devs[udev->slot_id]->eps[ep_index].ring; union xhci_trb *event; + trb_type type; u64 addr; u32 field;
xhci_queue_command(ctrl, 0, udev->slot_id, ep_index, TRB_STOP_RING);
- event = xhci_wait_for_event(ctrl, TRB_TRANSFER); + event = xhci_wait_for_event(ctrl, TRB_NONE); if (!event) return;
- field = le32_to_cpu(event->trans_event.flags); - BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); - BUG_ON(TRB_TO_EP_INDEX(field) != ep_index); - BUG_ON(GET_COMP_CODE(le32_to_cpu(event->trans_event.transfer_len - != COMP_STOP))); - xhci_acknowledge_event(ctrl); + type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags)); + if (type == TRB_TRANSFER) { + field = le32_to_cpu(event->trans_event.flags); + BUG_ON(TRB_TO_SLOT_ID(field) != udev->slot_id); + BUG_ON(TRB_TO_EP_INDEX(field) != ep_index); + BUG_ON(GET_COMP_CODE(le32_to_cpu(event->trans_event.transfer_len + != COMP_STOP))); + xhci_acknowledge_event(ctrl);
- event = xhci_wait_for_event(ctrl, TRB_COMPLETION); - if (!event) - return; + event = xhci_wait_for_event(ctrl, TRB_COMPLETION); + if (!event) + return; + type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags));
- BUG_ON(TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) + } else { + printf("abort_td: Expected a TRB_TRANSFER TRB first\n"); + } + + BUG_ON(type != TRB_COMPLETION || + TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) != udev->slot_id || GET_COMP_CODE(le32_to_cpu( event->event_cmd.status)) != COMP_SUCCESS); xhci_acknowledge_event(ctrl); diff --git a/include/usb/xhci.h b/include/usb/xhci.h index 4a4ac10229ac..04d16a256bbd 100644 --- a/include/usb/xhci.h +++ b/include/usb/xhci.h @@ -901,6 +901,8 @@ union xhci_trb {
/* TRB type IDs */ typedef enum { + /* reserved, used as a software sentinel */ + TRB_NONE = 0, /* bulk, interrupt, isoc scatter/gather, and control data stage */ TRB_NORMAL = 1, /* setup stage for control transfers */

On 10/29/23 07:37, Hector Martin wrote:
If the xHC has a problem with our STOP ENDPOINT command, it is likely to return a completion directly instead of first a transfer event for the in-progress transfer. Handle that more gracefully.
We still BUG() on the error code, but at least we don't end up timing out on the event and ending up with unexpected event errors.
Signed-off-by: Hector Martin marcan@marcan.st
Reviewed-by: Marek Vasut marex@denx.de

There is a race where an endpoint may halt by itself while we are trying to halt it, which results in a context state error. See xHCI 4.6.9 which mentions this case.
This also avoids BUGging when we attempt to stop an endpoint which was already stopped to begin with, which is probably a bug elsewhere but not a good reason to crash.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index d21e76e0bdb6..e02a6e300c4f 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -545,6 +545,7 @@ static void abort_td(struct usb_device *udev, int ep_index) struct xhci_ctrl *ctrl = xhci_get_ctrl(udev); struct xhci_ring *ring = ctrl->devs[udev->slot_id]->eps[ep_index].ring; union xhci_trb *event; + xhci_comp_code comp; trb_type type; u64 addr; u32 field; @@ -573,10 +574,11 @@ static void abort_td(struct usb_device *udev, int ep_index) printf("abort_td: Expected a TRB_TRANSFER TRB first\n"); }
+ comp = GET_COMP_CODE(le32_to_cpu(event->event_cmd.status)); BUG_ON(type != TRB_COMPLETION || TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags)) - != udev->slot_id || GET_COMP_CODE(le32_to_cpu( - event->event_cmd.status)) != COMP_SUCCESS); + != udev->slot_id || (comp != COMP_SUCCESS && comp + != COMP_CTX_STATE)); xhci_acknowledge_event(ctrl);
addr = xhci_trb_virt_to_dma(ring->enq_seg,

On 10/29/23 07:37, Hector Martin wrote:
There is a race where an endpoint may halt by itself while we are trying to halt it, which results in a context state error. See xHCI 4.6.9 which mentions this case.
This also avoids BUGging when we attempt to stop an endpoint which was already stopped to begin with, which is probably a bug elsewhere but not a good reason to crash.
Signed-off-by: Hector Martin marcan@marcan.st
drivers/usb/host/xhci-ring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index d21e76e0bdb6..e02a6e300c4f 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -545,6 +545,7 @@ static void abort_td(struct usb_device *udev, int ep_index) struct xhci_ctrl *ctrl = xhci_get_ctrl(udev); struct xhci_ring *ring = ctrl->devs[udev->slot_id]->eps[ep_index].ring; union xhci_trb *event;
- xhci_comp_code comp; trb_type type; u64 addr; u32 field;
@@ -573,10 +574,11 @@ static void abort_td(struct usb_device *udev, int ep_index) printf("abort_td: Expected a TRB_TRANSFER TRB first\n"); }
- comp = GET_COMP_CODE(le32_to_cpu(event->event_cmd.status)); BUG_ON(type != TRB_COMPLETION || TRB_TO_SLOT_ID(le32_to_cpu(event->event_cmd.flags))
!= udev->slot_id || GET_COMP_CODE(le32_to_cpu(
event->event_cmd.status)) != COMP_SUCCESS);
!= udev->slot_id || (comp != COMP_SUCCESS && comp
!= COMP_CTX_STATE));
Can you please, in the process, reformat this condition so it is more readable ?
xhci_acknowledge_event(ctrl);
addr = xhci_trb_virt_to_dma(ring->enq_seg,
Reviewed-by: Marek Vasut marex@denx.de

There is currently no codepath to recover from this case. In principle we could require that the upper layer do this explicitly, but let's just do it in xHCI when the next bulk transfer is started, since that reasonably implies whatever caused the problem has been dealt with.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index e02a6e300c4f..db8b8f200250 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -671,6 +671,14 @@ int xhci_bulk_tx(struct usb_device *udev, unsigned long pipe,
ep_ctx = xhci_get_ep_ctx(ctrl, virt_dev->out_ctx, ep_index);
+ /* + * If the endpoint was halted due to a prior error, resume it before + * the next transfer. It is the responsibility of the upper layer to + * have dealt with whatever caused the error. + */ + if ((le32_to_cpu(ep_ctx->ep_info) & EP_STATE_MASK) == EP_STATE_HALTED) + reset_ep(udev, ep_index); + ring = virt_dev->eps[ep_index].ring; /* * How much data is (potentially) left before the 64KB boundary?

On 10/29/23 07:37, Hector Martin wrote:
There is currently no codepath to recover from this case. In principle we could require that the upper layer do this explicitly, but let's just do it in xHCI when the next bulk transfer is started, since that reasonably implies whatever caused the problem has been dealt with.
Signed-off-by: Hector Martin marcan@marcan.st
Reviewed-by: Marek Vasut marex@denx.de

This isn't going to work, don't pretend it will and then end up timing out.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index db8b8f200250..a969eafdc8ee 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -243,7 +243,8 @@ static int prepare_ring(struct xhci_ctrl *ctrl, struct xhci_ring *ep_ring, puts("WARN waiting for error on ep to be cleared\n"); return -EINVAL; case EP_STATE_HALTED: - puts("WARN halted endpoint, queueing URB anyway.\n"); + puts("WARN endpoint is halted\n"); + return -EINVAL; case EP_STATE_STOPPED: case EP_STATE_RUNNING: debug("EP STATE RUNNING.\n");

On 10/29/23 07:37, Hector Martin wrote:
This isn't going to work, don't pretend it will and then end up timing out.
Signed-off-by: Hector Martin marcan@marcan.st
drivers/usb/host/xhci-ring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index db8b8f200250..a969eafdc8ee 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -243,7 +243,8 @@ static int prepare_ring(struct xhci_ctrl *ctrl, struct xhci_ring *ep_ring, puts("WARN waiting for error on ep to be cleared\n"); return -EINVAL; case EP_STATE_HALTED:
puts("WARN halted endpoint, queueing URB anyway.\n");
puts("WARN endpoint is halted\n");
return -EINVAL;
Reviewed-by: Marek Vasut marex@denx.de

Now that we always check the return value, just return NULL on timeouts. We can still log the error since this is a problem, but it's not reason to panic.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index a969eafdc8ee..ae0ab5744df0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -494,8 +494,9 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected) if (expected == TRB_TRANSFER) return NULL;
- printf("XHCI timeout on event type %d... cannot recover.\n", expected); - BUG(); + printf("XHCI timeout on event type %d...\n", expected); + + return NULL; }
/*

On 10/29/23 07:37, Hector Martin wrote:
Now that we always check the return value, just return NULL on timeouts. We can still log the error since this is a problem, but it's not reason to panic.
Signed-off-by: Hector Martin marcan@marcan.st
drivers/usb/host/xhci-ring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index a969eafdc8ee..ae0ab5744df0 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -494,8 +494,9 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected) if (expected == TRB_TRANSFER) return NULL;
- printf("XHCI timeout on event type %d... cannot recover.\n", expected);
- BUG();
- printf("XHCI timeout on event type %d...\n", expected);
- return NULL;
Reviewed-by: Marek Vasut marex@denx.de

We need to get the DMA address before incrementing the pointer, as that might move us onto another segment.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index ae0ab5744df0..b60661fe05e7 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -202,6 +202,7 @@ static dma_addr_t queue_trb(struct xhci_ctrl *ctrl, struct xhci_ring *ring, bool more_trbs_coming, unsigned int *trb_fields) { struct xhci_generic_trb *trb; + dma_addr_t addr; int i;
trb = &ring->enqueue->generic; @@ -211,9 +212,11 @@ static dma_addr_t queue_trb(struct xhci_ctrl *ctrl, struct xhci_ring *ring,
xhci_flush_cache((uintptr_t)trb, sizeof(struct xhci_generic_trb));
+ addr = xhci_trb_virt_to_dma(ring->enq_seg, (union xhci_trb *)trb); + inc_enq(ctrl, ring, more_trbs_coming);
- return xhci_trb_virt_to_dma(ring->enq_seg, (union xhci_trb *)trb); + return addr; }
/**

On 10/29/23 07:37, Hector Martin wrote:
We need to get the DMA address before incrementing the pointer, as that might move us onto another segment.
Signed-off-by: Hector Martin marcan@marcan.st
Reviewed-by: Marek Vasut marex@denx.de

A bunch of miscellaneous debug messages to aid in working out USB issues.
Signed-off-by: Hector Martin marcan@marcan.st --- drivers/usb/host/xhci-ring.c | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index b60661fe05e7..dabe6cf86af2 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -214,6 +214,9 @@ static dma_addr_t queue_trb(struct xhci_ctrl *ctrl, struct xhci_ring *ring,
addr = xhci_trb_virt_to_dma(ring->enq_seg, (union xhci_trb *)trb);
+ debug("trb @ %llx: %08x %08x %08x %08x\n", addr, + trb_fields[0], trb_fields[1], trb_fields[2], trb_fields[3]); + inc_enq(ctrl, ring, more_trbs_coming);
return addr; @@ -296,6 +299,8 @@ void xhci_queue_command(struct xhci_ctrl *ctrl, dma_addr_t addr, u32 slot_id, { u32 fields[4];
+ debug("CMD: %llx 0x%x 0x%x %d\n", addr, slot_id, ep_index, cmd); + BUG_ON(prepare_ring(ctrl, ctrl->cmd_ring, EP_STATE_RUNNING));
fields[0] = lower_32_bits(addr); @@ -471,8 +476,14 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected)
type = TRB_FIELD_TO_TYPE(le32_to_cpu(event->event_cmd.flags)); if (type == expected || - (expected == TRB_NONE && type != TRB_PORT_STATUS)) + (expected == TRB_NONE && type != TRB_PORT_STATUS)) { + debug("Event: %08x %08x %08x %08x\n", + le32_to_cpu(event->generic.field[0]), + le32_to_cpu(event->generic.field[1]), + le32_to_cpu(event->generic.field[2]), + le32_to_cpu(event->generic.field[3])); return event; + }
if (type == TRB_PORT_STATUS) /* TODO: remove this once enumeration has been reworked */ @@ -484,8 +495,9 @@ union xhci_trb *xhci_wait_for_event(struct xhci_ctrl *ctrl, trb_type expected) le32_to_cpu(event->generic.field[2])) != COMP_SUCCESS); else - printf("Unexpected XHCI event TRB, skipping... " + printf("Unexpected XHCI event TRB, expected %d... " "(%08x %08x %08x %08x)\n", + expected, le32_to_cpu(event->generic.field[0]), le32_to_cpu(event->generic.field[1]), le32_to_cpu(event->generic.field[2]), @@ -602,10 +614,13 @@ static void abort_td(struct usb_device *udev, int ep_index) static void record_transfer_result(struct usb_device *udev, union xhci_trb *event, int length) { + xhci_comp_code code = GET_COMP_CODE( + le32_to_cpu(event->trans_event.transfer_len)); + udev->act_len = min(length, length - (int)EVENT_TRB_LEN(le32_to_cpu(event->trans_event.transfer_len)));
- switch (GET_COMP_CODE(le32_to_cpu(event->trans_event.transfer_len))) { + switch (code) { case COMP_SUCCESS: BUG_ON(udev->act_len != length); /* fallthrough */ @@ -613,16 +628,23 @@ static void record_transfer_result(struct usb_device *udev, udev->status = 0; break; case COMP_STALL: + debug("Xfer STALL\n"); udev->status = USB_ST_STALLED; break; case COMP_DB_ERR: + debug("Xfer DB_ERR\n"); + udev->status = USB_ST_BUF_ERR; + break; case COMP_TRB_ERR: + debug("Xfer TRB_ERR\n"); udev->status = USB_ST_BUF_ERR; break; case COMP_BABBLE: + debug("Xfer BABBLE\n"); udev->status = USB_ST_BABBLE_DET; break; default: + debug("Xfer error: %d\n", code); udev->status = 0x80; /* USB_ST_TOO_LAZY_TO_MAKE_A_NEW_MACRO */ } } @@ -1016,6 +1038,7 @@ int xhci_ctrl_tx(struct usb_device *udev, unsigned long pipe, record_transfer_result(udev, event, length); xhci_acknowledge_event(ctrl); if (udev->status == USB_ST_STALLED) { + debug("EP %d stalled\n", ep_index); reset_ep(udev, ep_index); return -EPIPE; }

On 10/29/23 07:37, Hector Martin wrote:
A bunch of miscellaneous debug messages to aid in working out USB issues.
Signed-off-by: Hector Martin marcan@marcan.st
drivers/usb/host/xhci-ring.c | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index b60661fe05e7..dabe6cf86af2 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -214,6 +214,9 @@ static dma_addr_t queue_trb(struct xhci_ctrl *ctrl, struct xhci_ring *ring,
addr = xhci_trb_virt_to_dma(ring->enq_seg, (union xhci_trb *)trb);
- debug("trb @ %llx: %08x %08x %08x %08x\n", addr,
trb_fields[0], trb_fields[1], trb_fields[2], trb_fields[3]);
Could you please use dev_dbg() instead of debug() ? That way you'd also get the device prefix in the output.

On Sun, Oct 29, 2023 at 2:38 AM Hector Martin marcan@marcan.st wrote:
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months.
Patches #1-#6 fix a series of related robustness issues. Certain conditions related to endpoint stalls revealed a chain of bugs throughout the stack that caused U-Boot to completely fail when encountering some errors (and errors are a fact of life with USB).
Example scenario:
- The device stalls a bulk endpoint due to an error
- The upper layer driver tries to use the endpoint again
- There is no endpoint stall clear wired up in the API, so for starters this is doomed to fail (fix: #4)
- xHCI knows the endpoint is halted, but tries to queue the TRB anyway, which can't work (fix: #5)
- Since the endpoint is halted nothing happens, so the transfer times out.
- The timeout handling tries to abort the transfer
- abort_td() tries to stop the endpoint, but since it is already halted, this results in a context state error. As the transfer never started, there is no completion event, so xhci_wait_for_event() is waiting for the wrong event type, and logs an error and returns NULL. (fix: #2)
- abort_td() crashes due to failing to guard against the NULL event (fix: #1)
- Even after fixing all that, abort_td() would not handle the context state error properly and BUG() (fix: #3). This also fixes a race condition documented in the xHCI spec that could occur even in the absence of all the other bugs.
Patch #6 addresses a related robustness issue where xhci_wait_for_event() panics on event timeouts other than for transfers. While this is clearly an unexpected condition and indicates a bug elsewhere, it's no reason to outright crash. USB is notoriously impossible to get 100% right, and we can't afford to be breaking users' systems at any sign of trouble. Error conditions should always be dealt with as gracefully as possible (even if that results in a completely broken USB controller, that is much preferable to aborting the boot process entirely, especially on devices with non-USB storage and keyboards where USB support is effectively optional for most users). Since after patch #1 we now guard all callers to xhci_wait_for_event() with at least trivial NULL checks, it's okay to return and continue in case of timeouts.
Patch #7 addresses an unrelated bug I ran into while debugging all this, and patch #8 adds extra debug logs to make finding future issues less painful.
I believe this should fix this Fedora bug too, which is either an instance of the above sequence of events, or (more likely, given the difficulty reproducing) the race condition documented in xHCI 4.6.9:
https://bugzilla.redhat.com/show_bug.cgi?id=2244305
Signed-off-by: Hector Martin marcan@marcan.st
Changes in v2:
- Squashed in a trivial fix for patch #1
- Removed spurious blank line
- Added a longer description to the cover letter
- Link to v1: https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd928@marcan.st
Hector Martin (8): usb: xhci: Guard all calls to xhci_wait_for_event usb: xhci: Better error handling in abort_td() usb: xhci: Allow context state errors when halting an endpoint usb: xhci: Recover from halted bulk endpoints usb: xhci: Fail on attempt to queue TRBs to a halted endpoint usb: xhci: Do not panic on event timeouts usb: xhci: Fix DMA address calculation in queue_trb usb: xhci: Add more debugging
drivers/usb/host/xhci-ring.c | 99 ++++++++++++++++++++++++++++++++++++-------- drivers/usb/host/xhci.c | 9 ++++ include/usb/xhci.h | 2 + 3 files changed, 92 insertions(+), 18 deletions(-)
base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529 change-id: 20231027-usb-fixes-1-83bfc7013012
Series LGTM.
Reviewed-by: Neal Gompa neal@gompa.dev
participants (3)
-
Hector Martin
-
Marek Vasut
-
Neal Gompa