[U-Boot] [PATCH] usb: fix usb start problem with SMSC USB hub and Toshiba USB stick

Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with some other status bits. As a result clearing stall on an endpoint won't be done if other status bits were set.
E.g. 'usb start' often fails with Toshiba USB stick 0x930/0x6545 connected to the SMSC USB 2.0 hub 0x424/0x2514. Debugging the issue showed that while bulk IN transfers with length of 13 or 18 the status field of the qTD token sometimes indicates trans- action error (XactErr) and sometimes additionally endpoint halted state. In the latter case resetting the USB device in error recovery code fails as no clear stall request on the endpoint will be done. The patch fixes status field checking code to properly handle endpoint halted state.
However this fix is not enough to solve 'usb start' problem with hub/stick combination mentioned above. Running with lot of debug code in ehci_submit_async() I've never seen the problem with usb stick recognition. After removing this debug code the similar problem sometimes showed up again. Therefore the patch also adds delay in ehci_submit_async() for above-mentioned hub/stick combination. Even without this delay the fix is an improvement since it fixes the problem with board freezy after subsequent failed 'usb start/stop' cycles as it was observed on mpc5121ads board.
Signed-off-by: Anatolij Gustschin agust@denx.de --- common/usb_storage.c | 5 +++-- drivers/usb/host/ehci-hcd.c | 8 ++++++++ 2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/common/usb_storage.c b/common/usb_storage.c index 76949b8..5ca92c3 100644 --- a/common/usb_storage.c +++ b/common/usb_storage.c @@ -680,7 +680,8 @@ int usb_stor_BBB_transport(ccb *srb, struct us_data *us) result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata, srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5); /* special handling of STALL in DATA phase */ - if ((result < 0) && (us->pusb_dev->status & USB_ST_STALLED)) { + if ((result < 0) && + (us->pusb_dev->status & (USB_ST_STALLED | USB_ST_CRC_ERR))) { USB_STOR_PRINTF("DATA:stall\n"); /* clear the STALL on the endpoint */ result = usb_stor_BBB_clear_endpt_stall(us, @@ -710,7 +711,7 @@ again:
/* special handling of STALL in STATUS phase */ if ((result < 0) && (retry < 1) && - (us->pusb_dev->status & USB_ST_STALLED)) { + (us->pusb_dev->status & (USB_ST_STALLED | USB_ST_CRC_ERR))) { USB_STOR_PRINTF("STATUS:stall\n"); /* clear the STALL on the endpoint */ result = usb_stor_BBB_clear_endpt_stall(us, us->ep_in); diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 37d056e..7463a75 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -430,6 +430,12 @@ ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer, usbsts = ehci_readl(&hcor->or_usbsts); ehci_writel(&hcor->or_usbsts, (usbsts & 0x3f));
+ if (dev->descriptor.idVendor == 0x930 && + dev->descriptor.idProduct == 0x6545 && + dev->parent->descriptor.idVendor == 0x424 && + dev->parent->descriptor.idProduct == 0x2514) + wait_ms(10); + /* Enable async. schedule. */ cmd = ehci_readl(&hcor->or_usbcmd); cmd |= CMD_ASE; @@ -490,6 +496,8 @@ ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer, break; default: dev->status = USB_ST_CRC_ERR; + if ((token & 0x40) == 0x40) + dev->status |= USB_ST_STALLED; break; } dev->act_len = length - ((token >> 16) & 0x7fff);

Hi,
2010/8/31 Anatolij Gustschin agust@denx.de:
Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with some other status bits. As a result clearing stall on an endpoint won't be done if other status bits were set.
Reading this description and this code:
/* special handling of STALL in DATA phase */
if ((result < 0) && (us->pusb_dev->status & USB_ST_STALLED)) {
if ((result < 0) &&
(us->pusb_dev->status & (USB_ST_STALLED | USB_ST_CRC_ERR))) { USB_STOR_PRINTF("DATA:stall\n");
Description and code do not seem to match. The old implementation explicitly checked if the STALLED bit was set, and if so the endpoint would be cleared. Now, it seems you are clearing the endpoint _also_ when the CRC_ERR bit is set.
There are a lot more reasons, at least on other host controllers that set the status USB_ST_CRC_ERR, does this change not break the other code? (A simple grep will show you the situations where it is returned.)
E.g. 'usb start' often fails with Toshiba USB stick 0x930/0x6545 connected to the SMSC USB 2.0 hub 0x424/0x2514. Debugging the issue showed that while bulk IN transfers with length of 13 or 18 the status field of the qTD token sometimes indicates trans- action error (XactErr) and sometimes additionally endpoint halted state. In the latter case resetting the USB device in error recovery code fails as no clear stall request on the endpoint will be done.
OK
However this fix is not enough to solve 'usb start' problem with hub/stick combination mentioned above. Running with lot of debug code in ehci_submit_async() I've never seen the problem with usb stick recognition. After removing this debug code the similar problem sometimes showed up again. Therefore the patch also adds delay in ehci_submit_async() for above-mentioned hub/stick combination. Even without this delay the fix is an
Why only for these USB sticks? Are these sticks special among the thousands of different sticks out there? Or could it be more reliable to always do this delay for all USB sticks? Or even better, what are we missing here that the delay is needed at all?
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 37d056e..7463a75 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c
Is this problem only valid for the EHCI code? I can imagine that the other host controllers (like UHCI and OHCI and so on) code suffer the same problem and need similar fixes... (At least I know that I have here a couple of USB sticks that show similar problems with OHCI ;-) )
Kind regards,
Remy

Dear Anatolij,
In message AANLkTikbnmdtJNCjd-pjeHWsw+Ng8sTF1iZT1utG6JS8@mail.gmail.com Remy Bohmer wrote:
Hi,
2010/8/31 Anatolij Gustschin agust@denx.de:
Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with some other status bits. As a result clearing stall on an endpoint won't be done if other status bits were set.
Reading this description and this code:
/* special handling of STALL in DATA phase */
if ((result < 0) && (us->pusb_dev->status & USB_ST_STALLED)) {
if ((result < 0) &&
(us->pusb_dev->status & (USB_ST_STALLED | USB_ST_CRC_ERR))) { USB_STOR_PRINTF("DATA:stall\n");
Description and code do not seem to match. The old implementation explicitly checked if the STALLED bit was set, and if so the endpoint would be cleared. Now, it seems you are clearing the endpoint _also_ when the CRC_ERR bit is set.
There are a lot more reasons, at least on other host controllers that set the status USB_ST_CRC_ERR, does this change not break the other code? (A simple grep will show you the situations where it is returned.)
E.g. 'usb start' often fails with Toshiba USB stick 0x930/0x6545 connected to the SMSC USB 2.0 hub 0x424/0x2514. Debugging the issue showed that while bulk IN transfers with length of 13 or 18 the status field of the qTD token sometimes indicates trans- action error (XactErr) and sometimes additionally endpoint halted state. In the latter case resetting the USB device in error recovery code fails as no clear stall request on the endpoint will be done.
OK
However this fix is not enough to solve 'usb start' problem with hub/stick combination mentioned above. Running with lot of debug code in ehci_submit_async() I've never seen the problem with usb stick recognition. After removing this debug code the similar problem sometimes showed up again. Therefore the patch also adds delay in ehci_submit_async() for above-mentioned hub/stick combination. Even without this delay the fix is an
Why only for these USB sticks? Are these sticks special among the thousands of different sticks out there? Or could it be more reliable to always do this delay for all USB sticks? Or even better, what are we missing here that the delay is needed at all?
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 37d056e..7463a75 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c
Is this problem only valid for the EHCI code? I can imagine that the other host controllers (like UHCI and OHCI and so on) code suffer the same problem and need similar fixes... (At least I know that I have here a couple of USB sticks that show similar problems with OHCI ;-) )
As far as I can tell you never replied to thesequestions, with the result that this patch did not make it into mainline yet.
Could you please have a look? Thanks.
Best regards,
Wolfgang Denk

Hi Remy,
On Wed, 1 Sep 2010 10:49:22 +0200 Remy Bohmer linux@bohmer.net wrote: ...
2010/8/31 Anatolij Gustschin agust@denx.de:
Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with some other status bits. As a result clearing stall on an endpoint won't be done if other status bits were set.
Reading this description and this code:
/* special handling of STALL in DATA phase */
if ((result < 0) && (us->pusb_dev->status &
USB_ST_STALLED)) {
if ((result < 0) &&
(us->pusb_dev->status & (USB_ST_STALLED |
USB_ST_CRC_ERR))) { USB_STOR_PRINTF("DATA:stall\n");
Description and code do not seem to match. The old implementation explicitly checked if the STALLED bit was set, and if so the endpoint would be cleared. Now, it seems you are clearing the endpoint _also_ when the CRC_ERR bit is set.
Yes. This was intentional but unfortunately not directly mentioned in the patch description. The reason is the following:
in all cases where transaction error (XactErr) was reported with additionally STALLED bit set, the handling of the STALL condition successfully recovered from the error case and the device was recognized.
in cases where only transaction error was reported (STALLED bit not set) the recovery from the error case didn't succeed and "Device NOT ready" (Request Sense returned 06 28 00) status was finally reported. I didn't find the information in the EHCI spec how the recovery from the XactErr should be done properly. Do you know how to handle it?
There are a lot more reasons, at least on other host controllers that set the status USB_ST_CRC_ERR, does this change not break the other code? (A simple grep will show you the situations where it is returned.)
This change will probably break other code. I'll drop it here and additionally set USB_ST_STALLED flag in the ehci driver instead.
E.g. 'usb start' often fails with Toshiba USB stick 0x930/0x6545 connected to the SMSC USB 2.0 hub 0x424/0x2514. Debugging the issue showed that while bulk IN transfers with length of 13 or 18 the status field of the qTD token sometimes indicates trans- action error (XactErr) and sometimes additionally endpoint halted state. In the latter case resetting the USB device in error recovery code fails as no clear stall request on the endpoint will be done.
OK
However this fix is not enough to solve 'usb start' problem with hub/stick combination mentioned above. Running with lot of debug code in ehci_submit_async() I've never seen the problem with usb stick recognition. After removing this debug code the similar problem sometimes showed up again. Therefore the patch also adds delay in ehci_submit_async() for above-mentioned hub/stick combination. Even without this delay the fix is an
Why only for these USB sticks? Are these sticks special among the thousands of different sticks out there?
I do not have the right equipment to analyze this problem properly. This is the special USB stick-hub combination. The same USB stick is always recognized properly if it is connected directly. Other USB sticks also work with this USB hub.
Or could it be more reliable to always do this delay for all USB sticks? Or even better, what are we missing here that the delay is needed at all?
Doing this delay for other USB sticks will negatively affect data transfer speed.
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 37d056e..7463a75 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c
Is this problem only valid for the EHCI code? I can imagine that the other host controllers (like UHCI and OHCI and so on) code suffer the same problem and need similar fixes... (At least I know that I have here a couple of USB sticks that show similar problems with OHCI ;-) )
Maybe, maybe not. The problem shows up on SheevaPlug and mpc5121ads boards, both using USB EHCI. I do not have the possibility to test it on other host controllers currently.
Best regards, Anatolij

Hi,
2010/10/27 Anatolij Gustschin agust@denx.de:
Hi Remy,
On Wed, 1 Sep 2010 10:49:22 +0200 Remy Bohmer linux@bohmer.net wrote: ...
2010/8/31 Anatolij Gustschin agust@denx.de:
Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with some other status bits. As a result clearing stall on an endpoint won't be done if other status bits were set.
Reading this description and this code:
/* special handling of STALL in DATA phase */
- if ((result < 0) && (us->pusb_dev->status &
USB_ST_STALLED)) {
- if ((result < 0) &&
- (us->pusb_dev->status & (USB_ST_STALLED |
USB_ST_CRC_ERR))) { USB_STOR_PRINTF("DATA:stall\n");
Description and code do not seem to match. The old implementation explicitly checked if the STALLED bit was set, and if so the endpoint would be cleared. Now, it seems you are clearing the endpoint _also_ when the CRC_ERR bit is set.
Yes. This was intentional but unfortunately not directly mentioned in the patch description. The reason is the following:
in all cases where transaction error (XactErr) was reported with additionally STALLED bit set, the handling of the STALL condition successfully recovered from the error case and the device was recognized.
in cases where only transaction error was reported (STALLED bit not set) the recovery from the error case didn't succeed and "Device NOT ready" (Request Sense returned 06 28 00) status was finally reported. I didn't find the information in the EHCI spec how the recovery from the XactErr should be done properly. Do you know how to handle it?
There are a lot more reasons, at least on other host controllers that set the status USB_ST_CRC_ERR, does this change not break the other code? (A simple grep will show you the situations where it is returned.)
This change will probably break other code. I'll drop it here and additionally set USB_ST_STALLED flag in the ehci driver instead.
OK. I will ignore this version of this patch. I assume you will post a new revision of this patch?
Kind regards,
Remy

Hi,
On Fri, 29 Oct 2010 15:57:54 +0200 Remy Bohmer linux@bohmer.net wrote: ...
in cases where only transaction error was reported (STALLED bit not set) the recovery from the error case didn't succeed and "Device NOT ready" (Request Sense returned 06 28 00) status was finally reported. I didn't find the information in the EHCI spec how the recovery from the XactErr should be done properly. Do you know how to handle it?
There are a lot more reasons, at least on other host controllers that set the status USB_ST_CRC_ERR, does this change not break the other code? (A simple grep will show you the situations where it is returned.)
This change will probably break other code. I'll drop it here and additionally set USB_ST_STALLED flag in the ehci driver instead.
OK. I will ignore this version of this patch. I assume you will post a new revision of this patch?
Yes, my intention is to post v2 patch.
Thanks, Anatolij

Dear Remy,
In message 1283280314-10700-1-git-send-email-agust@denx.de Anatolij Gustschin wrote:
Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with some other status bits. As a result clearing stall on an endpoint won't be done if other status bits were set.
E.g. 'usb start' often fails with Toshiba USB stick 0x930/0x6545 connected to the SMSC USB 2.0 hub 0x424/0x2514. Debugging the issue showed that while bulk IN transfers with length of 13 or 18 the status field of the qTD token sometimes indicates trans- action error (XactErr) and sometimes additionally endpoint halted state. In the latter case resetting the USB device in error recovery code fails as no clear stall request on the endpoint will be done. The patch fixes status field checking code to properly handle endpoint halted state.
However this fix is not enough to solve 'usb start' problem with hub/stick combination mentioned above. Running with lot of debug code in ehci_submit_async() I've never seen the problem with usb stick recognition. After removing this debug code the similar problem sometimes showed up again. Therefore the patch also adds delay in ehci_submit_async() for above-mentioned hub/stick combination. Even without this delay the fix is an improvement since it fixes the problem with board freezy after subsequent failed 'usb start/stop' cycles as it was observed on mpc5121ads board.
Signed-off-by: Anatolij Gustschin agust@denx.de
common/usb_storage.c | 5 +++-- drivers/usb/host/ehci-hcd.c | 8 ++++++++ 2 files changed, 11 insertions(+), 2 deletions(-)
Any comments on this?
Best regards,
Wolfgang Denk

Hi,
2010/10/12 Wolfgang Denk wd@denx.de:
Dear Remy,
In message 1283280314-10700-1-git-send-email-agust@denx.de Anatolij Gustschin wrote:
Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with some other status bits. As a result clearing stall on an endpoint won't be done if other status bits were set.
<snip>
Signed-off-by: Anatolij Gustschin agust@denx.de
common/usb_storage.c | 5 +++-- drivers/usb/host/ehci-hcd.c | 8 ++++++++ 2 files changed, 11 insertions(+), 2 deletions(-)
Any comments on this?
I had, look here: http://lists.denx.de/pipermail/u-boot/2010-September/076537.html It is silence on this patch since.
Kind regards,
Remy

Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with XactErr status bit. As a result clearing stall on an endpoint won't be done if this status bit was also set. Check for halted bit and report USB_ST_STALLED status if the host controller also indicates endpoit stall condition.
Signed-off-by: Anatolij Gustschin agust@denx.de --- drivers/usb/host/ehci-hcd.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 982f96e..c7fba10 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -491,6 +491,8 @@ ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer, break; default: dev->status = USB_ST_CRC_ERR; + if ((token & 0x40) == 0x40) + dev->status |= USB_ST_STALLED; break; } dev->act_len = length - ((token >> 16) & 0x7fff);

Hi,
2010/11/2 Anatolij Gustschin agust@denx.de:
Checking the status field of the qTD token in the current code do not take into acount cases where endpoint stall (halted) bit is set together with XactErr status bit. As a result clearing stall on an endpoint won't be done if this status bit was also set. Check for halted bit and report USB_ST_STALLED status if the host controller also indicates endpoit stall condition.
Signed-off-by: Anatolij Gustschin agust@denx.de
drivers/usb/host/ehci-hcd.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 982f96e..c7fba10 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -491,6 +491,8 @@ ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer, break; default: dev->status = USB_ST_CRC_ERR;
- if ((token & 0x40) == 0x40)
- dev->status |= USB_ST_STALLED;
break; } dev->act_len = length - ((token >> 16) & 0x7fff);
Applied to u-boot-usb
Thanks.
Remy

'usb start' command often fails with Toshiba USB stick 0x930/0x6545 connected to the SMSC USB 2.0 hub 0x424/0x2514. Debugging the issue showed that while bulk IN transfers with length of 13 or 18 the status field of the qTD token sometimes indicates trans- action error (XactErr) and sometimes additionally endpoint halted state. In the latter case resetting the USB device in error handling code works after clearing stall request on the endpoint. In the case where only XactErr bit was set the resetting doesn't work properly and device not ready status will be finally reported. This patch adds reporting endpoint stall status in case of trans- action errors for this hub/stick combination so that the error handling code can reset the device after clearing stall request to the endpoint.
However this fix is not enough to solve 'usb start' problem with hub/stick combination mentioned above. Running with lot of debug code in ehci_submit_async() I've never seen the problem with usb stick recognition. After removing this debug code the similar problem sometimes showed up again. Therefore the patch also adds delay in ehci_submit_async() for above-mentioned hub/stick combination. Even without this delay the fix is an improvement since it fixes the problem with board freezy after subsequent failed 'usb start/stop' cycles as it was observed on mpc5121ads board.
Signed-off-by: Anatolij Gustschin agust@denx.de --- drivers/usb/host/ehci-hcd.c | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index c7fba10..a001143 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -431,6 +431,12 @@ ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer, usbsts = ehci_readl(&hcor->or_usbsts); ehci_writel(&hcor->or_usbsts, (usbsts & 0x3f));
+ if (dev->descriptor.idVendor == 0x930 && + dev->descriptor.idProduct == 0x6545 && + dev->parent->descriptor.idVendor == 0x424 && + dev->parent->descriptor.idProduct == 0x2514) + wait_ms(10); + /* Enable async. schedule. */ cmd = ehci_readl(&hcor->or_usbcmd); cmd |= CMD_ASE; @@ -493,6 +499,12 @@ ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer, dev->status = USB_ST_CRC_ERR; if ((token & 0x40) == 0x40) dev->status |= USB_ST_STALLED; + + if (dev->descriptor.idVendor == 0x930 && + dev->descriptor.idProduct == 0x6545 && + dev->parent->descriptor.idVendor == 0x424 && + dev->parent->descriptor.idProduct == 0x2514) + dev->status |= USB_ST_STALLED; break; } dev->act_len = length - ((token >> 16) & 0x7fff);

Hi,
2010/11/2 Anatolij Gustschin agust@denx.de:
'usb start' command often fails with Toshiba USB stick 0x930/0x6545 connected to the SMSC USB 2.0 hub 0x424/0x2514. Debugging the issue showed that while bulk IN transfers with length of 13 or 18 the status field of the qTD token sometimes indicates trans- action error (XactErr) and sometimes additionally endpoint halted state. In the latter case resetting the USB device in error handling code works after clearing stall request on the endpoint. In the case where only XactErr bit was set the resetting doesn't work properly and device not ready status will be finally reported. This patch adds reporting endpoint stall status in case of trans- action errors for this hub/stick combination so that the error handling code can reset the device after clearing stall request to the endpoint.
However this fix is not enough to solve 'usb start' problem with hub/stick combination mentioned above. Running with lot of debug code in ehci_submit_async() I've never seen the problem with usb stick recognition. After removing this debug code the similar problem sometimes showed up again. Therefore the patch also adds delay in ehci_submit_async() for above-mentioned hub/stick combination. Even without this delay the fix is an improvement since it fixes the problem with board freezy after subsequent failed 'usb start/stop' cycles as it was observed on mpc5121ads board.
Signed-off-by: Anatolij Gustschin agust@denx.de
drivers/usb/host/ehci-hcd.c | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index c7fba10..a001143 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -431,6 +431,12 @@ ehci_submit_async(struct usb_device *dev, unsigned long pipe, void *buffer, usbsts = ehci_readl(&hcor->or_usbsts); ehci_writel(&hcor->or_usbsts, (usbsts & 0x3f));
- if (dev->descriptor.idVendor == 0x930 &&
- dev->descriptor.idProduct == 0x6545 &&
- dev->parent->descriptor.idVendor == 0x424 &&
- dev->parent->descriptor.idProduct == 0x2514)
- wait_ms(10);
I have to say that I have a problem with this construction. This will solve only 1 particular situation where the user has exactly this USB stick and exactly this hub. If he uses another hub, he can have the same problem. If he use another USB-stick he also can run into this problem. What is the real cause of this issue, and can it be solved in a generic way? I feel the problem is still not understood.
So, I NAK this patch.. Sorry...
Remy

Dear Remy Bohmer,
In message AANLkTikqEFtB=XEAV1G6xyvOY3HW3imCPiFiLD0kVoG6@mail.gmail.com you wrote:
if (dev->descriptor.idVendor == 0x930 &&
dev->descriptor.idProduct == 0x6545 &&
dev->parent->descriptor.idVendor == 0x424 &&
dev->parent->descriptor.idProduct == 0x2514)
wait_ms(10);
I have to say that I have a problem with this construction. This will solve only 1 particular situation where the user has exactly this USB stick and exactly this hub. If he uses another hub, he can have the same problem. If he use another USB-stick he also can run into this problem. What is the real cause of this issue, and can it be solved in a generic way? I feel the problem is still not understood.
Indeed the problem is not exactly understood.
Anatolij has tested a wide range of board / hub / stick combinations, and the problem happens only with this very combination.
Yes, there is a chance that another hub / stick combination might show similar issues, but so far we have not been able to find such another combo.
So, I NAK this patch.. Sorry...
So what do you propose to solve the problem? Of course we could add the delay unconditionally, for all systems. But it brings with it a performance degradation which I would not like to see either.
What do you suggest to provide a solution for this specific situation?
Best regards,
Wolfgang Denk

Hi Wolfgang,
2010/11/2 Wolfgang Denk wd@denx.de:
Dear Remy Bohmer,
In message AANLkTikqEFtB=XEAV1G6xyvOY3HW3imCPiFiLD0kVoG6@mail.gmail.com you wrote:
- if (dev->descriptor.idVendor == 0x930 &&
- dev->descriptor.idProduct == 0x6545 &&
- dev->parent->descriptor.idVendor == 0x424 &&
- dev->parent->descriptor.idProduct == 0x2514)
- wait_ms(10);
I have to say that I have a problem with this construction. This will solve only 1 particular situation where the user has exactly this USB stick and exactly this hub. If he uses another hub, he can have the same problem. If he use another USB-stick he also can run into this problem. What is the real cause of this issue, and can it be solved in a generic way? I feel the problem is still not understood.
Indeed the problem is not exactly understood.
Anatolij has tested a wide range of board / hub / stick combinations, and the problem happens only with this very combination.
Yes, there is a chance that another hub / stick combination might show similar issues, but so far we have not been able to find such another combo.
So, I NAK this patch.. Sorry...
So what do you propose to solve the problem? Of course we could add the delay unconditionally, for all systems. But it brings with it a performance degradation which I would not like to see either.
Agreed
What do you suggest to provide a solution for this specific situation?
I have no idea what has been done, and has not been done. I have not been debugging this issue. I have no idea if a USB protocol analyser has shown something weird or something we could work on.
But anyway: You admit that the problem is not understood, so would the delay fix the problem, or just make it less obvious?
If we would allow such workarounds to U-boot we would end up with endless lists of device-ids that are excluded in some situation all over the place. The code would just become fluttered with all kinds of exceptions to mask problems not being understood and not being debugged properly.
And in this case: How big is the chance that any other user will have exact the same hardware combination and run into this problem? I guess that would be close to zero... My guts tells me that the chance that other combinations require the same 'fix' is bigger...
If it is proven to be a bug of a specific device, that would change the situation, in that case we would work-around a certain device bug that really cannot be solved otherwise. In that case we would _know_ what problem we are working around, and that would be a limited of devices and situations. We at least could document it.
So, I need more info to convince me that this is the right solution. Does, for example, Linux have similar issues? If not, why is it working there. Does a protocol analyser show different behaviour, different timing, different data? What is different?
Kind regards,
Remy

Dear Remy,
In message AANLkTimtZosBtOBM_oG304GbRCpcm1WJEt0xqo+mgN9D@mail.gmail.com you wrote:
I have no idea what has been done, and has not been done. I have not been debugging this issue. I have no idea if a USB protocol analyser has shown something weird or something we could work on.
Sorry - we don;t have a USB protocol analyzer that does high-speed. So our debugging is mostly limited to what we can see in the controller, and in the software levels above it.
But anyway: You admit that the problem is not understood, so would the delay fix the problem, or just make it less obvious?
You know that I cannot really answer this question. Without exact understanding what is going on it is just a pretty much dirty work around. It helps in this specific test case, but we have zero knowledge if it helps in opther cases, and what these cases may be.
The "interesting" part of these tests was that the problem really sticks to one specific combination of hub and storage device.
If we would allow such workarounds to U-boot we would end up with endless lists of device-ids that are excluded in some situation all over the place. The code would just become fluttered with all kinds of exceptions to mask problems not being understood and not being debugged properly.
Agreed.
And in this case: How big is the chance that any other user will have exact the same hardware combination and run into this problem? I guess that would be close to zero... My guts tells me that the chance that other combinations require the same 'fix' is bigger...
I am positive sure that other uses with this hardware combination _will_ run into the same problem. This fix was developed for a customer who needed it pretty much urgently because he was using this combo in numbers where he preferred paying for the fix over throwing away the sticks and.or hubs and using other models. The problem was reliablew repeatable on all his devices. And we saw it on PowerPC (MPC5121) and ARM (Kirkwood). That's why I'm pretty sure that it whill hit if you run such a combo.
If it is proven to be a bug of a specific device, that would change the situation, in that case we would work-around a certain device bug that really cannot be solved otherwise. In that case we would _know_ what problem we are working around, and that would be a limited of devices and situations. We at least could document it.
Unfortunately we can only point at the combo of devices - each for itself appears to be working OK.
So, I need more info to convince me that this is the right solution. Does, for example, Linux have similar issues? If not, why is it
No, we did not observe such problems under Linux. We were not able to find out why.
working there. Does a protocol analyser show different behaviour, different timing, different data? What is different?
We don't have any such information, unfortunately.
Best regards,
Wolfgang Denk

Hi,
2010/11/2 Wolfgang Denk wd@denx.de:
Dear Remy,
In message AANLkTimtZosBtOBM_oG304GbRCpcm1WJEt0xqo+mgN9D@mail.gmail.com you wrote:
I have no idea what has been done, and has not been done. I have not been debugging this issue. I have no idea if a USB protocol analyser has shown something weird or something we could work on.
Sorry - we don;t have a USB protocol analyzer that does high-speed. So our debugging is mostly limited to what we can see in the controller, and in the software levels above it.
Hmm, then, unfortunately, you are quite blindfolded for debugging this problem...
If we would allow such workarounds to U-boot we would end up with endless lists of device-ids that are excluded in some situation all over the place. The code would just become fluttered with all kinds of exceptions to mask problems not being understood and not being debugged properly.
Agreed.
As U-boot project-owner you know you have the last word in this. How do you think how to continue from here?
And in this case: How big is the chance that any other user will have exact the same hardware combination and run into this problem? I guess that would be close to zero... My guts tells me that the chance that other combinations require the same 'fix' is bigger...
I am positive sure that other uses with this hardware combination _will_ run into the same problem. This fix was developed for a customer who needed it pretty much urgently because he was using this combo in numbers where he preferred paying for the fix over throwing away the sticks and.or hubs and using other models. The problem was reliablew repeatable on all his devices. And we saw it on PowerPC (MPC5121) and ARM (Kirkwood). That's why I'm pretty sure that it whill hit if you run such a combo.
Understood.
If it is proven to be a bug of a specific device, that would change the situation, in that case we would work-around a certain device bug that really cannot be solved otherwise. In that case we would _know_ what problem we are working around, and that would be a limited of devices and situations. We at least could document it.
Unfortunately we can only point at the combo of devices - each for itself appears to be working OK.
OK
So, I need more info to convince me that this is the right solution. Does, for example, Linux have similar issues? If not, why is it
No, we did not observe such problems under Linux. We were not able to find out why.
There should be a difference in controlling the devices triggering the bug, and without hispeed analyser it will be extremely hard to find. We have one here, but we do not have your boards, USB hub/devices and so on. (And... neither do I have the time to debug it for you...)
working there. Does a protocol analyser show different behaviour, different timing, different data? What is different?
We don't have any such information, unfortunately.
Understood.
Kind regards,
Remy

Dear Remy Bohmer,
In message AANLkTik0BxXfe8d5+96Gy_=+Cu0H_FkEYUtfyo=cRPJR@mail.gmail.com you wrote:
As U-boot project-owner you know you have the last word in this.
This is a pretty precious resource that should be used wisely, and not without real need. This topic is clearly in your domain, and while I'm trying to explain the situation to you, I will not try to influence your decision.
How do you think how to continue from here?
I don't really know.
There should be a difference in controlling the devices triggering the bug, and without hispeed analyser it will be extremely hard to find. We have one here, but we do not have your boards, USB hub/devices and so on. (And... neither do I have the time to debug it for you...)
Nobody expects that you spend time and resources (and for free) on such a pretty exotic issue.
Best regards,
Wolfgang Denk

Hi Remy,
On Tue, 02 Nov 2010 21:46:33 +0100 Wolfgang Denk wd@denx.de wrote:
Dear Remy Bohmer,
In message AANLkTik0BxXfe8d5+96Gy_=+Cu0H_FkEYUtfyo=cRPJR@mail.gmail.com you wrote:
As U-boot project-owner you know you have the last word in this.
This is a pretty precious resource that should be used wisely, and not without real need. This topic is clearly in your domain, and while I'm trying to explain the situation to you, I will not try to influence your decision.
I just wanted to ask, what is your final decision on this patch after this discusion. Do you NACK it an we should find the real issue and fix it accordingly? Or can you accept this patch as is?
How do you think how to continue from here?
I don't really know.
There should be a difference in controlling the devices triggering the bug, and without hispeed analyser it will be extremely hard to find. We have one here, but we do not have your boards, USB hub/devices and so on. (And... neither do I have the time to debug it for you...)
Nobody expects that you spend time and resources (and for free) on such a pretty exotic issue.
Thanks, Anatolij

Dear Anatolij,
In message 20101125111729.3ce4f7c6@wker you wrote:
I just wanted to ask, what is your final decision on this patch after this discusion. Do you NACK it an we should find the real issue and fix it accordingly? Or can you accept this patch as is?
I'm afraid we have neither time nor resources to spend any significant efforts on this - from the customer's point of view the problem is solved. He is fine with the out-of-tree patch, but of course this is highly dissatisfying.
Best regards,
Wolfgang Denk

Hi,
2010/11/25 Anatolij Gustschin agust@denx.de:
Hi Remy,
On Tue, 02 Nov 2010 21:46:33 +0100 Wolfgang Denk wd@denx.de wrote:
Dear Remy Bohmer,
In message AANLkTik0BxXfe8d5+96Gy_=+Cu0H_FkEYUtfyo=cRPJR@mail.gmail.com you wrote:
As U-boot project-owner you know you have the last word in this.
This is a pretty precious resource that should be used wisely, and not without real need. This topic is clearly in your domain, and while I'm trying to explain the situation to you, I will not try to influence your decision.
I just wanted to ask, what is your final decision on this patch after this discusion. Do you NACK it an we should find the real issue and fix it accordingly? Or can you accept this patch as is?
Sorry, but I stand with my decision to NACK it. If we would allow these kind of patches, U-boot would be fluttered in no-time with all kinds of exceptions all over the place to hide bugs not really understood. What if someone else runs into the same bug in a different hardware setup and finds the real root-cause of that problem and posts a real fix. When do we know that this workaround can be removed? It would probably stay in there for years and nobody knows what to do with it...
Kind regards,
Remy
participants (3)
-
Anatolij Gustschin
-
Remy Bohmer
-
Wolfgang Denk