[PATCH] pci: Work around PCIe link training failures

Attempt to handle cases with a downstream port of a PCIe switch where link training never completes and the link continues switching between speeds indefinitely with the data link layer never reaching the active state.
It has been observed with a downstream port of the ASMedia ASM2824 Gen 3 switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2 switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device, P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the switches are supposed to negotiate the link speed of preferably 5.0GT/s, falling back to 2.5GT/s.
However the link continues oscillating between the two speeds, with link training reported active ~84% of the time repeatedly, e.g.:
02:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01) (prog-if 00 [Normal decode]) [...] Bus: primary=02, secondary=05, subordinate=05, sec-latency=0 [...] Capabilities: [80] Express (v2) Downstream Port (Slot+), MSI 00 [...] LnkSta: Speed 5GT/s (downgraded), Width x1 (ok) TrErr- Train+ SlotClk+ DLActive- BWMgmt+ ABWMgmt- [...] LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis+, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB [...]
Forcibly limiting the target link speed to 2.5GT/s with the upstream ASM2824 device makes the two switches communicate correctly however:
02:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01) (prog-if 00 [Normal decode]) [...] Bus: primary=02, secondary=05, subordinate=09, sec-latency=0 [...] Capabilities: [80] Express (v2) Downstream Port (Slot+), MSI 00 [...] LnkSta: Speed 2.5GT/s (downgraded), Width x1 (ok) TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt- [...] LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis+, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB [...]
and then:
05:00.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch [12d8:2304] (rev 05) (prog-if 00 [Normal decode]) [...] Bus: primary=05, secondary=06, subordinate=09, sec-latency=0 [...] Capabilities: [c0] Express (v2) Upstream Port, MSI 00 [...] LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- [...] LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB [...]
Make use of this observation then and attempt to detect the inability to negotiate the link speed automatically, and then handle it by hand. Use the Data Link Layer Link Active status flag as the primary indicator of successful link speed negotiation, but given that the flag is optional by hardware to implement (the ASM2824 does have it though), resort to checking for the mandatory Link Bandwidth Management Status flag showing that the link speed or width has been changed in an attempt to correct unreliable link operation (the ASM2824 does set it too).
If these checks indicate that link may not operate correctly, then poll the Data Link Layer Link Active status flag along with the Link Training flag for the duration of 200ms to see if the link has stabilised, that is either that the Data Link Layer Link Active status flag has been set or that Link Training has been inactive during at least the second half of the inteval.
If that has indicated failure, reduce the target speed, request a link retrain and check again if the link has stabilised. Repeat until either successful or the link speeds supported by the downstream port have been exhausted.
Signed-off-by: Maciej W. Rozycki macro@orcam.me.uk --- Hi,
I came across this issue when building the system based on my recently acquired SiFive HiFive Unmatched board. One of the project objectives was to have a pair of legacy PCI options wired and in the course of the design I came up with a solution including an intermediate Delock Riser Card PCI Express x1 > 2 x PCIe x1 device, which provides an extra PCIe slot beyond one taken by the downstream PCIe-PCI bridge device, a worthwhile addition (https://www.delock.com/produkt/41433/merkmale.html, now discontinued).
It has quickly turned out that the Delock device does not want to talk to the Unmatched board, even though both devices operate correctly in other topologies. So I have asked all the involved parties for assistance, but that did not help. ASMedia declined to answer, someone from Diodes (for the Pericom part) did reply, but quickly concluded they could only give support for silicon design, and Delock said that they do not provide such a level of support and if a device does not work, then I am on my own. The guys from SiFive were at least sympathetic (thank you!), but this level of hardware debug was beyond their skills and/or resources. So I was left on my own really.
I chose not to give up though and tried to poke at the upstream ASMedia switch based on general PCIe technical information (sadly the details of the switch's programming interface remain undocumented, unlike with the downstream Pericom part). That has revealed that the ASMedia part can be persuaded to talk to the downstream Pericom part by forcing the link speed to 2.5GT/s, e.g. by entering these commands at the U-Boot prompt:
=> pci write.w 02.03.0 0xb0 0x0061 => pci write.w 02.03.0 0x90 0x0020
This could do as a workaround, but one complication with the Unmatched is that it does not have a writable persistent environment storage, so it would have to be hardcoded with a patched U-Boot rebuild, quite a hack.
Then I thought perhaps we could do better and handle this automatically in U-Boot. Firmware is after all the place to fix up any hardware errata if possible. Not only this would relieve the user from having to know which PCIe port to apply the workaround to (the Unmatched has 3 external PCIe slots available total in a different form factor each), but it would be completely transparent regardless of the devices involved, possibly different from ones I have observed this issue with.
Looking into it I soon realised that the problematic pair of switches actually continues link negotiation indefinitely, as indicated by the Link Training flag flipping and the Current Link Speed indicator switching between 5GT/s and 2.5GT/s repeatedly. I have then used this information to make the workaround offered here. It works reliably in my case, though the polling period of 200ms is a bit of heuristics, especially as I cannot claim PCIe expertise really, so any input will be appreciated here.
Otherwise I think the solution is ready to include with U-Boot.
Questions, comments, concerns? Otherwise please apply.
NB I have bcc-ed this submission to ASMedia and Diodes in a half-hearted hope someone actually involved with the design of these devices chooses to chime in after all. I find it particularly regrettable that the switches have no failure counter or suchlike that would prevent link negotiation from oscillating between 5GT/s and 2.5GT/s indefinitely and after a number of iterations would instead make them settle on the lower speed, which appears fully reliable in this setup. The lack of documentation for the ASM2824 does not help either.
Maciej --- drivers/pci/pci_auto.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++ include/pci.h | 17 +++++ 2 files changed, 164 insertions(+), 1 deletion(-)
u-boot-pcie-manual-retrain.diff Index: u-boot/drivers/pci/pci_auto.c =================================================================== --- u-boot.orig/drivers/pci/pci_auto.c +++ u-boot/drivers/pci/pci_auto.c @@ -5,6 +5,7 @@ * Author: Matt Porter mporter@mvista.com * * Copyright 2000 MontaVista Software Inc. + * Copyright (c) 2021 Maciej W. Rozycki macro@orcam.me.uk */
#include <common.h> @@ -12,6 +13,7 @@ #include <errno.h> #include <log.h> #include <pci.h> +#include <time.h> #include "pci_internal.h"
/* the user can define CONFIG_SYS_PCI_CACHE_LINE_SIZE to avoid problems */ @@ -180,6 +182,146 @@ static void dm_pciauto_setup_device(stru dm_pci_write_config8(dev, PCI_LATENCY_TIMER, 0x80); }
+/* + * Check if the link of a downstream PCIe switch port operates correctly. + * + * For that check if the optional Data Link Layer Link Active status gets + * on within a 200ms period or failing that wait until the completion of + * that period and check if link training has shown the completed status + * continuously throughout the second half of that period. + * + * Observation with the ASMedia ASM2824 Gen 3 switch indicates it takes + * 11-44ms to indicate the Data Link Layer Link Active status at 2.5GT/s, + * though it may take a couple of link training iterations. + */ +static bool dm_pciauto_exp_link_stable(struct udevice *dev, int pcie_off) +{ + u64 loops = 0, trcount = 0, ntrcount = 0, flips = 0; + bool dllla, lnktr, plnktr; + u16 exp_lnksta; + pci_dev_t bdf; + u64 end; + + dm_pci_read_config16(dev, pcie_off + PCI_EXP_LNKSTA, &exp_lnksta); + plnktr = !!(exp_lnksta & PCI_EXP_LNKSTA_LNKTR); + + end = get_ticks() + usec_to_tick(200000); + do { + dm_pci_read_config16(dev, pcie_off + PCI_EXP_LNKSTA, + &exp_lnksta); + dllla = !!(exp_lnksta & PCI_EXP_LNKSTA_DLLLA); + lnktr = !!(exp_lnksta & PCI_EXP_LNKSTA_LNKTR); + + flips += plnktr ^ lnktr; + if (lnktr) { + ntrcount = 0; + trcount++; + } else { + ntrcount++; + } + loops++; + + plnktr = lnktr; + } while (!dllla && get_ticks() < end); + + bdf = dm_pci_get_bdf(dev); + debug("PCI Autoconfig: %02x.%02x.%02x: Fixup link: DL active: %u; " + "%3llu flips, %6llu loops of which %6llu while training, " + "final %6llu stable\n", + PCI_BUS(bdf), PCI_DEV(bdf), PCI_FUNC(bdf), + (unsigned int)dllla, + (unsigned long long)flips, (unsigned long long)loops, + (unsigned long long)trcount, (unsigned long long)ntrcount); + + return dllla || ntrcount >= loops / 2; +} + +/* + * Retrain the link of a downstream PCIe switch port by hand if necessary. + * + * This is needed at least where a downstream port of the ASMedia ASM2824 + * Gen 3 switch is wired to the upstream port of the Pericom PI7C9X2G304 + * Gen 2 switch, and observed with the Delock Riser Card PCI Express x1 > + * 2 x PCIe x1 device, P/N 41433, plugged into the SiFive HiFive Unmatched + * board. + * + * In such a configuration the switches are supposed to negotiate the link + * speed of preferably 5.0GT/s, falling back to 2.5GT/s. However the link + * continues switching between the two speeds indefinitely and the data + * link layer never reaches the active state, with link training reported + * active ~84% of the time repeatedly. Forcing the target link speed to + * 2.5GT/s with the upstream ASM2824 device makes the two switches talk to + * each other correctly however. + * + * As this can potentially happen with any devices and is cheap in the + * case of correctly functioning hardware, let's do it for all downstream + * switch ports and if automatic link training may have failed to complete, + * as indicated by the optional Data Link Layer Link Active status being + * off and the Link Bandwidth Management Status showing that hardware has + * changed the link speed or width in an attempt to correct unreliable + * link operation then try lower and lower speeds one by one until one has + * succeeded or we have run out of speeds. + * + * This requires the presence of the Link Control 2 register, so make sure + * the PCI Express Capability Version is at least 2. Also don't try, for + * obvious reasons, to limit the speed if 2.5GT/s is the only link speed + * supported. + */ +static void dm_pciauto_exp_fixup_link(struct udevice *dev, int pcie_off) +{ + u16 exp_lnksta, exp_lnkctl, exp_lnkctl2; + u16 exp_flags, exp_type, exp_version; + u32 exp_lnkcap, exp_lnkcap2; + unsigned int speed; + pci_dev_t bdf; + + dm_pci_read_config16(dev, pcie_off + PCI_EXP_FLAGS, &exp_flags); + exp_version = exp_flags & PCI_EXP_FLAGS_VERSION; + exp_type = exp_flags & PCI_EXP_FLAGS_TYPE; + if (exp_version < 2 || exp_type != PCI_EXP_TYPE_DOWN_PORT << 4) + return; + + dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP, &exp_lnkcap); + if ((exp_lnkcap & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB) + return; + + dm_pci_read_config16(dev, pcie_off + PCI_EXP_LNKSTA, &exp_lnksta); + if ((exp_lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) != + PCI_EXP_LNKSTA_LBMS) + return; + + if (dm_pciauto_exp_link_stable(dev, pcie_off)) + return; + + bdf = dm_pci_get_bdf(dev); + debug("PCI Autoconfig: %02x.%02x.%02x: Downgrading speed by hand\n", + PCI_BUS(bdf), PCI_DEV(bdf), PCI_FUNC(bdf)); + + dm_pci_read_config16(dev, pcie_off + PCI_EXP_LNKCTL, &exp_lnkctl); + dm_pci_read_config16(dev, pcie_off + PCI_EXP_LNKCTL2, &exp_lnkctl2); + exp_lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS; + + dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP, &exp_lnkcap); + dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP2, &exp_lnkcap2); + for (speed = (exp_lnkcap & PCI_EXP_LNKCAP_SLS) - 1; + speed >= PCI_EXP_LNKCAP_SLS_2_5GB; + speed--) { + if (exp_lnkcap2 && (exp_lnkcap2 & 1 << speed) == 0) + continue; + + debug("PCI Autoconfig: %02x.%02x.%02x: Trying speed: %u\n", + PCI_BUS(bdf), PCI_DEV(bdf), PCI_FUNC(bdf), speed); + + dm_pci_write_config16(dev, pcie_off + PCI_EXP_LNKCTL2, + exp_lnkctl2 | speed); + dm_pci_write_config16(dev, pcie_off + PCI_EXP_LNKCTL, + exp_lnkctl | PCI_EXP_LNKCTL_RETRLNK); + + if (dm_pciauto_exp_link_stable(dev, pcie_off)) + return; + } +} + void dm_pciauto_prescan_setup_bridge(struct udevice *dev, int sub_bus) { struct pci_region *pci_mem; @@ -189,6 +331,7 @@ void dm_pciauto_prescan_setup_bridge(str u8 io_32; struct udevice *ctlr = pci_get_controller(dev); struct pci_controller *ctlr_hose = dev_get_uclass_priv(ctlr); + int pcie_off;
pci_mem = ctlr_hose->pci_mem; pci_prefetch = ctlr_hose->pci_prefetch; @@ -267,6 +410,11 @@ void dm_pciauto_prescan_setup_bridge(str cmdstat |= PCI_COMMAND_IO; }
+ /* For PCIe devices see if we need to retrain the link by hand */ + pcie_off = dm_pci_find_capability(dev, PCI_CAP_ID_EXP); + if (pcie_off) + dm_pciauto_exp_fixup_link(dev, pcie_off); + /* Enable memory and I/O accesses, enable bus master */ dm_pci_write_config16(dev, PCI_COMMAND, cmdstat | PCI_COMMAND_MASTER); } Index: u-boot/include/pci.h =================================================================== --- u-boot.orig/include/pci.h +++ u-boot/include/pci.h @@ -5,6 +5,7 @@ * * (C) Copyright 2002 * Wolfgang Denk, DENX Software Engineering, wd@denx.de. + * Copyright (c) 2021 Maciej W. Rozycki macro@orcam.me.uk */
#ifndef _PCI_H @@ -475,16 +476,23 @@
/* PCI Express capabilities */ #define PCI_EXP_FLAGS 2 /* Capabilities register */ +#define PCI_EXP_FLAGS_VERSION 0x000f /* Capability Version */ #define PCI_EXP_FLAGS_TYPE 0x00f0 /* Device/Port type */ #define PCI_EXP_TYPE_ROOT_PORT 0x4 /* Root Port */ +#define PCI_EXP_TYPE_DOWN_PORT 0x6 /* Downstream Port */ #define PCI_EXP_DEVCAP 4 /* Device capabilities */ #define PCI_EXP_DEVCAP_FLR 0x10000000 /* Function Level Reset */ #define PCI_EXP_DEVCTL 8 /* Device Control */ #define PCI_EXP_DEVCTL_BCR_FLR 0x8000 /* Bridge Configuration Retry / FLR */ #define PCI_EXP_LNKCAP 12 /* Link Capabilities */ #define PCI_EXP_LNKCAP_SLS 0x0000000f /* Supported Link Speeds */ +#define PCI_EXP_LNKCAP_SLS_2_5GB 0x0001 /* Supported Link Speed 2.5GT/s */ +#define PCI_EXP_LNKCAP_SLS_5_0GB 0x0002 /* Supported Link Speed 5.0GT/s */ +#define PCI_EXP_LNKCAP_SLS_8_0GB 0x0003 /* Supported Link Speed 8.0GT/s */ #define PCI_EXP_LNKCAP_MLW 0x000003f0 /* Maximum Link Width */ #define PCI_EXP_LNKCAP_DLLLARC 0x00100000 /* Data Link Layer Link Active Reporting Capable */ +#define PCI_EXP_LNKCTL 16 /* Link Control */ +#define PCI_EXP_LNKCTL_RETRLNK 0x0020 /* Retrain Link */ #define PCI_EXP_LNKSTA 18 /* Link Status */ #define PCI_EXP_LNKSTA_CLS 0x000f /* Current Link Speed */ #define PCI_EXP_LNKSTA_CLS_2_5GB 0x0001 /* Current Link Speed 2.5GT/s */ @@ -492,7 +500,9 @@ #define PCI_EXP_LNKSTA_CLS_8_0GB 0x0003 /* Current Link Speed 8.0GT/s */ #define PCI_EXP_LNKSTA_NLW 0x03f0 /* Negotiated Link Width */ #define PCI_EXP_LNKSTA_NLW_SHIFT 4 /* start of NLW mask in link status */ +#define PCI_EXP_LNKSTA_LNKTR 0x0800 /* Link Training */ #define PCI_EXP_LNKSTA_DLLLA 0x2000 /* Data Link Layer Link Active */ +#define PCI_EXP_LNKSTA_LBMS 0x4000 /* Link Bandwidth Management Status */ #define PCI_EXP_SLTCAP 20 /* Slot Capabilities */ #define PCI_EXP_SLTCAP_PSN 0xfff80000 /* Physical Slot Number */ #define PCI_EXP_RTCTL 28 /* Root Control */ @@ -503,8 +513,13 @@ #define PCI_EXP_DEVCAP2_ARI 0x00000020 /* ARI Forwarding Supported */ #define PCI_EXP_DEVCTL2 40 /* Device Control 2 */ #define PCI_EXP_DEVCTL2_ARI 0x0020 /* Alternative Routing-ID */ - +#define PCI_EXP_LNKCAP2 44 /* Link Capability 2 */ +#define PCI_EXP_LNKCAP2_SLSV 0x000000fe /* Supported Link Speeds Vector */ #define PCI_EXP_LNKCTL2 48 /* Link Control 2 */ +#define PCI_EXP_LNKCTL2_TLS 0x000f /* Target Link Speed */ +#define PCI_EXP_LNKSTA_TLS_2_5GB 0x0001 /* Target Link Speed 2.5GT/s */ +#define PCI_EXP_LNKSTA_TLS_5_0GB 0x0002 /* Target Link Speed 5.0GT/s */ +#define PCI_EXP_LNKSTA_TLS_8_0GB 0x0003 /* Target Link Speed 8.0GT/s */ /* Single Root I/O Virtualization Registers */ #define PCI_SRIOV_CAP 0x04 /* SR-IOV Capabilities */ #define PCI_SRIOV_CTRL 0x08 /* SR-IOV Control */

Hello Maciej!
On Sunday 14 November 2021 15:16:47 Maciej W. Rozycki wrote:
- dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP, &exp_lnkcap);
- dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP2, &exp_lnkcap2);
- for (speed = (exp_lnkcap & PCI_EXP_LNKCAP_SLS) - 1;
speed >= PCI_EXP_LNKCAP_SLS_2_5GB;
speed--) {
if (exp_lnkcap2 && (exp_lnkcap2 & 1 << speed) == 0)
continue;
This check is not correct because lnkcap2 is optional also when lnkctl2 is implemented. IIRC unimplemented lnkcap2 with implemented lnkctl2 can happen only for link speeds <= 5GT/s and when lnkcap2 returns zero then software should detect speeds from lnkcap register. And I have such PCIe GEN2 HW where PCIe Root Ports implements lnkctl2, but lnkcap2 returns zeros. So it is not corner case.
debug("PCI Autoconfig: %02x.%02x.%02x: Trying speed: %u\n",
PCI_BUS(bdf), PCI_DEV(bdf), PCI_FUNC(bdf), speed);
dm_pci_write_config16(dev, pcie_off + PCI_EXP_LNKCTL2,
exp_lnkctl2 | speed);
dm_pci_write_config16(dev, pcie_off + PCI_EXP_LNKCTL,
exp_lnkctl | PCI_EXP_LNKCTL_RETRLNK);
if (dm_pciauto_exp_link_stable(dev, pcie_off))
return;
- }
+}
void dm_pciauto_prescan_setup_bridge(struct udevice *dev, int sub_bus) { struct pci_region *pci_mem; @@ -189,6 +331,7 @@ void dm_pciauto_prescan_setup_bridge(str u8 io_32; struct udevice *ctlr = pci_get_controller(dev); struct pci_controller *ctlr_hose = dev_get_uclass_priv(ctlr);
int pcie_off;
pci_mem = ctlr_hose->pci_mem; pci_prefetch = ctlr_hose->pci_prefetch;
@@ -267,6 +410,11 @@ void dm_pciauto_prescan_setup_bridge(str cmdstat |= PCI_COMMAND_IO; }
- /* For PCIe devices see if we need to retrain the link by hand */
- pcie_off = dm_pci_find_capability(dev, PCI_CAP_ID_EXP);
- if (pcie_off)
dm_pciauto_exp_fixup_link(dev, pcie_off);
- /* Enable memory and I/O accesses, enable bus master */ dm_pci_write_config16(dev, PCI_COMMAND, cmdstat | PCI_COMMAND_MASTER);
This code cannot be enabled globally for all PCIe devices. There are more PCIe cards which do not like retraining link and touching link retrain bit in downstream port break them (= power on/off is needed).
Could you bring this discussion to Linux PCI list? I guess that same issue had to be fixed also in Linux kernel and it would be great to have same fix in both projects.
For me it looks like Asmedia specific bug and so it should be applied only for affected Asmedia devices based on PCI vendor/device ids.

Hi Pali,
- dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP, &exp_lnkcap);
- dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP2, &exp_lnkcap2);
- for (speed = (exp_lnkcap & PCI_EXP_LNKCAP_SLS) - 1;
speed >= PCI_EXP_LNKCAP_SLS_2_5GB;
speed--) {
if (exp_lnkcap2 && (exp_lnkcap2 & 1 << speed) == 0)
continue;
This check is not correct because lnkcap2 is optional also when lnkctl2 is implemented. IIRC unimplemented lnkcap2 with implemented lnkctl2 can happen only for link speeds <= 5GT/s and when lnkcap2 returns zero then software should detect speeds from lnkcap register. And I have such PCIe GEN2 HW where PCIe Root Ports implements lnkctl2, but lnkcap2 returns zeros. So it is not corner case.
Well, this code checks for non-zero lnkcap2 first and ignores it if it's zero, so I believe it does the right thing. Have I missed anything?
@@ -267,6 +410,11 @@ void dm_pciauto_prescan_setup_bridge(str cmdstat |= PCI_COMMAND_IO; }
- /* For PCIe devices see if we need to retrain the link by hand */
- pcie_off = dm_pci_find_capability(dev, PCI_CAP_ID_EXP);
- if (pcie_off)
dm_pciauto_exp_fixup_link(dev, pcie_off);
- /* Enable memory and I/O accesses, enable bus master */ dm_pci_write_config16(dev, PCI_COMMAND, cmdstat | PCI_COMMAND_MASTER);
This code cannot be enabled globally for all PCIe devices. There are more PCIe cards which do not like retraining link and touching link retrain bit in downstream port break them (= power on/off is needed).
Sigh. It looks like plenty of breakage to choose from.
However we have a case here where the link does not work anyway, so by trying to retrain it we're not going to make it any worse. At worst it will remain in the non-working state. I have tried hard to make my code rather careful in not touching links that do work, so unless I missed a case (please point out one if I did), this is not going to affect those PCIe devices that do not like link retraining.
Could you bring this discussion to Linux PCI list? I guess that same issue had to be fixed also in Linux kernel and it would be great to have same fix in both projects.
It could be done in Linux in addition to U-Boot, although I think doing that in the firmware is more important, especially as there could be a boot device downstream such a switch. And depending on the platform Linux does not always reassign buses, so a user intervention (i.e. an explicitly added kernel parameter) would have to be required.
And these are generic PCIe switches, they could be anywhere and with the weird combinations of hardware interfaces available now (e.g. PCI-PCIe bridge adapters or M.2 to regular PCIe slot adapters) virtually any combination is possible.
E.g. I have a 1997-vintage dual Pentium MMX box (82439HX host bridge; [8086:1250]) with PCIe devices, although it does require a lot of Linux interventions to cope with its firmware limitations. NB I plan to add some NVMe storage to that box, and I believe the ASM2824 has been used in some M.2 carrier boards meant for NVMe devices.
I've cc-ed the linux-pci list in case someone there wants to chime in.
For me it looks like Asmedia specific bug and so it should be applied only for affected Asmedia devices based on PCI vendor/device ids.
It is surely one option to consider, though I'd prefer to keep this piece generic and enabled as widely as possible in a best effort to make things work seamlessly, so as to prevent end users from having to chase a similar problem with another device. It was not exactly easy to figure out what is going on here to me and I am a professional with decades of experience. So what could an ordinary user do other than concluding that things do not work? I think a seamless solution is always preferable.
As I noted ASMedia declined to comment, so it's hard to say if the cause has been nailed correctly and which devices if any, beyond [1b21:2824], are affected.
Thank you for your input.
Maciej

Hello!
On Sunday 14 November 2021 18:07:18 Maciej W. Rozycki wrote:
Hi Pali,
- dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP, &exp_lnkcap);
- dm_pci_read_config32(dev, pcie_off + PCI_EXP_LNKCAP2, &exp_lnkcap2);
- for (speed = (exp_lnkcap & PCI_EXP_LNKCAP_SLS) - 1;
speed >= PCI_EXP_LNKCAP_SLS_2_5GB;
speed--) {
if (exp_lnkcap2 && (exp_lnkcap2 & 1 << speed) == 0)
continue;
This check is not correct because lnkcap2 is optional also when lnkctl2 is implemented. IIRC unimplemented lnkcap2 with implemented lnkctl2 can happen only for link speeds <= 5GT/s and when lnkcap2 returns zero then software should detect speeds from lnkcap register. And I have such PCIe GEN2 HW where PCIe Root Ports implements lnkctl2, but lnkcap2 returns zeros. So it is not corner case.
Well, this code checks for non-zero lnkcap2 first and ignores it if it's zero, so I believe it does the right thing. Have I missed anything?
I'm reading spec again and I'm not sure now. It has following section:
For software to determine the supported Link speeds for components where the Link Capabilities 2 register is either not implemented, or the value of its Supported Link Speeds Vector is 0000000b, software can read bits 3:0 of the Link Capabilities register (now defined to be the Max Link Speed field), and interpret the value as follows: 0001b 2.5 GT/s Link speed supported 0010b 5.0 GT/s and 2.5 GT/s Link speeds supported
I'm not sure how it should be interpreted, but my assumption is that empty lnkcap2 implicates that only 5GT/s or 2.5GT/s speeds are supported and therefore trying higher should not be done...
Also in spec is following note:
It is strongly encouraged that software primarily utilize the Supported Link Speeds Vector instead of the Max Link Speed field.
So based on this note, iteration should be done via lnkcap2 bits instead of lnkcap speed.
@@ -267,6 +410,11 @@ void dm_pciauto_prescan_setup_bridge(str cmdstat |= PCI_COMMAND_IO; }
- /* For PCIe devices see if we need to retrain the link by hand */
- pcie_off = dm_pci_find_capability(dev, PCI_CAP_ID_EXP);
- if (pcie_off)
dm_pciauto_exp_fixup_link(dev, pcie_off);
- /* Enable memory and I/O accesses, enable bus master */ dm_pci_write_config16(dev, PCI_COMMAND, cmdstat | PCI_COMMAND_MASTER);
This code cannot be enabled globally for all PCIe devices. There are more PCIe cards which do not like retraining link and touching link retrain bit in downstream port break them (= power on/off is needed).
Sigh. It looks like plenty of breakage to choose from.
However we have a case here where the link does not work anyway, so by trying to retrain it we're not going to make it any worse. At worst it will remain in the non-working state. I have tried hard to make my code rather careful in not touching links that do work, so unless I missed a case (please point out one if I did), this is not going to affect those PCIe devices that do not like link retraining.
It looks like it should not cause issue. This is something which I said to myself during debugging other issue and then I found those Atheros wifi cards which enters into broken state if device on other side of the PCIe link tries to retrain link. Something which I would never expect that could be broken :-(
Could you bring this discussion to Linux PCI list? I guess that same issue had to be fixed also in Linux kernel and it would be great to have same fix in both projects.
It could be done in Linux in addition to U-Boot, although I think doing that in the firmware is more important, especially as there could be a boot device downstream such a switch. And depending on the platform Linux does not always reassign buses, so a user intervention (i.e. an explicitly added kernel parameter) would have to be required.
It is important to have it in both components (U-Boot and Linux). For example native PCIe controller drivers in linux kernel as a first thing do complete reset of controller together with connected devices. So whatever do U-Boot is completely lost. And important is that Linux kernel drivers should not depend on some bootloader configuration. And note that your patch implements this workaround in CONFIG_PCI_PNP code, so if board disable this option, workaround is not applied.
And these are generic PCIe switches, they could be anywhere and with the weird combinations of hardware interfaces available now (e.g. PCI-PCIe bridge adapters or M.2 to regular PCIe slot adapters) virtually any combination is possible.
E.g. I have a 1997-vintage dual Pentium MMX box (82439HX host bridge; [8086:1250]) with PCIe devices, although it does require a lot of Linux interventions to cope with its firmware limitations. NB I plan to add some NVMe storage to that box, and I believe the ASM2824 has been used in some M.2 carrier boards meant for NVMe devices.
Hm... now thinking about your patch... and if it is general, applied to all devices, should it also be applied to any type of downstream port? Meaning also for root ports, or PCIe port of PCI/X to PCIe bridge (I guess that such old platforms with only PCI host bridge without PCIe could be affected too if you connect PCIe card via PCI/X to PCIe bridge and then this bridge to host PCI slot).
Also one important note. I have some PCIe devices which do not want to automatically negotiate maximal link speed (5 GT/s) and stay on initial 2.5 GT/s until flipping link retrain bit happen. But I have not debugged this fully (I postponed this issue to future). So maybe this something similar to your Asmedia issue: link speed is detected incorrectly and software needs to setup (maximal) link speed manually.
I've cc-ed the linux-pci list in case someone there wants to chime in.
For me it looks like Asmedia specific bug and so it should be applied only for affected Asmedia devices based on PCI vendor/device ids.
It is surely one option to consider, though I'd prefer to keep this piece generic and enabled as widely as possible in a best effort to make things work seamlessly, so as to prevent end users from having to chase a similar problem with another device. It was not exactly easy to figure out what is going on here to me and I am a professional with decades of experience. So what could an ordinary user do other than concluding that things do not work? I think a seamless solution is always preferable.
As I noted ASMedia declined to comment, so it's hard to say if the cause has been nailed correctly and which devices if any, beyond [1b21:2824], are affected.
Yea, we do not know... And because we know that there is lot of broken HW which works with current version, we need to be very careful when introducing some workaround which is called on every hardware.
For example something similar like in your patch was implemented in Marvell SerDes U-Boot driver, which controls physical layer (so on place where nobody would expect touching higher layer code). Just this code did not touched retrain link bit... And so it could not have worked and was recently removed in patch series: https://lore.kernel.org/u-boot/20210924205922.25432-1-marek.behun@nic.cz/
I'm not saying that I'm opposing this patch. Just I would like see how is this issue will be fixed in kernel as kernel general workaround would affect lot of more devices. And so solution accepted by kernel project should be perfectly fine also for smaller project like U-Boot.
Thank you for your input.
Maciej

Hi Pali,
Well, this code checks for non-zero lnkcap2 first and ignores it if it's zero, so I believe it does the right thing. Have I missed anything?
I'm reading spec again and I'm not sure now. It has following section:
For software to determine the supported Link speeds for components where the Link Capabilities 2 register is either not implemented, or the value of its Supported Link Speeds Vector is 0000000b, software can read bits 3:0 of the Link Capabilities register (now defined to be the Max Link Speed field), and interpret the value as follows: 0001b 2.5 GT/s Link speed supported 0010b 5.0 GT/s and 2.5 GT/s Link speeds supported
This is consistent with speeds defined with PCIe rev. 2.0, which is the last revision not to define the Link Capabilities 2 register.
I'm not sure how it should be interpreted, but my assumption is that empty lnkcap2 implicates that only 5GT/s or 2.5GT/s speeds are supported and therefore trying higher should not be done...
I can see no ambiguity here.
The Link Capabilities register always defines the maximum speed supported and given that there is no speed masking defined for PCIe rev. 2.0 and below all speeds up to the maximum must be always supported with hardware compliant to these specification revisions.
As from PCIe rev. 3.0 the Link Capabilities 2 register provides support for speed masking and therefore not all speeds below the maximum given by the Link Capabilities register must be supported with hardware compliant to this and higher specification revisions.
Also in spec is following note:
It is strongly encouraged that software primarily utilize the Supported Link Speeds Vector instead of the Max Link Speed field.
So based on this note, iteration should be done via lnkcap2 bits instead of lnkcap speed.
I believe my code does iterate over lnkcap2 bits already, because speed encodings that correspond to those lnkcap2 bits that are zero are skipped.
It could be done in Linux in addition to U-Boot, although I think doing that in the firmware is more important, especially as there could be a boot device downstream such a switch. And depending on the platform Linux does not always reassign buses, so a user intervention (i.e. an explicitly added kernel parameter) would have to be required.
It is important to have it in both components (U-Boot and Linux). For example native PCIe controller drivers in linux kernel as a first thing do complete reset of controller together with connected devices. So whatever do U-Boot is completely lost. And important is that Linux kernel drivers should not depend on some bootloader configuration. And note that your patch implements this workaround in CONFIG_PCI_PNP code, so if board disable this option, workaround is not applied.
Well, everything is not Linux. The lone Unmatched board can run at least FreeBSD or Haiku right now, and it is only one of the numerous machines supported by U-Boot. A piece of code in Linux will not help other OSes.
Also as I noted not all Linux platforms discard the firmware settings with PCI/e devices, let alone bring them to their power-on defaults; this certainly does not happen with the Unmatched or I couldn't have benefitted with this change of mine proposed or earlier hacks with poking at the relevant registers from the U-Boot command line. So in my scenario this fix surely qualifies as good enough. Interestingly Linux does notice the clamp imposed by my change with the offending link:
pci 0000:05:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:02:03.0 (capable of 8.000 Gb/s with 5.0 GT/s PCIe x2 link)
(the x1 vs x2 limitation is imposed by wiring one lane only in the Delock device in the first place; then I wired it to a single-lane M.2 connector of the Unmatched).
That said I'm not opposed to porting this code to Linux, I have certainly contributed infinitely more code to Linux than I did to U-Boot. It is just that my resources are limited, so I need to cautiously allocate them. I have other priorities, the Unmatched box is in my remote lab and I am only going to have physical access to it this week only. The next opportunity may be next year at the earliest. And I dare not reflashing U-Boot remotely so as not to lose access with a broken build because I need the machine for other purposes like GCC verification.
So I'd rather focus on getting the change right for U-Boot now, while I am here. If we agree on the way to move forward with this code, then I can look into porting this stuff to Linux, as I can surely add another kernel image to boot the system from remotely in a safe manner.
And these are generic PCIe switches, they could be anywhere and with the weird combinations of hardware interfaces available now (e.g. PCI-PCIe bridge adapters or M.2 to regular PCIe slot adapters) virtually any combination is possible.
E.g. I have a 1997-vintage dual Pentium MMX box (82439HX host bridge; [8086:1250]) with PCIe devices, although it does require a lot of Linux interventions to cope with its firmware limitations. NB I plan to add some NVMe storage to that box, and I believe the ASM2824 has been used in some M.2 carrier boards meant for NVMe devices.
Hm... now thinking about your patch... and if it is general, applied to all devices, should it also be applied to any type of downstream port?
Meaning also for root ports, or PCIe port of PCI/X to PCIe bridge (I guess that such old platforms with only PCI host bridge without PCIe could be affected too if you connect PCIe card via PCI/X to PCIe bridge and then this bridge to host PCI slot).
I guess so. I considered root ports, but somehow it did not occur to me that they can be wired directly to slots (any sane manufacturer would get their onboard devices sorted), which of course they can. I forgot about the PCI/PCI-X to PCI Express bridges though. Thanks for the suggestion.
As this is a trivial change to make I'll post an update, but I will wait till the end of the day or so so as to let people who do this stuff as a part of their dayjob have a chance to give feedback.
As I noted ASMedia declined to comment, so it's hard to say if the cause has been nailed correctly and which devices if any, beyond [1b21:2824], are affected.
Yea, we do not know... And because we know that there is lot of broken HW which works with current version, we need to be very careful when introducing some workaround which is called on every hardware.
For example something similar like in your patch was implemented in Marvell SerDes U-Boot driver, which controls physical layer (so on place where nobody would expect touching higher layer code). Just this code did not touched retrain link bit... And so it could not have worked and was recently removed in patch series: https://lore.kernel.org/u-boot/20210924205922.25432-1-marek.behun@nic.cz/
Sure, which is why this change tries hard to be conservative and checks multiple conditions to make sure a link is in trouble before it pokes at it. If you suspect that these checks may be insufficient, then I'm open to further suggestions.
NB I forgot to mention in the change description and I can add it in v2 that the rate of the speed oscillation/link training with the ASM2824 is 34-35 times per second. So it's quite a lengthy process.
I'm not saying that I'm opposing this patch. Just I would like see how is this issue will be fixed in kernel as kernel general workaround would affect lot of more devices. And so solution accepted by kernel project should be perfectly fine also for smaller project like U-Boot.
Well, as far as I'm concerned I'd propose a functional equivalent that has been merely mechanically adapted to Linux internal interfaces (though busy-looping for extended periods with interrupts disabled, which is what is needed here, may be problematic in Linux or indeed any OS, unlike with the firmware; maybe PCI enumeration is early enough for that not to matter in reality though).
Again, thanks for your input.
Maciej
participants (2)
-
Maciej W. Rozycki
-
Pali Rohár