
On Sat, Nov 23, 2024 at 3:40 PM Tom Rini trini@konsulko.com wrote:
On Wed, Nov 20, 2024 at 11:29:43AM +1300, Chris Packham wrote:
Hi U-Boot,
We've hit a weird problem at $dayjob with a board using the Marvell CN9130 SoC and using the asix88179 USB-Eth adapter.
The problem is after enabling and unrelated feature in u-boot the asix88179 fails to receive data (I can confirm that the link partner does see packets in the transmit direction)
=> version U-Boot 2022.01 (Nov 08 2024 - 09:45:44 +0000) => usb start starting USB... Bus usb3@500000: Register 2000120 NbrPorts 2 Starting the controller USB XHCI 1.00 scanning bus usb3@500000 for devices... 2 USB Device(s) found scanning usb for storage devices... 0 Storage Device(s) found => ping ${serverip} Waiting for Ethernet connection... unable to connect. Reset Ethernet Device Waiting for Ethernet connection... done. Using ax88179_eth device Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5 Rx: failed to receive: -5
Abort ping failed; host 10.37.233.65 is not alive => <INTERRUPT>
Debugging a little we can see that the -EIO is actually because xhci_bulk_tx() hits a timeout from xhci_wait_for_event().
We think this is triggered by the u-boot image size crossing some boundary (the problem seems to start when .bss_end crosses 0x00000000000f0000) although I've so far been unable to find specifically why that might be. As far as I can tell u-boot is being built relocatably and nothing is overlapping. I also considered that ATF might be preventing access to something but so far I see no evidence of this.
If I turn off some features to reduce the build size the problem goes away. That is actually how we've avoided the immediate issue, although that means the problem will likely come back and an inopportune time.
Does anyone have any ideas as to what the true root cause might be? I'm a bit stumped.
Hummmm. Since you note it seems to be when a threshold is crossed in BSS size, add something to the BSS of a variable size that you control, and after confirming that you can replicate the problem this way, grow it just past the limit and compare u-boot.map files in the works/fails cases to see just what's being moved around?
So I tried a little experiment
diff --git a/net/net.c b/net/net.c index b003b84b3537..a6def9785133 100644 --- a/net/net.c +++ b/net/net.c @@ -180,6 +180,10 @@ u32 net_boot_file_size; /* Boot file size in blocks as reported by the DHCP server */ u32 net_boot_file_expected_size_in_blocks;
+#define DUMMY_SIZE (1 << 11) + +int dummy[DUMMY_SIZE] = {0}; + static uchar net_pkt_buf[(PKTBUFSRX+1) * PKTSIZE_ALIGN + PKTALIGN]; /* Receive packets */ uchar *net_rx_packets[PKTBUFSRX]; @@ -211,6 +215,7 @@ int __maybe_unused net_busy_flag; static int on_ipaddr(const char *name, const char *value, enum env_op op, int flags) { + dummy[DUMMY_SIZE - 1] = -1; if (flags & H_PROGRAMMATIC) return 0;
If I make DUMMY_SIZE (1 << 10) I don't see the problem. With DUMMY_SIZE (1 << 11) I can see the problem. If I make it DUMMY_SIZE (1 << 14) then the problem goes away again.
The obvious things that are moving are net_rx_packet, net_rx_packet_len and net_rx_packets. I'll see if I can narrow things down to specifically which of these is being problematic.
-- Tom