[U-Boot] netconsole: USB Ethernet connection dropping with ping or tftpboot

I followed the instructions from Marek in 'M28 U-Boot Single-Wire Debug preview' to enable NetConsole: http://www.denx-cs.de/?q=blogm28singlewiredebug
I'm using mxsldr to flash the u-boot.sb image to RAM. This is the dmesg output from my host:
[31048.492181] usb 3-13: new high-speed USB device number 24 using xhci_hcd [31048.667617] cdc_ether 3-13:1.0 eth0: register 'cdc_ether' at usb-0000:00:14.0-13, CDC Ethernet Device, 00:19:b8:00:00:01 [31048.669101] cdc_ether 3-13:1.0 enp0s20u13: renamed from eth0 [31048.696758] cdc_ether 3-13:1.0 enp0s20u13: kevent 12 may have been dropped
I noticed the 'kevent 12 may have been dropped' message here.
Nevertheless, NetworkManager shows me a USB Ethernet connection and using './netconsole 10.0.0.2' I am able to communicate with the device, e.g. 'nand info' outputs correctly.
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number 24 [31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
Do I missed something?
Best regards Jörg Krause

On Di, 2015-02-03 at 22:44 +0100, Jörg Krause wrote:
I followed the instructions from Marek in 'M28 U-Boot Single-Wire Debug preview' to enable NetConsole: http://www.denx-cs.de/?q=blogm28singlewiredebug
I'm using mxsldr to flash the u-boot.sb image to RAM. This is the dmesg output from my host:
[31048.492181] usb 3-13: new high-speed USB device number 24 using xhci_hcd [31048.667617] cdc_ether 3-13:1.0 eth0: register 'cdc_ether' at usb-0000:00:14.0-13, CDC Ethernet Device, 00:19:b8:00:00:01 [31048.669101] cdc_ether 3-13:1.0 enp0s20u13: renamed from eth0 [31048.696758] cdc_ether 3-13:1.0 enp0s20u13: kevent 12 may have been dropped
I noticed the 'kevent 12 may have been dropped' message here.
Nevertheless, NetworkManager shows me a USB Ethernet connection and using './netconsole 10.0.0.2' I am able to communicate with the device, e.g. 'nand info' outputs correctly.
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number 24 [31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
Do I missed something?
I managed to get serial console and netconsole muxing to work, so I can see all the debug messages:
Hit any key to stop autoboot: 5 using ci_udc, OUT ep- IN ep- STATUS ep- MAC 00:19:b8:00:00:02 HOST MAC 00:19:b8:00:00:01 high speed config #1: 2 mA, Ethernet Gadget, using CDC Ethernet USB network up! 0 => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- timeout sending packets to usb ethernet
This reminded me about an issue I had some months ago: http://lists.denx.de/pipermail/u-boot/2014-July/182885.html
I enabled debug output in arch/arm/cpu/arm926ejs/cache.c and I get error: => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- CACHE: Misaligned operation at range [43f7b010, 43f7b070]
I removed the flush_cache() call in cmd_net.c:netboot_common() as suggested by Marek in the thread. But the error message is still there.
Any idea?

On 02/05/2015 04:21 AM, Jörg Krause wrote:
On Di, 2015-02-03 at 22:44 +0100, Jörg Krause wrote:
I followed the instructions from Marek in 'M28 U-Boot Single-Wire Debug preview' to enable NetConsole: http://www.denx-cs.de/?q=blogm28singlewiredebug
I'm using mxsldr to flash the u-boot.sb image to RAM. This is the dmesg output from my host:
[31048.492181] usb 3-13: new high-speed USB device number 24 using xhci_hcd [31048.667617] cdc_ether 3-13:1.0 eth0: register 'cdc_ether' at usb-0000:00:14.0-13, CDC Ethernet Device, 00:19:b8:00:00:01 [31048.669101] cdc_ether 3-13:1.0 enp0s20u13: renamed from eth0 [31048.696758] cdc_ether 3-13:1.0 enp0s20u13: kevent 12 may have been dropped
I noticed the 'kevent 12 may have been dropped' message here.
Nevertheless, NetworkManager shows me a USB Ethernet connection and using './netconsole 10.0.0.2' I am able to communicate with the device, e.g. 'nand info' outputs correctly.
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number 24 [31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
Do I missed something?
I managed to get serial console and netconsole muxing to work, so I can see all the debug messages:
Hit any key to stop autoboot: 5 using ci_udc, OUT ep- IN ep- STATUS ep- MAC 00:19:b8:00:00:02 HOST MAC 00:19:b8:00:00:01 high speed config #1: 2 mA, Ethernet Gadget, using CDC Ethernet USB network up! 0 => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- timeout sending packets to usb ethernet
This reminded me about an issue I had some months ago: http://lists.denx.de/pipermail/u-boot/2014-July/182885.html
I enabled debug output in arch/arm/cpu/arm926ejs/cache.c and I get error: => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- CACHE: Misaligned operation at range [43f7b010, 43f7b070]
I removed the flush_cache() call in cmd_net.c:netboot_common() as suggested by Marek in the thread. But the error message is still there.
Perhaps make flush_cache() a macro that also passes in the file/line number where it's called from, and print those out along with he "misaligned operation" error message?
(or find some other way to perform a stack dump from within that function).

On Do, 2015-02-05 at 08:33 -0700, Stephen Warren wrote:
On 02/05/2015 04:21 AM, Jörg Krause wrote:
On Di, 2015-02-03 at 22:44 +0100, Jörg Krause wrote:
I followed the instructions from Marek in 'M28 U-Boot Single-Wire Debug preview' to enable NetConsole: http://www.denx-cs.de/?q=blogm28singlewiredebug
I'm using mxsldr to flash the u-boot.sb image to RAM. This is the dmesg output from my host:
[31048.492181] usb 3-13: new high-speed USB device number 24 using xhci_hcd [31048.667617] cdc_ether 3-13:1.0 eth0: register 'cdc_ether' at usb-0000:00:14.0-13, CDC Ethernet Device, 00:19:b8:00:00:01 [31048.669101] cdc_ether 3-13:1.0 enp0s20u13: renamed from eth0 [31048.696758] cdc_ether 3-13:1.0 enp0s20u13: kevent 12 may have been dropped
I noticed the 'kevent 12 may have been dropped' message here.
Nevertheless, NetworkManager shows me a USB Ethernet connection and using './netconsole 10.0.0.2' I am able to communicate with the device, e.g. 'nand info' outputs correctly.
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number 24 [31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
Do I missed something?
I managed to get serial console and netconsole muxing to work, so I can see all the debug messages:
Hit any key to stop autoboot: 5 using ci_udc, OUT ep- IN ep- STATUS ep- MAC 00:19:b8:00:00:02 HOST MAC 00:19:b8:00:00:01 high speed config #1: 2 mA, Ethernet Gadget, using CDC Ethernet USB network up! 0 => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- timeout sending packets to usb ethernet
This reminded me about an issue I had some months ago: http://lists.denx.de/pipermail/u-boot/2014-July/182885.html
I enabled debug output in arch/arm/cpu/arm926ejs/cache.c and I get error: => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- CACHE: Misaligned operation at range [43f7b010, 43f7b070]
I removed the flush_cache() call in cmd_net.c:netboot_common() as suggested by Marek in the thread. But the error message is still there.
Perhaps make flush_cache() a macro that also passes in the file/line number where it's called from, and print those out along with he "misaligned operation" error message?
(or find some other way to perform a stack dump from within that function).
I've added in each function in ci_udc a printf statement before a cache function is called, eg:
static void ci_flush_qh(int ep_num) { [..] printf("CACHE: flush_dcache_range %s %d \n",__FUNCTION__,__LINE__); flush_dcache_range(start, end); }
I've also looked at all calling functions of flush_cache or flush_dcache_range, but I think there is nothing of relevance.
This is a snippet of the output:
Using usb_ether device CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_bounce 356 CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_bounce 356 CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_flush_qh 166 CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_flush_qh 166 CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_flush_qh 166 CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_flush_qh 166 CACHE: flush_dcache_range ci_bounce 356 timeout sending packets to usb ethernet CACHE: flush_dcache_range ci_flush_qh 166 CACHE: flush_dcache_range ci_bounce 356 CACHE: Misaligned operation at range [43f7b010, 43f7b070] timeout sending packets to usb ethernet ping failed; host 10.0.0.1 is not alive
I hope this helps.

On 02/05/2015 03:10 PM, Jörg Krause wrote:
On Do, 2015-02-05 at 08:33 -0700, Stephen Warren wrote:
On 02/05/2015 04:21 AM, Jörg Krause wrote:
...
This reminded me about an issue I had some months ago: http://lists.denx.de/pipermail/u-boot/2014-July/182885.html
I enabled debug output in arch/arm/cpu/arm926ejs/cache.c and I get error: => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- CACHE: Misaligned operation at range [43f7b010, 43f7b070]
I removed the flush_cache() call in cmd_net.c:netboot_common() as suggested by Marek in the thread. But the error message is still there.
Perhaps make flush_cache() a macro that also passes in the file/line number where it's called from, and print those out along with he "misaligned operation" error message?
(or find some other way to perform a stack dump from within that function).
I've added in each function in ci_udc a printf statement before a cache function is called, eg:
static void ci_flush_qh(int ep_num) { [..] printf("CACHE: flush_dcache_range %s %d \n",__FUNCTION__,__LINE__); flush_dcache_range(start, end); }
I've also looked at all calling functions of flush_cache or flush_dcache_range, but I think there is nothing of relevance.
This is a snippet of the output:
...
CACHE: flush_dcache_range ci_bounce 356 CACHE: Misaligned operation at range [43f7b010, 43f7b070] timeout sending packets to usb ethernet ping failed; host 10.0.0.1 is not alive
Which git commit did you build, and which board?
I'm curious what values you have for ARCH_DMA_MINALIGN and CONFIG_SYS_CACHELINE_SIZE.
Are you sure flush_dcache_range() is the code printing the alignment error, not e.g. invalidate_dcache_range()?
The reason I ask most of these questions is that line 356 mentioned in your debug spew is in function ci_debounce() in the code I have; no doubt I have a different git commit than you have checked out, and the debug printfs you added changed the line numbers too.
About the only thing I can think of is that:
a) You're not using an up-to-date ci_udc.c; IIRC I fixed quite a few cache issues when I reworked it a while back.
b) In ci_bounce(), the bounce buffer is only allocated if the user-buffer is already aligned, and if a large-enough bounce buffer wasn't previously allocated. If ci_req->b_buf was uninitialized it could be non-zero (thus preventing the expected aligned allocation) yet not actually aligned enough.
c) Perhaps ARCH_DMA_MINALIGN or CONFIG_SYS_CACHELINE_SIZE aren't correct or are inconsistent.
Ah. I note that check_cache_range() in either arch/arm/cpu/arm1136/cpu.c or arch/arm/cpu/arm926ejs/cache.c uses CONFIG_SYS_CACHELINE_SIZE to check for alignment, whereas ci_udc.c uses ARCH_DMA_MINALIGN. Inconsistency between those two could be at fault.

On Do, 2015-02-05 at 15:23 -0700, Stephen Warren wrote:
On 02/05/2015 03:10 PM, Jörg Krause wrote:
On Do, 2015-02-05 at 08:33 -0700, Stephen Warren wrote:
On 02/05/2015 04:21 AM, Jörg Krause wrote:
...
This reminded me about an issue I had some months ago: http://lists.denx.de/pipermail/u-boot/2014-July/182885.html
I enabled debug output in arch/arm/cpu/arm926ejs/cache.c and I get error: => tftp u-boot.sb using ci_udc, OUT ep- IN ep- STATUS ep- CACHE: Misaligned operation at range [43f7b010, 43f7b070]
I removed the flush_cache() call in cmd_net.c:netboot_common() as suggested by Marek in the thread. But the error message is still there.
Perhaps make flush_cache() a macro that also passes in the file/line number where it's called from, and print those out along with he "misaligned operation" error message?
(or find some other way to perform a stack dump from within that function).
I've added in each function in ci_udc a printf statement before a cache function is called, eg:
static void ci_flush_qh(int ep_num) { [..] printf("CACHE: flush_dcache_range %s %d \n",__FUNCTION__,__LINE__); flush_dcache_range(start, end); }
I've also looked at all calling functions of flush_cache or flush_dcache_range, but I think there is nothing of relevance.
This is a snippet of the output:
...
CACHE: flush_dcache_range ci_bounce 356 CACHE: Misaligned operation at range [43f7b010, 43f7b070] timeout sending packets to usb ethernet ping failed; host 10.0.0.1 is not alive
Which git commit did you build, and which board?
Building tag v2015.01 for a custom i.MX28 board which is not upstream. It's configuration mainly based on mx28evk and m28evk.
I'm curious what values you have for ARCH_DMA_MINALIGN and CONFIG_SYS_CACHELINE_SIZE.
I defined CONFIG_SYS_CACHELINE_SIZE 32 in the config. ARCH_DMA_MINALIGN is also 32.
Are you sure flush_dcache_range() is the code printing the alignment error, not e.g. invalidate_dcache_range()?
I've also added a printf statement to all functions in ci_udc which calls invalidate_dcache_range(), but it is not logged when the alignment error occurs. e.g.
using ci_udc, OUT ep- IN ep- STATUS ep- MAC 00:19:b8:00:00:02 HOST MAC 00:19:b8:00:00:01 CACHE: flush_dcache_range ci_flush_qh 166 CACHE: invalidate_dcache_range ci_invalidate_qh 182 CACHE: flush_dcache_range ci_bounce 356 CACHE: flush_dcache_range ci_flush_qtd 199 CACHE: flush_dcache_range ci_flush_qh 166
The reason I ask most of these questions is that line 356 mentioned in your debug spew is in function ci_debounce() in the code I have; no doubt I have a different git commit than you have checked out, and the debug printfs you added changed the line numbers too.
You're right, the debug printfs changed the line number. It's the flush_dcache_range() call at line 348 in ci_bounce(). I checked this after removing all printfs.
About the only thing I can think of is that:
a) You're not using an up-to-date ci_udc.c; IIRC I fixed quite a few cache issues when I reworked it a while back.
Yes, we had a troubleshooting about this last summer. I had problems with timeouts using tftp. Marek and you worked on this issue.
b) In ci_bounce(), the bounce buffer is only allocated if the user-buffer is already aligned, and if a large-enough bounce buffer wasn't previously allocated. If ci_req->b_buf was uninitialized it could be non-zero (thus preventing the expected aligned allocation) yet not actually aligned enough.
Maybe, we should work on this?
c) Perhaps ARCH_DMA_MINALIGN or CONFIG_SYS_CACHELINE_SIZE aren't correct or are inconsistent.
Ah. I note that check_cache_range() in either arch/arm/cpu/arm1136/cpu.c or arch/arm/cpu/arm926ejs/cache.c uses CONFIG_SYS_CACHELINE_SIZE to check for alignment, whereas ci_udc.c uses ARCH_DMA_MINALIGN. Inconsistency between those two could be at fault.
Both are set to 32, so this should not be the problem.

On Do, 2015-02-05 at 15:23 -0700, Stephen Warren wrote:
b) In ci_bounce(), the bounce buffer is only allocated if the user-buffer is already aligned, and if a large-enough bounce buffer wasn't previously allocated. If ci_req->b_buf was uninitialized it could be non-zero (thus preventing the expected aligned allocation) yet not actually aligned enough.
I can reproduce this issue now. After some "timeout sending packets to usb ethernet" messages, the bounce buffer somehow gets corrupted. ci_bounce() is called with an unaligned input buffer length 'req->length=66', but the bounce buffer length 'ci_req->b_len=1140305940' or in hex 'ci_req->b_len=0x43f7b014'. This bounce buffer length is obviously an address, as the following misaligned error message shows: "CACHE: Misaligned operation at range [43f7b010, 43f7b070]".
Both if conditions in 'align:' are not entered.
This is a snippet from my debug output:
timeout sending packets to usb ethernet 1: 43b7e180 - 43b7e200. req->length: 66 b_len_1: 96 b_len_2: 96 5: 43b7e660 - 43b7e6c0
timeout sending packets to usb ethernet 1: 43b7e000 - 43b7e080. req->length: 66 b_len_1: 1140305940 b_len_2: 1140305940 5: 43f7b010 - 43f7b070 CACHE: Misaligned operation at range [43f7b010, 43f7b070] timeout sending packets to usb ethernet ping failed; host 10.0.0.1 is not alive
This is the corresponding code (debug number 1 and 5):
static void ci_flush_qh(int ep_num) { struct ept_queue_head *head = ci_get_qh(ep_num, 0); const uint32_t start = (uint32_t)head; const uint32_t end = start + 2 * sizeof(*head);
printf("1: %x - %x.\n", start, end); flush_dcache_range(start, end); }
[..]
static int ci_bounce(struct ci_req *ci_req, int in) { struct usb_request *req = &ci_req->req; uint32_t addr = (uint32_t)req->buf; uint32_t hwaddr; uint32_t aligned_used_len;
/* Input buffer address is not aligned. */ if (addr & (ARCH_DMA_MINALIGN - 1)) { goto align; }
/* Input buffer length is not aligned. */ if (req->length & (ARCH_DMA_MINALIGN - 1)) { printf("req->length: %d\n", req->length); goto align; }
/* The buffer is well aligned, only flush cache. */ ci_req->hw_len = req->length; ci_req->hw_buf = req->buf; goto flush;
align: printf("b_len_1: %d\n", ci_req->b_len); if (ci_req->b_buf && req->length > ci_req->b_len) { printf("A: %d - %d\n", req->length, ci_req->b_len); free(ci_req->b_buf); ci_req->b_buf = 0; } if (!ci_req->b_buf) { ci_req->b_len = roundup(req->length, ARCH_DMA_MINALIGN); ci_req->b_buf = memalign(ARCH_DMA_MINALIGN, ci_req->b_len); printf("B: %d - %d\n", req->length, ci_req->b_len); if (!ci_req->b_buf) return -ENOMEM; } ci_req->hw_len = ci_req->b_len; ci_req->hw_buf = ci_req->b_buf;
printf("b_len_2: %d\n", ci_req->b_len);
if (in) memcpy(ci_req->hw_buf, req->buf, req->length);
flush: hwaddr = (uint32_t)ci_req->hw_buf; aligned_used_len = roundup(req->length, ARCH_DMA_MINALIGN); printf("5: %x - %x\n", hwaddr, hwaddr + aligned_used_len); flush_dcache_range(hwaddr, hwaddr + aligned_used_len);
return 0; }

On 02/05/2015 06:06 PM, Jörg Krause wrote:
On Do, 2015-02-05 at 15:23 -0700, Stephen Warren wrote:
b) In ci_bounce(), the bounce buffer is only allocated if the user-buffer is already aligned, and if a large-enough bounce buffer wasn't previously allocated. If ci_req->b_buf was uninitialized it could be non-zero (thus preventing the expected aligned allocation) yet not actually aligned enough.
I can reproduce this issue now. After some "timeout sending packets to usb ethernet" messages, the bounce buffer somehow gets corrupted. ci_bounce() is called with an unaligned input buffer length 'req->length=66', but the bounce buffer length 'ci_req->b_len=1140305940' or in hex 'ci_req->b_len=0x43f7b014'. This bounce buffer length is obviously an address, as the following misaligned error message shows: "CACHE: Misaligned operation at range [43f7b010, 43f7b070]".
Ah, I hadn't realized that was [start, length] rather than [start, end].
The question is: How is ci_req->b_len getting corrupted? Is it simply never initialized, or does something trash that value later?
ci_ep_alloc_request() appears to calloc() the whole struct ci_req, so I imagine an initialization/allocating error isn't happening.
The only issue there might be some code somehow creating its own struct usb_request instead of calling into the controller's ->alloc_request() function. I vaguely recall fixing some of those, but might have missed some in protocols that I didn't test (i.e. anything other than USB Mass Storage or DFU, although I might have very briefly tested netconsole once?).
I would suggest adding a whole ton of printfs() to catch where ci_reqs are being allocated, and where ci_req->b_len is getting written in which ci_req objects, and then mapping that back to the ci_req that the cache alignment error message complains about. Sorry, this will be a bit painful.
If the ci_req is always at the same address on different boots of the code, that will make it easier, especially if you have a debugger with a data watchpoint, or can write some code to use any data watchpoint self-hosted debug capability in your CPU.

On Fr, 2015-02-06 at 11:06 -0700, Stephen Warren wrote:
On 02/05/2015 06:06 PM, Jörg Krause wrote:
On Do, 2015-02-05 at 15:23 -0700, Stephen Warren wrote:
b) In ci_bounce(), the bounce buffer is only allocated if the user-buffer is already aligned, and if a large-enough bounce buffer wasn't previously allocated. If ci_req->b_buf was uninitialized it could be non-zero (thus preventing the expected aligned allocation) yet not actually aligned enough.
I can reproduce this issue now. After some "timeout sending packets to usb ethernet" messages, the bounce buffer somehow gets corrupted. ci_bounce() is called with an unaligned input buffer length 'req->length=66', but the bounce buffer length 'ci_req->b_len=1140305940' or in hex 'ci_req->b_len=0x43f7b014'. This bounce buffer length is obviously an address, as the following misaligned error message shows: "CACHE: Misaligned operation at range [43f7b010, 43f7b070]".
Ah, I hadn't realized that was [start, length] rather than [start, end].
The question is: How is ci_req->b_len getting corrupted? Is it simply never initialized, or does something trash that value later?
ci_ep_alloc_request() appears to calloc() the whole struct ci_req, so I imagine an initialization/allocating error isn't happening.
The only issue there might be some code somehow creating its own struct usb_request instead of calling into the controller's ->alloc_request() function. I vaguely recall fixing some of those, but might have missed some in protocols that I didn't test (i.e. anything other than USB Mass Storage or DFU, although I might have very briefly tested netconsole once?).
I would suggest adding a whole ton of printfs() to catch where ci_reqs are being allocated, and where ci_req->b_len is getting written in which ci_req objects, and then mapping that back to the ci_req that the cache alignment error message complains about. Sorry, this will be a bit painful.
If the ci_req is always at the same address on different boots of the code, that will make it easier, especially if you have a debugger with a data watchpoint, or can write some code to use any data watchpoint self-hosted debug capability in your CPU.
I think I found the answer.
I used a lot of debug messages and tried to understand the involved drivers. I will try to sum up my investigations.
NetLoop is entered first for the ping protocol and then for the netconsole protocol: --- NetLoop Entry (PING) --- NetLoop Entry (NETCONSOLE)
ci_udc is probed and ci_ep_alloc_request() is called for EP0: ci_ep_alloc_request usb_ep:0x43fd0028 ci_req:0x43b83220
In eth_bind() ci_ep_alloc_request() is called for status EP: ci_ep_alloc_request usb_ep:0x43fd00a0 ci_req:0x43b7b4a8
Everything is fine: using ci_udc, OUT ep- IN ep- STATUS ep- MAC 00:19:b8:00:00:02 HOST MAC 00:19:b8:00:00:01
In eth_set_config() ci_ep_alloc_request() is invoked again for tx_req and rx_req: ci_ep_alloc_request usb_ep:0x43fd0050 ci_req:0x43b7b568 ci_ep_alloc_request usb_ep:0x43fd0078 ci_req:0x43b7b5b0
Everything is fine: high speed config #1: 2 mA, Ethernet Gadget, using CDC Ethernet USB network up!
Now the NetLoop is available for netconsole: --- NetState set to 0 (CONTINUE) --- NetLoop Init (NETCONSOLE) --- NetLoop ARP handler set (43fa3b14) [..] Got ARP REPLY, set eth addr (00:19:b8:00:00:01) --- NetState set to 2 (SUCCESS) --- NetLoop UDP handler set (00000000) --- NetLoop ARP handler set (00000000) --- NetLoop timeout handler cancelled --- NetLoop Success! (NETCONSOLE)
Now it's pings turn in the NetLoop. The if (eth_is_on_demand_init()) branch is executed. eth_halt() and later eth_disconnect() and eth_reset_config() are invoked. There the in and out EPs are freed: ci_ep_free_request ci_req:0x43b7b568 ci_ep_free_request ci_req:0x43b7b5b0
However, netconsole tries to send every printf() as an UDP packet to the host. sending UDP to 10.0.0.1/00:19:b8:00:00:01
But the usb_request [1] pointer in usb_eth_send is NULL after the free request: usb_eth_send usb_request: 00000000 length: 82
usb_eth_send() calls ci_ep_queue() which calls ci_bounce() and because of the corrupted ci_req pointer I get a misaligned cache.
So, as far as I understand, the main problem is that netconsole does not knows that the connection is disconnected by the NetLoop?
What are your suggestions to this issue? Should we add an addition check for an initialized usb_request pointer in usb_eth_send()? Or somewhere else?
Best regards Jörg Krause
[1] http://git.denx.de/?p=u-boot.git;a=blob;f=drivers/usb/gadget/ether.c;h=ba442...

On 02/08/2015 02:25 PM, Jörg Krause wrote:
On Fr, 2015-02-06 at 11:06 -0700, Stephen Warren wrote:
On 02/05/2015 06:06 PM, Jörg Krause wrote:
On Do, 2015-02-05 at 15:23 -0700, Stephen Warren wrote:
b) In ci_bounce(), the bounce buffer is only allocated if the user-buffer is already aligned, and if a large-enough bounce buffer wasn't previously allocated. If ci_req->b_buf was uninitialized it could be non-zero (thus preventing the expected aligned allocation) yet not actually aligned enough.
I can reproduce this issue now. After some "timeout sending packets to usb ethernet" messages, the bounce buffer somehow gets corrupted. ci_bounce() is called with an unaligned input buffer length 'req->length=66', but the bounce buffer length 'ci_req->b_len=1140305940' or in hex 'ci_req->b_len=0x43f7b014'. This bounce buffer length is obviously an address, as the following misaligned error message shows: "CACHE: Misaligned operation at range [43f7b010, 43f7b070]".
Ah, I hadn't realized that was [start, length] rather than [start, end].
The question is: How is ci_req->b_len getting corrupted? Is it simply never initialized, or does something trash that value later?
ci_ep_alloc_request() appears to calloc() the whole struct ci_req, so I imagine an initialization/allocating error isn't happening.
The only issue there might be some code somehow creating its own struct usb_request instead of calling into the controller's ->alloc_request() function. I vaguely recall fixing some of those, but might have missed some in protocols that I didn't test (i.e. anything other than USB Mass Storage or DFU, although I might have very briefly tested netconsole once?).
I would suggest adding a whole ton of printfs() to catch where ci_reqs are being allocated, and where ci_req->b_len is getting written in which ci_req objects, and then mapping that back to the ci_req that the cache alignment error message complains about. Sorry, this will be a bit painful.
If the ci_req is always at the same address on different boots of the code, that will make it easier, especially if you have a debugger with a data watchpoint, or can write some code to use any data watchpoint self-hosted debug capability in your CPU.
I think I found the answer.
I used a lot of debug messages and tried to understand the involved drivers. I will try to sum up my investigations.
NetLoop is entered first for the ping protocol and then for the netconsole protocol: --- NetLoop Entry (PING) --- NetLoop Entry (NETCONSOLE)
ci_udc is probed and ci_ep_alloc_request() is called for EP0: ci_ep_alloc_request usb_ep:0x43fd0028 ci_req:0x43b83220
In eth_bind() ci_ep_alloc_request() is called for status EP: ci_ep_alloc_request usb_ep:0x43fd00a0 ci_req:0x43b7b4a8
Everything is fine: using ci_udc, OUT ep- IN ep- STATUS ep- MAC 00:19:b8:00:00:02 HOST MAC 00:19:b8:00:00:01
In eth_set_config() ci_ep_alloc_request() is invoked again for tx_req and rx_req: ci_ep_alloc_request usb_ep:0x43fd0050 ci_req:0x43b7b568 ci_ep_alloc_request usb_ep:0x43fd0078 ci_req:0x43b7b5b0
Everything is fine: high speed config #1: 2 mA, Ethernet Gadget, using CDC Ethernet USB network up!
Now the NetLoop is available for netconsole: --- NetState set to 0 (CONTINUE) --- NetLoop Init (NETCONSOLE) --- NetLoop ARP handler set (43fa3b14) [..] Got ARP REPLY, set eth addr (00:19:b8:00:00:01) --- NetState set to 2 (SUCCESS) --- NetLoop UDP handler set (00000000) --- NetLoop ARP handler set (00000000) --- NetLoop timeout handler cancelled --- NetLoop Success! (NETCONSOLE)
Now it's pings turn in the NetLoop. The if (eth_is_on_demand_init()) branch is executed. eth_halt() and later eth_disconnect() and eth_reset_config() are invoked. There the in and out EPs are freed: ci_ep_free_request ci_req:0x43b7b568 ci_ep_free_request ci_req:0x43b7b5b0
That sounds like a problem.
I would assume that once netconsole is enabled, the Ethernet connection is no longer on-demand, but should be initialized once, and then never taken down.
Take a look at the definition of eth_is_on_demand_init() in include/net.h. It has a special case for CONFIG_NETCONSOLE, based on variable net_loop_last_protocol. Perhaps that isn't getting set correctly, or perhaps the variable is getting set to something other than netconsole, thus causing the connection to get taken down?
However, netconsole tries to send every printf() as an UDP packet to the host. sending UDP to 10.0.0.1/00:19:b8:00:00:01
But the usb_request [1] pointer in usb_eth_send is NULL after the free request: usb_eth_send usb_request: 00000000 length: 82
usb_eth_send() calls ci_ep_queue() which calls ci_bounce() and because of the corrupted ci_req pointer I get a misaligned cache.
So, as far as I understand, the main problem is that netconsole does not knows that the connection is disconnected by the NetLoop?
What are your suggestions to this issue? Should we add an addition check for an initialized usb_request pointer in usb_eth_send()? Or somewhere else?
It might be a useful debugging/validation aid to add a check somewhere for valid non-free struct usb_requests. However, it sounds like that's just a symptom; the underlying problem is that the Ethernet connection is taken down at all, while netconsole is active.

On Mo, 2015-02-09 at 10:38 -0700, Stephen Warren wrote:
On 02/08/2015 02:25 PM, Jörg Krause wrote:
On Fr, 2015-02-06 at 11:06 -0700, Stephen Warren wrote:
On 02/05/2015 06:06 PM, Jörg Krause wrote:
On Do, 2015-02-05 at 15:23 -0700, Stephen Warren wrote:
b) In ci_bounce(), the bounce buffer is only allocated if the user-buffer is already aligned, and if a large-enough bounce buffer wasn't previously allocated. If ci_req->b_buf was uninitialized it could be non-zero (thus preventing the expected aligned allocation) yet not actually aligned enough.
I can reproduce this issue now. After some "timeout sending packets to usb ethernet" messages, the bounce buffer somehow gets corrupted. ci_bounce() is called with an unaligned input buffer length 'req->length=66', but the bounce buffer length 'ci_req->b_len=1140305940' or in hex 'ci_req->b_len=0x43f7b014'. This bounce buffer length is obviously an address, as the following misaligned error message shows: "CACHE: Misaligned operation at range [43f7b010, 43f7b070]".
Ah, I hadn't realized that was [start, length] rather than [start, end].
The question is: How is ci_req->b_len getting corrupted? Is it simply never initialized, or does something trash that value later?
ci_ep_alloc_request() appears to calloc() the whole struct ci_req, so I imagine an initialization/allocating error isn't happening.
The only issue there might be some code somehow creating its own struct usb_request instead of calling into the controller's ->alloc_request() function. I vaguely recall fixing some of those, but might have missed some in protocols that I didn't test (i.e. anything other than USB Mass Storage or DFU, although I might have very briefly tested netconsole once?).
I would suggest adding a whole ton of printfs() to catch where ci_reqs are being allocated, and where ci_req->b_len is getting written in which ci_req objects, and then mapping that back to the ci_req that the cache alignment error message complains about. Sorry, this will be a bit painful.
If the ci_req is always at the same address on different boots of the code, that will make it easier, especially if you have a debugger with a data watchpoint, or can write some code to use any data watchpoint self-hosted debug capability in your CPU.
I think I found the answer.
I used a lot of debug messages and tried to understand the involved drivers. I will try to sum up my investigations.
NetLoop is entered first for the ping protocol and then for the netconsole protocol: --- NetLoop Entry (PING) --- NetLoop Entry (NETCONSOLE)
ci_udc is probed and ci_ep_alloc_request() is called for EP0: ci_ep_alloc_request usb_ep:0x43fd0028 ci_req:0x43b83220
In eth_bind() ci_ep_alloc_request() is called for status EP: ci_ep_alloc_request usb_ep:0x43fd00a0 ci_req:0x43b7b4a8
Everything is fine: using ci_udc, OUT ep- IN ep- STATUS ep- MAC 00:19:b8:00:00:02 HOST MAC 00:19:b8:00:00:01
In eth_set_config() ci_ep_alloc_request() is invoked again for tx_req and rx_req: ci_ep_alloc_request usb_ep:0x43fd0050 ci_req:0x43b7b568 ci_ep_alloc_request usb_ep:0x43fd0078 ci_req:0x43b7b5b0
Everything is fine: high speed config #1: 2 mA, Ethernet Gadget, using CDC Ethernet USB network up!
Now the NetLoop is available for netconsole: --- NetState set to 0 (CONTINUE) --- NetLoop Init (NETCONSOLE) --- NetLoop ARP handler set (43fa3b14) [..] Got ARP REPLY, set eth addr (00:19:b8:00:00:01) --- NetState set to 2 (SUCCESS) --- NetLoop UDP handler set (00000000) --- NetLoop ARP handler set (00000000) --- NetLoop timeout handler cancelled --- NetLoop Success! (NETCONSOLE)
Now it's pings turn in the NetLoop. The if (eth_is_on_demand_init()) branch is executed. eth_halt() and later eth_disconnect() and eth_reset_config() are invoked. There the in and out EPs are freed: ci_ep_free_request ci_req:0x43b7b568 ci_ep_free_request ci_req:0x43b7b5b0
That sounds like a problem.
I would assume that once netconsole is enabled, the Ethernet connection is no longer on-demand, but should be initialized once, and then never taken down.
I took me a while to fully understand the concept behind the net driver. The initial suppose for the net driver was to halt the underlying ethernet driver after each protocol communication. This was changed for speed-up purpose of netconsole. However, I think this broke the communication for other protocols used together with netconsole.
Take a look at the definition of eth_is_on_demand_init() in include/net.h. It has a special case for CONFIG_NETCONSOLE, based on variable net_loop_last_protocol. Perhaps that isn't getting set correctly, or perhaps the variable is getting set to something other than netconsole, thus causing the connection to get taken down?
The cause it, that after running a ping or tftp command, the ethernet connection is halt. I managed to get it working for those two protocols by including them to eth_is_on_demand_init() and some other parts in the net, but I'm not quite satisfied with the solution. I will discuss the with Joe in another mail.
However, netconsole tries to send every printf() as an UDP packet to the host. sending UDP to 10.0.0.1/00:19:b8:00:00:01
But the usb_request [1] pointer in usb_eth_send is NULL after the free request: usb_eth_send usb_request: 00000000 length: 82
usb_eth_send() calls ci_ep_queue() which calls ci_bounce() and because of the corrupted ci_req pointer I get a misaligned cache.
So, as far as I understand, the main problem is that netconsole does not knows that the connection is disconnected by the NetLoop?
What are your suggestions to this issue? Should we add an addition check for an initialized usb_request pointer in usb_eth_send()? Or somewhere else?
It might be a useful debugging/validation aid to add a check somewhere for valid non-free struct usb_requests. However, it sounds like that's just a symptom; the underlying problem is that the Ethernet connection is taken down at all, while netconsole is active.
You're right, it's just a symptom, but it would help in early detection of possible bugs in any of the other drivers. Like the cache misaligned debug message.

On Tue, Feb 3, 2015 at 3:44 PM, Jörg Krause jkrause@posteo.de wrote:
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number 24 [31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
One thing to note is that the network stack generally is designed to sit in a state of inactive (i.e. devices halted / inactive). When a network command is issued, the driver is initialized, then the command runs, then the driver is halted again.
NetConsole breaks this assumption, since the network stack needs to be up the whole time it is selected. Net console used to bring the network driver up and down for every character it sent. Naturally this was a huge problem on USB network adapters that don't come up fast or any other that doesn't. As a workaround (and its current state) when net console is used, the network stack is lied to about the state of the driver (telling it the driver is halted or inited) when the current and previous packets are net console packets. When a different type (ping or tftp, etc) of network packet is sent, the driver is actually brought down and back up to ensure the assumptions about the network stack hold true. With more work we can potentially make these better when net console is enabled.
Cheers, -Joe

Hi Joe,
On Do, 2015-02-05 at 13:20 -0600, Joe Hershberger wrote:
On Tue, Feb 3, 2015 at 3:44 PM, Jörg Krause jkrause@posteo.de wrote:
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number 24 [31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
One thing to note is that the network stack generally is designed to sit in a state of inactive (i.e. devices halted / inactive). When a network command is issued, the driver is initialized, then the command runs, then the driver is halted again.
NetConsole breaks this assumption, since the network stack needs to be up the whole time it is selected. Net console used to bring the network driver up and down for every character it sent. Naturally this was a huge problem on USB network adapters that don't come up fast or any other that doesn't. As a workaround (and its current state) when net console is used, the network stack is lied to about the state of the driver (telling it the driver is halted or inited) when the current and previous packets are net console packets. When a different type (ping or tftp, etc) of network packet is sent, the driver is actually brought down and back up to ensure the assumptions about the network stack hold true. With more work we can potentially make these better when net console is enabled.
Thank you for the explanation! What do you mean with more work? Do you have already some ideas in mind?
Best regards Jörg

On Thu, Feb 5, 2015 at 2:39 PM, Jörg Krause jkrause@posteo.de wrote:
Hi Joe,
On Do, 2015-02-05 at 13:20 -0600, Joe Hershberger wrote:
On Tue, Feb 3, 2015 at 3:44 PM, Jörg Krause jkrause@posteo.de wrote:
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number 24 [31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
One thing to note is that the network stack generally is designed to sit in a state of inactive (i.e. devices halted / inactive). When a network command is issued, the driver is initialized, then the command runs, then the driver is halted again.
NetConsole breaks this assumption, since the network stack needs to be up the whole time it is selected. Net console used to bring the network driver up and down for every character it sent. Naturally this was a huge problem on USB network adapters that don't come up fast or any other that doesn't. As a workaround (and its current state) when net console is used, the network stack is lied to about the state of the driver (telling it the driver is halted or inited) when the current and previous packets are net console packets. When a different type (ping or tftp, etc) of network packet is sent, the driver is actually brought down and back up to ensure the assumptions about the network stack hold true. With more work we can potentially make these better when net console is enabled.
Thank you for the explanation! What do you mean with more work? Do you have already some ideas in mind?
I think it might be possible to work through the reasons that these network functions assume the network interface should be down when they begin and for certain special cases not bring them down at all when net console is enabled. I think this is likely to be non-trivial, though. If you look in include/net.h you'll see eth_is_on_demand_init(). Changing the logic here would be the first step to testing.
Cheers, -Joe

On Do, 2015-02-05 at 14:48 -0600, Joe Hershberger wrote:
On Thu, Feb 5, 2015 at 2:39 PM, Jörg Krause jkrause@posteo.de wrote:
Hi Joe,
On Do, 2015-02-05 at 13:20 -0600, Joe Hershberger wrote:
On Tue, Feb 3, 2015 at 3:44 PM, Jörg Krause jkrause@posteo.de
wrote:
But if I use 'ping 10.0.0.1' or 'tftpboot u-boot.sb' the network connection drops. Both commands work fine if I switch back from netconsole to serial in-/output.
This is the output from dmesg: [31620.215354] usb 3-13: USB disconnect, device number
24
[31620.215422] cdc_ether 3-13:1.0 enp0s20u13: unregister 'cdc_ether' usb-0000:00:14.0-13, CDC Ethernet Device
One thing to note is that the network stack generally is designed
to
sit in a state of inactive (i.e. devices halted / inactive). When
a
network command is issued, the driver is initialized, then the
command
runs, then the driver is halted again.
NetConsole breaks this assumption, since the network stack needs
to be
up the whole time it is selected. Net console used to bring the network driver up and down for every character it sent. Naturally this was a huge problem on USB network adapters that don't come up fast or any other that doesn't. As a workaround (and its current state) when net console is used, the network stack is lied to
about
the state of the driver (telling it the driver is halted or
inited)
when the current and previous packets are net console packets.
When a
different type (ping or tftp, etc) of network packet is sent, the driver is actually brought down and back up to ensure the
assumptions
about the network stack hold true. With more work we can
potentially
make these better when net console is enabled.
Thank you for the explanation! What do you mean with more work? Do
you
have already some ideas in mind?
I think it might be possible to work through the reasons that these network functions assume the network interface should be down when they begin and for certain special cases not bring them down at all when net console is enabled. I think this is likely to be non-trivial, though. If you look in include/net.h you'll see eth_is_on_demand_init(). Changing the logic here would be the first step to testing.
I think I read sth about that the network interface should be down or in a well-defined state for handling over the control to the Linux kernel. There where some problems with the Linux drivers, if I remember rigthly.
I will take a closer look at eth_is_on_demand_init. Thank you!
participants (3)
-
Joe Hershberger
-
Jörg Krause
-
Stephen Warren