
Dear "Moffett, Kyle D",
In message 613C8F89-3CE5-4C28-A48E-D5C3E8143A4C@boeing.com you wrote:
On our boards, when the "reset" button is pressed in hardware, both processor modules on the board and all the attached hardware reset at the same time.
OK. So a sane design would provide a way for both of the processors to do the same, for example by toggeling some GPIO or similar.
If just *one* of the 2 CPUs triggers the reset then only *some* of the attached hardware will be properly reset due to a hardware errata, and as a result the board will sometimes hang or corrupt DMA transfers to the SSDs shortly after reset.
...
Yes, it's a royal pain, but we're stuck with this hardware for the time being, and if the board can't communicate then it might as well hang() anyways.
Do you agree that this is a highly board-specific problem (I would call it a hardware bug, but I don't insist you agree on that term), and while there is a the need for you to work around such behaviour there is little or no reason to do this, or anything like that, in common code ?
And if there are more things that could be done to provide a "better" reset, then why should we not always do these?
If the board is in a panic() state it may well have still-running DMA transfers (such as USB URBs), or be in the middle of writing to FLASH.
The same (at least having USB or other drivers still being enabled, and USB writing it's SOF counters to RAM) can happen for any call to the reset() function. I see no reason for assuming there would be better or worse conditions to perform a reset.
Performing a jump to early-boot code which is only ever tested when everything is OK and devices are properly initialized is a great way to cause data corruption.
If there is a software way to prevent such issues, then these steps should always be performed.
I know for a fact that our boards would rather hang forever than try to reset without cooperation from the other CPU.
As mentioned above, this is a board specific issue that should not influence common code design.
While I was going through the hooks I noticed that several of them were explicitly NOT safe if the board was in the middle of a panic() for whatever
Can you please peovide some specific examples? I don't understand what you are talking about.
Ok, using the ppmc7xx board as an example:
/* Disable and invalidate cache */ icache_disable(); dcache_disable(); /* Jump to cold reset point (in RAM) */ _start(); /* Should never get here */ while(1) ;
This board uses the EEPRO100 driver, which appears to set up statically allocated TX and RX rings which the device performs DMA to/from.
If this board starts receiving packets and then panic()s, it will disable address translation and immediately re-relocate U-Boot into RAM, then zero the BSS. If the network card tries to receive a packet after BSS is zeroed, it will read a packet buffer address of (probably) 0x0 from the RX ring and promptly overwrite part of U-Boot's memory at that address.
Agreed. So this should be fixed. One clean way to fix it would be to help improving the driver model for U-Boot (read: create one) and making sure drivers get deinitialized in such a case.
Since the panic() path is so infrequently used and tested, it's better to be safe and hang() on the boards which do not have a reliable hardware-level reset than it is to cause undefined behavior or potentially corrupt data.
I disagree. Instead of adding somewhat obscure alternate code paths (which get tested even less frequently) we should focus oin fixing such problems where we run into them.
Best regards,
Wolfgang Denk