
Jerry Van Baren wrote:
Larry Johnson wrote:
The ECC POST reported intermittent failures running after power-up on the Korat PPC440EPx board. Even when the test passed, the debugging output occasionally reported additional unexpected ECC errors.
This refactoring had two main objectives: (1) minimize the code executed with ECC enabled during the tests, and (2) add more checking of the results so any unexpected ECC errors would cause the test to fail.
So far, the refactored test has not reported any intermittent failures. Further, synchronization instructions appear no longer to be require, so have been removed. If intermittent failures do occur in the future, the refactoring should make the causes easier to identify.
WHOOP, WHOOP, WHOOP, red alert! "[S]ynchronization instructions appear no longer to be require[d], so have been removed".
Synchronization instructions either *ARE* required or *ARE NOT* required, there is no "appear". When sync instructions appear to not be required, but actually are required, that is when really obscure bugs start happening in the dead of winter off the coast of Alaska / Siberia and your boss asks you if you have warm clothes.
I am not familiar with the 4xx family or the PowerPC core that is used in it but...
[snip]
-static int test_ecc(unsigned long ecc_addr) +static int test_ecc(uint32_t ecc_addr) {
- unsigned long value;
- volatile unsigned *const ecc_mem = (volatile unsigned *) ecc_addr;
- int pret;
- uint32_t value;
- volatile uint32_t *const ecc_mem = (volatile uint32_t *)ecc_addr; int ret = 0;
- sync();
- eieio(); WATCHDOG_RESET();
The combination of "sync" and "eieio" is a strong indication of someone sprinkling pixie dust rather than understanding the problem.
"Sync" forces all pending I/O (read/write) operations to be completed all the way to memory/hardware register before the instruction continues. Sync guarantees *WHEN* the I/O will complete: NOW. This is a big hammer: it can cause a significant performance hit because it stalls the processor BUT it is guaranteed effective (except for the places that need an both an isync and a sync combination - thankfully, I believe that is only needed in special cases when playing with the processor's control registers).
"Eieio" (enforce in-order execution of I/O) is a barrier that says all I/O that goes before it must be completed before any I/O that goes after it is started. It *DOES NOT* guarantee *WHEN* the preceding reads/writes will be completed. Theoretically, the bus interface unit (BIU) could hold the whole shootin' match for 10 minutes before it does the preceding I/O followed by the succeeding I/O. Eieio is much less draconian to the processor than sync (which is why eieios are preferred) but an eieio may or may not cause the intended synchronizing result if you are relying on a write or read causing the proper effect *NOW*. Note that eieios are NOPs to processor cores that don't reorder I/O.
Some PowerPC cores (e.g. the 74xx family) can reorder reads and writes in the bus interface unit (some cores, such as the 603e, do *not* reorder reads and writes). This is a performance enhancement... writes (generally) are non-blocking to the processor core where a read causes the processor to have to wait for the data (which cascades into pipeline stalls and performance hits). The bus is a highly oversubscribed resource (core speed / bus speed can be 8x or more). As a result, you want to get reads done ASAP (if possible) and thus it is beneficial to move a read ahead of a write.
As you should have picked up by now, a sync (forcing all I/O to complete) followed by eieio is silly - the eieio is superfluous. Seeing syncs/isyncs/eieios sprinkled in code is an indication that the author didn't understand what was going on and, as a result, kept hitting the problem with a bigger and bigger hammer until it appeared to have gone away.
Besides read/write reordering problems, the bus interface unit (BIU) can "short circuit" a read that follows a write to the same address. This is very likely to be implemented in a given core - it offers a very good speed up traded off against a modest increase in complexity to the BIU. The problem is (for instance), if you configure your EDC to store an invalid EDC flag, do a write to a test location (which gets held in the BIU because the bus is busy), followed by a read of the test location (expecting to see an EDC failure), the BIU could return the queued *but unwritten* write value.
OK, enough lecturing...
Repeated disclaimer: What I write here is applicable for more complex PowerPC implementations. It may not be applicable for the particular 4xx core you are running on. I am not familiar with the 4xx core.
The reason sync/eieio is very likely VITAL in a EDC test is that...
- The EDC is being reconfigured in a way that can cause latent EDC
faults. If a write - for instance a save of a register on the stack - gets deferred inadvertently until _after_ the EDC hardware is configured to test an error, you could end up with the register save performed with inadvertently screwed up EDC. As a result, when your code executes the return postlog (popping the register off the stack) you will get a totally unexpected EDC error.
- Even if an eieio is used properly, the (EDC reconfiguration, write,
read, EDC fixup) sequence may occur in the right order, but it may occur *way later* than you expected which could cause an EDC exception way later in the code than you expected. This would lead to very flaky results, unexpected EDC failures, etc. Hmmmmm.
While the previous scenarios are worst cases, improper sync discipline can cause test failures as well. In fact, it is actually more likely to cause problems with the test that the worst case scenario. For instance, if the BIU holds a write and short-circuits subsequent reads, you may *think* you are testing EDC but, if the BIU has the write queued and the read comes from the BIU rather than actual memory, the BIU will inadvertently short circuit your test as well.
By the way, if interrupts are enabled during this time.......... (shudders) oooh, good choice to run polled, Dan/Wolfgang!
[major snip]
HTH, gvb
Yes, it does help. Thanks, Jerry.
When I first modified the (then) LWMON5 ECC POST to run on Korat and other 440EPx boards, Stefan urged me to replace the memory accesses using volatile pointers with accesses via "in_be32()" et al. As I understood it, this should have eliminated the need for any external synchronization. However, after checking the PPC440 documetation, I now believe that there should be a "sync" (actually, an "msync", which is the same opcode) between the memory access and access to the SDRAM- controller registers.
(From what I can tell, "in_be32()" et al. do not not force completion of the storage access before returning. Is this correct?)
Stefan, please hold of on this patch, as I expect to be resubmitting it soon. :-)
Best regards, Larry