[U-Boot] AMCC 405EX Trap

All,
I am experiencing a machine check on a custom AMCC 405EX PPC board. Our board is based on the AMCC Kilauea evaluation board. We have a few of these boards that are up and running, but I am trying to track down a machine check error on a couple of them.
My question for you is this: when the registers are printed to the console, there is one called TRAP. I want to know how/where/when and with what data that gets populated. I have read through the AMCC manuals a couple of times trying to find it and have searched through the U-Boot code to no avail. All I know is that there is a data type "struct pt_regs*" that contains all that data, but nowhere can I find where it is populated.
Below is the console output. The line "!!!! PAUSE !!!!" was inserted by me after I copied the text from the console to remind me of the ~20 second pause that occurs at that point.
I am hoping that someone can point me to the bit definitions for whatever register is being displayed in TRAP. From there, I think I can trace the problem back to the specific piece of hardware and get it fixed.
Thanks!
Jonathan
U-Boot 1.3.4 (Apr 28 2009 - 16:10:06)
CPU: AMCC PowerPC 405EX Rev. C at 400 MHz (PLB=200, OPB=100, EBC=100 MHz) Security support Bootstrap Option C - Boot ROM Location EBC (16 bits) 16 kB I-Cache 16 kB D-Cache Board: SDLPPC - RT PPC405EX Board I2C: ready DRAM: 256 MB Reserving 16384k for kernel logbuffer at 0fffb000 Top of RAM usable for U-Boot at: 0fffb000 Reserving 306k for U-Boot at: 0ffae000 Reserving 1040k for malloc() at: 0feaa000 Reserving 124 Bytes for Board Info at: 0fea9f84 Reserving 64 Bytes for Global Data at: 0fea9f44 Stack Pointer at: 0fea9f28 New Stack Pointer is: 0fea9f28 !!!! PAUSE !!!! Now running in RAM - U-Boot at: 0ffae000 -> Initializing logBuff pointers... -> Calling post_output_backlog()... -> Calling post_reloc()... -> Sync'ing CPU... -> Setting up trap handlers... Bus Fault @ 0x00000000, fixup 0x00000000 Machine Check Exception. Caused by (from msr): regs 0fea9de8 Instruction Synchronous Machine Check exception NIP: 00000000 XER: 20000000 LR: 0FFB071C REGS: 0fea9de8 TRAP: 0200 DEAR: 00000000 MSR: 00000000 EE: 0 PR: 0 FP: 0 ME: 0 IR/DR: 00
GPR00: 0FFB039C 0FEA9ED8 0FEA9F44 0FFAE000 0FFB3D6C 00000001 00000001 00021000 GPR08: 00000600 00002098 17D78400 00000002 2BA7DEF3 FFFFFFFF 0FFF1D00 1000E000 GPR16: 775DF377 FFFFF6FF FFFFFFFF FF5FF7FF FFFFFFFF FFFFFFFF FFDFFFFF FF7FFFFF GPR24: 0000A000 0FEA9F44 0FFAE000 0FEA9F84 0FEAE000 0FEA9F84 0FFF1ED8 00000000 Call backtrace: 0FFB3D64 0FFB06A4 machine check
-- Jonathan R. Haws Electrical Engineering Space Dynamics Laboratory
Jonathan.Haws@sdl.usu.edu (435)797-4629

On 4/29/09 12:45 PM, Jonathan Haws wrote:
I am experiencing a machine check on a custom AMCC 405EX PPC board. Our board is based on the AMCC Kilauea evaluation board. We have a few of these boards that are up and running, but I am trying to track down a machine check error on a couple of them.
My question for you is this: when the registers are printed to the console, there is one called TRAP. I want to know how/where/when and with what data that gets populated. I have read through the AMCC manuals a couple of times trying to find it and have searched through the U-Boot code to no avail. All I know is that there is a data type "struct pt_regs*" that contains all that data, but nowhere can I find where it is populated.
Below is the console output. The line "!!!! PAUSE !!!!" was inserted by me after I copied the text from the console to remind me of the ~20 second pause that occurs at that point.
I am hoping that someone can point me to the bit definitions for whatever register is being displayed in TRAP. From there, I think I can trace the problem back to the specific piece of hardware and get it fixed.
Jonathan:
Typically machine checks such as this are latent and are more about something that happened earlier during bootstrap and initialization rather than something that happened at the time the machine check was actually realized. This is because up until that point, exceptions have not been enabled.
The first thing to check is your u-boot board configuration file. Are all EBC settings correct? Are all SDRAM settings correct? Are you using the right addresses and chip selects for data cache bootstrapping?
Beyond that, it might be useful to single step with your BDI/GDB (or other debugger) from start.S forward, watching key exception registers after every step.
To assist with such debugging, I defined the following macro in my .gdbinit file to dump relevant registers after every single step:
.gdbinit: define dumpexcregs monitor rd msr monitor rd esr monitor rd dead monitor rd srr0 monitor rd srr1 monitor rd srr2 monitor rd srr3 monitor rd mcsr monitor rd mcar monitor rd mcsrr0 monitor rd mcsrr1 monitor rd ebc_besr0 monitor rd ebc_besr1 monitor rd sdram_besr0 monitor rd sdram_besr0 monitor rd sdram_bearl monitor rd sdram_bearh end
Regards,
Grant Erickson

Grant,
Thanks for the reply.
I am certain that it is a hardware failure that is causing the machine check because I can use the exact same binary on another (identical) board and have it boot just fine. That tells me that all the EBC and SDRAM settings are correct; and that I am using the right addresses and chip selects for the data cache.
Currently I am leaning toward an SDRAM problem because I get about a 20 second pause when U-Boot tries to relocate to RAM.
Again, thanks for the reply.
Jonathan
-----Original Message----- From: Grant Erickson [mailto:gerickson@nuovations.com] Sent: Wednesday, April 29, 2009 5:38 PM To: Jonathan Haws Cc: u-boot@lists.denx.de Subject: Re: [U-Boot] AMCC 405EX Trap
On 4/29/09 12:45 PM, Jonathan Haws wrote:
I am experiencing a machine check on a custom AMCC 405EX PPC board. Our
board
is based on the AMCC Kilauea evaluation board. We have a few of these
boards
that are up and running, but I am trying to track down a machine check
error
on a couple of them.
My question for you is this: when the registers are printed to the
console,
there is one called TRAP. I want to know how/where/when and with what
data
that gets populated. I have read through the AMCC manuals a couple of
times
trying to find it and have searched through the U-Boot code to no avail.
All
I know is that there is a data type "struct pt_regs*" that contains all
that
data, but nowhere can I find where it is populated.
Below is the console output. The line "!!!! PAUSE !!!!" was inserted by
me
after I copied the text from the console to remind me of the ~20 second
pause
that occurs at that point.
I am hoping that someone can point me to the bit definitions for
whatever
register is being displayed in TRAP. From there, I think I can trace
the
problem back to the specific piece of hardware and get it fixed.
Jonathan:
Typically machine checks such as this are latent and are more about something that happened earlier during bootstrap and initialization rather than something that happened at the time the machine check was actually realized. This is because up until that point, exceptions have not been enabled.
The first thing to check is your u-boot board configuration file. Are all EBC settings correct? Are all SDRAM settings correct? Are you using the right addresses and chip selects for data cache bootstrapping?
Beyond that, it might be useful to single step with your BDI/GDB (or other debugger) from start.S forward, watching key exception registers after every step.
To assist with such debugging, I defined the following macro in my .gdbinit file to dump relevant registers after every single step:
.gdbinit: define dumpexcregs monitor rd msr monitor rd esr monitor rd dead monitor rd srr0 monitor rd srr1 monitor rd srr2 monitor rd srr3 monitor rd mcsr monitor rd mcar monitor rd mcsrr0 monitor rd mcsrr1 monitor rd ebc_besr0 monitor rd ebc_besr1 monitor rd sdram_besr0 monitor rd sdram_besr0 monitor rd sdram_bearl monitor rd sdram_bearh end
Regards,
Grant Erickson

On Thursday 30 April 2009, Jonathan Haws wrote:
I am certain that it is a hardware failure that is causing the machine check because I can use the exact same binary on another (identical) board and have it boot just fine. That tells me that all the EBC and SDRAM settings are correct;
From my experience you can't be sure that SDRAM setting are "currect" at this stage.
and that I am using the right addresses and chip selects for the data cache.
Currently I am leaning toward an SDRAM problem because I get about a 20 second pause when U-Boot tries to relocate to RAM.
Yes, I'm pretty sure that you have some SDRAM related problems. Either configuration is non-optimal, or even (perhaps more unlikely) a hardware problem. I suggest that you re-check the DDR2 autocalibration (method A & B).
Best regards, Stefan
===================================================================== DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: +49-8142-66989-0 Fax: +49-8142-66989-80 Email: office@denx.de =====================================================================

On Thursday 30 April 2009, Jonathan Haws wrote:
I am certain that it is a hardware failure that is causing the machine check because I can use the exact same binary on another (identical)
board
and have it boot just fine. That tells me that all the EBC and SDRAM settings are correct;
From my experience you can't be sure that SDRAM setting are "currect" at this stage.
Would that be the case on our other 6 boards then? We have 6 boards that are up and running with the exact same U-Boot binary file. If there was a problem with the SDRAM settings on one board, would not the other board show the same symptoms? That is the reason why I have not dug deeper into the SDRAM initialization. However, I will take your advice and do so, because if there is a problem there, then the other boards may be experiencing problems, just not to the extent that this one is.
and that I am using the right addresses and chip selects for the data cache.
Currently I am leaning toward an SDRAM problem because I get about a 20 second pause when U-Boot tries to relocate to RAM.
Yes, I'm pretty sure that you have some SDRAM related problems. Either configuration is non-optimal, or even (perhaps more unlikely) a hardware problem. I suggest that you re-check the DDR2 autocalibration (method A & B).
Thanks for confirming my initial hunch - problem lies in SDRAM. I will let you know what I find - whether it is a hardware or software problem.
One thing I may mention is that we had the SDRAM chips re-balled before they were mounted on the board. Maybe something went wrong during that process on these chips on the problem board - who knows.
Anyway, once I get this resolved I will post the solution.
Thanks all!
Jonathan

On 5/4/09 7:43 AM, Jonathan Haws wrote:
On Thursday 30 April 2009, Jonathan Haws wrote:
I am certain that it is a hardware failure that is causing the machine check because I can use the exact same binary on another (identical) board and have it boot just fine. That tells me that all the EBC and SDRAM settings are correct;
From my experience you can't be sure that SDRAM setting are "currect" at this stage.
Would that be the case on our other 6 boards then? We have 6 boards that are up and running with the exact same U-Boot binary file. If there was a problem with the SDRAM settings on one board, would not the other board show the same symptoms? That is the reason why I have not dug deeper into the SDRAM initialization. However, I will take your advice and do so, because if there is a problem there, then the other boards may be experiencing problems, just not to the extent that this one is.
Have these additional six boards passed four corners testing with an intensive and exhaustive memory diagnostic?
Regards,
Grant

On Monday 04 May 2009, Jonathan Haws wrote:
From my experience you can't be sure that SDRAM setting are "currect" at this stage.
Would that be the case on our other 6 boards then? We have 6 boards that are up and running with the exact same U-Boot binary file. If there was a problem with the SDRAM settings on one board, would not the other board show the same symptoms?
Not necessarily. Some SDRAM related problems only show very seldom or only under specific conditions (temperature and/or component differences etc). This "might" explain why some boards show no problems and other do.
That is the reason why I have not dug deeper into the SDRAM initialization. However, I will take your advice and do so, because if there is a problem there, then the other boards may be experiencing problems, just not to the extent that this one is.
and that I am using the right addresses and chip selects for the data cache.
Currently I am leaning toward an SDRAM problem because I get about a 20 second pause when U-Boot tries to relocate to RAM.
Yes, I'm pretty sure that you have some SDRAM related problems. Either configuration is non-optimal, or even (perhaps more unlikely) a hardware problem. I suggest that you re-check the DDR2 autocalibration (method A & B).
Thanks for confirming my initial hunch - problem lies in SDRAM. I will let you know what I find - whether it is a hardware or software problem.
One thing I may mention is that we had the SDRAM chips re-balled before they were mounted on the board. Maybe something went wrong during that process on these chips on the problem board - who knows.
I see. This could be a problem.
I suggest that you run some stress tests in a conditioning cabinet to see if the other boards don't show any problems.
Best regards, Stefan
===================================================================== DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: +49-8142-66989-0 Fax: +49-8142-66989-80 Email: office@denx.de =====================================================================

I suggest that you run some stress tests in a conditioning cabinet to see if the other boards don't show any problems.
That is a good idea. I haven't thought of performing those tests. Are there specific tests I can enable in the U-Boot environment for that?
We have been using a couple of these boards extensively and in some pretty loaded configurations. For example, one board has been used as a data capture system to capture gigabytes of data over a network connection. That uses RAM extensively before it actually writes it out to disk. However, we have not run any sort of extensive memory diagnostics to check all parts of RAM. That is next on my list.
Thanks!
Jonathan

On Monday 04 May 2009, Jonathan Haws wrote:
I suggest that you run some stress tests in a conditioning cabinet to see if the other boards don't show any problems.
That is a good idea. I haven't thought of performing those tests. Are there specific tests I can enable in the U-Boot environment for that?
Perhaps the memory tests from the POST infrastructure. But from my experience a realworld application running under Linux is a good test. For example compiling a Linux kernel in a loop. Perhaps mounted via NFS. Something like this should fail at some time when SDRAM related problems exist.
We have been using a couple of these boards extensively and in some pretty loaded configurations. For example, one board has been used as a data capture system to capture gigabytes of data over a network connection. That uses RAM extensively before it actually writes it out to disk.
That's good. Which OS was used here? Linux?
But Jerry's note about x-raying the problematic board is a good idea.
Best regards, Stefan
===================================================================== DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: +49-8142-66989-0 Fax: +49-8142-66989-80 Email: office@denx.de =====================================================================

On 5/4/09 8:08 AM, Stefan Roese wrote:
On Monday 04 May 2009, Jonathan Haws wrote:
I suggest that you run some stress tests in a conditioning cabinet to see if the other boards don't show any problems.
That is a good idea. I haven't thought of performing those tests. Are there specific tests I can enable in the U-Boot environment for that?
Perhaps the memory tests from the POST infrastructure. But from my experience a realworld application running under Linux is a good test. For example compiling a Linux kernel in a loop. Perhaps mounted via NFS. Something like this should fail at some time when SDRAM related problems exist.
Agreed that real world application tests can be sufficiently abusive to surface problems.
However, a side benefit of a non-application, exhaustive diagnostic is the attendant reporting that goes with such a test that can identify particular data patterns or addresses that fail giving better insight into the true nature of the problem.
Regards,
Grant

On 5/4/09 8:08 AM, Stefan Roese wrote:
On Monday 04 May 2009, Jonathan Haws wrote:
I suggest that you run some stress tests in a conditioning cabinet to
see
if the other boards don't show any problems.
That is a good idea. I haven't thought of performing those tests. Are there specific tests I can enable in the U-Boot environment for that?
Perhaps the memory tests from the POST infrastructure. But from my
experience
a realworld application running under Linux is a good test. For example compiling a Linux kernel in a loop. Perhaps mounted via NFS. Something
like
this should fail at some time when SDRAM related problems exist.
Agreed that real world application tests can be sufficiently abusive to surface problems.
However, a side benefit of a non-application, exhaustive diagnostic is the attendant reporting that goes with such a test that can identify particular data patterns or addresses that fail giving better insight into the true nature of the problem.
I agree with Grant on this point. If the x-rays do not show anything, then I believe that there is something in the chips that is causing the problem - which a memory diagnostic would show well. And if these chips are having issues and the others on the other boards are from the same lot, then there could be issues not cropping up on the other boards.
Also, Stefan, to answer your question about OS - we are using VxWorks as the OS.
Thanks again!
Jonathan

In message 200905041708.59991.sr@denx.de Stefan Roese wrote:
On Monday 04 May 2009, Jonathan Haws wrote:
...
That is a good idea. I haven't thought of performing those tests. Are there specific tests I can enable in the U-Boot environment for that?
Perhaps the memory tests from the POST infrastructure. But from my experience a realworld application running under Linux is a good test. For example compiling a Linux kernel in a loop. Perhaps mounted via NFS. Something like this should fail at some time when SDRAM related problems exist.
Stefan is right. The memory tests in U-Boot all boil down to plain read-/write-cycles on the bus. This is nothing compared to the stress you put on the memory system when you have back-to-back burst mode accesses. To get these, you need a combination of cache flushes (such as in an OS when it is context-switching), cache loads (like instruction fetches when lots of different code are being executed), and DMA (like when you have heavy network traffic or another active DMA device). Booting Linux with root file system over NFS is the easiest and one of the most reliable stress tests I know of.
Best regards,
Wolfgang Denk

Jonathan Haws wrote:
On Thursday 30 April 2009, Jonathan Haws wrote:
I am certain that it is a hardware failure that is causing the machine check because I can use the exact same binary on another (identical)
board
and have it boot just fine. That tells me that all the EBC and SDRAM settings are correct;
From my experience you can't be sure that SDRAM setting are "currect" at this stage.
Would that be the case on our other 6 boards then? We have 6 boards that are up and running with the exact same U-Boot binary file. If there was a problem with the SDRAM settings on one board, would not the other board show the same symptoms? That is the reason why I have not dug deeper into the SDRAM initialization. However, I will take your advice and do so, because if there is a problem there, then the other boards may be experiencing problems, just not to the extent that this one is.
and that I am using the right addresses and chip selects for the data cache.
Currently I am leaning toward an SDRAM problem because I get about a 20 second pause when U-Boot tries to relocate to RAM.
Yes, I'm pretty sure that you have some SDRAM related problems. Either configuration is non-optimal, or even (perhaps more unlikely) a hardware problem. I suggest that you re-check the DDR2 autocalibration (method A & B).
Thanks for confirming my initial hunch - problem lies in SDRAM. I will let you know what I find - whether it is a hardware or software problem.
One thing I may mention is that we had the SDRAM chips re-balled before they were mounted on the board. Maybe something went wrong during that process on these chips on the problem board - who knows.
Anyway, once I get this resolved I will post the solution.
Thanks all!
Jonathan
1) Six boards work, one board fails. 2) SDRAM chips re-balled on the failing board. 3) SDRAM failing.
That sounds like a hardware/assembly problem to me. My bet is a solder problem. Have you (can you) x-ray the chips and verify the SDRAM soldering is OK?
Best regards, gvb

- Six boards work, one board fails.
- SDRAM chips re-balled on the failing board.
- SDRAM failing.
That sounds like a hardware/assembly problem to me. My bet is a solder problem. Have you (can you) x-ray the chips and verify the SDRAM soldering is OK?
That was my initial hunch simply because of the 6 working boards. Our hardware designer took the board in for x-ray this morning to see if there is an issue there. If that is the problem, then we are set - though I still plan to run some extensive diagnostics on the memory just to be sure.
Good to hear that someone has the same hunch as I did!
Jonathan
participants (5)
-
Grant Erickson
-
Jerry Van Baren
-
Jonathan Haws
-
Stefan Roese
-
Wolfgang Denk