[U-Boot] [PATCHv2] fsl_ddr: Don't use full 64-bit divides on 32-bit PowerPC

The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
Since all of the inputs to this are 32-bit (nanoseconds, clock cycles, and DDR frequencies), we can easily restructure the computation to use the "do_div()" function to perform 64-bit/32-bit divide operations.
This decreases compute time rather significantly for each conversion and avoids bringing in a very complicated function from libgcc.
It should be noted that nothing else in U-Boot or the Linux kernel seems to require a full 64-bit divide on any 32-bit PowerPC.
Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
Signed-off-by: Kyle Moffett Kyle.D.Moffett@boeing.com Cc: Andy Fleming afleming@gmail.com Cc: Kumar Gala kumar.gala@freescale.com Cc: Wolfgang Denk wd@denx.de Cc: Kim Phillips kim.phillips@freescale.com ---
Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.
I spent a fair amount of time trying to verify that the resulting math is exactly the same as it was before, but the consequences of a mistake are insideous timing problems. Additional in-depth review would be much appreciated.
Cheers, Kyle Moffett
Changelog: v2: Resubmitted separately from the other HWW-1U-1A patches
arch/powerpc/cpu/mpc8xxx/ddr/util.c | 58 ++++++++++++++++++++++++---------- 1 files changed, 41 insertions(+), 17 deletions(-)
diff --git a/arch/powerpc/cpu/mpc8xxx/ddr/util.c b/arch/powerpc/cpu/mpc8xxx/ddr/util.c index 1e2d921..c545d59 100644 --- a/arch/powerpc/cpu/mpc8xxx/ddr/util.c +++ b/arch/powerpc/cpu/mpc8xxx/ddr/util.c @@ -8,11 +8,19 @@
#include <common.h> #include <asm/fsl_law.h> +#include <div64.h>
#include "ddr.h"
unsigned int fsl_ddr_get_mem_data_rate(void);
+/* To avoid 64-bit full-divides, we factor this here */ +#define ULL_2e12 2000000000000ULL +#define UL_5pow12 244140625UL +#define UL_2pow13 (1UL << 13) + +#define ULL_8Fs 0xFFFFFFFFULL + /* * Round mclk_ps to nearest 10 ps in memory controller code. * @@ -22,36 +30,52 @@ unsigned int fsl_ddr_get_mem_data_rate(void); */ unsigned int get_memory_clk_period_ps(void) { - unsigned int mclk_ps; + unsigned int data_rate = fsl_ddr_get_mem_data_rate(); + unsigned int result; + + /* Round to nearest 10ps, being careful about 64-bit multiply/divide */ + unsigned long long mclk_ps = ULL_2e12;
- mclk_ps = 2000000000000ULL / fsl_ddr_get_mem_data_rate(); - /* round to nearest 10 ps */ - return 10 * ((mclk_ps + 5) / 10); + /* Add 5*data_rate, for rounding */ + mclk_ps += 5*(unsigned long long)data_rate; + + /* Now perform the big divide, the result fits in 32-bits */ + do_div(mclk_ps, data_rate); + result = mclk_ps; + + /* We still need to round to 10ps */ + return 10 * (result/10); }
/* Convert picoseconds into DRAM clock cycles (rounding up if needed). */ unsigned int picos_to_mclk(unsigned int picos) { - const unsigned long long ULL_2e12 = 2000000000000ULL; - const unsigned long long ULL_8Fs = 0xFFFFFFFFULL; - unsigned long long clks; - unsigned long long clks_temp; + unsigned long long clks, clks_rem;
+ /* Short circuit for zero picos */ if (!picos) return 0;
- clks = fsl_ddr_get_mem_data_rate() * (unsigned long long) picos; - clks_temp = clks; - clks = clks / ULL_2e12; - if (clks_temp % ULL_2e12) { + /* First multiply the time by the data rate (32x32 => 64) */ + clks = picos * (unsigned long long)fsl_ddr_get_mem_data_rate(); + + /* + * Now divide by 5^12 and track the 32-bit remainder, then divide + * by 2*(2^12) using shifts (and updating the remainder). + */ + clks_rem = do_div(clks, UL_5pow12); + clks_rem <<= 13; + clks_rem |= clks & (UL_2pow13-1); + clks >>= 13; + + /* If we had a remainder, then round up */ + if (clks_rem) clks++; - }
- if (clks > ULL_8Fs) { + /* Clamp to the maximum representable value */ + if (clks > ULL_8Fs) clks = ULL_8Fs; - } - - return (unsigned int) clks; + return (unsigned int)clks; }
unsigned int mclk_to_picos(unsigned int mclk)

On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
Since all of the inputs to this are 32-bit (nanoseconds, clock cycles, and DDR frequencies), we can easily restructure the computation to use the "do_div()" function to perform 64-bit/32-bit divide operations.
This decreases compute time rather significantly for each conversion and avoids bringing in a very complicated function from libgcc.
It should be noted that nothing else in U-Boot or the Linux kernel seems to require a full 64-bit divide on any 32-bit PowerPC.
Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
Signed-off-by: Kyle Moffett Kyle.D.Moffett@boeing.com Cc: Andy Fleming afleming@gmail.com Cc: Kumar Gala kumar.gala@freescale.com Cc: Wolfgang Denk wd@denx.de Cc: Kim Phillips kim.phillips@freescale.com
Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.
I spent a fair amount of time trying to verify that the resulting math is exactly the same as it was before, but the consequences of a mistake are insideous timing problems. Additional in-depth review would be much appreciated.
What's the gain of the changes? A smaller footprint? Going forward from 32- to 64-bit, are we going to change it back?
York

On Mar 14, 2011, at 14:19, York Sun wrote:
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
Since all of the inputs to this are 32-bit (nanoseconds, clock cycles, and DDR frequencies), we can easily restructure the computation to use the "do_div()" function to perform 64-bit/32-bit divide operations.
This decreases compute time rather significantly for each conversion and avoids bringing in a very complicated function from libgcc.
It should be noted that nothing else in U-Boot or the Linux kernel seems to require a full 64-bit divide on any 32-bit PowerPC.
Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
Signed-off-by: Kyle Moffett Kyle.D.Moffett@boeing.com Cc: Andy Fleming afleming@gmail.com Cc: Kumar Gala kumar.gala@freescale.com Cc: Wolfgang Denk wd@denx.de Cc: Kim Phillips kim.phillips@freescale.com
Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.
I spent a fair amount of time trying to verify that the resulting math is exactly the same as it was before, but the consequences of a mistake are insideous timing problems. Additional in-depth review would be much appreciated.
What's the gain of the changes? A smaller footprint? Going forward from 32- to 64-bit, are we going to change it back?
On 64-bit this change is basically a no-op, because do_div() is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.
On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is 1.1kb of code and hundreds or thousands of dependent cycles to compute, all of which is linked in from libgcc. Another 1.2kb of code comes in for __umoddi3.
The original reason I wrote this patch is that the native "libgcc" on my boards is hard-float and therefore generates warnings when linking to it from soft-float code. The toolchain is the native Debian powerpc (or powerpcspe), and most other native PowerPC distributions are also hard-float-only.
When I combine this patch with the other patch I posted to create a minimal internal libgcc with a few 64-bit shift functions from the linux kernel, I can successfully build U-Boot on those native PowerPC systems.
Cheers, Kyle Moffett

Kyle,
On Mon, 2011-03-14 at 14:04 -0500, Moffett, Kyle D wrote:
On 64-bit this change is basically a no-op, because do_div() is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.
On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is 1.1kb of code and hundreds or thousands of dependent cycles to compute, all of which is linked in from libgcc. Another 1.2kb of code comes in for __umoddi3.
The original reason I wrote this patch is that the native "libgcc" on my boards is hard-float and therefore generates warnings when linking to it from soft-float code. The toolchain is the native Debian powerpc (or powerpcspe), and most other native PowerPC distributions are also hard-float-only.
When I combine this patch with the other patch I posted to create a minimal internal libgcc with a few 64-bit shift functions from the linux kernel, I can successfully build U-Boot on those native PowerPC systems.
Points taken. Can you explain your algorithm? I see you want to do clks/5^12/2^13, but I don't see when the clks is divided by 5^12. Did I miss some lines?
York

On Mar 14, 2011, at 15:41, York Sun wrote:
Kyle,
On Mon, 2011-03-14 at 14:04 -0500, Moffett, Kyle D wrote:
On 64-bit this change is basically a no-op, because do_div() is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.
On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is 1.1kb of code and hundreds or thousands of dependent cycles to compute, all of which is linked in from libgcc. Another 1.2kb of code comes in for __umoddi3.
The original reason I wrote this patch is that the native "libgcc" on my boards is hard-float and therefore generates warnings when linking to it from soft-float code. The toolchain is the native Debian powerpc (or powerpcspe), and most other native PowerPC distributions are also hard-float-only.
When I combine this patch with the other patch I posted to create a minimal internal libgcc with a few 64-bit shift functions from the linux kernel, I can successfully build U-Boot on those native PowerPC systems.
Points taken. Can you explain your algorithm? I see you want to do clks/5^12/2^13, but I don't see when the clks is divided by 5^12. Did I miss some lines?
The "do_div()" macro is a bit confusing that way, it simultaneously performs an in-place division of the first argument and returns the remainder. Unfortunately that's a clumsy interface that's been around in the Linux kernel for ages.
So this line here: clks_rem = do_div(clks, UL_5pow12);
It divides clks by UL_5pow12 and saves the remainder as clks_rem.
Then the 2^^13 divide and remainder is perform with shifts and masks.
Actually, it occurs to me that the one comment isn't quite right, the remainder is a 64-bit value (not a 32-bit one).
I feel relatively confident that this is the right technical direction to go, because almost all of the Linux kernel timekeeping and scheduling code uses 64-bit values and careful application of do_div() or shift+mask for efficiency.
Cheers, Kyle Moffett

On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
* Now divide by 5^12 and track the 32-bit remainder, then divide
* by 2*(2^12) using shifts (and updating the remainder).
*/
- clks_rem = do_div(clks, UL_5pow12);
- clks_rem <<= 13;
Shouldn't this be clks_rem >>= 13 ?
- clks_rem |= clks & (UL_2pow13-1);
- clks >>= 13;
- /* If we had a remainder, then round up */
- if (clks_rem) clks++;
- }
York

On Mar 14, 2011, at 16:22, York Sun wrote:
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
* Now divide by 5^12 and track the 32-bit remainder, then divide
* by 2*(2^12) using shifts (and updating the remainder).
*/
- clks_rem = do_div(clks, UL_5pow12);
- clks_rem <<= 13;
Shouldn't this be clks_rem >>= 13 ?
- clks_rem |= clks & (UL_2pow13-1);
- clks >>= 13;
- /* If we had a remainder, then round up */
- if (clks_rem) clks++;
- }
Since I'm dividing a second time, the old remainder value represents the high bits of the new remainder value and therefore needs to be left-shifted, then the now-empty low bits of the remainder are taken from the bits of "clks" which get shifted away.
Example:
Say I want to divide 1999999999999 (IE: 2*10^^12 - 1) by 2000000000000 (2*10^^12). Obviously the dividend is less than the divisor (by 1), so the result should be 0 and the remainder should be equal to the dividend.
So my initial value is: clks = 1999999999999;
Now I divide by 5^^12: clks_rem = do_div(clks, 244140625); /* This number is 5pow12 */
The results are: clks_rem == 244140624 clks == 8191
Now I shift left: clks_rem <<= 13;
And get: clks_rem == 1999999991808
Finally, I copy the low bits of clks to clks_rem and shift them out of clks: clks_rem |= clks & (UL_2pow13-1); clks >>= 13
The result is as expected: clks_rem == 1999999999999; clks == 0;
Cheers, Kyle Moffett

Kyle,
On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
On Mar 14, 2011, at 16:22, York Sun wrote:
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
* Now divide by 5^12 and track the 32-bit remainder, then divide
* by 2*(2^12) using shifts (and updating the remainder).
*/
- clks_rem = do_div(clks, UL_5pow12);
- clks_rem <<= 13;
Shouldn't this be clks_rem >>= 13 ?
- clks_rem |= clks & (UL_2pow13-1);
- clks >>= 13;
- /* If we had a remainder, then round up */
- if (clks_rem) clks++;
- }
I found a problem with the round up. Please try
picos_to_mclk(mclk_to_picos(3))
You will notice the result is round up. I am sure it is unexpected.
York

On Jul 28, 2011, at 17:41, York Sun wrote:
On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
On Mar 14, 2011, at 16:22, York Sun wrote:
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
* Now divide by 5^12 and track the 32-bit remainder, then divide
* by 2*(2^12) using shifts (and updating the remainder).
*/
- clks_rem = do_div(clks, UL_5pow12);
- clks_rem <<= 13;
Shouldn't this be clks_rem >>= 13 ?
- clks_rem |= clks & (UL_2pow13-1);
- clks >>= 13;
- /* If we had a remainder, then round up */
- if (clks_rem) clks++;
- }
I found a problem with the round up. Please try
picos_to_mclk(mclk_to_picos(3))
You will notice the result is round up. I am sure it is unexpected.
Hmm... what does fsl_ddr_get_mem_data_rate() return on your system? I cannot reproduce this here with my memory freq of 800MHz.
The function has obviously always done rounding, even before I made any changes. Have you retested what the math does with the patch reverted to see if it makes any difference?
Hmm, looking at it, the obvious problem case would be mclk_to_picos() using a rounded result in a subsequent multiply. Can you retest by replacing mclk_to_picos() with the following code; this should keep the error down by doing the rounding-to-10ps *after* the multiply.
unsigned int mclk_to_picos(unsigned int mclk) { unsigned int data_rate = get_ddr_freq(0); unsigned int result;
/* Round to nearest 10ps, being careful about 64-bit multiply/divide */ unsigned long long mclk_ps = ULL_2e12 * mclk;
/* Add 5*data_rate, for rounding */ mclk_ps += 5*(unsigned long long)data_rate;
/* Now perform the big divide, the result fits in 32-bits */ do_div(mclk_ps, data_rate); result = mclk_ps;
/* We still need to round to 10ps */ return 10 * (result/10); }
Obviously you could also get rid of the rounding alltogether, but I don't know why it was put in the code in the first place...
Cheers, Kyle Moffett

Kyle,
On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
On Jul 28, 2011, at 17:41, York Sun wrote:
On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
On Mar 14, 2011, at 16:22, York Sun wrote:
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
* Now divide by 5^12 and track the 32-bit remainder, then divide
* by 2*(2^12) using shifts (and updating the remainder).
*/
- clks_rem = do_div(clks, UL_5pow12);
- clks_rem <<= 13;
Shouldn't this be clks_rem >>= 13 ?
- clks_rem |= clks & (UL_2pow13-1);
- clks >>= 13;
- /* If we had a remainder, then round up */
- if (clks_rem) clks++;
- }
I found a problem with the round up. Please try
picos_to_mclk(mclk_to_picos(3))
You will notice the result is round up. I am sure it is unexpected.
Hmm... what does fsl_ddr_get_mem_data_rate() return on your system? I cannot reproduce this here with my memory freq of 800MHz.
The function has obviously always done rounding, even before I made any changes. Have you retested what the math does with the patch reverted to see if it makes any difference?
Hmm, looking at it, the obvious problem case would be mclk_to_picos() using a rounded result in a subsequent multiply. Can you retest by replacing mclk_to_picos() with the following code; this should keep the error down by doing the rounding-to-10ps *after* the multiply.
unsigned int mclk_to_picos(unsigned int mclk) { unsigned int data_rate = get_ddr_freq(0); unsigned int result;
/* Round to nearest 10ps, being careful about 64-bit multiply/divide */ unsigned long long mclk_ps = ULL_2e12 * mclk; /* Add 5*data_rate, for rounding */ mclk_ps += 5*(unsigned long long)data_rate; /* Now perform the big divide, the result fits in 32-bits */ do_div(mclk_ps, data_rate); result = mclk_ps; /* We still need to round to 10ps */ return 10 * (result/10);
}
Obviously you could also get rid of the rounding alltogether, but I don't know why it was put in the code in the first place...
I think the problem comes from the round up. But your calculation of remainder is not right. You explained to me with example of 2*10^^12-1. It seems to be right until I tried another number 264*10^^7. I think the following calculation of remainder is
clks_rem = do_div(clks, UL_5POW12); clks >>= 13; clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;
Then we can compare the remainder with the "error" introduced by mclk_to_picos, which is 10ps.
if (clks_rem > 10 * data_rate) clk++;
What do you think?
York

On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
Kyle,
On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
On Jul 28, 2011, at 17:41, York Sun wrote:
On Mon, 2011-03-14 at 16:35 -0500, Moffett, Kyle D wrote:
On Mar 14, 2011, at 16:22, York Sun wrote:
On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
* Now divide by 5^12 and track the 32-bit remainder, then divide
* by 2*(2^12) using shifts (and updating the remainder).
*/
- clks_rem = do_div(clks, UL_5pow12);
- clks_rem <<= 13;
Shouldn't this be clks_rem >>= 13 ?
- clks_rem |= clks & (UL_2pow13-1);
- clks >>= 13;
- /* If we had a remainder, then round up */
- if (clks_rem) clks++;
- }
I found a problem with the round up. Please try
picos_to_mclk(mclk_to_picos(3))
You will notice the result is round up. I am sure it is unexpected.
Hmm... what does fsl_ddr_get_mem_data_rate() return on your system? I cannot reproduce this here with my memory freq of 800MHz.
The function has obviously always done rounding, even before I made any changes. Have you retested what the math does with the patch reverted to see if it makes any difference?
Hmm, looking at it, the obvious problem case would be mclk_to_picos() using a rounded result in a subsequent multiply. Can you retest by replacing mclk_to_picos() with the following code; this should keep the error down by doing the rounding-to-10ps *after* the multiply.
unsigned int mclk_to_picos(unsigned int mclk) { unsigned int data_rate = get_ddr_freq(0); unsigned int result;
/* Round to nearest 10ps, being careful about 64-bit multiply/divide */ unsigned long long mclk_ps = ULL_2e12 * mclk; /* Add 5*data_rate, for rounding */ mclk_ps += 5*(unsigned long long)data_rate; /* Now perform the big divide, the result fits in 32-bits */ do_div(mclk_ps, data_rate); result = mclk_ps; /* We still need to round to 10ps */ return 10 * (result/10);
}
Obviously you could also get rid of the rounding alltogether, but I don't know why it was put in the code in the first place...
I think the problem comes from the round up. But your calculation of remainder is not right. You explained to me with example of 2*10^^12-1. It seems to be right until I tried another number 264*10^^7. I think the following calculation of remainder is
clks_rem = do_div(clks, UL_5POW12); clks >>= 13; clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;
Then we can compare the remainder with the "error" introduced by mclk_to_picos, which is 10ps.
if (clks_rem > 10 * data_rate) clk++;
What do you think?
Maybe not a good idea. It may not round up when necessary.
York

On Jul 29, 2011, at 14:46, York Sun wrote:
On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
On Jul 28, 2011, at 17:41, York Sun wrote:
I found a problem with the round up. Please try
picos_to_mclk(mclk_to_picos(3))
You will notice the result is round up. I am sure it is unexpected.
Hmm... what does fsl_ddr_get_mem_data_rate() return on your system? I cannot reproduce this here with my memory freq of 800MHz.
The function has obviously always done rounding, even before I made any changes. Have you retested what the math does with the patch reverted to see if it makes any difference?
Hmm, looking at it, the obvious problem case would be mclk_to_picos() using a rounded result in a subsequent multiply. Can you retest by replacing mclk_to_picos() with the following code; this should keep the error down by doing the rounding-to-10ps *after* the multiply.
unsigned int mclk_to_picos(unsigned int mclk) { unsigned int data_rate = get_ddr_freq(0); unsigned int result;
/* Round to nearest 10ps, being careful about 64-bit multiply/divide */ unsigned long long mclk_ps = ULL_2e12 * mclk; /* Add 5*data_rate, for rounding */ mclk_ps += 5*(unsigned long long)data_rate; /* Now perform the big divide, the result fits in 32-bits */ do_div(mclk_ps, data_rate); result = mclk_ps; /* We still need to round to 10ps */ return 10 * (result/10);
}
Obviously you could also get rid of the rounding alltogether, but I don't know why it was put in the code in the first place...
I think the problem comes from the round up. But your calculation of remainder is not right. You explained to me with example of 2*10^^12-1. It seems to be right until I tried another number 264*10^^7. I think the following calculation of remainder is
clks_rem = do_div(clks, UL_5POW12); clks >>= 13; clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;
That code you pasted is definitely not what is in the tree. The in-tree code is: 57 /* First multiply the time by the data rate (32x32 => 64) */ 58 clks = picos * (unsigned long long)get_ddr_freq(0); 59 60 /* 61 * Now divide by 5^12 and track the 32-bit remainder, then divide 62 * by 2*(2^12) using shifts (and updating the remainder). 63 */ 64 clks_rem = do_div(clks, UL_5pow12); 65 clks_rem <<= 13; 66 clks_rem |= clks & (UL_2pow13-1); 67 clks >>= 13; 68 69 /* If we had a remainder, then round up */ 70 if (clks_rem) 71 clks++;
Can you tell me what your value of get_ddr_freq() is? I may be able to reproduce the issue here.
Also, can you try your test case with the replacement for mclk_to_picos() that I pasted above and let me know what happens?
I think the *real* error is that get_memory_clk_period_ps() (which I did not change) rounds up or down to a multiple of 10ps, and then the *result* of get_memory_clk_period_ps() is multiplied by the number of clocks. The effect is that the 5ps of rounding error that can occur is multiplied by the number of clocks, and may result in a much-too-large result. When you convert that number of picoseconds *back* to memory clocks you get a result that is too big, but that's because it was too many picoseconds!
Where did you get the number 264*10^7? That seems too big to be a number of picoseconds (that's 2ms!), but way too small to be the product of get_ddr_freq() and some number of picoseconds.
EG: For my system here with a DDR frequency of 800MHz, each clock is 2500ps, so 3 clocks times 2500ps times 800MHz is 6*10^12.
In fact if you give "picos_to_mclk()" any number of picoseconds larger than at least 1 clock, the initial value of the "clks" variable is guaranteed to be at least 2*10^12, right?
Using my number, let's run through the math:
So if you start with picos = 7499 (to put rounding to the test) and get_ddr_freq() is 800MHz: clks = 7499 * 800*1000*1000;
Now we have "clks" = 5999200000000, and we divide by 5**12: clks_rem = do_div(clks, UL_5pow12);
Since 5999200000000/(5**12) is 24572.7232, we get: clks = 24572 clks_rem = 176562500 (which is equal to 5^12 * 0.7232)
Now left-shift clks_rem by 13: clks_rem = 1446400000000
Then bitwise-OR in the lower 13 bits of clks: clks_rem |= (24572 & 8191) clks_rem = 1446400008188;
Finally, rightshift clks by 13: clks = 2
Since there is a remainder, we "round up" from 2 to 3.
These numbers empirically make sense, because I gave the system 7499ps (which is 3 clocks minus 1ps) and it rounded up properly.
If I give it a simple number like 7500ps, then the math is really easy and obvious: clks = 7500 * 800*1000*100
Then the do_div(): clks = 24576 clks_rem = 0
Shifting clks_rem does nothing, because it is zero...
Oring with the lower 13 bits of clks has no effect (it has no low bits set)
Finally, when right-shifting clks by 13: clks = 3
This is obviously the correct result.
Cheers, Kyle Moffett

On Fri, 2011-07-29 at 16:26 -0500, Moffett, Kyle D wrote:
On Jul 29, 2011, at 14:46, York Sun wrote:
On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
On Jul 28, 2011, at 17:41, York Sun wrote:
I found a problem with the round up. Please try
picos_to_mclk(mclk_to_picos(3))
You will notice the result is round up. I am sure it is unexpected.
Hmm... what does fsl_ddr_get_mem_data_rate() return on your system? I cannot reproduce this here with my memory freq of 800MHz.
The function has obviously always done rounding, even before I made any changes. Have you retested what the math does with the patch reverted to see if it makes any difference?
Hmm, looking at it, the obvious problem case would be mclk_to_picos() using a rounded result in a subsequent multiply. Can you retest by replacing mclk_to_picos() with the following code; this should keep the error down by doing the rounding-to-10ps *after* the multiply.
unsigned int mclk_to_picos(unsigned int mclk) { unsigned int data_rate = get_ddr_freq(0); unsigned int result;
/* Round to nearest 10ps, being careful about 64-bit multiply/divide */ unsigned long long mclk_ps = ULL_2e12 * mclk; /* Add 5*data_rate, for rounding */ mclk_ps += 5*(unsigned long long)data_rate; /* Now perform the big divide, the result fits in 32-bits */ do_div(mclk_ps, data_rate); result = mclk_ps; /* We still need to round to 10ps */ return 10 * (result/10);
}
Obviously you could also get rid of the rounding alltogether, but I don't know why it was put in the code in the first place...
I think the problem comes from the round up. But your calculation of remainder is not right. You explained to me with example of 2*10^^12-1. It seems to be right until I tried another number 264*10^^7. I think the following calculation of remainder is
clks_rem = do_div(clks, UL_5POW12); clks >>= 13; clks_rem |= (clks & (UL_2POW13-1)) * UL_5POW12;
That code you pasted is definitely not what is in the tree. The in-tree code is: 57 /* First multiply the time by the data rate (32x32 => 64) */ 58 clks = picos * (unsigned long long)get_ddr_freq(0); 59 60 /* 61 * Now divide by 5^12 and track the 32-bit remainder, then divide 62 * by 2*(2^12) using shifts (and updating the remainder). 63 */ 64 clks_rem = do_div(clks, UL_5pow12); 65 clks_rem <<= 13; 66 clks_rem |= clks & (UL_2pow13-1); 67 clks >>= 13; 68 69 /* If we had a remainder, then round up */ 70 if (clks_rem) 71 clks++;
Can you tell me what your value of get_ddr_freq() is? I may be able to reproduce the issue here.
264000000
Also, can you try your test case with the replacement for mclk_to_picos() that I pasted above and let me know what happens?
I think the *real* error is that get_memory_clk_period_ps() (which I did not change) rounds up or down to a multiple of 10ps, and then the *result* of get_memory_clk_period_ps() is multiplied by the number of clocks. The effect is that the 5ps of rounding error that can occur is multiplied by the number of clocks, and may result in a much-too-large result. When you convert that number of picoseconds *back* to memory clocks you get a result that is too big, but that's because it was too many picoseconds!
Correct. get_memory_clk_period_ps() is causing this problem.
Where did you get the number 264*10^7? That seems too big to be a number of picoseconds (that's 2ms!), but way too small to be the product of get_ddr_freq() and some number of picoseconds.
Forget that. I was just trying.
EG: For my system here with a DDR frequency of 800MHz, each clock is 2500ps, so 3 clocks times 2500ps times 800MHz is 6*10^12.
In fact if you give "picos_to_mclk()" any number of picoseconds larger than at least 1 clock, the initial value of the "clks" variable is guaranteed to be at least 2*10^12, right?
Using my number, let's run through the math:
So if you start with picos = 7499 (to put rounding to the test) and get_ddr_freq() is 800MHz: clks = 7499 * 800*1000*1000;
Now we have "clks" = 5999200000000, and we divide by 5**12: clks_rem = do_div(clks, UL_5pow12);
Since 5999200000000/(5**12) is 24572.7232, we get: clks = 24572 clks_rem = 176562500 (which is equal to 5^12 * 0.7232)
Now left-shift clks_rem by 13: clks_rem = 1446400000000
Here I think it is wrong. I want to use the clks_rem value later.
I change it to this
clks_rem = do_div(clks, UL_5POW12); clks_rem += (clks & (UL_2POW13-1)) * UL_5POW12; clks >>= 13;
/* If we had a remainder greater than the 1ps error, then round up */ if (clks_rem > data_rate) clks++;
After fixing the get_memory_clk_period_ps() to round up to near 1ps, the calculation is more accurate (well, not as floating point). And ow clks_rem holds the correct remainder.
York

On Jul 29, 2011, at 17:46, York Sun wrote:
On Fri, 2011-07-29 at 16:26 -0500, Moffett, Kyle D wrote:
On Jul 29, 2011, at 14:46, York Sun wrote:
On Fri, 2011-07-29 at 11:35 -0700, York Sun wrote:
On Fri, 2011-07-29 at 12:20 -0500, Moffett, Kyle D wrote:
On Jul 28, 2011, at 17:41, York Sun wrote:
I found a problem with the round up. Please try
picos_to_mclk(mclk_to_picos(3))
You will notice the result is round up. I am sure it is unexpected.
Hmm... what does fsl_ddr_get_mem_data_rate() return on your system? I cannot reproduce this here with my memory freq of 800MHz.
[...snip...]
Obviously you could also get rid of the rounding alltogether, but I don't know why it was put in the code in the first place...
I think the problem comes from the round up. But your calculation of remainder is not right.
Can you tell me what your value of get_ddr_freq() is? I may be able to reproduce the issue here.
264000000
Ok, so that's a 264MHz DDR frequency, right? 264 * 10^6?
If I plug that number in to the code I can reproduce your result: picos_to_mclk(mclk_to_picos(3)) == 4
But that happens with both the new code *AND* the old code. I ran them side-by-side in the same test-harness: OLD: get_memory_clk_period_ps(): 7580 picos_to_mclk(mclk_to_picos(3)): 4 ---------------------- NEW: get_memory_clk_period_ps(): 7580 picos_to_mclk(mclk_to_picos(3)): 4
So there's a bug, but it's always been there...
Just to show the floating-point math really quick:
2 * (10^12 picosec/sec) / (264 * 10^6 cyc/sec) == 7575.757575..... picoseconds
Since the code rounds to the nearest 10ps, we get 7580ps. Then 3 mclk is 22740ps, but it should be 22727.272727.....
The picos_to_mclk() function always rounds up, because we must avoid undercutting the SPD timing margin. If your SPD specifies a minimum of a 7590ps timing window, you MUST use a 2 mclk delay (even though it's painfully close to 1 mclk), because otherwise you will get RAM errors.
Therefore mclk_to_picos() and get_memory_clk_period_ps() should always round down, even though they currently round to the closest 10ps.
Continuing the example: If we later we do picos_to_mclk(22740): 22740ps * (264 * 10^5 cyc/sec) * 0.5 / (10^12 picosec/sec) == 3.00168
When we do the integer math with remainder, we get: 3 rem 3360000000
In both cases, we correctly automatically round up to 4.
I believe I have a fix for the problem; here's my fixed test case log: get_memory_clk_period_ps(): 7575 picos_to_mclk(mclk_to_picos(3)): 3
And here is the new code (the picos_to_mclk() function is unchanged):
unsigned int mclk_to_picos(unsigned int mclk) { unsigned int data_rate = get_ddr_freq(0);
/* Be careful to avoid a 64-bit full-divide (result fits in 32) */ unsigned long long mclk_ps = ULL_2E12 * mclk; do_div(mclk_ps, data_rate); return mclk_ps; }
unsigned int get_memory_clk_period_ps(void) { return mclk_to_picos(1); }
Cheers, Kyle Moffett

On Wed, 2011-02-23 at 11:35 -0500, Kyle Moffett wrote:
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
Since all of the inputs to this are 32-bit (nanoseconds, clock cycles, and DDR frequencies), we can easily restructure the computation to use the "do_div()" function to perform 64-bit/32-bit divide operations.
This decreases compute time rather significantly for each conversion and avoids bringing in a very complicated function from libgcc.
It should be noted that nothing else in U-Boot or the Linux kernel seems to require a full 64-bit divide on any 32-bit PowerPC.
Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
Signed-off-by: Kyle Moffett Kyle.D.Moffett@boeing.com Cc: Andy Fleming afleming@gmail.com Cc: Kumar Gala kumar.gala@freescale.com Cc: Wolfgang Denk wd@denx.de Cc: Kim Phillips kim.phillips@freescale.com
Acked-by: York Sun yorksun@freescale.com

On Feb 23, 2011, at 10:35 AM, Kyle Moffett wrote:
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
Since all of the inputs to this are 32-bit (nanoseconds, clock cycles, and DDR frequencies), we can easily restructure the computation to use the "do_div()" function to perform 64-bit/32-bit divide operations.
This decreases compute time rather significantly for each conversion and avoids bringing in a very complicated function from libgcc.
It should be noted that nothing else in U-Boot or the Linux kernel seems to require a full 64-bit divide on any 32-bit PowerPC.
Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
Signed-off-by: Kyle Moffett Kyle.D.Moffett@boeing.com Cc: Andy Fleming afleming@gmail.com Cc: Kumar Gala kumar.gala@freescale.com Cc: Wolfgang Denk wd@denx.de Cc: Kim Phillips kim.phillips@freescale.com
Ok, so this patch touches a rather sensitive part of the fsl_ddr logic.
I spent a fair amount of time trying to verify that the resulting math is exactly the same as it was before, but the consequences of a mistake are insideous timing problems. Additional in-depth review would be much appreciated.
Cheers, Kyle Moffett
Changelog: v2: Resubmitted separately from the other HWW-1U-1A patches
arch/powerpc/cpu/mpc8xxx/ddr/util.c | 58 ++++++++++++++++++++++++---------- 1 files changed, 41 insertions(+), 17 deletions(-)
Can you rebase on 'next' branch of git://git.denx.de/u-boot-mpc85xx.git
- k

The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
Since all of the inputs to this are 32-bit (nanoseconds, clock cycles, and DDR frequencies), we can easily restructure the computation to use the "do_div()" function to perform 64-bit/32-bit divide operations.
On 64-bit this change is basically a no-op, because do_div is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.
On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is over 1.1kB of code and thousands of heavily dependent cycles to compute, all of which is linked from libgcc. Another 1.2kB of code comes in for the function __umoddi3.
It should be noted that nothing else in U-Boot or the Linux kernel seems to require a full 64-bit divide on my 32-bit PowerPC.
Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
Signed-off-by: Kyle Moffett Kyle.D.Moffett@boeing.com Acked-by: York Sun yorksun@freescale.com Cc: Andy Fleming afleming@gmail.com Cc: Kumar Gala kumar.gala@freescale.com
-- Changelog: v2: Resubmitted separately from the other HWW-1U-1A patches v3: Rebased on the 'next' branch of git://git.denx.de/u-boot-mpc85xx.git
arch/powerpc/cpu/mpc8xxx/ddr/util.c | 56 +++++++++++++++++++++++++---------- 1 files changed, 40 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/cpu/mpc8xxx/ddr/util.c b/arch/powerpc/cpu/mpc8xxx/ddr/util.c index b9a5a69..02908b4 100644 --- a/arch/powerpc/cpu/mpc8xxx/ddr/util.c +++ b/arch/powerpc/cpu/mpc8xxx/ddr/util.c @@ -8,9 +8,17 @@
#include <common.h> #include <asm/fsl_law.h> +#include <div64.h>
#include "ddr.h"
+/* To avoid 64-bit full-divides, we factor this here */ +#define ULL_2e12 2000000000000ULL +#define UL_5pow12 244140625UL +#define UL_2pow13 (1UL << 13) + +#define ULL_8Fs 0xFFFFFFFFULL + /* * Round mclk_ps to nearest 10 ps in memory controller code. * @@ -20,35 +28,51 @@ */ unsigned int get_memory_clk_period_ps(void) { - unsigned int mclk_ps; + unsigned int data_rate = get_ddr_freq(0); + unsigned int result; + + /* Round to nearest 10ps, being careful about 64-bit multiply/divide */ + unsigned long long mclk_ps = ULL_2e12;
- mclk_ps = 2000000000000ULL / get_ddr_freq(0); - /* round to nearest 10 ps */ - return 10 * ((mclk_ps + 5) / 10); + /* Add 5*data_rate, for rounding */ + mclk_ps += 5*(unsigned long long)data_rate; + + /* Now perform the big divide, the result fits in 32-bits */ + do_div(mclk_ps, data_rate); + result = mclk_ps; + + /* We still need to round to 10ps */ + return 10 * (result/10); }
/* Convert picoseconds into DRAM clock cycles (rounding up if needed). */ unsigned int picos_to_mclk(unsigned int picos) { - const unsigned long long ULL_2e12 = 2000000000000ULL; - const unsigned long long ULL_8Fs = 0xFFFFFFFFULL; - unsigned long long clks; - unsigned long long clks_temp; + unsigned long long clks, clks_rem;
+ /* Short circuit for zero picos */ if (!picos) return 0;
- clks = get_ddr_freq(0) * (unsigned long long) picos; - clks_temp = clks; - clks = clks / ULL_2e12; - if (clks_temp % ULL_2e12) { + /* First multiply the time by the data rate (32x32 => 64) */ + clks = picos * (unsigned long long)get_ddr_freq(0); + + /* + * Now divide by 5^12 and track the 32-bit remainder, then divide + * by 2*(2^12) using shifts (and updating the remainder). + */ + clks_rem = do_div(clks, UL_5pow12); + clks_rem <<= 13; + clks_rem |= clks & (UL_2pow13-1); + clks >>= 13; + + /* If we had a remainder, then round up */ + if (clks_rem) clks++; - }
- if (clks > ULL_8Fs) { + /* Clamp to the maximum representable value */ + if (clks > ULL_8Fs) clks = ULL_8Fs; - } - return (unsigned int) clks; }

On Mar 15, 2011, at 10:23 AM, Kyle Moffett wrote:
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
Since all of the inputs to this are 32-bit (nanoseconds, clock cycles, and DDR frequencies), we can easily restructure the computation to use the "do_div()" function to perform 64-bit/32-bit divide operations.
On 64-bit this change is basically a no-op, because do_div is implemented as a literal 64-bit divide operation and the instruction scheduling works out almost the same.
On 32-bit PowerPC a fully accurate 64/64 divide (__udivdi3 in libgcc) is over 1.1kB of code and thousands of heavily dependent cycles to compute, all of which is linked from libgcc. Another 1.2kB of code comes in for the function __umoddi3.
It should be noted that nothing else in U-Boot or the Linux kernel seems to require a full 64-bit divide on my 32-bit PowerPC.
Build-and-boot-tested on the HWW-1U-1A board using DDR2 SPD detection.
Signed-off-by: Kyle Moffett Kyle.D.Moffett@boeing.com Acked-by: York Sun yorksun@freescale.com Cc: Andy Fleming afleming@gmail.com Cc: Kumar Gala kumar.gala@freescale.com
-- Changelog: v2: Resubmitted separately from the other HWW-1U-1A patches v3: Rebased on the 'next' branch of git://git.denx.de/u-boot-mpc85xx.git
arch/powerpc/cpu/mpc8xxx/ddr/util.c | 56 +++++++++++++++++++++++++---------- 1 files changed, 40 insertions(+), 16 deletions(-)
applied to 85xx next
- k

Dear Kyle Moffett,
In message 1300202627-7245-1-git-send-email-Kyle.D.Moffett@boeing.com you wrote:
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
...
+/* To avoid 64-bit full-divides, we factor this here */ +#define ULL_2e12 2000000000000ULL +#define UL_5pow12 244140625UL +#define UL_2pow13 (1UL << 13)
+#define ULL_8Fs 0xFFFFFFFFULL
Unfortunately this is already in mainline. Can you please send a cleanup patch to fix the CamelCaps macro names?
Thanks.
Best regards,
Wolfgang Denk

Dear Kyle Moffett,
In message 1298478920-22044-1-git-send-email-Kyle.D.Moffett@boeing.com you wrote:
The current FreeScale MPC-8xxx DDR SPD interpreter is using full 64-bit integer divide operations to convert between nanoseconds and DDR clock cycles given arbitrary DDR clock frequencies.
...
+/* To avoid 64-bit full-divides, we factor this here */ +#define ULL_2e12 2000000000000ULL +#define UL_5pow12 244140625UL +#define UL_2pow13 (1UL << 13)
+#define ULL_8Fs 0xFFFFFFFFULL
We don't allow CamelCaps macro names. Macros are all caps, please.
Best regards,
Wolfgang Denk
participants (5)
-
Kumar Gala
-
Kyle Moffett
-
Moffett, Kyle D
-
Wolfgang Denk
-
York Sun