[U-Boot] [PATCH 0/9] CACHE: Finishing touches

Add a few finishing touches for the cache support. Then ultimately enable cache support on m28.
BIG FAT NOTE: This is for -next. This may break something. Compile testing IS STILL IN PROGRESS!
Marek Vasut (9): COMMON: Add __stringify() function CACHE: Add cache_aligned() macro CACHE: ext2load: Test if start address is aligned CACHE: fatload: Test if start address is aligned CACHE: mmc read/write: Test if start address is aligned CACHE: nand read/write: Test if start address is aligned CACHE: net: Test if start address is aligned CACHE: net: asix: Fix asix driver to work with data cache on M28EVK: Enable instruction and data cache
common/cmd_ext2.c | 3 +++ common/cmd_fat.c | 4 ++++ common/cmd_mmc.c | 2 ++ common/cmd_nand.c | 9 +++++++++ common/cmd_net.c | 4 ++++ drivers/usb/eth/asix.c | 29 ++++++++++++++++------------- include/common.h | 25 +++++++++++++++++++++++++ include/configs/m28evk.h | 2 -- 8 files changed, 63 insertions(+), 15 deletions(-)
Cc: Wolfgang Denk wd@denx.de

This function converts static number to string in preprocessor. This is useful as it allows higher usage of puts() in favour of printf()
Signed-off-by: Marek Vasut marex@denx.de Cc: Wolfgang Denk wd@denx.de --- include/common.h | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/include/common.h b/include/common.h index 7cf15c5..322569e 100644 --- a/include/common.h +++ b/include/common.h @@ -268,6 +268,13 @@ typedef void (interrupt_handler_t)(void *); const typeof( ((type *)0)->member ) *__mptr = (ptr); \ (type *)( (char *)__mptr - offsetof(type,member) );})
+/** + * __stringify - preprocessor magic to return string from number + * @x: constant number + */ +#define __stringify_1(x...) #x +#define __stringify(x...) __stringify_1(x) + /* * Function Prototypes */

This macro returns 1 if the argument (address) is aligned, returns zero otherwise. This will be used to test user-supplied address to various commands to prevent user from loading data to/from unaligned address when using caches.
This is made as a macro, because macros are expanded where they are used. Therefore it can be easily instrumented to report position of the fault.
Signed-off-by: Marek Vasut marex@denx.de Cc: Wolfgang Denk wd@denx.de --- include/common.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/common.h b/include/common.h index 322569e..17c64b0 100644 --- a/include/common.h +++ b/include/common.h @@ -730,6 +730,24 @@ void invalidate_dcache_range(unsigned long start, unsigned long stop); void invalidate_dcache_all(void); void invalidate_icache_all(void);
+/* Test if address is cache-aligned. Returns 0 if it is, 1 otherwise. */ +#define cacheline_aligned(addr) \ + ({ \ + int __ret; \ + if (!dcache_status()) { \ + __ret = 1; \ + } else if ((addr) & (ARCH_DMA_MINALIGN - 1)) { \ + puts("Align load address to " \ + __stringify(ARCH_DMA_MINALIGN) \ + " bytes when using caches!\n"); \ + __ret = 0; \ + } else { \ + __ret = 1; \ + } \ + __ret; \ + }) + + /* arch/$(ARCH)/lib/ticks.S */ unsigned long long get_ticks(void); void wait_ticks (unsigned long);

On 06/24/2012 07:17 PM, Marek Vasut wrote:
This macro returns 1 if the argument (address) is aligned, returns zero otherwise. This will be used to test user-supplied address to various commands to prevent user from loading data to/from unaligned address when using caches.
This is made as a macro, because macros are expanded where they are used. Therefore it can be easily instrumented to report position of the fault.
Signed-off-by: Marek Vasut marex@denx.de Cc: Wolfgang Denk wd@denx.de
include/common.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/common.h b/include/common.h index 322569e..17c64b0 100644 --- a/include/common.h +++ b/include/common.h @@ -730,6 +730,24 @@ void invalidate_dcache_range(unsigned long start, unsigned long stop); void invalidate_dcache_all(void); void invalidate_icache_all(void);
+/* Test if address is cache-aligned. Returns 0 if it is, 1 otherwise. */ +#define cacheline_aligned(addr) \
- ({ \
- int __ret; \
- if (!dcache_status()) { \
__ret = 1; \
- } else if ((addr) & (ARCH_DMA_MINALIGN - 1)) { \
puts("Align load address to " \
__stringify(ARCH_DMA_MINALIGN) \
" bytes when using caches!\n"); \
__ret = 0; \
- } else { \
__ret = 1; \
- } \
- __ret; \
- })
What if it's a store rather than a load? If this is only supposed to be used for loads (because on a store you can flush the cache rather than invalidate), it's not labelled that way, the changelog says "to/from unaligned address", and the caller might be common code that doesn't know which direction the transfer is in. Besides, it would be awkward user interface to allow an address to be used in one direction but not the other.
What if the caller wants to try a different strategy if this returns 0, rather than print an error?
Why should the success of a command depend on whether caches are enabled? If we're going to forbid unaligned addresses in certain contexts, shouldn't it always be forbidden to ensure consistent user experience? Or if we're going to be picky about when we reject it, why don't we care whether the device in question does DMA, and whether that DMA is coherent?
-Scott

Dear Scott Wood,
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This macro returns 1 if the argument (address) is aligned, returns zero otherwise. This will be used to test user-supplied address to various commands to prevent user from loading data to/from unaligned address when using caches.
This is made as a macro, because macros are expanded where they are used. Therefore it can be easily instrumented to report position of the fault.
Signed-off-by: Marek Vasut marex@denx.de Cc: Wolfgang Denk wd@denx.de
include/common.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/common.h b/include/common.h index 322569e..17c64b0 100644 --- a/include/common.h +++ b/include/common.h @@ -730,6 +730,24 @@ void invalidate_dcache_range(unsigned long start, unsigned long stop);
void invalidate_dcache_all(void); void invalidate_icache_all(void);
+/* Test if address is cache-aligned. Returns 0 if it is, 1 otherwise. */ +#define cacheline_aligned(addr)
\
- ({ \
- int __ret; \
- if (!dcache_status()) { \
__ret = 1; \
- } else if ((addr) & (ARCH_DMA_MINALIGN - 1)) { \
puts("Align load address to " \
__stringify(ARCH_DMA_MINALIGN) \
" bytes when using caches!\n"); \
__ret = 0; \
- } else { \
__ret = 1; \
- } \
- __ret; \
- })
What if it's a store rather than a load?
Goot point #1
If this is only supposed to be used for loads (because on a store you can flush the cache rather than invalidate), it's not labelled that way, the changelog says "to/from unaligned address", and the caller might be common code that doesn't know which direction the transfer is in.
And we're back at the flush/invalidate thing. This is what I'd really love to understand once and for all:
1) We must prevent user from loading to address 0x.......1 (unaligned), I'll get back to this later on.
2) If the user had address that starts at aligned location but ends at unaligned:
Terminology: va -- unrelated variable [****] -- aligned range
2a) LOAD from external source to local address: The cache must be invalidated over that area. I see the following issue:
[---------buffer---------][va] [********][********][********]
So consider the scenario where i) va is written in u-boot ii) data arrive into the buffer via DMA at the same time iii) cache is invalidated over whole buffer
This will cause variable "va" to be lost forever. The "va" variable might as well be user data, therefore in order to protect those, we should first flush the user-supplied area.
2b) STORE from local address to external source: Basically the inverted problem as in 2a), but if the driver is written properly, shouldn't be any issue.
Besides, it would be awkward user interface to allow an address to be used in one direction but not the other.
Correct, it should be possible to adjust the message.
What if the caller wants to try a different strategy if this returns 0, rather than print an error?
Good point.
Why should the success of a command depend on whether caches are enabled?
On less capable architectures, flushing caches over unaligned area causes real mayhem (and DMA depends on this) and it's really tough to debug. If the caches are off, it's all good.
Good point is, that this code should be enabled on per-CPU-model basis probably? Or entirely configurable in the include/configs/....
If we're going to forbid unaligned addresses in certain contexts, shouldn't it always be forbidden to ensure consistent user experience?
This is a good argument ... I wonder, let's hear others opinions.
Or if we're going to be picky about when we reject it, why don't we care whether the device in question does DMA, and whether that DMA is coherent?
Correct, good point. But then this is something that should be added into the upcomming uboot driver model!
-Scott
Best regards, Marek Vasut

Hi Marek,
On 06/25/2012 04:30 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This macro returns 1 if the argument (address) is aligned, returns zero otherwise. This will be used to test user-supplied address to various commands to prevent user from loading data to/from unaligned address when using caches.
This is made as a macro, because macros are expanded where they are used. Therefore it can be easily instrumented to report position of the fault.
Signed-off-by: Marek Vasutmarex@denx.de Cc: Wolfgang Denkwd@denx.de
include/common.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/include/common.h b/include/common.h index 322569e..17c64b0 100644 --- a/include/common.h +++ b/include/common.h @@ -730,6 +730,24 @@ void invalidate_dcache_range(unsigned long start, unsigned long stop);
void invalidate_dcache_all(void); void invalidate_icache_all(void);
+/* Test if address is cache-aligned. Returns 0 if it is, 1 otherwise. */ +#define cacheline_aligned(addr)
\
- ({ \
- int __ret; \
- if (!dcache_status()) { \
__ret = 1; \
- } else if ((addr)& (ARCH_DMA_MINALIGN - 1)) { \
puts("Align load address to " \
__stringify(ARCH_DMA_MINALIGN) \
" bytes when using caches!\n"); \
__ret = 0; \
- } else { \
__ret = 1; \
- } \
- __ret; \
- })
What if it's a store rather than a load?
Goot point #1
If this is only supposed to be used for loads (because on a store you can flush the cache rather than invalidate), it's not labelled that way, the changelog says "to/from unaligned address", and the caller might be common code that doesn't know which direction the transfer is in.
And we're back at the flush/invalidate thing. This is what I'd really love to understand once and for all:
- We must prevent user from loading to address 0x.......1 (unaligned), I'll get
back to this later on.
- If the user had address that starts at aligned location but ends at
unaligned:
Terminology: va -- unrelated variable [****] -- aligned range
2a) LOAD from external source to local address: The cache must be invalidated over that area. I see the following issue:
[---------buffer---------][va] [********][********][********] So consider the scenario where i) va is written in u-boot ii) data arrive into the buffer via DMA at the same time iii) cache is invalidated over whole buffer
This will cause variable "va" to be lost forever. The "va" variable might as well be user data, therefore in order to protect those, we should first flush the user-supplied area.
Flushing the last cache line may corrupt the data you just DMAed in. The current implementation after a lot of discussions [1] with Albert is this:
Invalidate only the aligned part of the buffer. So, in the above case the first two [********] will get invalidated not the last one. The last un-aligned cache line will result in a warning.
[1] http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/105113
2b) STORE from local address to external source: Basically the inverted problem as in 2a), but if the driver is written properly, shouldn't be any issue.
No. Store doesn't have any problem because flush doesn't harm anything.
I have laid down the invalidate/flush requirements for DMA buffers in: doc/README.arm-caches
best regards, Aneesh

This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasut marex@denx.de Cc: Wolfgang Denk wd@denx.de --- common/cmd_ext2.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/common/cmd_ext2.c b/common/cmd_ext2.c index 79b1e2f..45b5a0c 100644 --- a/common/cmd_ext2.c +++ b/common/cmd_ext2.c @@ -166,6 +166,9 @@ int do_ext2load (cmd_tbl_t *cmdtp, int flag, int argc, char * const argv[]) return CMD_RET_USAGE; }
+ if (!cacheline_aligned(addr)) + return 1; + if (!filename) { puts ("** No boot file defined **\n"); return 1;

This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasut marex@denx.de Cc: Wolfgang Denk wd@denx.de --- common/cmd_fat.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/common/cmd_fat.c b/common/cmd_fat.c index 559a16d..951e712 100644 --- a/common/cmd_fat.c +++ b/common/cmd_fat.c @@ -68,7 +68,11 @@ int do_fat_fsload (cmd_tbl_t *cmdtp, int flag, int argc, char * const argv[]) argv[1], dev, part); return 1; } + offset = simple_strtoul(argv[3], NULL, 16); + if (!cacheline_aligned(offset)) + return 1; + if (argc == 6) count = simple_strtoul(argv[5], NULL, 16); else

This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasut marex@denx.de Cc: Andy Fleming afleming@freescale.com --- common/cmd_mmc.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/common/cmd_mmc.c b/common/cmd_mmc.c index 750509d..a47baaa 100644 --- a/common/cmd_mmc.c +++ b/common/cmd_mmc.c @@ -268,6 +268,8 @@ int do_mmcops(cmd_tbl_t *cmdtp, int flag, int argc, char * const argv[])
if (state != MMC_ERASE) { addr = (void *)simple_strtoul(argv[idx], NULL, 16); + if (!cacheline_aligned((ulong)addr)) + return 1; ++idx; } else addr = 0;

This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasut marex@denx.de Cc: Scott Wood scottwood@freescale.com --- common/cmd_nand.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/common/cmd_nand.c b/common/cmd_nand.c index a91ccf4..122a91c 100644 --- a/common/cmd_nand.c +++ b/common/cmd_nand.c @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[]) goto usage;
addr = (ulong)simple_strtoul(argv[2], NULL, 16); + if (!cacheline_aligned(addr)) + return 1;
read = strncmp(cmd, "read", 4) == 0; /* 1 = read, 0 = write */ printf("\nNAND %s: ", read ? "read" : "write"); @@ -924,6 +926,9 @@ int do_nandboot(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[]) addr = simple_strtoul(argv[1], NULL, 16); else addr = CONFIG_SYS_LOAD_ADDR; + if (!cacheline_aligned(addr)) + return 1; + return nand_load_image(cmdtp, &nand_info[dev->id->num], part->offset, addr, argv[0]); } @@ -956,6 +961,10 @@ usage: bootstage_error(BOOTSTAGE_ID_NAND_SUFFIX); return CMD_RET_USAGE; } + + if (!cacheline_aligned(addr)) + return 1; + bootstage_mark(BOOTSTAGE_ID_NAND_SUFFIX);
if (!boot_device) {

On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
Signed-off-by: Marek Vasut marex@denx.de Cc: Scott Wood scottwood@freescale.com
common/cmd_nand.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/common/cmd_nand.c b/common/cmd_nand.c index a91ccf4..122a91c 100644 --- a/common/cmd_nand.c +++ b/common/cmd_nand.c @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[]) goto usage;
addr = (ulong)simple_strtoul(argv[2], NULL, 16);
if (!cacheline_aligned(addr))
return 1;
There's no way you can just return like this without printing an error that lets the user know that the operation wasn't performed, and why.
-Scott

On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
We would need to architect things such that any 'load' command would be routed through this logic. A solvable problem for sure, but a little bit larger than just for NAND :)
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
I don't want to get into an ARM vs PowerPC argument. I think the best answer is that I'm not sure having things unaligned works totally right today as I did a silly test of loading a uImage to 0x82000001 and bootm hung inside of U-Boot not long ago. Can you try that on some cache coherent hardware and see if that works?
[snip]
@@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[]) goto usage;
addr = (ulong)simple_strtoul(argv[2], NULL, 16);
if (!cacheline_aligned(addr))
return 1;
There's no way you can just return like this without printing an error that lets the user know that the operation wasn't performed, and why.
Note that cacheline_aligned does the error print (patch 2 of 9).

On 06/25/2012 01:43 PM, Tom Rini wrote:
On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
We would need to architect things such that any 'load' command would be routed through this logic.
It's something the driver backend should handle (possibly via a common helper library). The fact that you can't do a DMA transfer to an unaligned buffer is a hardware-specific detail, just as is the fact that you're setting up a DMA buffer in the first place.
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
I don't want to get into an ARM vs PowerPC argument. I think the best answer is that I'm not sure having things unaligned works totally right today as I did a silly test of loading a uImage to 0x82000001 and bootm hung inside of U-Boot not long ago. Can you try that on some cache coherent hardware and see if that works?
I'm not sure what bootm has to do with nand (and the fact that some ppc is cache coherent actually doesn't matter, since we don't do DMA for NAND), but I was able to bootm from an odd RAM address, and "nand read" to an odd RAM address, on p5020ds.
[snip]
@@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[]) goto usage;
addr = (ulong)simple_strtoul(argv[2], NULL, 16);
if (!cacheline_aligned(addr))
return 1;
There's no way you can just return like this without printing an error that lets the user know that the operation wasn't performed, and why.
Note that cacheline_aligned does the error print (patch 2 of 9).
Ah. I saw this patch first since it was CCed to me. :-)
-Scott

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/25/2012 01:08 PM, Scott Wood wrote:
On 06/25/2012 01:43 PM, Tom Rini wrote:
On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
We would need to architect things such that any 'load' command would be routed through this logic.
It's something the driver backend should handle (possibly via a common helper library). The fact that you can't do a DMA transfer to an unaligned buffer is a hardware-specific detail, just as is the fact that you're setting up a DMA buffer in the first place.
Right. What I'm trying to say is it's not a NAND problem it's an unaligned addresses problem so the solution needs to be easily used everywhere.
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
I don't want to get into an ARM vs PowerPC argument. I think the best answer is that I'm not sure having things unaligned works totally right today as I did a silly test of loading a uImage to 0x82000001 and bootm hung inside of U-Boot not long ago. Can you try that on some cache coherent hardware and see if that works?
I'm not sure what bootm has to do with nand (and the fact that some ppc is cache coherent actually doesn't matter, since we don't do DMA for NAND), but I was able to bootm from an odd RAM address, and "nand read" to an odd RAM address, on p5020ds.
On ARM-land we have a lot of problems with unaligned addresses, even with cache off. I went to reproduce the original bootm problem and ran into fatload hanging. tftp didn't fail but bootm hangs.
- -- Tom

On 06/25/2012 03:48 PM, Tom Rini wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/25/2012 01:08 PM, Scott Wood wrote:
On 06/25/2012 01:43 PM, Tom Rini wrote:
On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
We would need to architect things such that any 'load' command would be routed through this logic.
It's something the driver backend should handle (possibly via a common helper library). The fact that you can't do a DMA transfer to an unaligned buffer is a hardware-specific detail, just as is the fact that you're setting up a DMA buffer in the first place.
Right. What I'm trying to say is it's not a NAND problem it's an unaligned addresses problem so the solution needs to be easily used everywhere.
OK, so fix it in each driver that has this issue. A lot of drivers are probably not so performance critical that you can't just always use a bounce buffer. A static buffer plus memcpy isn't that burdensome -- it's close to what the drivers for non-DMA hardware do. For higher performance peripherals, throw in an if-statement or two. It doesn't seem like something that needs a U-Boot-wide change.
In the specific case of NAND, how many NAND drivers use DMA at all?
I'm not sure what bootm has to do with nand (and the fact that some ppc is cache coherent actually doesn't matter, since we don't do DMA for NAND), but I was able to bootm from an odd RAM address, and "nand read" to an odd RAM address, on p5020ds.
On ARM-land we have a lot of problems with unaligned addresses, even with cache off. I went to reproduce the original bootm problem and ran into fatload hanging. tftp didn't fail but bootm hangs.
Maybe you can't take alignment exceptions during bootm? PPC doesn't normally take alignment checks, but we would have trouble with this scenario if it did, since bootm clobbers the exception vectors.
-Scott

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/25/2012 02:17 PM, Scott Wood wrote:
On 06/25/2012 03:48 PM, Tom Rini wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/25/2012 01:08 PM, Scott Wood wrote:
On 06/25/2012 01:43 PM, Tom Rini wrote:
On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
We would need to architect things such that any 'load' command would be routed through this logic.
It's something the driver backend should handle (possibly via a common helper library). The fact that you can't do a DMA transfer to an unaligned buffer is a hardware-specific detail, just as is the fact that you're setting up a DMA buffer in the first place.
Right. What I'm trying to say is it's not a NAND problem it's an unaligned addresses problem so the solution needs to be easily used everywhere.
OK, so fix it in each driver that has this issue. A lot of drivers are probably not so performance critical that you can't just always use a bounce buffer. A static buffer plus memcpy isn't that burdensome -- it's close to what the drivers for non-DMA hardware do. For higher performance peripherals, throw in an if-statement or two. It doesn't seem like something that needs a U-Boot-wide change.
In the specific case of NAND, how many NAND drivers use DMA at all?
Off hand I'm not sure. I know there've been patches for OMAP parts before but they had some unaddressed feedback and aren't in mainline.
I'm not sure what bootm has to do with nand (and the fact that some ppc is cache coherent actually doesn't matter, since we don't do DMA for NAND), but I was able to bootm from an odd RAM address, and "nand read" to an odd RAM address, on p5020ds.
On ARM-land we have a lot of problems with unaligned addresses, even with cache off. I went to reproduce the original bootm problem and ran into fatload hanging. tftp didn't fail but bootm hangs.
Maybe you can't take alignment exceptions during bootm? PPC doesn't normally take alignment checks, but we would have trouble with this scenario if it did, since bootm clobbers the exception vectors.
Could be it. But again, fatload via mmc also chokes, so there's a few and quite long-standing problems. What Marek and I are arguing for I believe is to say it's not worth the effort to enable this corner case. Since it works on other arches, we shouldn't make this a silent change that slips in, certainly, but at the end of the day it's highly unlikely someone is using this in the wild and won't be able to work around it in their next design or field upgrade. I would hope at least.
- -- Tom

Dear Scott Wood,
On 06/25/2012 03:48 PM, Tom Rini wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/25/2012 01:08 PM, Scott Wood wrote:
On 06/25/2012 01:43 PM, Tom Rini wrote:
On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
We would need to architect things such that any 'load' command would be routed through this logic.
It's something the driver backend should handle (possibly via a common helper library). The fact that you can't do a DMA transfer to an unaligned buffer is a hardware-specific detail, just as is the fact that you're setting up a DMA buffer in the first place.
Right. What I'm trying to say is it's not a NAND problem it's an unaligned addresses problem so the solution needs to be easily used everywhere.
OK, so fix it in each driver that has this issue. A lot of drivers are probably not so performance critical that you can't just always use a bounce buffer. A static buffer plus memcpy isn't that burdensome -- it's close to what the drivers for non-DMA hardware do. For higher performance peripherals, throw in an if-statement or two. It doesn't seem like something that needs a U-Boot-wide change.
This is flat bull. I don't want bounce buffers growing all around uboot, see my previous email. I'm 120% firm in that.
And btw it's not about bounce buffers, it's also about other code (like FS code) which does unaligned accesses and we're fixing it.
In the specific case of NAND, how many NAND drivers use DMA at all?
Many do, it's not only nand, it's all over the place.
SPI, NAND, MMC etc.
I'm not sure what bootm has to do with nand (and the fact that some ppc is cache coherent actually doesn't matter, since we don't do DMA for NAND), but I was able to bootm from an odd RAM address, and "nand read" to an odd RAM address, on p5020ds.
On ARM-land we have a lot of problems with unaligned addresses, even with cache off. I went to reproduce the original bootm problem and ran into fatload hanging. tftp didn't fail but bootm hangs.
Maybe you can't take alignment exceptions during bootm? PPC doesn't normally take alignment checks, but we would have trouble with this scenario if it did, since bootm clobbers the exception vectors.
-Scott
Best regards, Marek Vasut

On 06/25/2012 06:42 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/25/2012 03:48 PM, Tom Rini wrote:
Right. What I'm trying to say is it's not a NAND problem it's an unaligned addresses problem so the solution needs to be easily used everywhere.
OK, so fix it in each driver that has this issue. A lot of drivers are probably not so performance critical that you can't just always use a bounce buffer. A static buffer plus memcpy isn't that burdensome -- it's close to what the drivers for non-DMA hardware do. For higher performance peripherals, throw in an if-statement or two. It doesn't seem like something that needs a U-Boot-wide change.
This is flat bull. I don't want bounce buffers growing all around uboot, see my previous email. I'm 120% firm in that.
I'm 150% firm that you're going to need a better justification to change the user interface for everybody.
What's next, restricting "cp" (or even "cp.b") in case someone wants to use a DMA engine to copy the bits?
Note that in the case of "nand read.oob", depending on NAND page size and platform, there's a good chance that you're imposing an alignment restriction that is larger than the data being transferred even if the user asks to read the entire OOB.
What about "nand write.yaffs2" or multi-page "nand read.raw", which deal with arrays of interleaved main+spare? With a small page NAND chip, you'll need cache lines that are 16 bytes or smaller to avoid unaligned transactions -- and those will bypass your front-end check (unless the user is so "stupid" as to want to modify the data after a raw-read, and do a raw-write of a particular page).
And btw it's not about bounce buffers, it's also about other code (like FS code) which does unaligned accesses and we're fixing it.
Fixing the alignment of U-Boot-generated buffers is good. That doesn't affect the user interface. This is different.
In the specific case of NAND, how many NAND drivers use DMA at all?
Many do,
How many? Specifically, how many that have alignment restrictions, that would need to be fixed?
it's not only nand, it's all over the place.
This patch is about NAND.
-Scott

Dear Scott Wood,
On 06/25/2012 06:42 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/25/2012 03:48 PM, Tom Rini wrote:
Right. What I'm trying to say is it's not a NAND problem it's an unaligned addresses problem so the solution needs to be easily used everywhere.
OK, so fix it in each driver that has this issue. A lot of drivers are probably not so performance critical that you can't just always use a bounce buffer. A static buffer plus memcpy isn't that burdensome -- it's close to what the drivers for non-DMA hardware do. For higher performance peripherals, throw in an if-statement or two. It doesn't seem like something that needs a U-Boot-wide change.
This is flat bull. I don't want bounce buffers growing all around uboot, see my previous email. I'm 120% firm in that.
I'm 150% firm that you're going to need a better justification to change the user interface for everybody.
I'm trying to make it user-proof. Honestly, loading to some bullshit unaligned address is a corner case, so it won't affect many people. And that minority who'll be affected will adjust their stuff by a few bytes, no problem there.
What's next, restricting "cp" (or even "cp.b") in case someone wants to use a DMA engine to copy the bits?
Ad absurdum indeed, but it's partly valid point. You'd need to restrict it somehow if that'd be the case, therefore I'll add it to the DM. Drivers should be able to specify their requirements.
Note that in the case of "nand read.oob", depending on NAND page size and platform, there's a good chance that you're imposing an alignment restriction that is larger than the data being transferred even if the user asks to read the entire OOB.
I don't think I completely follow here.
What about "nand write.yaffs2" or multi-page "nand read.raw", which deal with arrays of interleaved main+spare? With a small page NAND chip, you'll need cache lines that are 16 bytes or smaller to avoid unaligned transactions -- and those will bypass your front-end check (unless the user is so "stupid" as to want to modify the data after a raw-read, and do a raw-write of a particular page).
Ok, I think I'm very stupid now, probably since I have high fever. I'll read this after I sleep. Sorry Scott, I'm sure you're rolling out a valid point, it's just that I'm incapable of understanding it right now.
And btw it's not about bounce buffers, it's also about other code (like FS code) which does unaligned accesses and we're fixing it.
Fixing the alignment of U-Boot-generated buffers is good. That doesn't affect the user interface. This is different.
And it's the easy part.
In the specific case of NAND, how many NAND drivers use DMA at all?
Many do,
How many? Specifically, how many that have alignment restrictions, that would need to be fixed?
At least all on the ARM architecture. And more will come, since ARM is on the rise.
it's not only nand, it's all over the place.
This patch is about NAND.
Check the whole patchset ... and that still only covers a small part of it all.
-Scott
Best regards, Marek Vasut

On 06/25/2012 08:16 PM, Marek Vasut wrote:
Dear Scott Wood,
Note that in the case of "nand read.oob", depending on NAND page size and platform, there's a good chance that you're imposing an alignment restriction that is larger than the data being transferred even if the user asks to read the entire OOB.
I don't think I completely follow here.
Why should I need to have 64-byte alignment for transfering a 16-byte OOB?
What about "nand write.yaffs2" or multi-page "nand read.raw", which deal with arrays of interleaved main+spare? With a small page NAND chip, you'll need cache lines that are 16 bytes or smaller to avoid unaligned transactions -- and those will bypass your front-end check (unless the user is so "stupid" as to want to modify the data after a raw-read, and do a raw-write of a particular page).
Ok, I think I'm very stupid now, probably since I have high fever. I'll read this after I sleep. Sorry Scott, I'm sure you're rolling out a valid point, it's just that I'm incapable of understanding it right now.
Probably best to wait until you're feeling better for the rest of it, too. Hope you get well soon.
In the specific case of NAND, how many NAND drivers use DMA at all?
Many do,
How many? Specifically, how many that have alignment restrictions, that would need to be fixed?
At least all on the ARM architecture. And more will come, since ARM is on the rise.
atmel: no, doesn't use DMA davinci: no, doesn't use DMA kb9202: no, doesn't use DMA kirkwood: no, doesn't use DMA mxc: I don't think this uses DMA, but this driver scares me. :-) mxs: Even scarier. Looks like this one does use DMA. omap_gpmc: no, doesn't use DMA s3c64xx: no, doesn't use DMA spr: no, doesn't use DMA s3c2410: no, doesn't use DMA
If this is about not wanting to touch the mxs_nand driver again, I sympathize. :-)
it's not only nand, it's all over the place.
This patch is about NAND.
Check the whole patchset ... and that still only covers a small part of it all.
Right, I think it's wrong elsewhere too when it's user interface, but I'll let the relevant custodians argue those cases.
-Scott

Dear Tom Rini,
On Mon, Jun 25, 2012 at 11:58:10AM -0500, Scott Wood wrote:
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned? The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty. How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
We would need to architect things such that any 'load' command would be routed through this logic. A solvable problem for sure, but a little bit larger than just for NAND :)
It's not only NAND, it's the whole user interface that's broken right now.
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
I don't want to get into an ARM vs PowerPC argument.
ARM is better anyway :-D
I think the best answer is that I'm not sure having things unaligned works totally right today as I did a silly test of loading a uImage to 0x82000001 and bootm hung inside of U-Boot not long ago. Can you try that on some cache coherent hardware and see if that works?
+1
[...]
Best regards, Marek Vasut

Dear Scott Wood,
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
Oh ... you mean powerpc? Or rather something like this http://cache.freescale.com/files/32bit/doc/fact_sheet/QORIQLS2FAMILYFS.pdf ? :-D
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned?
Not happening, bounce buffer is bullshit, especially if we can prevent it by teaching user not to do retarded things.
It's like driving a car in the wrong lane. Sure, you can do it, but it'll eventually have some consequences. And using a bounce buffer is like driving a tank in the wrong lane ...
The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty.
Slight is very weak word here.
How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
If the user is dumb, always. But if you tell the user how to milk the most of the hardware, he'll be happier.
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
That's actually almost right, we're doing a bootloader here, it might have limitations. We're not writing yet another operating system with no bounds on possibilities!
Signed-off-by: Marek Vasut marex@denx.de Cc: Scott Wood scottwood@freescale.com
common/cmd_nand.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/common/cmd_nand.c b/common/cmd_nand.c index a91ccf4..122a91c 100644 --- a/common/cmd_nand.c +++ b/common/cmd_nand.c @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[])
goto usage;
addr = (ulong)simple_strtoul(argv[2], NULL, 16);
if (!cacheline_aligned(addr))
return 1;
There's no way you can just return like this without printing an error that lets the user know that the operation wasn't performed, and why.
-Scott
Best regards, Marek Vasut

On 06/25/2012 06:37 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
Oh ... you mean powerpc? Or rather something like this http://cache.freescale.com/files/32bit/doc/fact_sheet/QORIQLS2FAMILYFS.pdf ? :-D
The word "coherent/coherency" appears 5 times in that document by my count. :-)
I hope that applies to DMA, not just core-to-core.
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned?
Not happening, bounce buffer is bullshit,
Hacking up the common frontend with a new limitation because you can't be bothered to fix your drivers is bullshit.
It's like driving a car in the wrong lane. Sure, you can do it, but it'll eventually have some consequences. And using a bounce buffer is like driving a tank in the wrong lane ...
Using a bounce buffer is like parking your car before going into the building, rather than insisting the building's hallways be paved.
The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty.
Slight is very weak word here.
Prove me wrong with benchmarks.
How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
If the user is dumb, always. But if you tell the user how to milk the most of the hardware, he'll be happier.
So, if you use bounce buffers conditionally (based on whether the address is misaligned), there's no impact except to "dumb" users, and for those users they would merely get a performance degradation rather than breakage. How is this "bullshit"?
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
That's actually almost right, we're doing a bootloader here, it might have limitations. We're not writing yet another operating system with no bounds on possibilities!
We also don't need to bend over backwards to squeeze every last cycle out of the boot process, at the expense of a stable user interface (not to mention requiring the user to know the system's cache line size).
-Scott

Dear Scott Wood,
On 06/25/2012 06:37 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/24/2012 07:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer,
Or get cache coherent hardware. :-)
Oh ... you mean powerpc? Or rather something like this http://cache.freescale.com/files/32bit/doc/fact_sheet/QORIQLS2FAMILYFS.pd f ? :-D
The word "coherent/coherency" appears 5 times in that document by my count. :-)
I hope that applies to DMA, not just core-to-core.
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned?
Not happening, bounce buffer is bullshit,
Hacking up the common frontend with a new limitation because you can't be bothered to fix your drivers is bullshit.
The drivers are not broken, they have hardware limitations. And checking for those has to be done as early as possible. And it's not a new common frontend!
It's like driving a car in the wrong lane. Sure, you can do it, but it'll eventually have some consequences. And using a bounce buffer is like driving a tank in the wrong lane ...
Using a bounce buffer is like parking your car before going into the building, rather than insisting the building's hallways be paved.
The other is obviously faster, more comfortable and lets you carry more stuff at once. And if you drive a truck, you can dump a lot of payload instead of carrying it back and forth from the building. That's why there's a special garage for trucks possibly with cargo elevators etc.
The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty.
Slight is very weak word here.
Prove me wrong with benchmarks.
Well, copying data back and forth is tremendous overhead. You don't need a benchmark to calculate something like this:
133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput (now if it's DDR, dual/quad pumped, that doesn't give you any more advantage since you have to: send address, read the data, send address, write the data ... this is expensive ... without data cache on, even more so)
Now consider you do it via really dump memcpy, what happens: 1) You need to read the data into register 1a) Send address 1b) Read the data into register 2) You need to write the data to a new location 2a) Send address 2b) Write the data into the memory
In the meantime, you get some refresh cycles etc. Now, if you take read and write in 1 time unit and "send address" in 0.5 time unit (this gives total 3 time units per one loop) and consider you're not doing sustained read/write, you should see you'll be able to copy at speed of about 133/3 ~= 40Mb/s
If you want to load 3MB kernel at 40Mb/s onto an unaligned address via DMA, the DMA will deploy it via sustained write, that'll be at 10MB/s, therefore in 300ms. But the subsequent copy will take another 600ms.
And now, I need someone to recalculate it. Also, read up here, it is _VERY_ good and it is certainly more accurate than my previous delirious attempt:
http://www.akkadia.org/drepper/cpumemory.pdf
How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
If the user is dumb, always. But if you tell the user how to milk the most of the hardware, he'll be happier.
So, if you use bounce buffers conditionally (based on whether the address is misaligned), there's no impact except to "dumb" users, and for those users they would merely get a performance degradation rather than breakage. How is this "bullshit"?
Correct, but users will complain if they get a subpar performance.
I'm hesitant to break something -- even if it's odd (literally in this case) -- that currently works on most hardware, just because one or two drivers can't handle it. It feels kind of like changing the read() and write() system calls to require cacheline alignment. :-P
That's actually almost right, we're doing a bootloader here, it might have limitations. We're not writing yet another operating system with no bounds on possibilities!
We also don't need to bend over backwards to squeeze every last cycle out of the boot process, at the expense of a stable user interface (not to mention requiring the user to know the system's cache line size).
But that's reported in my patch ;-)
And yes, we want to make the boot process as blazing fast as possible. Imagine you fire a rocket into the deep space and it gets broken and needs reboot, will you enjoy the waiting ? ;-)
-Scott
Best regards, Marek Vasut

On 06/25/2012 08:33 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/25/2012 06:37 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/24/2012 07:17 PM, Marek Vasut wrote:
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned?
Not happening, bounce buffer is bullshit,
Hacking up the common frontend with a new limitation because you can't be bothered to fix your drivers is bullshit.
The drivers are not broken, they have hardware limitations.
They're broken because they ignore those limitations.
And checking for those has to be done as early as possible.
Why?
And it's not a new common frontend!
No, it's a compatibility-breaking change to the existing common frontend.
It's like driving a car in the wrong lane. Sure, you can do it, but it'll eventually have some consequences. And using a bounce buffer is like driving a tank in the wrong lane ...
Using a bounce buffer is like parking your car before going into the building, rather than insisting the building's hallways be paved.
The other is obviously faster, more comfortable and lets you carry more stuff at once.
Then you end up needing buildings to be many times as large to give every cubicle an adjacent parking spot, maneuvering room, etc. You'll be breathing fumes all day, and it'll be a lot less comfortable to get even across the hallway without using a car, etc. Communication between coworkers would be limited to horns and obscene gestures. :-)
And if you drive a truck, you can dump a lot of payload instead of carrying it back and forth from the building. That's why there's a special garage for trucks possibly with cargo elevators etc.
Yes, it's called targeted optimization rather than premature optimization.
The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty.
Slight is very weak word here.
Prove me wrong with benchmarks.
Well, copying data back and forth is tremendous overhead. You don't need a benchmark to calculate something like this:
133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput
You're saying you get only a little more bandwidth from memory than you'd get from a 100 Mb/s ethernet port? Come on. Data buses are not one bit wide.
And how fast can you pull data out of a NAND chip, even with DMA?
(now if it's DDR, dual/quad pumped, that doesn't give you any more advantage
So such things were implemented for fun?
since you have to: send address, read the data, send address, write the data ...
What about bursts? I'm pretty sure you don't have to send the address separately for every single byte.
this is expensive ... without data cache on, even more so)
Why do we care about "without data cache"? You don't need the bounce buffer in that case.
Now consider you do it via really dump memcpy, what happens:
It looks like ARM U-Boot has an optimized memcpy.
- You need to read the data into register
1a) Send address 1b) Read the data into register 2) You need to write the data to a new location 2a) Send address 2b) Write the data into the memory
In the meantime, you get some refresh cycles etc. Now, if you take read and write in 1 time unit and "send address" in 0.5 time unit (this gives total 3 time units per one loop) and consider you're not doing sustained read/write, you should see you'll be able to copy at speed of about 133/3 ~= 40Mb/s
If you want to load 3MB kernel at 40Mb/s onto an unaligned address via DMA, the DMA will deploy it via sustained write, that'll be at 10MB/s, therefore in 300ms. But the subsequent copy will take another 600ms.
On a p5020ds, using NAND hardware that doesn't do DMA at all, I'm able to load a 3MiB image from NAND in around 300-400 ms. This is with using memcpy_fromio() on an uncached hardware buffer.
Again, I'm not saying that bounce buffers are always negligible overhead -- just that I doubt NAND is fast enough that it makes a huge difference in this specific case.
How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
If the user is dumb, always. But if you tell the user how to milk the most of the hardware, he'll be happier.
So, if you use bounce buffers conditionally (based on whether the address is misaligned), there's no impact except to "dumb" users, and for those users they would merely get a performance degradation rather than breakage. How is this "bullshit"?
Correct, but users will complain if they get a subpar performance.
If you expend the minimal effort required to make the bounce buffer usage conditional on the address actually being misaligned, the only users that will see subpar performance are those who would see breakage with your approach. Users will complain if they see breakage even more than when the see subpar performance.
If the issue is educating the user to avoid the performance hit (regardless of magnitude), and you care enough, have the driver print a warning (not error) message the first time it needs to use a bounce buffer.
-Scott

Dear Scott Wood,
On 06/25/2012 08:33 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/25/2012 06:37 PM, Marek Vasut wrote:
Dear Scott Wood,
On 06/24/2012 07:17 PM, Marek Vasut wrote:
but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
How about use the bounce buffer only if the address is misaligned?
Not happening, bounce buffer is bullshit,
Hacking up the common frontend with a new limitation because you can't be bothered to fix your drivers is bullshit.
The drivers are not broken, they have hardware limitations.
They're broken because they ignore those limitations.
And checking for those has to be done as early as possible.
Why?
Why keep an overhead.
And it's not a new common frontend!
No, it's a compatibility-breaking change to the existing common frontend.
Well, those are corner cases, it's not like the people will start hitting it en- masse.
I agree it should be somehow platform or even CPU specific.
It's like driving a car in the wrong lane. Sure, you can do it, but it'll eventually have some consequences. And using a bounce buffer is like driving a tank in the wrong lane ...
Using a bounce buffer is like parking your car before going into the building, rather than insisting the building's hallways be paved.
The other is obviously faster, more comfortable and lets you carry more stuff at once.
Then you end up needing buildings to be many times as large to give every cubicle an adjacent parking spot, maneuvering room, etc. You'll be breathing fumes all day, and it'll be a lot less comfortable to get even across the hallway without using a car, etc. Communication between coworkers would be limited to horns and obscene gestures. :-)
Ok, this has gone waaay out of hand here :-)
And if you drive a truck, you can dump a lot of payload instead of carrying it back and forth from the building. That's why there's a special garage for trucks possibly with cargo elevators etc.
Yes, it's called targeted optimization rather than premature optimization.
The corrective action a user has to take is the same as with this patch, except for an additional option of living with the slight performance penalty.
Slight is very weak word here.
Prove me wrong with benchmarks.
Well, copying data back and forth is tremendous overhead. You don't need a benchmark to calculate something like this:
133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput
You're saying you get only a little more bandwidth from memory than you'd get from a 100 Mb/s ethernet port? Come on. Data buses are not one bit wide.
Good point, it was too late and I forgot to count that in.
And how fast can you pull data out of a NAND chip, even with DMA?
(now if it's DDR, dual/quad pumped, that doesn't give you any more advantage
So such things were implemented for fun?
since you have to: send address, read the data, send address, write the data ...
What about bursts? I'm pretty sure you don't have to send the address separately for every single byte.
If you do memcpy? You only have registers, sure, you can optimize it, but on intel for example, you don't have many of those.
this is expensive ... without data cache on, even more so)
Why do we care about "without data cache"? You don't need the bounce buffer in that case.
Correct, it's expensive in both cases though.
Now consider you do it via really dump memcpy, what happens:
It looks like ARM U-Boot has an optimized memcpy.
- You need to read the data into register
1a) Send address 1b) Read the data into register 2) You need to write the data to a new location 2a) Send address 2b) Write the data into the memory
In the meantime, you get some refresh cycles etc. Now, if you take read and write in 1 time unit and "send address" in 0.5 time unit (this gives total 3 time units per one loop) and consider you're not doing sustained read/write, you should see you'll be able to copy at speed of about 133/3 ~= 40Mb/s
If you want to load 3MB kernel at 40Mb/s onto an unaligned address via DMA, the DMA will deploy it via sustained write, that'll be at 10MB/s, therefore in 300ms. But the subsequent copy will take another 600ms.
On a p5020ds, using NAND hardware that doesn't do DMA at all, I'm able to load a 3MiB image from NAND in around 300-400 ms. This is with using memcpy_fromio() on an uncached hardware buffer.
blazing almost half a second ... but everyone these days wants a faster boot, without memcpy, it can go down to 100ms or even less! And this kind of limitation is not something that'd inconvenience anyone.
Again, I'm not saying that bounce buffers are always negligible overhead -- just that I doubt NAND is fast enough that it makes a huge difference in this specific case.
It does make a difference. Correct, thinking about it -- implementing a generic bounce buffer for cache-impotent hardware might be a better way to go.
How often does this actually happen? How much does it actually slow things down compared to the speed of the NAND chip?
If the user is dumb, always. But if you tell the user how to milk the most of the hardware, he'll be happier.
So, if you use bounce buffers conditionally (based on whether the address is misaligned), there's no impact except to "dumb" users, and for those users they would merely get a performance degradation rather than breakage. How is this "bullshit"?
Correct, but users will complain if they get a subpar performance.
If you expend the minimal effort required to make the bounce buffer usage conditional on the address actually being misaligned, the only users that will see subpar performance are those who would see breakage with your approach. Users will complain if they see breakage even more than when the see subpar performance.
If the issue is educating the user to avoid the performance hit (regardless of magnitude), and you care enough, have the driver print a warning (not error) message the first time it needs to use a bounce buffer.
Ok, on a second thought, you're right. Let's do it the same way we did it with mmc.
-Scott
Best regards, Marek Vasut

On 06/24/2012 05:17 PM, Marek Vasut wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasutmarex@denx.de Cc: Scott Woodscottwood@freescale.com
common/cmd_nand.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/common/cmd_nand.c b/common/cmd_nand.c index a91ccf4..122a91c 100644 --- a/common/cmd_nand.c +++ b/common/cmd_nand.c @@ -609,6 +609,8 @@ int do_nand(cmd_tbl_t * cmdtp, int flag, int argc, char * const argv[]) goto usage;
addr = (ulong)simple_strtoul(argv[2], NULL, 16);
if (!cacheline_aligned(addr))
return 1;
You need to check the end address too. Also, I agree with Scott that that this is an un-justifiable restriction on cache-coherent architectures. IMO, such checks should be done in platform specific code where the DMA is being attempted.
best regards, Aneesh

This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com --- common/cmd_net.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/common/cmd_net.c b/common/cmd_net.c index a9ade8b..917061b 100644 --- a/common/cmd_net.c +++ b/common/cmd_net.c @@ -242,6 +242,10 @@ static int netboot_common(enum proto_t proto, cmd_tbl_t *cmdtp, int argc, bootstage_error(BOOTSTAGE_ID_NET_START); return CMD_RET_USAGE; } + + if (!cacheline_aligned(load_addr)) + return 1; + bootstage_mark(BOOTSTAGE_ID_NET_START);
if ((size = NetLoop(proto)) < 0) {

Hi Marek,
On Sun, Jun 24, 2012 at 7:17 PM, Marek Vasut marex@denx.de wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com
Acked-by: Joe Hershberger joe.hershberger@gmail.com

Dear Joe Hershberger,
Hi Marek,
On Sun, Jun 24, 2012 at 7:17 PM, Marek Vasut marex@denx.de wrote:
This prevents the scenario where data cache is on and the device uses DMA to deploy data. In that case, it might not be possible to flush/invalidate data to RAM properly. The other option is to use bounce buffer, but that involves a lot of copying and therefore degrades performance rapidly. Therefore disallow this possibility of unaligned load address altogether if data cache is on.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com
Acked-by: Joe Hershberger joe.hershberger@gmail.com
NAK, actually ... let's wait until the discussion is settled ;-)
Best regards, Marek Vasut

The asix driver did not align buffers, therefore it didn't work with data cache enabled. Fix this.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com --- drivers/usb/eth/asix.c | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/drivers/usb/eth/asix.c b/drivers/usb/eth/asix.c index a3bf51a..c27ed51 100644 --- a/drivers/usb/eth/asix.c +++ b/drivers/usb/eth/asix.c @@ -168,27 +168,28 @@ static inline int asix_set_hw_mii(struct ueth_data *dev)
static int asix_mdio_read(struct ueth_data *dev, int phy_id, int loc) { - __le16 res; + ALLOC_CACHE_ALIGN_BUFFER(__le16, res, 1);
asix_set_sw_mii(dev); - asix_read_cmd(dev, AX_CMD_READ_MII_REG, phy_id, (__u16)loc, 2, &res); + asix_read_cmd(dev, AX_CMD_READ_MII_REG, phy_id, (__u16)loc, 2, res); asix_set_hw_mii(dev);
debug("asix_mdio_read() phy_id=0x%02x, loc=0x%02x, returns=0x%04x\n", - phy_id, loc, le16_to_cpu(res)); + phy_id, loc, le16_to_cpu(*res));
- return le16_to_cpu(res); + return le16_to_cpu(*res); }
static void asix_mdio_write(struct ueth_data *dev, int phy_id, int loc, int val) { - __le16 res = cpu_to_le16(val); + ALLOC_CACHE_ALIGN_BUFFER(__le16, res, 1); + *res = cpu_to_le16(val);
debug("asix_mdio_write() phy_id=0x%02x, loc=0x%02x, val=0x%04x\n", phy_id, loc, val); asix_set_sw_mii(dev); - asix_write_cmd(dev, AX_CMD_WRITE_MII_REG, phy_id, (__u16)loc, 2, &res); + asix_write_cmd(dev, AX_CMD_WRITE_MII_REG, phy_id, (__u16)loc, 2, res); asix_set_hw_mii(dev); }
@@ -210,7 +211,8 @@ static int asix_sw_reset(struct ueth_data *dev, u8 flags)
static inline int asix_get_phy_addr(struct ueth_data *dev) { - u8 buf[2]; + ALLOC_CACHE_ALIGN_BUFFER(u8, buf, 2); + int ret = asix_read_cmd(dev, AX_CMD_READ_PHY_ID, 0, 0, 2, buf);
debug("asix_get_phy_addr()\n"); @@ -242,13 +244,14 @@ static int asix_write_medium_mode(struct ueth_data *dev, u16 mode)
static u16 asix_read_rx_ctl(struct ueth_data *dev) { - __le16 v; - int ret = asix_read_cmd(dev, AX_CMD_READ_RX_CTL, 0, 0, 2, &v); + ALLOC_CACHE_ALIGN_BUFFER(__le16, v, 1); + + int ret = asix_read_cmd(dev, AX_CMD_READ_RX_CTL, 0, 0, 2, v);
if (ret < 0) debug("Error reading RX_CTL register: %02x\n", ret); else - ret = le16_to_cpu(v); + ret = le16_to_cpu(*v); return ret; }
@@ -313,7 +316,7 @@ static int mii_nway_restart(struct ueth_data *dev) static int asix_init(struct eth_device *eth, bd_t *bd) { int embd_phy; - unsigned char buf[ETH_ALEN]; + ALLOC_CACHE_ALIGN_BUFFER(unsigned char, buf, ETH_ALEN); u16 rx_ctl; struct ueth_data *dev = (struct ueth_data *)eth->priv; int timeout = 0; @@ -425,7 +428,7 @@ static int asix_send(struct eth_device *eth, void *packet, int length) int err; u32 packet_len; int actual_len; - unsigned char msg[PKTSIZE + sizeof(packet_len)]; + ALLOC_CACHE_ALIGN_BUFFER(unsigned char, msg, PKTSIZE + sizeof(packet_len));
debug("** %s(), len %d\n", __func__, length);
@@ -452,7 +455,7 @@ static int asix_send(struct eth_device *eth, void *packet, int length) static int asix_recv(struct eth_device *eth) { struct ueth_data *dev = (struct ueth_data *)eth->priv; - static unsigned char recv_buf[AX_RX_URB_SIZE]; + ALLOC_CACHE_ALIGN_BUFFER(unsigned char, recv_buf, AX_RX_URB_SIZE); unsigned char *buf_ptr; int err; int actual_len;

Hi Marex,
On Sun, Jun 24, 2012 at 7:17 PM, Marek Vasut marex@denx.de wrote:
The asix driver did not align buffers, therefore it didn't work with data cache enabled. Fix this.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com
Acked-by: Joe Hershberger joe.hershberger@gmail.com

Dear Joe Hershberger,
Hi Marex,
On Sun, Jun 24, 2012 at 7:17 PM, Marek Vasut marex@denx.de wrote:
The asix driver did not align buffers, therefore it didn't work with data cache enabled. Fix this.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com
Acked-by: Joe Hershberger joe.hershberger@gmail.com
Yes, this one is good, can you pick it up please?
Best regards, Marek Vasut

Dear Joe Hershberger,
Hi Marex,
On Sun, Jun 24, 2012 at 7:17 PM, Marek Vasut marex@denx.de wrote:
The asix driver did not align buffers, therefore it didn't work with data cache enabled. Fix this.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com
Acked-by: Joe Hershberger joe.hershberger@gmail.com
Joe, don't apply this one, I'll send updated one.
Best regards, Marek Vasut

Dear Joe Hershberger,
Hi Marex,
On Sun, Jun 24, 2012 at 7:17 PM, Marek Vasut marex@denx.de wrote:
The asix driver did not align buffers, therefore it didn't work with data cache enabled. Fix this.
Signed-off-by: Marek Vasut marex@denx.de Cc: Joe Hershberger joe.hershberger@gmail.com
Acked-by: Joe Hershberger joe.hershberger@gmail.com
Joe, don't apply this one, I'll send updated one.
Stupid me, this can be safely applied.
Best regards, Marek Vasut
Best regards, Marek Vasut

Signed-off-by: Marek Vasut marex@denx.de Cc: Wolfgang Denk wd@denx.de Cc: Stefano Babic sbabic@denx.de Cc: Fabio Estevam festevam@freescale.com --- include/configs/m28evk.h | 2 -- 1 file changed, 2 deletions(-)
diff --git a/include/configs/m28evk.h b/include/configs/m28evk.h index b730b2a..347a3ea 100644 --- a/include/configs/m28evk.h +++ b/include/configs/m28evk.h @@ -37,8 +37,6 @@ #define CONFIG_MACH_TYPE MACH_TYPE_M28EVK
#define CONFIG_SYS_NO_FLASH -#define CONFIG_SYS_ICACHE_OFF -#define CONFIG_SYS_DCACHE_OFF #define CONFIG_BOARD_EARLY_INIT_F #define CONFIG_ARCH_CPU_INIT #define CONFIG_ARCH_MISC_INIT
participants (5)
-
Aneesh V
-
Joe Hershberger
-
Marek Vasut
-
Scott Wood
-
Tom Rini