[U-Boot] [PATCH 0/5] dcache support for Raspberry Pi 1

This patchset enables dcache support for Raspberry Pi 1. First the cache support code was made similar to existing arm1136 code. The invalidate and flush functions were inprovoed to accept also non-cacheline aligned addresses. This reduces the cacheline size knowledge from generic code. Then rpi mailbox code has now dcache flush for writing the mailbox request and a dcache invalidation for receiving the mailbox answer. Finally the CONFIG_SYS_DCACHE_OFF switch got removed from rpi config.
dcache supprt increases the MMC read performance on RPI 1 from 5,4 MiB/s to 12.3 MiB/s. It doesn't seem to have any affect on RPI 2 though. I just get error messages about non-cacheline aligned address upon invalidation. The performance stucks at 1.2 MiB/s.
This was tested by the following command:
fatload mmc 0:1 ${kernel_addr_r} zImage
Alexander Stein (5): arm1176/cpu: Match cache_flush to arm1136 arm1176/cpu: Add icache and dcache support arm1176/cpu: Align cache flushing addresses to cacheline size arm/mach-bcm283x/mbox: Flush and invalidate dcache when using fw mailbox arm/rpi: Enable dcache
arch/arm/cpu/arm1176/cpu.c | 114 +++++++++++++++++++++++++++++++++++++++++-- arch/arm/mach-bcm283x/mbox.c | 6 +++ include/configs/rpi-common.h | 1 - 3 files changed, 116 insertions(+), 5 deletions(-)

This is effectively the same code but it also does a clean cache before invalidating and doing a memory barrier.
Signed-off-by: Alexander Stein alexanders83@web.de --- arch/arm/cpu/arm1176/cpu.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/arm/cpu/arm1176/cpu.c b/arch/arm/cpu/arm1176/cpu.c index 2d81651..24b5cc7 100644 --- a/arch/arm/cpu/arm1176/cpu.c +++ b/arch/arm/cpu/arm1176/cpu.c @@ -41,11 +41,13 @@ int cleanup_before_linux (void) return 0; }
-/* flush I/D-cache */ -static void cache_flush (void) +static void cache_flush(void) { + unsigned long i = 0; + /* clean entire data cache */ + asm volatile("mcr p15, 0, %0, c7, c10, 0" : : "r" (i)); /* invalidate both caches and flush btb */ - asm ("mcr p15, 0, %0, c7, c7, 0": :"r" (0)); + asm volatile("mcr p15, 0, %0, c7, c7, 0" : : "r" (i)); /* mem barrier to sync things */ - asm ("mcr p15, 0, %0, c7, c10, 4": :"r" (0)); + asm volatile("mcr p15, 0, %0, c7, c10, 4" : : "r" (i)); }

On 07/04/2015 03:48 AM, Alexander Stein wrote:
This is effectively the same code but it also does a clean cache before invalidating and doing a memory barrier.
Is it possible to share this function with arm1136 rather than cut/pasting it? I guess it's pretty small so not a huge deal, but still.

On Friday 10 July 2015, 23:20:07 wrote Stephen Warren:
On 07/04/2015 03:48 AM, Alexander Stein wrote:
This is effectively the same code but it also does a clean cache before invalidating and doing a memory barrier.
Is it possible to share this function with arm1136 rather than cut/pasting it? I guess it's pretty small so not a huge deal, but still.
I guess this would be possible, but you would cross subdirectoy boundaries for arm1136 and arm1176. Dunno if this is acceptable or even if there are subtle differences between those.
Alexander

The code is copied 1:1 from arm1136 which uses the same cp15 registers.
Signed-off-by: Alexander Stein alexanders83@web.de --- arch/arm/cpu/arm1176/cpu.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+)
diff --git a/arch/arm/cpu/arm1176/cpu.c b/arch/arm/cpu/arm1176/cpu.c index 24b5cc7..2ff0e25 100644 --- a/arch/arm/cpu/arm1176/cpu.c +++ b/arch/arm/cpu/arm1176/cpu.c @@ -51,3 +51,100 @@ static void cache_flush(void) /* mem barrier to sync things */ asm volatile("mcr p15, 0, %0, c7, c10, 4" : : "r" (i)); } + +#ifndef CONFIG_SYS_DCACHE_OFF + +#ifndef CONFIG_SYS_CACHELINE_SIZE +#define CONFIG_SYS_CACHELINE_SIZE 32 +#endif + +void invalidate_dcache_all(void) +{ + asm volatile("mcr p15, 0, %0, c7, c6, 0" : : "r" (0)); +} + +void flush_dcache_all(void) +{ + asm volatile("mcr p15, 0, %0, c7, c10, 0" : : "r" (0)); + asm volatile("mcr p15, 0, %0, c7, c10, 4" : : "r" (0)); +} + +static int check_cache_range(unsigned long start, unsigned long stop) +{ + int ok = 1; + + if (start & (CONFIG_SYS_CACHELINE_SIZE - 1)) + ok = 0; + + if (stop & (CONFIG_SYS_CACHELINE_SIZE - 1)) + ok = 0; + + if (!ok) + debug("CACHE: Misaligned operation at range [%08lx, %08lx]\n", + start, stop); + + return ok; +} + +void invalidate_dcache_range(unsigned long start, unsigned long stop) +{ + if (!check_cache_range(start, stop)) + return; + + while (start < stop) { + asm volatile("mcr p15, 0, %0, c7, c6, 1" : : "r" (start)); + start += CONFIG_SYS_CACHELINE_SIZE; + } +} + +void flush_dcache_range(unsigned long start, unsigned long stop) +{ + if (!check_cache_range(start, stop)) + return; + + while (start < stop) { + asm volatile("mcr p15, 0, %0, c7, c14, 1" : : "r" (start)); + start += CONFIG_SYS_CACHELINE_SIZE; + } + + asm volatile("mcr p15, 0, %0, c7, c10, 4" : : "r" (0)); +} + +void flush_cache(unsigned long start, unsigned long size) +{ + flush_dcache_range(start, start + size); +} + +#else /* #ifndef CONFIG_SYS_DCACHE_OFF */ +void invalidate_dcache_all(void) +{ +} + +void flush_dcache_all(void) +{ +} + +void invalidate_dcache_range(unsigned long start, unsigned long stop) +{ +} + +void flush_dcache_range(unsigned long start, unsigned long stop) +{ +} + +void flush_cache(unsigned long start, unsigned long size) +{ +} +#endif /* #ifndef CONFIG_SYS_DCACHE_OFF */ + +#if !defined(CONFIG_SYS_ICACHE_OFF) || !defined(CONFIG_SYS_DCACHE_OFF) +void enable_caches(void) +{ +#ifndef CONFIG_SYS_ICACHE_OFF + icache_enable(); +#endif +#ifndef CONFIG_SYS_DCACHE_OFF + dcache_enable(); +#endif +} +#endif

On 07/04/2015 03:48 AM, Alexander Stein wrote:
The code is copied 1:1 from arm1136 which uses the same cp15 registers.
Same comment here. Perhaps create a cache-armv6.c (or whatever name is appropriate; I'm not sure if ARMv6 mandates caches work this way, or if ARM11 is a better name, or ...?

On Friday 10 July 2015, 23:21:32 wrote Stephen Warren:
On 07/04/2015 03:48 AM, Alexander Stein wrote:
The code is copied 1:1 from arm1136 which uses the same cp15 registers.
Same comment here. Perhaps create a cache-armv6.c (or whatever name is appropriate; I'm not sure if ARMv6 mandates caches work this way, or if ARM11 is a better name, or ...?
I know that cortex-m0 and cortex-m1 are also ARMv6(-M). I have no idea how cache works there or even if there is any. I think ARM11 seems more suitable for me. But where to put? Currently each CPU core has it's own directory.
Alexander

On 07/12/2015 01:26 AM, Alexander Stein wrote:
On Friday 10 July 2015, 23:21:32 wrote Stephen Warren:
On 07/04/2015 03:48 AM, Alexander Stein wrote:
The code is copied 1:1 from arm1136 which uses the same cp15 registers.
Same comment here. Perhaps create a cache-armv6.c (or whatever name is appropriate; I'm not sure if ARMv6 mandates caches work this way, or if ARM11 is a better name, or ...?
I know that cortex-m0 and cortex-m1 are also ARMv6(-M). I have no idea how cache works there or even if there is any. I think ARM11 seems more suitable for me. But where to put? Currently each CPU core has it's own directory.
Perhaps create an arch/arm/cpu/arm11 or arch/arm/cpu/armv6 directory?

cache flushing addresses must be cacheline size aligned, so mask the start and stop address appropriately.
Signed-off-by: Alexander Stein alexanders83@web.de --- arch/arm/cpu/arm1176/cpu.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/arch/arm/cpu/arm1176/cpu.c b/arch/arm/cpu/arm1176/cpu.c index 2ff0e25..5ac8e79 100644 --- a/arch/arm/cpu/arm1176/cpu.c +++ b/arch/arm/cpu/arm1176/cpu.c @@ -57,6 +57,7 @@ static void cache_flush(void) #ifndef CONFIG_SYS_CACHELINE_SIZE #define CONFIG_SYS_CACHELINE_SIZE 32 #endif +#define CACHLINE_MASK (CONFIG_SYS_CACHELINE_SIZE - 1)
void invalidate_dcache_all(void) { @@ -88,6 +89,9 @@ static int check_cache_range(unsigned long start, unsigned long stop)
void invalidate_dcache_range(unsigned long start, unsigned long stop) { + stop = (stop + CACHLINE_MASK) & ~CACHLINE_MASK; + start &= ~CACHLINE_MASK; + if (!check_cache_range(start, stop)) return;
@@ -99,6 +103,9 @@ void invalidate_dcache_range(unsigned long start, unsigned long stop)
void flush_dcache_range(unsigned long start, unsigned long stop) { + stop = (stop + CACHLINE_MASK) & ~CACHLINE_MASK; + start &= ~CACHLINE_MASK; + if (!check_cache_range(start, stop)) return;

On 07/04/2015 03:48 AM, Alexander Stein wrote:
cache flushing addresses must be cacheline size aligned, so mask the start and stop address appropriately.
As mentioned elsewhere, NAK.

When using dcache the setup data for the mailbox must be actually written into memory before calling into firmware. Thus flush and invalidate the memory.
Signed-off-by: Alexander Stein alexanders83@web.de --- arch/arm/mach-bcm283x/mbox.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/arm/mach-bcm283x/mbox.c b/arch/arm/mach-bcm283x/mbox.c index 1af9be7..237c76c 100644 --- a/arch/arm/mach-bcm283x/mbox.c +++ b/arch/arm/mach-bcm283x/mbox.c @@ -111,6 +111,12 @@ int bcm2835_mbox_call_prop(u32 chan, struct bcm2835_mbox_hdr *buffer) dump_buf(buffer); #endif
+ flush_dcache_range((unsigned long)buffer, + (unsigned long)((void *)buffer + + buffer->buf_size)); + invalidate_dcache_range((unsigned long)buffer, + (unsigned long)((void *)buffer + + buffer->buf_size)); ret = bcm2835_mbox_call_raw(chan, phys_to_bus((u32)buffer), &rbuffer); if (ret) return ret;

On 07/04/2015 03:48 AM, Alexander Stein wrote:
When using dcache the setup data for the mailbox must be actually written into memory before calling into firmware. Thus flush and invalidate the memory.
diff --git a/arch/arm/mach-bcm283x/mbox.c b/arch/arm/mach-bcm283x/mbox.c
- flush_dcache_range((unsigned long)buffer,
(unsigned long)((void *)buffer +
buffer->buf_size));
- invalidate_dcache_range((unsigned long)buffer,
(unsigned long)((void *)buffer +
ret = bcm2835_mbox_call_raw(chan, phys_to_bus((u32)buffer), &rbuffer); if (ret) return ret;buffer->buf_size));
I'm not sure of the details of ARMv6 memory pre-fetching, so perhaps this doesn't matter. However, I think it'd be best if this code did:
flush cache execute mailbox operation invalidate cache use results
... rather than doing the invalidate before executing the mailbox operation. This change would guarantee that the invalidate happens after the VideoCore has written the response into RAM, and hence guarantees that the invalidate will cause the response to be picked up.

Now that mailbox driver supports cache flush and invalidation, we can enable dcache.
Signed-off-by: Alexander Stein alexanders83@web.de --- include/configs/rpi-common.h | 1 - 1 file changed, 1 deletion(-)
diff --git a/include/configs/rpi-common.h b/include/configs/rpi-common.h index 1012cdd..dd638c4 100644 --- a/include/configs/rpi-common.h +++ b/include/configs/rpi-common.h @@ -14,7 +14,6 @@ #define CONFIG_SYS_GENERIC_BOARD #define CONFIG_BCM2835 #define CONFIG_ARCH_CPU_INIT -#define CONFIG_SYS_DCACHE_OFF
#define CONFIG_SYS_TIMER_RATE 1000000 #define CONFIG_SYS_TIMER_COUNTER \

On 07/04/2015 03:48 AM, Alexander Stein wrote:
Now that mailbox driver supports cache flush and invalidation, we can enable dcache.
Did you also test that HDMI output doesn't show any issues after this change? I think there's already cache-flushing in the LCD console code, so it should be fine, but best to check.

On Friday 10 July 2015, 23:24:58 wrote Stephen Warren:
On 07/04/2015 03:48 AM, Alexander Stein wrote:
Now that mailbox driver supports cache flush and invalidation, we can enable dcache.
Did you also test that HDMI output doesn't show any issues after this change? I think there's already cache-flushing in the LCD console code, so it should be fine, but best to check.
I don't use HDMI on my board, just serial and SSH. But how may this affect that code if it is already flushing caches?
Alexander

On 07/12/2015 01:30 AM, Alexander Stein wrote:
On Friday 10 July 2015, 23:24:58 wrote Stephen Warren:
On 07/04/2015 03:48 AM, Alexander Stein wrote:
Now that mailbox driver supports cache flush and invalidation, we can enable dcache.
Did you also test that HDMI output doesn't show any issues after this change? I think there's already cache-flushing in the LCD console code, so it should be fine, but best to check.
I don't use HDMI on my board, just serial and SSH. But how may this affect that code if it is already flushing caches?
Well, if the cache manipulation code changed, anything that uses it should be tested. If you don't have any way to test, I guess I can test it for you; let me know.

Hello Alexander,
On Sat, 4 Jul 2015 11:48:39 +0200, Alexander Stein alexanders83@web.de wrote:
dcache supprt increases the MMC read performance on RPI 1 from 5,4 MiB/s to 12.3 MiB/s. It doesn't seem to have any affect on RPI 2 though. I just get error messages about non-cacheline aligned address upon invalidation.
Could it be that code needed to support dcache is not the same for rpi_2's bcm2836 than it is for rpi's bcm2835?
Anyway: if code properly handles unaligned addresses then it should not throw an error message about it. Can you look into why the error is thrown?
The performance stucks at 1.2 MiB/s.
This was tested by the following command:
fatload mmc 0:1 ${kernel_addr_r} zImage
Alexander Stein (5): arm1176/cpu: Match cache_flush to arm1136 arm1176/cpu: Add icache and dcache support arm1176/cpu: Align cache flushing addresses to cacheline size arm/mach-bcm283x/mbox: Flush and invalidate dcache when using fw mailbox arm/rpi: Enable dcache
arch/arm/cpu/arm1176/cpu.c | 114 +++++++++++++++++++++++++++++++++++++++++-- arch/arm/mach-bcm283x/mbox.c | 6 +++ include/configs/rpi-common.h | 1 - 3 files changed, 116 insertions(+), 5 deletions(-)
-- 2.4.5
Amicalement,

Hello Albert,
On Monday 06 July 2015, 09:39:40 wrote Albert ARIBAUD:
On Sat, 4 Jul 2015 11:48:39 +0200, Alexander Stein alexanders83@web.de wrote:
dcache supprt increases the MMC read performance on RPI 1 from 5,4 MiB/s to 12.3 MiB/s. It doesn't seem to have any affect on RPI 2 though. I just get error messages about non-cacheline aligned address upon invalidation.
Could it be that code needed to support dcache is not the same for rpi_2's bcm2836 than it is for rpi's bcm2835?
Sure, bcm2835 is a armv6 while bcm2836 is a armv7.
Anyway: if code properly handles unaligned addresses then it should not throw an error message about it. Can you look into why the error is thrown?
Apparently it does not handle non-cacheline aligned addresses transparently or silently.
Here is the part of the code:
static void v7_dcache_inval_range(u32 start, u32 stop, u32 line_len) { /* * If start address is not aligned to cache-line do not * invalidate the first cache-line */ if (start & (line_len - 1)) { printf("ERROR: %s - start address is not aligned - 0x%08x\n", __func__, start); /* move to next cache line */ start = (start + line_len - 1) & ~(line_len - 1); }
I don't know why (a) the cache invalidation is only done from the next cache line and (b) why this can't be done transparently without printing an error. But currently I'm not keen on fiddling with armv7 caches.
Best regards, Alexander

Hello Alexander,
On Mon, 06 Jul 2015 20:24:31 +0200, Alexander Stein alexanders83@web.de wrote:
Hello Albert,
On Monday 06 July 2015, 09:39:40 wrote Albert ARIBAUD:
On Sat, 4 Jul 2015 11:48:39 +0200, Alexander Stein alexanders83@web.de wrote:
dcache supprt increases the MMC read performance on RPI 1 from 5,4 MiB/s to 12.3 MiB/s. It doesn't seem to have any affect on RPI 2 though. I just get error messages about non-cacheline aligned address upon invalidation.
Could it be that code needed to support dcache is not the same for rpi_2's bcm2836 than it is for rpi's bcm2835?
Sure, bcm2835 is a armv6 while bcm2836 is a armv7.
Anyway: if code properly handles unaligned addresses then it should not throw an error message about it. Can you look into why the error is thrown?
Apparently it does not handle non-cacheline aligned addresses transparently or silently.
Here is the part of the code:
static void v7_dcache_inval_range(u32 start, u32 stop, u32 line_len) { /* * If start address is not aligned to cache-line do not * invalidate the first cache-line */ if (start & (line_len - 1)) { printf("ERROR: %s - start address is not aligned - 0x%08x\n", __func__, start); /* move to next cache line */ start = (start + line_len - 1) & ~(line_len - 1); }
I don't know why (a) the cache invalidation is only done from the next cache line and (b) why this can't be done transparently without printing an error. But currently I'm not keen on fiddling with armv7 caches.
Well, I can see why.
Let's assume were invalidating the second half of a cache line because that's where a buffer starts which we want to force-read from external memory because some device fille dthis buffer with important data.
Now, most probably the compiler and linker will have used the addresses before our buffer to map some variables which may be unrelated to the buffer.
At the time we're told to invalidate the buffer, these variables may be modified in-cache but not yet written out to external memory. If we invalidate the first cache line, then we erase these modifications -- they're lost.
Now, this is an unsolvable problem -- we can't flush these variables before invalidating, because then we would flush the whole cache line, which would overwrite and trash the buffer in external memory.
So anyway, we're doomed; there is nothing we can do -- hence the ERROR message. From the on, we can either just give up and go hang(), or we can try to save whatever can be, skip the half cache line and start invalidating at the next boundary.
(same goes for the last address: it has to be at the end of a cache line, or else we can neither invalidate nor flush.)
Best regards, Alexander
Amicalement,

Hello Albert,
On Monday 06 July 2015, 23:26:41 wrote Albert ARIBAUD:
On Mon, 06 Jul 2015 20:24:31 +0200, Alexander Stein alexanders83@web.de wrote:
Hello Albert,
On Monday 06 July 2015, 09:39:40 wrote Albert ARIBAUD:
On Sat, 4 Jul 2015 11:48:39 +0200, Alexander Stein alexanders83@web.de wrote:
dcache supprt increases the MMC read performance on RPI 1 from 5,4 MiB/s to 12.3 MiB/s. It doesn't seem to have any affect on RPI 2 though. I just get error messages about non-cacheline aligned address upon invalidation.
Could it be that code needed to support dcache is not the same for rpi_2's bcm2836 than it is for rpi's bcm2835?
Sure, bcm2835 is a armv6 while bcm2836 is a armv7.
Anyway: if code properly handles unaligned addresses then it should not throw an error message about it. Can you look into why the error is thrown?
Apparently it does not handle non-cacheline aligned addresses transparently or silently.
Here is the part of the code:
static void v7_dcache_inval_range(u32 start, u32 stop, u32 line_len) { /* * If start address is not aligned to cache-line do not * invalidate the first cache-line */ if (start & (line_len - 1)) { printf("ERROR: %s - start address is not aligned - 0x%08x\n", __func__, start); /* move to next cache line */ start = (start + line_len - 1) & ~(line_len - 1); }
I don't know why (a) the cache invalidation is only done from the next cache line and (b) why this can't be done transparently without printing an error. But currently I'm not keen on fiddling with armv7 caches.
Well, I can see why.
Let's assume were invalidating the second half of a cache line because that's where a buffer starts which we want to force-read from external memory because some device fille dthis buffer with important data.
Now, most probably the compiler and linker will have used the addresses before our buffer to map some variables which may be unrelated to the buffer.
At the time we're told to invalidate the buffer, these variables may be modified in-cache but not yet written out to external memory. If we invalidate the first cache line, then we erase these modifications -- they're lost.
Now, this is an unsolvable problem -- we can't flush these variables before invalidating, because then we would flush the whole cache line, which would overwrite and trash the buffer in external memory.
So anyway, we're doomed; there is nothing we can do -- hence the ERROR message. From the on, we can either just give up and go hang(), or we can try to save whatever can be, skip the half cache line and start invalidating at the next boundary.
(same goes for the last address: it has to be at the end of a cache line, or else we can neither invalidate nor flush.)
Thanks for thise detailed explanation. I agree this is really a problem. But how should this behandled: The raspberry pi messagebox handling sends a message which might have more or less arbitrary length. I think it might be possible to achieve in every case that those message starts at the beginning of a cacheline. But the end might be at different positions with different messages sent. You must flush your data to get the firmware actually see this and you must invalidate to eventually read the answer data which is located at the same position. I guess I might just have not hit your described problem in my board (yet).
Best regards, Alexander

Hello Alexander,
On Wed, 08 Jul 2015 20:15:42 +0200, Alexander Stein alexanders83@web.de wrote:
Hello Albert,
On Monday 06 July 2015, 23:26:41 wrote Albert ARIBAUD:
On Mon, 06 Jul 2015 20:24:31 +0200, Alexander Stein alexanders83@web.de wrote:
Hello Albert,
On Monday 06 July 2015, 09:39:40 wrote Albert ARIBAUD:
On Sat, 4 Jul 2015 11:48:39 +0200, Alexander Stein alexanders83@web.de wrote:
dcache supprt increases the MMC read performance on RPI 1 from 5,4 MiB/s to 12.3 MiB/s. It doesn't seem to have any affect on RPI 2 though. I just get error messages about non-cacheline aligned address upon invalidation.
Could it be that code needed to support dcache is not the same for rpi_2's bcm2836 than it is for rpi's bcm2835?
Sure, bcm2835 is a armv6 while bcm2836 is a armv7.
Anyway: if code properly handles unaligned addresses then it should not throw an error message about it. Can you look into why the error is thrown?
Apparently it does not handle non-cacheline aligned addresses transparently or silently.
Here is the part of the code:
static void v7_dcache_inval_range(u32 start, u32 stop, u32 line_len) { /* * If start address is not aligned to cache-line do not * invalidate the first cache-line */ if (start & (line_len - 1)) { printf("ERROR: %s - start address is not aligned - 0x%08x\n", __func__, start); /* move to next cache line */ start = (start + line_len - 1) & ~(line_len - 1); }
I don't know why (a) the cache invalidation is only done from the next cache line and (b) why this can't be done transparently without printing an error. But currently I'm not keen on fiddling with armv7 caches.
Well, I can see why.
Let's assume were invalidating the second half of a cache line because that's where a buffer starts which we want to force-read from external memory because some device fille dthis buffer with important data.
Now, most probably the compiler and linker will have used the addresses before our buffer to map some variables which may be unrelated to the buffer.
At the time we're told to invalidate the buffer, these variables may be modified in-cache but not yet written out to external memory. If we invalidate the first cache line, then we erase these modifications -- they're lost.
Now, this is an unsolvable problem -- we can't flush these variables before invalidating, because then we would flush the whole cache line, which would overwrite and trash the buffer in external memory.
So anyway, we're doomed; there is nothing we can do -- hence the ERROR message. From the on, we can either just give up and go hang(), or we can try to save whatever can be, skip the half cache line and start invalidating at the next boundary.
(same goes for the last address: it has to be at the end of a cache line, or else we can neither invalidate nor flush.)
Thanks for thise detailed explanation. I agree this is really a problem. But how should this behandled: The raspberry pi messagebox handling sends a message which might have more or less arbitrary length. I think it might be possible to achieve in every case that those message starts at the beginning of a cacheline. But the end might be at different positions with different messages sent. You must flush your data to get the firmware actually see this and you must invalidate to eventually read the answer data which is located at the same position. I guess I might just have not hit your described problem in my board (yet).
True, each message might have various and non-cacheline-aligned sizes, but there is always a point in time when the buffer for that message is allocated, and the size of buffer can be greater than the size of the message. Here, we can compute the /buffer/ size by rounding up the /message/ size to a multiple of the cache line.
Then, if we make sure that the buffer starts at a cache line boundary (using the 'align' attribute or calling memalign), it follows that the buffer also ends at a cache line boundary--IOW, the buffer occupies whole cache lines only, and cannot share a cache line with some other, unknown, variable.
Therefore, invalidating (resp. flushing) the buffer will always invalidate (resp. flush) the whole message and nothing else than the message. It will invalidate (resp.flush) a few more bytes, but these are unused so there is no risk.
Note that I've considered the cache line here, but buffers which are copied to/from DDR by some device using DMA might have stricter alignment constraints yet. This just changes what value we should align the start address and size of such buffers to.
Best regards, Alexander
Amicalement,

On 07/04/2015 03:48 AM, Alexander Stein wrote:
This patchset enables dcache support for Raspberry Pi 1. First the cache support code was made similar to existing arm1136 code. The invalidate and flush functions were inprovoed to accept also non-cacheline aligned addresses. This reduces the cacheline size knowledge from generic code.
As mentioned in some other responses, client code must be aware of the cacheline size, so the cache management code shouldn't try to handle misaligned requests, but should error out.
BTW, ALLOC_ALIGN_BUFFER() is probably what you need to fix any misbehaved client code.

On Friday 10 July 2015, 23:17:58 wrote Stephen Warren:
On 07/04/2015 03:48 AM, Alexander Stein wrote:
This patchset enables dcache support for Raspberry Pi 1. First the cache support code was made similar to existing arm1136 code. The invalidate and flush functions were inprovoed to accept also non-cacheline aligned addresses. This reduces the cacheline size knowledge from generic code.
As mentioned in some other responses, client code must be aware of the cacheline size, so the cache management code shouldn't try to handle misaligned requests, but should error out.
BTW, ALLOC_ALIGN_BUFFER() is probably what you need to fix any misbehaved client code.
It turns out this is already used everywhere, but the aligned is only 16 instead of 32. Is there some neat define for the cache size to be used everywhere without defining it in the board config? This would make no sense as according to ARM1176 TRM the cache size is fixed to 32 bytes. So CONFIG_SYS_CACHELINE_SIZE is defined in arch/arm/cpu/arm1176/cpu.c if not defined at board config. I could use ALLOC_CACHE_ALIGN_BUFFER instead, but this would use an (unnecessary) 64 byte alignment, unless CONFIG_SYS_CACHELINE_SIZE is defined.
Alexander

On 07/12/2015 02:10 AM, Alexander Stein wrote:
On Friday 10 July 2015, 23:17:58 wrote Stephen Warren:
On 07/04/2015 03:48 AM, Alexander Stein wrote:
This patchset enables dcache support for Raspberry Pi 1. First the cache support code was made similar to existing arm1136 code. The invalidate and flush functions were inprovoed to accept also non-cacheline aligned addresses. This reduces the cacheline size knowledge from generic code.
As mentioned in some other responses, client code must be aware of the cacheline size, so the cache management code shouldn't try to handle misaligned requests, but should error out.
BTW, ALLOC_ALIGN_BUFFER() is probably what you need to fix any misbehaved client code.
It turns out this is already used everywhere, but the aligned is only 16 instead of 32.
I'm not sure why; is the macro that defines the required alignment defined incorrectly?
Is there some neat define for the cache size to be used everywhere without defining it in the board config? This would make no sense as according to ARM1176 TRM the cache size is fixed to 32 bytes. So CONFIG_SYS_CACHELINE_SIZE is defined in arch/arm/cpu/arm1176/cpu.c if not defined at board config.
There's ARCH_DMA_MINALIGN, which I'd imagine is equal to the cache line size in most cases.
I could use ALLOC_CACHE_ALIGN_BUFFER instead, but this would use an (unnecessary) 64 byte alignment, unless CONFIG_SYS_CACHELINE_SIZE is defined.
Why's that? Is it simply because CONFIG_SYS_CACHELINE_SIZE isn't defined, so it defaults to 64? If so, I'd suggest simply setting CONFIG_SYS_CACHELINE_SIZE.
participants (3)
-
Albert ARIBAUD
-
Alexander Stein
-
Stephen Warren