[U-Boot] [Resend RFC PATCH 0/2] Fix armv8 cache flushing

I have been puzzled by the need to flush external L3 cache for Freescale Layerscape series SoCs. Flushing L3 requires EL3. It is the case now, but this may change in the future. Implementing a SMC call to perform this task is possible but only if necessary. Recent investigation shows we can flush by virtual address most of the time. The only exception is when dcache_disable() is called. I think this can be addressed by flushing the stack U-Boot is using and skip flushing L3 totally.
Once this is proved to work, we can drop flushing L3 all together.
During this investigation, I found the procedure of turning off d-cache seems wrong. The data is lost if d-cache is off first. I am not sure if this only happens to Freescale Layerscape SoCs. Wondering why no one else reports any issue, expecially those SoCs without L3 cache.
York Sun (2): armv8: Fix dcache disable function armv8: Fix flush_dcache_all function
arch/arm/cpu/armv8/cache_v8.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)

Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
Signed-off-by: York Sun york.sun@nxp.com CC: David Feng fenghua@phytium.com.cn ---
arch/arm/cpu/armv8/cache_v8.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index cd3f6c1..92d6277 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -478,9 +478,9 @@ void dcache_disable(void) if (!(sctlr & CR_C)) return;
+ flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all(); }

On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index cd3f6c1..92d6277 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -478,9 +478,9 @@ void dcache_disable(void) if (!(sctlr & CR_C)) return;
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
}
This one makes sense. I'll try and test it soon.

On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
This patch, Tested-by: Stephen Warren swarren@nvidia.com

On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct. Here's my interpretation of what he said:
The dcache must be disabled first. This prevents allocation of new entries in the cache during the flush operation, which prevents the race conditions that were mentioned in the other thread.
Then, the flush operation must be invoked. Since the cache is now disabled, this can fully flush the cache without worrying about racing with things being added to the cache.
This all implies that the implementation of dcache_disable(), set_sctlr(), flush_dcache_all(), and any code they call must not access data in DRAM at all; since because the dcache is off, any DRAM access will[1] read potentially stale data from DRAM, rather than any dirty data that might be in the cache.
[1] I'm not sure if that's "will" or "may", i.e. whether this is architecturally guaranteed in ARMv8 or is implementation defined. At least the Cortex A72 TRM says "will" for that CPU; not sure about others.
Perhaps the most obvious upshot of this is that the stack can't be used. This implies to me that we need to recode all those functions purely in assembly, or just possibly play some tricks to 100% force gcc not to touch memory anywhere inside dcache_disable() or the functions it calls. We're just getting lucky here right now since everything happens to be inlined, but I don't think we're doing anything to 100% guarantee this.
What worries me here is that at least on Tegra, a "flush the entire dcache" operation requires an SMC call to the secure monitor. That will certainly access DRAM when the secure monitor runs, but perhaps this doesn't matter since that's at a different exception level, and we know the secure monitor accesses DRAM regions that are separate from U-Boot's DRAM? I suspect life isn't that convenient. I'm wondering if this all implies that, like patch 2 in this series, we need to get 100% away from flush-by-set/way, even with SoC-specific hooks to make that work reliably, and just flush everything by VA, which IIRC is architecturally guaranteed to work without SoC-specific logic. That way, we can encapsulate everything into an assembly function without worrying about calling SMCs or SoC-specific hook functions without using DRAM. Of course, how that assembly function knows which VAs to flush without decoding the page tables or other data structure is another matter:-(

On 10/19/2016 09:25 AM, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct. Here's my interpretation of what he said:
The dcache must be disabled first. This prevents allocation of new entries in the cache during the flush operation, which prevents the race conditions that were mentioned in the other thread.
Then, the flush operation must be invoked. Since the cache is now disabled, this can fully flush the cache without worrying about racing with things being added to the cache.
This all implies that the implementation of dcache_disable(), set_sctlr(), flush_dcache_all(), and any code they call must not access data in DRAM at all; since because the dcache is off, any DRAM access will[1] read potentially stale data from DRAM, rather than any dirty data that might be in the cache.
[1] I'm not sure if that's "will" or "may", i.e. whether this is architecturally guaranteed in ARMv8 or is implementation defined. At least the Cortex A72 TRM says "will" for that CPU; not sure about others.
Perhaps the most obvious upshot of this is that the stack can't be used. This implies to me that we need to recode all those functions purely in assembly, or just possibly play some tricks to 100% force gcc not to touch memory anywhere inside dcache_disable() or the functions it calls. We're just getting lucky here right now since everything happens to be inlined, but I don't think we're doing anything to 100% guarantee this.
What worries me here is that at least on Tegra, a "flush the entire dcache" operation requires an SMC call to the secure monitor. That will certainly access DRAM when the secure monitor runs, but perhaps this doesn't matter since that's at a different exception level, and we know the secure monitor accesses DRAM regions that are separate from U-Boot's DRAM? I suspect life isn't that convenient. I'm wondering if this all implies that, like patch 2 in this series, we need to get 100% away from flush-by-set/way, even with SoC-specific hooks to make that work reliably, and just flush everything by VA, which IIRC is architecturally guaranteed to work without SoC-specific logic. That way, we can encapsulate everything into an assembly function without worrying about calling SMCs or SoC-specific hook functions without using DRAM. Of course, how that assembly function knows which VAs to flush without decoding the page tables or other data structure is another matter:-(
Re: the last paragraph there:
After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to flush/... its own RAM out of the dcache, since that's all that's relevant at the EL U-Boot is running at, and doesn't care about anything EL3 might have mapped cached. So, it's safe to invoke SMCs from the cache flush code in U-Boot even if the EL3 code touches its own DRAM. There might be a corner case where this isn't true if EL3 has some EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
Re: my other series to add more cache hooks: I'll re-implement the Tegra hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
If it turns out that dcache_disable() ever starts touching DRAM at the wrong time, we can deal with that then; it doesn't now at least.

On 10/19/2016 12:18 PM, Stephen Warren wrote:
On 10/19/2016 09:25 AM, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct. Here's my interpretation of what he said:
The dcache must be disabled first. This prevents allocation of new entries in the cache during the flush operation, which prevents the race conditions that were mentioned in the other thread.
Then, the flush operation must be invoked. Since the cache is now disabled, this can fully flush the cache without worrying about racing with things being added to the cache.
This all implies that the implementation of dcache_disable(), set_sctlr(), flush_dcache_all(), and any code they call must not access data in DRAM at all; since because the dcache is off, any DRAM access will[1] read potentially stale data from DRAM, rather than any dirty data that might be in the cache.
[1] I'm not sure if that's "will" or "may", i.e. whether this is architecturally guaranteed in ARMv8 or is implementation defined. At least the Cortex A72 TRM says "will" for that CPU; not sure about others.
Perhaps the most obvious upshot of this is that the stack can't be used. This implies to me that we need to recode all those functions purely in assembly, or just possibly play some tricks to 100% force gcc not to touch memory anywhere inside dcache_disable() or the functions it calls. We're just getting lucky here right now since everything happens to be inlined, but I don't think we're doing anything to 100% guarantee this.
What worries me here is that at least on Tegra, a "flush the entire dcache" operation requires an SMC call to the secure monitor. That will certainly access DRAM when the secure monitor runs, but perhaps this doesn't matter since that's at a different exception level, and we know the secure monitor accesses DRAM regions that are separate from U-Boot's DRAM? I suspect life isn't that convenient. I'm wondering if this all implies that, like patch 2 in this series, we need to get 100% away from flush-by-set/way, even with SoC-specific hooks to make that work reliably, and just flush everything by VA, which IIRC is architecturally guaranteed to work without SoC-specific logic. That way, we can encapsulate everything into an assembly function without worrying about calling SMCs or SoC-specific hook functions without using DRAM. Of course, how that assembly function knows which VAs to flush without decoding the page tables or other data structure is another matter:-(
Re: the last paragraph there:
After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to flush/... its own RAM out of the dcache, since that's all that's relevant at the EL U-Boot is running at, and doesn't care about anything EL3 might have mapped cached. So, it's safe to invoke SMCs from the cache flush code in U-Boot even if the EL3 code touches its own DRAM. There might be a corner case where this isn't true if EL3 has some EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
Re: my other series to add more cache hooks: I'll re-implement the Tegra hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
If it turns out that dcache_disable() ever starts touching DRAM at the wrong time, we can deal with that then; it doesn't now at least.
Stephen,
I think one of our difference is whether the data is returned correctly with dirty dcache being disabled. With current code flow, the dcache is disabled, followed by flushing/invalidating by way/set, then L3 cache is flushed. I see correct stack after these steps. During my recent attempt to remove flushing L3, I added code to flush the stack by VA (which I already confirmed the data will end up in main memory) to replace flushing L3. I found as soon as the dcache is disabled, the dirty cache data is lost when trying to access from the core (debugged by JTAG tool). This makes me to think the order may be wrong. By reverting the order, i.e. flushing dcache first then disable it, I can see correct data in main memory. I understand the proposed new code requires not touching stack or any data after the first flush. I prefer to not to make this change if we have an explanation of what I saw.
York

On 10/19/2016 04:32 PM, york sun wrote:
On 10/19/2016 12:18 PM, Stephen Warren wrote:
On 10/19/2016 09:25 AM, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct. Here's my interpretation of what he said:
The dcache must be disabled first. This prevents allocation of new entries in the cache during the flush operation, which prevents the race conditions that were mentioned in the other thread.
Then, the flush operation must be invoked. Since the cache is now disabled, this can fully flush the cache without worrying about racing with things being added to the cache.
This all implies that the implementation of dcache_disable(), set_sctlr(), flush_dcache_all(), and any code they call must not access data in DRAM at all; since because the dcache is off, any DRAM access will[1] read potentially stale data from DRAM, rather than any dirty data that might be in the cache.
[1] I'm not sure if that's "will" or "may", i.e. whether this is architecturally guaranteed in ARMv8 or is implementation defined. At least the Cortex A72 TRM says "will" for that CPU; not sure about others.
Perhaps the most obvious upshot of this is that the stack can't be used. This implies to me that we need to recode all those functions purely in assembly, or just possibly play some tricks to 100% force gcc not to touch memory anywhere inside dcache_disable() or the functions it calls. We're just getting lucky here right now since everything happens to be inlined, but I don't think we're doing anything to 100% guarantee this.
What worries me here is that at least on Tegra, a "flush the entire dcache" operation requires an SMC call to the secure monitor. That will certainly access DRAM when the secure monitor runs, but perhaps this doesn't matter since that's at a different exception level, and we know the secure monitor accesses DRAM regions that are separate from U-Boot's DRAM? I suspect life isn't that convenient. I'm wondering if this all implies that, like patch 2 in this series, we need to get 100% away from flush-by-set/way, even with SoC-specific hooks to make that work reliably, and just flush everything by VA, which IIRC is architecturally guaranteed to work without SoC-specific logic. That way, we can encapsulate everything into an assembly function without worrying about calling SMCs or SoC-specific hook functions without using DRAM. Of course, how that assembly function knows which VAs to flush without decoding the page tables or other data structure is another matter:-(
Re: the last paragraph there:
After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to flush/... its own RAM out of the dcache, since that's all that's relevant at the EL U-Boot is running at, and doesn't care about anything EL3 might have mapped cached. So, it's safe to invoke SMCs from the cache flush code in U-Boot even if the EL3 code touches its own DRAM. There might be a corner case where this isn't true if EL3 has some EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
Re: my other series to add more cache hooks: I'll re-implement the Tegra hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
If it turns out that dcache_disable() ever starts touching DRAM at the wrong time, we can deal with that then; it doesn't now at least.
Stephen,
I think one of our difference is whether the data is returned correctly with dirty dcache being disabled. With current code flow, the dcache is disabled, followed by flushing/invalidating by way/set, then L3 cache is flushed. I see correct stack after these steps. During my recent attempt to remove flushing L3, I added code to flush the stack by VA (which I already confirmed the data will end up in main memory) to replace flushing L3. I found as soon as the dcache is disabled, the dirty cache data is lost when trying to access from the core (debugged by JTAG tool). This makes me to think the order may be wrong. By reverting the order, i.e. flushing dcache first then disable it, I can see correct data in main memory. I understand the proposed new code requires not touching stack or any data after the first flush. I prefer to not to make this change if we have an explanation of what I saw.
I think the confusion is: what does "lost" mean?
On ARMv8 at least:
Starting off with dcache enabled and containing some dirty data:
CPU reads will read the dirty data from the cache.
Now if dcache is disabled:
CPU reads will be of the uncached/uncachable type since dcache is off. The dcache will not respond. So, the CPU will receive (potentially) stale data directly from RAM.
In this state, it is not possible to access the dirty data in the caches. However, it's not (permanently) lost...
Now if the dcache is fully flushed:
All dirty data that was in the dcache will be written to RAM. After this process is complete, any reads from RAM will return the most up-to-date possible data; the temporarily "lost" dirty data is now no longer lost.
The cache disable process must proceed as above; we can't flush the dcache and then disable it. Attempting to do things in that order would introduce race conditions. At all times before the dcache is disabled, any CPU writes to memory will allocate new lines in the dcache. Also, the dcache is fully responsive to the cache coherency protocol, so some other agent on the bus (another CPU, cluster, DMA agent, ...) could cause a (clean or dirty) line to be brought into the cache. This might happen after that part of the cache would have been flushed. In other words, it's impossible to fully flush the cache (by set/way) while it's enabled.
I'm not sure that flushing by VA instead of by set/way solves anything much here. I believe flushing by VA runs less risk of interference due to the coherency protocol bringing lines back into the cache, since flush by VA will affect other caches? However this still doesn't allow use of the CPU's own stack or other DRAM writes; if we:
a) Flush by VA (this step is assumed to perform some function calls/returns, or other DRAM writes)
b) Disable cache
... then irrespective of how step (a) was performed, since the DRAM writes performed in that step are likely still in the cache as dirty data, so step (b) will cause them to be "lost" until a full flush operation is performed. Thus we must always flush after disabling the cache, so there's no point also flushing before.

On 10/19/2016 06:01 PM, Stephen Warren wrote:
On 10/19/2016 04:32 PM, york sun wrote:
On 10/19/2016 12:18 PM, Stephen Warren wrote:
On 10/19/2016 09:25 AM, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct. Here's my interpretation of what he said:
The dcache must be disabled first. This prevents allocation of new entries in the cache during the flush operation, which prevents the race conditions that were mentioned in the other thread.
Then, the flush operation must be invoked. Since the cache is now disabled, this can fully flush the cache without worrying about racing with things being added to the cache.
This all implies that the implementation of dcache_disable(), set_sctlr(), flush_dcache_all(), and any code they call must not access data in DRAM at all; since because the dcache is off, any DRAM access will[1] read potentially stale data from DRAM, rather than any dirty data that might be in the cache.
[1] I'm not sure if that's "will" or "may", i.e. whether this is architecturally guaranteed in ARMv8 or is implementation defined. At least the Cortex A72 TRM says "will" for that CPU; not sure about others.
Perhaps the most obvious upshot of this is that the stack can't be used. This implies to me that we need to recode all those functions purely in assembly, or just possibly play some tricks to 100% force gcc not to touch memory anywhere inside dcache_disable() or the functions it calls. We're just getting lucky here right now since everything happens to be inlined, but I don't think we're doing anything to 100% guarantee this.
What worries me here is that at least on Tegra, a "flush the entire dcache" operation requires an SMC call to the secure monitor. That will certainly access DRAM when the secure monitor runs, but perhaps this doesn't matter since that's at a different exception level, and we know the secure monitor accesses DRAM regions that are separate from U-Boot's DRAM? I suspect life isn't that convenient. I'm wondering if this all implies that, like patch 2 in this series, we need to get 100% away from flush-by-set/way, even with SoC-specific hooks to make that work reliably, and just flush everything by VA, which IIRC is architecturally guaranteed to work without SoC-specific logic. That way, we can encapsulate everything into an assembly function without worrying about calling SMCs or SoC-specific hook functions without using DRAM. Of course, how that assembly function knows which VAs to flush without decoding the page tables or other data structure is another matter:-(
Re: the last paragraph there:
After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to flush/... its own RAM out of the dcache, since that's all that's relevant at the EL U-Boot is running at, and doesn't care about anything EL3 might have mapped cached. So, it's safe to invoke SMCs from the cache flush code in U-Boot even if the EL3 code touches its own DRAM. There might be a corner case where this isn't true if EL3 has some EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
Re: my other series to add more cache hooks: I'll re-implement the Tegra hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
If it turns out that dcache_disable() ever starts touching DRAM at the wrong time, we can deal with that then; it doesn't now at least.
Stephen,
I think one of our difference is whether the data is returned correctly with dirty dcache being disabled. With current code flow, the dcache is disabled, followed by flushing/invalidating by way/set, then L3 cache is flushed. I see correct stack after these steps. During my recent attempt to remove flushing L3, I added code to flush the stack by VA (which I already confirmed the data will end up in main memory) to replace flushing L3. I found as soon as the dcache is disabled, the dirty cache data is lost when trying to access from the core (debugged by JTAG tool). This makes me to think the order may be wrong. By reverting the order, i.e. flushing dcache first then disable it, I can see correct data in main memory. I understand the proposed new code requires not touching stack or any data after the first flush. I prefer to not to make this change if we have an explanation of what I saw.
I think the confusion is: what does "lost" mean?
On ARMv8 at least:
Starting off with dcache enabled and containing some dirty data:
CPU reads will read the dirty data from the cache.
Now if dcache is disabled:
CPU reads will be of the uncached/uncachable type since dcache is off. The dcache will not respond. So, the CPU will receive (potentially) stale data directly from RAM.
In this state, it is not possible to access the dirty data in the caches. However, it's not (permanently) lost...
Now if the dcache is fully flushed:
All dirty data that was in the dcache will be written to RAM. After this process is complete, any reads from RAM will return the most up-to-date possible data; the temporarily "lost" dirty data is now no longer lost.
The cache disable process must proceed as above; we can't flush the dcache and then disable it. Attempting to do things in that order would introduce race conditions. At all times before the dcache is disabled, any CPU writes to memory will allocate new lines in the dcache. Also, the dcache is fully responsive to the cache coherency protocol, so some other agent on the bus (another CPU, cluster, DMA agent, ...) could cause a (clean or dirty) line to be brought into the cache. This might happen after that part of the cache would have been flushed. In other words, it's impossible to fully flush the cache (by set/way) while it's enabled.
I'm not sure that flushing by VA instead of by set/way solves anything much here. I believe flushing by VA runs less risk of interference due to the coherency protocol bringing lines back into the cache, since flush by VA will affect other caches? However this still doesn't allow use of the CPU's own stack or other DRAM writes; if we:
a) Flush by VA (this step is assumed to perform some function calls/returns, or other DRAM writes)
b) Disable cache
... then irrespective of how step (a) was performed, since the DRAM writes performed in that step are likely still in the cache as dirty data, so step (b) will cause them to be "lost" until a full flush operation is performed. Thus we must always flush after disabling the cache, so there's no point also flushing before.
Stephen,
Sorry for late response. I am still traveling... I understand the data in dirty cache is not lost when the dcache is disabled. It is just not accessible. In my case, after flushing L1/L2 by way/set, the data is probably in L3 cache. Without flushing L3, I have stalled data (including the stack) in main memory.
My previous test was trying to prove I can skip flushing L3 if I flush the necessary VA. Now I when recall it carefully, I think I made a mistake by flushing by VA _after_ flushing the cache by way/set. I might have a positive result if I flushed the cache by VA first. I will repeat this test when I get back to prove this theory.
That being said, do you or anyone see any value by skipping flushing L3? My target was to avoid making a SMC call for EL2/EL1, and to avoid implementing flushing L3 function in U-Boot (even we already have one for CCN-504). I understand this is different from what dcache_disable() implies.
York

On 10/19/2016 11:06 PM, york sun wrote:
On 10/19/2016 06:01 PM, Stephen Warren wrote:
On 10/19/2016 04:32 PM, york sun wrote:
On 10/19/2016 12:18 PM, Stephen Warren wrote:
On 10/19/2016 09:25 AM, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct. Here's my interpretation of what he said:
The dcache must be disabled first. This prevents allocation of new entries in the cache during the flush operation, which prevents the race conditions that were mentioned in the other thread.
Then, the flush operation must be invoked. Since the cache is now disabled, this can fully flush the cache without worrying about racing with things being added to the cache.
This all implies that the implementation of dcache_disable(), set_sctlr(), flush_dcache_all(), and any code they call must not access data in DRAM at all; since because the dcache is off, any DRAM access will[1] read potentially stale data from DRAM, rather than any dirty data that might be in the cache.
[1] I'm not sure if that's "will" or "may", i.e. whether this is architecturally guaranteed in ARMv8 or is implementation defined. At least the Cortex A72 TRM says "will" for that CPU; not sure about others.
Perhaps the most obvious upshot of this is that the stack can't be used. This implies to me that we need to recode all those functions purely in assembly, or just possibly play some tricks to 100% force gcc not to touch memory anywhere inside dcache_disable() or the functions it calls. We're just getting lucky here right now since everything happens to be inlined, but I don't think we're doing anything to 100% guarantee this.
What worries me here is that at least on Tegra, a "flush the entire dcache" operation requires an SMC call to the secure monitor. That will certainly access DRAM when the secure monitor runs, but perhaps this doesn't matter since that's at a different exception level, and we know the secure monitor accesses DRAM regions that are separate from U-Boot's DRAM? I suspect life isn't that convenient. I'm wondering if this all implies that, like patch 2 in this series, we need to get 100% away from flush-by-set/way, even with SoC-specific hooks to make that work reliably, and just flush everything by VA, which IIRC is architecturally guaranteed to work without SoC-specific logic. That way, we can encapsulate everything into an assembly function without worrying about calling SMCs or SoC-specific hook functions without using DRAM. Of course, how that assembly function knows which VAs to flush without decoding the page tables or other data structure is another matter:-(
Re: the last paragraph there:
After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to flush/... its own RAM out of the dcache, since that's all that's relevant at the EL U-Boot is running at, and doesn't care about anything EL3 might have mapped cached. So, it's safe to invoke SMCs from the cache flush code in U-Boot even if the EL3 code touches its own DRAM. There might be a corner case where this isn't true if EL3 has some EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
Re: my other series to add more cache hooks: I'll re-implement the Tegra hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
If it turns out that dcache_disable() ever starts touching DRAM at the wrong time, we can deal with that then; it doesn't now at least.
Stephen,
I think one of our difference is whether the data is returned correctly with dirty dcache being disabled. With current code flow, the dcache is disabled, followed by flushing/invalidating by way/set, then L3 cache is flushed. I see correct stack after these steps. During my recent attempt to remove flushing L3, I added code to flush the stack by VA (which I already confirmed the data will end up in main memory) to replace flushing L3. I found as soon as the dcache is disabled, the dirty cache data is lost when trying to access from the core (debugged by JTAG tool). This makes me to think the order may be wrong. By reverting the order, i.e. flushing dcache first then disable it, I can see correct data in main memory. I understand the proposed new code requires not touching stack or any data after the first flush. I prefer to not to make this change if we have an explanation of what I saw.
I think the confusion is: what does "lost" mean?
On ARMv8 at least:
Starting off with dcache enabled and containing some dirty data:
CPU reads will read the dirty data from the cache.
Now if dcache is disabled:
CPU reads will be of the uncached/uncachable type since dcache is off. The dcache will not respond. So, the CPU will receive (potentially) stale data directly from RAM.
In this state, it is not possible to access the dirty data in the caches. However, it's not (permanently) lost...
Now if the dcache is fully flushed:
All dirty data that was in the dcache will be written to RAM. After this process is complete, any reads from RAM will return the most up-to-date possible data; the temporarily "lost" dirty data is now no longer lost.
The cache disable process must proceed as above; we can't flush the dcache and then disable it. Attempting to do things in that order would introduce race conditions. At all times before the dcache is disabled, any CPU writes to memory will allocate new lines in the dcache. Also, the dcache is fully responsive to the cache coherency protocol, so some other agent on the bus (another CPU, cluster, DMA agent, ...) could cause a (clean or dirty) line to be brought into the cache. This might happen after that part of the cache would have been flushed. In other words, it's impossible to fully flush the cache (by set/way) while it's enabled.
I'm not sure that flushing by VA instead of by set/way solves anything much here. I believe flushing by VA runs less risk of interference due to the coherency protocol bringing lines back into the cache, since flush by VA will affect other caches? However this still doesn't allow use of the CPU's own stack or other DRAM writes; if we:
a) Flush by VA (this step is assumed to perform some function calls/returns, or other DRAM writes)
b) Disable cache
... then irrespective of how step (a) was performed, since the DRAM writes performed in that step are likely still in the cache as dirty data, so step (b) will cause them to be "lost" until a full flush operation is performed. Thus we must always flush after disabling the cache, so there's no point also flushing before.
Stephen,
Sorry for late response. I am still traveling...
I didn't think your response was slow:-)
I understand the data in dirty cache is not lost when the dcache is disabled. It is just not accessible. In my case, after flushing L1/L2 by way/set, the data is probably in L3 cache. Without flushing L3, I have stalled data (including the stack) in main memory.
I assume "stale" not "stalled".
Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to do everything, but the L3 cache doesn't implement by set/way operations), then L3 can indeed still contain dirty data, and hence main memory can be stale.
My previous test was trying to prove I can skip flushing L3 if I flush the necessary VA.
It depends whether your L3 cache is before/after the Level of Unification and/or the Level of Coherency for your HW, and whether you flush by VA to the PoU or PoC.
Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it uses __asm_flush_dcache_range() which cleans and invalidates to the point of coherency (it invokes the dc civac instruction).
Now I when recall it carefully, I think I made a mistake by flushing by VA _after_ flushing the cache by way/set. I might have a positive result if I flushed the cache by VA first. I will repeat this test when I get back to prove this theory.
I'll assume you L3 cache is before the Point of Coherency. If so, then performing a clean/invalidate by VA to the PoC will affect all of L1, L2, and L3 caches. As such, you need only perform the c/i by VA, and you can entirely skip the c/i by set/way; I believe it would be redundant, assuming that the by VA operations cover all relevant VAs.
That being said, do you or anyone see any value by skipping flushing L3? My target was to avoid making a SMC call for EL2/EL1, and to avoid implementing flushing L3 function in U-Boot (even we already have one for CCN-504). I understand this is different from what dcache_disable() implies.
I /think/ that if we can modify U-Boot to do all cache-wide operations by VA to the PoC, then we can indeed do that instead of any by-set/way operations, and in turn that will mean U-Boot doesn't need to invoke any SMC calls or SoC-specific logic when enabling/disabling the cache. My remaining questions would be:
a) How can we determine the set of VA ranges we need to operate on?
In the case where U-Boot is invalidating the dcache prior to enabling it, I think the set of VA ranges to operate on is determined by the MMU configuration of the SW that ran before U-Boot, since that defines what dirty data might be present in the dcache. U-Boot won't be able to determine that.
Perhaps we only care about the set of VAs that U-Boot will actually enable in its own MMU configuration, since those will be the only VAs the CPU will be allowed to architecturally access. So, this might actually be a solvable problem.
Perhaps U-Boot requires that the previous SW has already cleaned all relevant parts of the dcache (the fact U-Boot attempts to invalidate the entire dcache before enabling it implies this is not the case though).
In the case where U-Boot has already programmed its own MMU configuration, then that MMU configuration can determine the set of VAs to operate on. This case is easy (easier) since we definitely have all the required configuration knowledge.
b) How can we implement the by VA code in a way that doesn't touch DRAM?
Implementing by-set/way is fairly constrained in that all the configuration data is in a few registers, and the algorithm just requires a few nested loops.
A generic by VA implementation seems like it would require walking U-Boot's struct mm_region *mem_map data structure (or the CPU's translation tables) in RAM. Perhaps that's OK since it's read-only...

On 10/20/2016 01:34 PM, Stephen Warren wrote:
On 10/19/2016 11:06 PM, york sun wrote:
On 10/19/2016 06:01 PM, Stephen Warren wrote:
On 10/19/2016 04:32 PM, york sun wrote:
On 10/19/2016 12:18 PM, Stephen Warren wrote:
On 10/19/2016 09:25 AM, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote: > Current code turns off d-cache first, then flush all levels of cache. > This results data loss. As soon as d-cache is off, the dirty cache > is discarded according to the test on LS2080A. This issue was not > seen as long as external L3 cache was flushed to push the data to > main memory. However, external L3 cache is not guaranteed to have > the data. To fix this, flush the d-cache by way/set first to make > sure cache is clean before turning it off.
> diff --git a/arch/arm/cpu/armv8/cache_v8.c > b/arch/arm/cpu/armv8/cache_v8.c
> @@ -478,9 +478,9 @@ void dcache_disable(void)
> + flush_dcache_all(); > set_sctlr(sctlr & ~(CR_C|CR_M)); > > - flush_dcache_all(); > __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct. Here's my interpretation of what he said:
The dcache must be disabled first. This prevents allocation of new entries in the cache during the flush operation, which prevents the race conditions that were mentioned in the other thread.
Then, the flush operation must be invoked. Since the cache is now disabled, this can fully flush the cache without worrying about racing with things being added to the cache.
This all implies that the implementation of dcache_disable(), set_sctlr(), flush_dcache_all(), and any code they call must not access data in DRAM at all; since because the dcache is off, any DRAM access will[1] read potentially stale data from DRAM, rather than any dirty data that might be in the cache.
[1] I'm not sure if that's "will" or "may", i.e. whether this is architecturally guaranteed in ARMv8 or is implementation defined. At least the Cortex A72 TRM says "will" for that CPU; not sure about others.
Perhaps the most obvious upshot of this is that the stack can't be used. This implies to me that we need to recode all those functions purely in assembly, or just possibly play some tricks to 100% force gcc not to touch memory anywhere inside dcache_disable() or the functions it calls. We're just getting lucky here right now since everything happens to be inlined, but I don't think we're doing anything to 100% guarantee this.
What worries me here is that at least on Tegra, a "flush the entire dcache" operation requires an SMC call to the secure monitor. That will certainly access DRAM when the secure monitor runs, but perhaps this doesn't matter since that's at a different exception level, and we know the secure monitor accesses DRAM regions that are separate from U-Boot's DRAM? I suspect life isn't that convenient. I'm wondering if this all implies that, like patch 2 in this series, we need to get 100% away from flush-by-set/way, even with SoC-specific hooks to make that work reliably, and just flush everything by VA, which IIRC is architecturally guaranteed to work without SoC-specific logic. That way, we can encapsulate everything into an assembly function without worrying about calling SMCs or SoC-specific hook functions without using DRAM. Of course, how that assembly function knows which VAs to flush without decoding the page tables or other data structure is another matter:-(
Re: the last paragraph there:
After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to flush/... its own RAM out of the dcache, since that's all that's relevant at the EL U-Boot is running at, and doesn't care about anything EL3 might have mapped cached. So, it's safe to invoke SMCs from the cache flush code in U-Boot even if the EL3 code touches its own DRAM. There might be a corner case where this isn't true if EL3 has some EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
Re: my other series to add more cache hooks: I'll re-implement the Tegra hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
If it turns out that dcache_disable() ever starts touching DRAM at the wrong time, we can deal with that then; it doesn't now at least.
Stephen,
I think one of our difference is whether the data is returned correctly with dirty dcache being disabled. With current code flow, the dcache is disabled, followed by flushing/invalidating by way/set, then L3 cache is flushed. I see correct stack after these steps. During my recent attempt to remove flushing L3, I added code to flush the stack by VA (which I already confirmed the data will end up in main memory) to replace flushing L3. I found as soon as the dcache is disabled, the dirty cache data is lost when trying to access from the core (debugged by JTAG tool). This makes me to think the order may be wrong. By reverting the order, i.e. flushing dcache first then disable it, I can see correct data in main memory. I understand the proposed new code requires not touching stack or any data after the first flush. I prefer to not to make this change if we have an explanation of what I saw.
I think the confusion is: what does "lost" mean?
On ARMv8 at least:
Starting off with dcache enabled and containing some dirty data:
CPU reads will read the dirty data from the cache.
Now if dcache is disabled:
CPU reads will be of the uncached/uncachable type since dcache is off. The dcache will not respond. So, the CPU will receive (potentially) stale data directly from RAM.
In this state, it is not possible to access the dirty data in the caches. However, it's not (permanently) lost...
Now if the dcache is fully flushed:
All dirty data that was in the dcache will be written to RAM. After this process is complete, any reads from RAM will return the most up-to-date possible data; the temporarily "lost" dirty data is now no longer lost.
The cache disable process must proceed as above; we can't flush the dcache and then disable it. Attempting to do things in that order would introduce race conditions. At all times before the dcache is disabled, any CPU writes to memory will allocate new lines in the dcache. Also, the dcache is fully responsive to the cache coherency protocol, so some other agent on the bus (another CPU, cluster, DMA agent, ...) could cause a (clean or dirty) line to be brought into the cache. This might happen after that part of the cache would have been flushed. In other words, it's impossible to fully flush the cache (by set/way) while it's enabled.
I'm not sure that flushing by VA instead of by set/way solves anything much here. I believe flushing by VA runs less risk of interference due to the coherency protocol bringing lines back into the cache, since flush by VA will affect other caches? However this still doesn't allow use of the CPU's own stack or other DRAM writes; if we:
a) Flush by VA (this step is assumed to perform some function calls/returns, or other DRAM writes)
b) Disable cache
... then irrespective of how step (a) was performed, since the DRAM writes performed in that step are likely still in the cache as dirty data, so step (b) will cause them to be "lost" until a full flush operation is performed. Thus we must always flush after disabling the cache, so there's no point also flushing before.
Stephen,
Sorry for late response. I am still traveling...
I didn't think your response was slow:-)
I understand the data in dirty cache is not lost when the dcache is disabled. It is just not accessible. In my case, after flushing L1/L2 by way/set, the data is probably in L3 cache. Without flushing L3, I have stalled data (including the stack) in main memory.
I assume "stale" not "stalled".
Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to do everything, but the L3 cache doesn't implement by set/way operations), then L3 can indeed still contain dirty data, and hence main memory can be stale.
My previous test was trying to prove I can skip flushing L3 if I flush the necessary VA.
It depends whether your L3 cache is before/after the Level of Unification and/or the Level of Coherency for your HW, and whether you flush by VA to the PoU or PoC.
Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it uses __asm_flush_dcache_range() which cleans and invalidates to the point of coherency (it invokes the dc civac instruction).
Now I when recall it carefully, I think I made a mistake by flushing by VA _after_ flushing the cache by way/set. I might have a positive result if I flushed the cache by VA first. I will repeat this test when I get back to prove this theory.
I'll assume you L3 cache is before the Point of Coherency. If so, then performing a clean/invalidate by VA to the PoC will affect all of L1, L2, and L3 caches. As such, you need only perform the c/i by VA, and you can entirely skip the c/i by set/way; I believe it would be redundant, assuming that the by VA operations cover all relevant VAs.
I believe the PoC and PoU is before L3 for me. I can clean/invalidate by VA, it may not cover all the cache lines. So by set/way is still needed.
That being said, do you or anyone see any value by skipping flushing L3? My target was to avoid making a SMC call for EL2/EL1, and to avoid implementing flushing L3 function in U-Boot (even we already have one for CCN-504). I understand this is different from what dcache_disable() implies.
I /think/ that if we can modify U-Boot to do all cache-wide operations by VA to the PoC, then we can indeed do that instead of any by-set/way operations, and in turn that will mean U-Boot doesn't need to invoke any SMC calls or SoC-specific logic when enabling/disabling the cache. My remaining questions would be:
a) How can we determine the set of VA ranges we need to operate on?
In the case where U-Boot is invalidating the dcache prior to enabling it, I think the set of VA ranges to operate on is determined by the MMU configuration of the SW that ran before U-Boot, since that defines what dirty data might be present in the dcache. U-Boot won't be able to determine that.
Perhaps we only care about the set of VAs that U-Boot will actually enable in its own MMU configuration, since those will be the only VAs the CPU will be allowed to architecturally access. So, this might actually be a solvable problem.
Perhaps U-Boot requires that the previous SW has already cleaned all relevant parts of the dcache (the fact U-Boot attempts to invalidate the entire dcache before enabling it implies this is not the case though).
In the case where U-Boot has already programmed its own MMU configuration, then that MMU configuration can determine the set of VAs to operate on. This case is easy (easier) since we definitely have all the required configuration knowledge.
b) How can we implement the by VA code in a way that doesn't touch DRAM?
Implementing by-set/way is fairly constrained in that all the configuration data is in a few registers, and the algorithm just . requires a few nested loops.
A generic by VA implementation seems like it would require walking U-Boot's struct mm_region *mem_map data structure (or the CPU's translation tables) in RAM. Perhaps that's OK since it's read-only...
I agree in general about your points, but it may not be always practical to flush all by VA. U-Boot may map huge amount of memory. Walking through MMU table and flush all will be too much. Without flushing all memory, we really cannot say we flushed all dcache. On the other side, for U-Boot itself to operate, we don't really have to flush all. I guess the key is if we need to flush all. For Linux to boot, we don't. We can flush some memory (including U-Boot stack), turn off the MMU. As soon as kernel boots up, it enables dcache and everything is back. One can make the argument that if users put something in memory and it is needed before dcache is re-enabled, then it makes sense to flush the concerned memory, unless it is not practical to do so.
At this moment, I can drop these two patches. I will still test it as I said earlier.
York

On Fri, Oct 21, 2016 at 07:31:52PM +0000, york sun wrote:
On 10/20/2016 01:34 PM, Stephen Warren wrote:
On 10/19/2016 11:06 PM, york sun wrote:
I understand the data in dirty cache is not lost when the dcache is disabled. It is just not accessible. In my case, after flushing L1/L2 by way/set, the data is probably in L3 cache. Without flushing L3, I have stalled data (including the stack) in main memory.
I assume "stale" not "stalled".
Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to do everything, but the L3 cache doesn't implement by set/way operations), then L3 can indeed still contain dirty data, and hence main memory can be stale.
My previous test was trying to prove I can skip flushing L3 if I flush the necessary VA.
It depends whether your L3 cache is before/after the Level of Unification and/or the Level of Coherency for your HW, and whether you flush by VA to the PoU or PoC.
Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it uses __asm_flush_dcache_range() which cleans and invalidates to the point of coherency (it invokes the dc civac instruction).
Now I when recall it carefully, I think I made a mistake by flushing by VA _after_ flushing the cache by way/set. I might have a positive result if I flushed the cache by VA first. I will repeat this test when I get back to prove this theory.
I'll assume you L3 cache is before the Point of Coherency. If so, then performing a clean/invalidate by VA to the PoC will affect all of L1, L2, and L3 caches. As such, you need only perform the c/i by VA, and you can entirely skip the c/i by set/way; I believe it would be redundant, assuming that the by VA operations cover all relevant VAs.
I believe the PoC and PoU is before L3 for me.
If you are using CCN, then the PoC is beyond the L3.
Were the PoC before the L3, there would be no requirement to perform maintenance on the L3. The PoC is the point at which *all* accesses (cacheable or otherwise) see the same data.
Per the ARM ARM (for ARMv8-A), maintenance to the PoC *must* affect system caches (including CCN).
I can clean/invalidate by VA, it may not cover all the cache lines. So by set/way is still needed.
The problem is figuring out which VA ranges require maintenance.
Do we not have an idea of the set of memory banks present in the SoC? Like the memblock array in Linux?
b) How can we implement the by VA code in a way that doesn't touch DRAM?
Implementing by-set/way is fairly constrained in that all the configuration data is in a few registers, and the algorithm just . requires a few nested loops.
A generic by VA implementation seems like it would require walking U-Boot's struct mm_region *mem_map data structure (or the CPU's translation tables) in RAM. Perhaps that's OK since it's read-only...
So long as you clean the structure by VA to the PoC first, you can safely access it with non-cacheable accesses.
I agree in general about your points, but it may not be always practical to flush all by VA. U-Boot may map huge amount of memory. Walking through MMU table and flush all will be too much.
I would recommend that you benchmark that; from my own experiments, so long as you only perform maintenance on the portions of the PA space you care about (and amortize barriers), this can take surprisingly little time.
Without flushing all memory, we really cannot say we flushed all dcache. On the other side, for U-Boot itself to operate, we don't really have to flush all. I guess the key is if we need to flush all. For Linux to boot, we don't.
This depends; so long as you've *only* used Normal, Inner-Shareable, Inner Write-Back, Outer Write-Back, you could omit some maintenance, but you still need to clean the Linux image to the PoC.
Any memory mapped with other attributes *must* be invalidated (and perhaps clean+invalidated) from the caches.
We can flush some memory (including U-Boot stack), turn off the MMU. As soon as kernel boots up, it enables dcache and everything is back.
If you have used memory attributes inconsistent with Linux, things will end badly from here, due to potential loss of coherency resulting from mismatched memory attributes.
Thanks, Mark.

On 10/24/2016 04:59 AM, Mark Rutland wrote:
On Fri, Oct 21, 2016 at 07:31:52PM +0000, york sun wrote:
On 10/20/2016 01:34 PM, Stephen Warren wrote:
On 10/19/2016 11:06 PM, york sun wrote:
I understand the data in dirty cache is not lost when the dcache is disabled. It is just not accessible. In my case, after flushing L1/L2 by way/set, the data is probably in L3 cache. Without flushing L3, I have stalled data (including the stack) in main memory.
I assume "stale" not "stalled".
Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to do everything, but the L3 cache doesn't implement by set/way operations), then L3 can indeed still contain dirty data, and hence main memory can be stale.
My previous test was trying to prove I can skip flushing L3 if I flush the necessary VA.
It depends whether your L3 cache is before/after the Level of Unification and/or the Level of Coherency for your HW, and whether you flush by VA to the PoU or PoC.
Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it uses __asm_flush_dcache_range() which cleans and invalidates to the point of coherency (it invokes the dc civac instruction).
Now I when recall it carefully, I think I made a mistake by flushing by VA _after_ flushing the cache by way/set. I might have a positive result if I flushed the cache by VA first. I will repeat this test when I get back to prove this theory.
I'll assume you L3 cache is before the Point of Coherency. If so, then performing a clean/invalidate by VA to the PoC will affect all of L1, L2, and L3 caches. As such, you need only perform the c/i by VA, and you can entirely skip the c/i by set/way; I believe it would be redundant, assuming that the by VA operations cover all relevant VAs.
I believe the PoC and PoU is before L3 for me.
If you are using CCN, then the PoC is beyond the L3.
Were the PoC before the L3, there would be no requirement to perform maintenance on the L3. The PoC is the point at which *all* accesses (cacheable or otherwise) see the same data.
Per the ARM ARM (for ARMv8-A), maintenance to the PoC *must* affect system caches (including CCN).
I can clean/invalidate by VA, it may not cover all the cache lines. So by set/way is still needed.
The problem is figuring out which VA ranges require maintenance.
Do we not have an idea of the set of memory banks present in the SoC? Like the memblock array in Linux?
There are two data structures in ARM U-Boot that describe memory layout:
1) A list of RAM memory regions. U-Boot uses these to know where to relocate itself to (it relocates itself to the top of RAM at boot), and to fill in the /memory node when booting an OS using DT (and other equivalent mechanisms when not using DT.)
2) For AArch64 specifically, there's a memory map table that defines all RAM and MMIO regions, and the translation table attributes required for those regions.
Judging by your comments later in the original message, it sounds like it'd be fine to read from these structures during any dcache clean routine provided the table has already been cleaned. That makes using the tables much more tractable:-)

On 10/26/2016 12:47 PM, Stephen Warren wrote:
There are two data structures in ARM U-Boot that describe memory layout:
- A list of RAM memory regions. U-Boot uses these to know where to
relocate itself to (it relocates itself to the top of RAM at boot), and to fill in the /memory node when booting an OS using DT (and other equivalent mechanisms when not using DT.)
- For AArch64 specifically, there's a memory map table that defines all
RAM and MMIO regions, and the translation table attributes required for those regions.
Judging by your comments later in the original message, it sounds like it'd be fine to read from these structures during any dcache clean routine provided the table has already been cleaned. That makes using the tables much more tractable:-)
I think we need to benchmark walking through the MMU tables. It can map huge amount of memory. For our case, it is more than 16GB. I have been reluctant to do so for the size. I am now back testing to revert _this_ patch, hoping to confirm what I learned from this discussion. After that, I will see how long it takes to flush all cached addresses by VA.
York

On 10/26/2016 01:54 PM, york sun wrote:
On 10/26/2016 12:47 PM, Stephen Warren wrote:
There are two data structures in ARM U-Boot that describe memory layout:
- A list of RAM memory regions. U-Boot uses these to know where to
relocate itself to (it relocates itself to the top of RAM at boot), and to fill in the /memory node when booting an OS using DT (and other equivalent mechanisms when not using DT.)
- For AArch64 specifically, there's a memory map table that defines all
RAM and MMIO regions, and the translation table attributes required for those regions.
Judging by your comments later in the original message, it sounds like it'd be fine to read from these structures during any dcache clean routine provided the table has already been cleaned. That makes using the tables much more tractable:-)
I think we need to benchmark walking through the MMU tables. It can map huge amount of memory. For our case, it is more than 16GB. I have been reluctant to do so for the size. I am now back testing to revert _this_ patch, hoping to confirm what I learned from this discussion. After that, I will see how long it takes to flush all cached addresses by VA.
Given the existence of the two data structures I mentioned above, I don't think we'd ever need to walk the MMU translation tables (assuming you mean the data structures U-Boot creates and configures into the CPU's MMU), rather than just walking over the tiny number of entries in those data structures?

On 10/26/2016 01:12 PM, Stephen Warren wrote:
On 10/26/2016 01:54 PM, york sun wrote:
On 10/26/2016 12:47 PM, Stephen Warren wrote:
There are two data structures in ARM U-Boot that describe memory layout:
- A list of RAM memory regions. U-Boot uses these to know where to
relocate itself to (it relocates itself to the top of RAM at boot), and to fill in the /memory node when booting an OS using DT (and other equivalent mechanisms when not using DT.)
- For AArch64 specifically, there's a memory map table that defines all
RAM and MMIO regions, and the translation table attributes required for those regions.
Judging by your comments later in the original message, it sounds like it'd be fine to read from these structures during any dcache clean routine provided the table has already been cleaned. That makes using the tables much more tractable:-)
I think we need to benchmark walking through the MMU tables. It can map huge amount of memory. For our case, it is more than 16GB. I have been reluctant to do so for the size. I am now back testing to revert _this_ patch, hoping to confirm what I learned from this discussion. After that, I will see how long it takes to flush all cached addresses by VA.
Given the existence of the two data structures I mentioned above, I don't think we'd ever need to walk the MMU translation tables (assuming you mean the data structures U-Boot creates and configures into the CPU's MMU), rather than just walking over the tiny number of entries in those data structures?
The list of RAM used by U-Boot is different from the total memory. U-Boot relocates itself to the gd->ram_top. We may have memory in second region, or even the third region.
We also have other cached region other than RAM. I know for Layerscape SoCs, there is QMan portal has 64MB cached. If we consider to flush _all_ address, they should be included.
I think the 2nd data structure you mentioned is the MMU table for aarch64, isn't it?
York

On 10/26/2016 02:29 PM, york sun wrote:
On 10/26/2016 01:12 PM, Stephen Warren wrote:
On 10/26/2016 01:54 PM, york sun wrote:
On 10/26/2016 12:47 PM, Stephen Warren wrote:
There are two data structures in ARM U-Boot that describe memory layout:
- A list of RAM memory regions. U-Boot uses these to know where to
relocate itself to (it relocates itself to the top of RAM at boot), and to fill in the /memory node when booting an OS using DT (and other equivalent mechanisms when not using DT.)
- For AArch64 specifically, there's a memory map table that defines all
RAM and MMIO regions, and the translation table attributes required for those regions.
Judging by your comments later in the original message, it sounds like it'd be fine to read from these structures during any dcache clean routine provided the table has already been cleaned. That makes using the tables much more tractable:-)
I think we need to benchmark walking through the MMU tables. It can map huge amount of memory. For our case, it is more than 16GB. I have been reluctant to do so for the size. I am now back testing to revert _this_ patch, hoping to confirm what I learned from this discussion. After that, I will see how long it takes to flush all cached addresses by VA.
Given the existence of the two data structures I mentioned above, I don't think we'd ever need to walk the MMU translation tables (assuming you mean the data structures U-Boot creates and configures into the CPU's MMU), rather than just walking over the tiny number of entries in those data structures?
The list of RAM used by U-Boot is different from the total memory. U-Boot relocates itself to the gd->ram_top. We may have memory in second region, or even the third region.
We also have other cached region other than RAM. I know for Layerscape SoCs, there is QMan portal has 64MB cached. If we consider to flush _all_ address, they should be included.
I think the 2nd data structure you mentioned is the MMU table for aarch64, isn't it?
The 2nd data structure I mentioned is what U-Boot uses to create the translation tables, but isn't the translation table itself.

On 10/26/2016 02:00 PM, Stephen Warren wrote:
On 10/26/2016 02:29 PM, york sun wrote:
On 10/26/2016 01:12 PM, Stephen Warren wrote:
On 10/26/2016 01:54 PM, york sun wrote:
On 10/26/2016 12:47 PM, Stephen Warren wrote:
There are two data structures in ARM U-Boot that describe memory layout:
- A list of RAM memory regions. U-Boot uses these to know where to
relocate itself to (it relocates itself to the top of RAM at boot), and to fill in the /memory node when booting an OS using DT (and other equivalent mechanisms when not using DT.)
- For AArch64 specifically, there's a memory map table that defines all
RAM and MMIO regions, and the translation table attributes required for those regions.
Judging by your comments later in the original message, it sounds like it'd be fine to read from these structures during any dcache clean routine provided the table has already been cleaned. That makes using the tables much more tractable:-)
I think we need to benchmark walking through the MMU tables. It can map huge amount of memory. For our case, it is more than 16GB. I have been reluctant to do so for the size. I am now back testing to revert _this_ patch, hoping to confirm what I learned from this discussion. After that, I will see how long it takes to flush all cached addresses by VA.
Given the existence of the two data structures I mentioned above, I don't think we'd ever need to walk the MMU translation tables (assuming you mean the data structures U-Boot creates and configures into the CPU's MMU), rather than just walking over the tiny number of entries in those data structures?
The list of RAM used by U-Boot is different from the total memory. U-Boot relocates itself to the gd->ram_top. We may have memory in second region, or even the third region.
We also have other cached region other than RAM. I know for Layerscape SoCs, there is QMan portal has 64MB cached. If we consider to flush _all_ address, they should be included.
I think the 2nd data structure you mentioned is the MMU table for aarch64, isn't it?
The 2nd data structure I mentioned is what U-Boot uses to create the translation tables, but isn't the translation table itself.
I see. You were referring the struct mm_region. It does simplify walking the address. You presumption is we don't update the MMU table after creating it. It is mostly true, but current MMU code allows updating.
York

Hi,
Sorry for joining this a bit late; apologies if the below re-treads ground already covered.
On Wed, Oct 19, 2016 at 09:25:02AM -0600, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct.
Well, almost, but not quite. It's a long story... ;)
I gave a primer [1,2] on the details at ELC earlier this year, which may or may not be useful.
The big details are:
* Generaly "Flush" is ambiguous/meaningless. Here you seem to want clean+invalidate.
* Set/Way operations are for IMPLEMENTATION DEFINED (i.e. SoC-specific) cache maintenance sequences, and are not truly portable (e.g. not affecting system caches).
I assume that an earlier boot stage initialised the caches prior to U-Boot. Given that, you *only* need to perform maintenance for the memory you have (at any point) mapped with cacheable attrbiutes, which should be a small subset of the PA space. With ARMv8-A, broadcast maintenance to the PoC should affect all relevant caches (assuming you use the correct shareability attributes).
* You *cannot* write a dcache disable routine in C, as the compiler can perform a number of implicit memory accesses (e.g. stack, globals, GOT). For that alone, I do not believe the code above is correct.
Note that we have seen this being an issue in practice, before we got rid of Set/Way ops from arm64 Linux (see commit 5e051531447259e5).
* Your dcache disable code *must* be clean to the PoC, prior to execution, or instruction fetches could see stale data. You can first *clean* this to the PoC, which is sufficient to avoid the problems above.
* The SCTLR_ELx.{C,I} bits do not enable/disable caches; they merely activate/deactiveate cacheable attributes on data/instruction fetches.
Note that cacheable instruction fetches can allocate into unified/data caches.
Also, note that the I bit is independent of the C bit, and the attributes it provides differ when the M bit is clear. Generally, I would advise that at all times M == C == I, as that leads to the least surprise.
Thanks, Mark.
[1] http://events.linuxfoundation.org/sites/events/files/slides/slides_17.pdf [2] https://www.youtube.com/watch?v=F0SlIMHRnLk

On 10/24/2016 04:44 AM, Mark Rutland wrote:
Hi,
Sorry for joining this a bit late; apologies if the below re-treads ground already covered.
On Wed, Oct 19, 2016 at 09:25:02AM -0600, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Current code turns off d-cache first, then flush all levels of cache. This results data loss. As soon as d-cache is off, the dirty cache is discarded according to the test on LS2080A. This issue was not seen as long as external L3 cache was flushed to push the data to main memory. However, external L3 cache is not guaranteed to have the data. To fix this, flush the d-cache by way/set first to make sure cache is clean before turning it off.
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@ void dcache_disable(void)
- flush_dcache_all(); set_sctlr(sctlr & ~(CR_C|CR_M));
- flush_dcache_all(); __asm_invalidate_tlb_all();
I talked to Mark Rutland at ARM, and I believe the current code is correct.
Well, almost, but not quite. It's a long story... ;)
To be clear, I was referring specifically/only to the relative ordering of the call to flush_dcache_all() and the SCTLR register modification, rather than wider aspects of the code:-)
I gave a primer [1,2] on the details at ELC earlier this year, which may or may not be useful.
The big details are:
- Generaly "Flush" is ambiguous/meaningless. Here you seem to want clean+invalidate.
Yes. I think the naming of this code is driven by U-Boot's cross-architecture compatibility and history, hence why it doesn't use the expected ARM terminology.
Set/Way operations are for IMPLEMENTATION DEFINED (i.e. SoC-specific) cache maintenance sequences, and are not truly portable (e.g. not affecting system caches).
I assume that an earlier boot stage initialised the caches prior to U-Boot. Given that, you *only* need to perform maintenance for the memory you have (at any point) mapped with cacheable attrbiutes, which should be a small subset of the PA space. With ARMv8-A, broadcast maintenance to the PoC should affect all relevant caches (assuming you use the correct shareability attributes).
There is another thread (or fork of this thread) where York suggests replacing this code with operations by VA instead.
You *cannot* write a dcache disable routine in C, as the compiler can perform a number of implicit memory accesses (e.g. stack, globals, GOT). For that alone, I do not believe the code above is correct.
Note that we have seen this being an issue in practice, before we got rid of Set/Way ops from arm64 Linux (see commit 5e051531447259e5).
I agree. In practice, at least with the compiler I happen to be using, we are currently getting lucky and there aren't any RAM accesses emitted by the compiler in this part of the code. Admittedly that won't hold true for everyone or across all of time though, so this should certainly be fixed...
- Your dcache disable code *must* be clean to the PoC, prior to execution, or instruction fetches could see stale data. You can first *clean* this to the PoC, which is sufficient to avoid the problems above.
I'm not sure if we have any documented rules/assumptions regarding cache/... state upon U-Boot entry like the Linux kernel does. It would probably be a good idea to do so...
I believe that for this particular piece of code, since it's always executed late in U-Boot's operation, the requirement is covered by U-Boot's code relocation routine, since that memcpy()s' .text/.data to the top of usable RAM, and hence hopefully is already fully cleaning the dcache and invalidating the icache afterwards.
The SCTLR_ELx.{C,I} bits do not enable/disable caches; they merely activate/deactiveate cacheable attributes on data/instruction fetches.
Note that cacheable instruction fetches can allocate into unified/data caches.
Also, note that the I bit is independent of the C bit, and the attributes it provides differ when the M bit is clear. Generally, I would advise that at all times M == C == I, as that leads to the least surprise.
I think M==C==I is generally true in U-Boot, except for a limited time right before it jumps to the OS, where icache and dcache "are disabled"[1] separately, but one right after the other.
[1] to abuse the terminology that you pointed out was incorrect:-)
Thanks, Mark.
[1] http://events.linuxfoundation.org/sites/events/files/slides/slides_17.pdf [2] https://www.youtube.com/watch?v=F0SlIMHRnLk

Previously it was believed L3 cache has to be flushed in order to guarantee data integrity in main memory. However, flushing L3 cache may require EL3, depending on SoC implementation. Flushing with virtual address can also put data into main memory. The trick is to find the correct address range. For U-Boot to function correctly, the stack needs to be flushed, up to the top of ram used by U-Boot.
Signed-off-by: York Sun york.sun@nxp.com
--- Stephen Warren, your recently added flushing L3 cache for tegra (8e5d804). Can you check if your board still works with this proposed change?
arch/arm/cpu/armv8/cache_v8.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 92d6277..f5494f8 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -430,14 +430,13 @@ void invalidate_dcache_all(void) */ inline void flush_dcache_all(void) { - int ret; + ulong sp;
+ asm("mov %0, sp" : "=r"(sp) : ); + /* Flush stack to the top of ram */ + __asm_flush_dcache_range(sp, gd->ram_top); + /* Flush cache by way/set */ __asm_flush_dcache_all(); - ret = __asm_flush_l3_cache(); - if (ret) - debug("flushing dcache returns 0x%x\n", ret); - else - debug("flushing dcache successfully.\n"); }
/*

On 10/14/2016 02:17 PM, York Sun wrote:
Previously it was believed L3 cache has to be flushed in order to guarantee data integrity in main memory. However, flushing L3 cache may require EL3, depending on SoC implementation. Flushing with virtual address can also put data into main memory. The trick is to find the correct address range. For U-Boot to function correctly, the stack needs to be flushed, up to the top of ram used by U-Boot.
Signed-off-by: York Sun york.sun@nxp.com
Stephen Warren, your recently added flushing L3 cache for tegra (8e5d804). Can you check if your board still works with this proposed change?
I'm not convinced this is safe. If we do flush by VA rather than any other mechanism, we should at least iterate over the entire page table and find all regions marked cacheable, rather than assuming the delta between stack pointer and ram_top is the only relevant region. That said, I'm not sure even doing that will completely guarantee that the cache is fully clean; I think by-set/-way is the only way. I think it's better to leave this part as-is.
BTW, in the very near future (I've been meaning to do this for the last couple of days), I expect to send a patch which adds hooks for SoC-/CPU-/system-specific hooks to all 3 of invalidate_dcache_all(), flush_dcache_all(), invalidate_icache_all(). That patch will rename __asm_flush_l3_cache() for consistency with the other hooks, and remove "l3" from the function name because the function isn't only necessary for L3 flushing, but can also be required by other aspects of multi-cluster systems. As such, the following definitely won't work for Tegra:
- ret = __asm_flush_l3_cache();

On 10/14/2016 01:30 PM, Stephen Warren wrote:
On 10/14/2016 02:17 PM, York Sun wrote:
Previously it was believed L3 cache has to be flushed in order to guarantee data integrity in main memory. However, flushing L3 cache may require EL3, depending on SoC implementation. Flushing with virtual address can also put data into main memory. The trick is to find the correct address range. For U-Boot to function correctly, the stack needs to be flushed, up to the top of ram used by U-Boot.
Signed-off-by: York Sun york.sun@nxp.com
Stephen Warren, your recently added flushing L3 cache for tegra (8e5d804). Can you check if your board still works with this proposed change?
I'm not convinced this is safe. If we do flush by VA rather than any other mechanism, we should at least iterate over the entire page table and find all regions marked cacheable, rather than assuming the delta between stack pointer and ram_top is the only relevant region. That said, I'm not sure even doing that will completely guarantee that the cache is fully clean; I think by-set/-way is the only way. I think it's better to leave this part as-is.
I agree this is not same as flush_all. I had internal discussion with our people. I acknowledge we have other memory potentially needs a flush. There are many possibility that a user put something in the memory and expect the data persists when OS comes up. My target is to eliminate flushing external cache.
Consider how we boot Linux. The kernel image is flushed by VA, then d-cache off, MMU off. When kernel comes up, it enables d-cache. As far as the data is not accessed for that, we don't have to flush it in U-Boot. I understand this is only not the whole picture. Sending out this RFC patch meant to start this discussion.
BTW, in the very near future (I've been meaning to do this for the last couple of days), I expect to send a patch which adds hooks for SoC-/CPU-/system-specific hooks to all 3 of invalidate_dcache_all(), flush_dcache_all(), invalidate_icache_all(). That patch will rename __asm_flush_l3_cache() for consistency with the other hooks, and remove "l3" from the function name because the function isn't only necessary for L3 flushing, but can also be required by other aspects of multi-cluster systems. As such, the following definitely won't work for Tegra:
- ret = __asm_flush_l3_cache();
Let's see what the patches will be.
York
participants (4)
-
Mark Rutland
-
Stephen Warren
-
York Sun
-
york sun