[U-Boot] [PATCH] arm: Replace test for CONFIG_ARMV7 with CONFIG_CPU_V7

The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Signed-off-by: Marek Vasut marex@denx.de Cc: Tom Rini trini@konsulko.com Cc: Albert Aribaud albert.u.boot@aribaud.net Cc: Simon Glass sjg@chromium.org --- arch/arm/include/asm/system.h | 4 ++-- arch/arm/lib/cache-cp15.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h index 71b3108..dec83c7 100644 --- a/arch/arm/include/asm/system.h +++ b/arch/arm/include/asm/system.h @@ -220,7 +220,7 @@ static inline void set_dacr(unsigned int val) isb(); }
-#ifdef CONFIG_ARMV7 +#ifdef CONFIG_CPU_V7 /* Short-Descriptor Translation Table Level 1 Bits */ #define TTB_SECT_NS_MASK (1 << 19) #define TTB_SECT_NG_MASK (1 << 17) @@ -257,7 +257,7 @@ enum { MMU_SECTION_SIZE = 1 << MMU_SECTION_SHIFT, };
-#ifdef CONFIG_ARMV7 +#ifdef CONFIG_CPU_V7 /* TTBR0 bits */ #define TTBR0_BASE_ADDR_MASK 0xFFFFC000 #define TTBR0_RGN_NC (0 << 3) diff --git a/arch/arm/lib/cache-cp15.c b/arch/arm/lib/cache-cp15.c index c65e068..8e18538 100644 --- a/arch/arm/lib/cache-cp15.c +++ b/arch/arm/lib/cache-cp15.c @@ -96,7 +96,7 @@ static inline void mmu_setup(void) dram_bank_mmu_setup(i); }
-#ifdef CONFIG_ARMV7 +#ifdef CONFIG_CPU_V7 /* Set TTBR0 */ reg = gd->arch.tlb_addr & TTBR0_BASE_ADDR_MASK; #if defined(CONFIG_SYS_ARM_CACHE_WRITETHROUGH)

Hello Marek,
On Tue, 8 Dec 2015 14:43:42 +0100, Marek Vasut marex@denx.de wrote:
The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Note:
As discussed with Marek on IRC, this patch enables what is supposed to be the correct MMU settings for ARMv7, which causes a sharp Ethernet performance drop (40%) but also a strong general memory access performance hit (a copy of 4 MB is almost instantaneous without the patch and takes 2-3 seconds with it).
I would like to either fix the performance or come up with an explanation for it before I pick this patch.
Marek's analysis shows the only MMU-related effect of the patch is that the S bit gets set in first-level MMU table entries.
The S bit makes a mmu region shareable with other IPs within the silicon (DMA engines, other cores, possibly even a piece of bus or interconnect). For instance, it will cause memory writes within the region to propagate to other IPs (which [must] also have defined this region as shareable) so that these IPs know the region has been written to and update their cache state accordingly.
With the S bit clear, USB and QSPI fail to work on Marek's board, probably because writes do not propagate properly between the core and these IPs; with the S bit set, USB and QSPI work but cache performance is very reduced.
My hypothesis right now is that some other IP(s) hamper(s) the propagation of shareable region writes; for instance, some IP is off or in the wrong state, and does not properly respond, causing some stall.
At first I suspected the second core, which was off during the tests; but Marek tried with that core on and it did not improve performance.
We can keep on looking into finding IPs that might affect shareable accesses and try to turn them on or off more or less at random, but that is time-consuming, and I'm not even sure I'm on the right track.
I will welcome suggestions if anyone with more experience in MMU shareability than us -- or with brilliant insight :) -- has any.
Amicalement,

Hi All!
On 09.12.2015 09:09, Albert ARIBAUD wrote:
On Tue, 8 Dec 2015 14:43:42 +0100, Marek Vasut marex@denx.de wrote:
The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Note:
As discussed with Marek on IRC, this patch enables what is supposed to be the correct MMU settings for ARMv7, which causes a sharp Ethernet performance drop (40%) but also a strong general memory access performance hit (a copy of 4 MB is almost instantaneous without the patch and takes 2-3 seconds with it).
I've tested Marek's patch on my current Armada XP platform (also ARMv7). And also noticed a performance drop by about 30-40%. The dhrystone command can be used for this testing as well (CONFIG_CMD_DHRYSTONE).
So this is definitely a NAK from me to this current patch. It fixes the issue on SoCFPGA but affects other platforms in a negative way (i.MX6, Armada XP / 38x and perhaps others too). My feeling is, that SoCFPGA needs to get fixed in a way, without affecting the other platforms.
Marek, have you checked if the dcache is enabled at all on SoCFPGA with this patch applied? I remember that before my patch to really enable the dcache on SoCFPGA, the performance was so low. And I even could remove all caching functions from the ethernet driver (invalidate and flush). And the driver still worked without any issues.
I would like to either fix the performance or come up with an explanation for it before I pick this patch.
Yes. Please don't apply it quickly.
Thanks, Stefan

On Wednesday, December 09, 2015 at 02:02:03 PM, Stefan Roese wrote:
Hi All!
On 09.12.2015 09:09, Albert ARIBAUD wrote:
On Tue, 8 Dec 2015 14:43:42 +0100, Marek Vasut marex@denx.de wrote:
The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Note:
As discussed with Marek on IRC, this patch enables what is supposed to be the correct MMU settings for ARMv7, which causes a sharp Ethernet performance drop (40%) but also a strong general memory access performance hit (a copy of 4 MB is almost instantaneous without the patch and takes 2-3 seconds with it).
I've tested Marek's patch on my current Armada XP platform (also ARMv7). And also noticed a performance drop by about 30-40%. The dhrystone command can be used for this testing as well (CONFIG_CMD_DHRYSTONE).
So this is definitely a NAK from me to this current patch.
This is a wrong approach, this patch just fixes another patch which was broken from the getgo and the TTBR0 reg was not programmed correctly due to that. This patch must be applied, but what we need to decide is whether we need _another_ patch which removes the S-bit from the pagetable or how to deal with that part.
It fixes the issue on SoCFPGA but affects other platforms in a negative way (i.MX6, Armada XP / 38x and perhaps others too). My feeling is, that SoCFPGA needs to get fixed in a way, without affecting the other platforms.
Marek, have you checked if the dcache is enabled at all on SoCFPGA with this patch applied? I remember that before my patch to really enable the dcache on SoCFPGA, the performance was so low. And I even could remove all caching functions from the ethernet driver (invalidate and flush). And the driver still worked without any issues.
If I do dcache off, the performance drops further, so I suspect the cache is indeed enabled.
I would like to either fix the performance or come up with an explanation for it before I pick this patch.
Yes. Please don't apply it quickly.
Thanks, Stefan
Best regards, Marek Vasut

On Wed, Dec 09, 2015 at 03:09:13PM +0100, Marek Vasut wrote:
On Wednesday, December 09, 2015 at 02:02:03 PM, Stefan Roese wrote:
Hi All!
On 09.12.2015 09:09, Albert ARIBAUD wrote:
On Tue, 8 Dec 2015 14:43:42 +0100, Marek Vasut marex@denx.de wrote:
The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Note:
As discussed with Marek on IRC, this patch enables what is supposed to be the correct MMU settings for ARMv7, which causes a sharp Ethernet performance drop (40%) but also a strong general memory access performance hit (a copy of 4 MB is almost instantaneous without the patch and takes 2-3 seconds with it).
I've tested Marek's patch on my current Armada XP platform (also ARMv7). And also noticed a performance drop by about 30-40%. The dhrystone command can be used for this testing as well (CONFIG_CMD_DHRYSTONE).
So this is definitely a NAK from me to this current patch.
This is a wrong approach, this patch just fixes another patch which was broken from the getgo and the TTBR0 reg was not programmed correctly due to that. This patch must be applied, but what we need to decide is whether we need _another_ patch which removes the S-bit from the pagetable or how to deal with that part.
No, we don't have to apply this patch. The original patch here (97840b5) was likely not run-time tested when submitted as it was using the then-defunct CONFIG_ARMV7 symbol, and was likely a bugfix in an internal tree that was pushed upstream (which is on the whole, good). So in sum U-Boot has always been as broken in some specific regard before that patch as it is after that patch. So we have time to see what we need to do about enabling this feature correctly and even ensuring that we don't happen to say break things once Linux is up too.

On Thursday, December 10, 2015 at 03:27:08 AM, Tom Rini wrote:
On Wed, Dec 09, 2015 at 03:09:13PM +0100, Marek Vasut wrote:
On Wednesday, December 09, 2015 at 02:02:03 PM, Stefan Roese wrote:
Hi All!
On 09.12.2015 09:09, Albert ARIBAUD wrote:
On Tue, 8 Dec 2015 14:43:42 +0100, Marek Vasut marex@denx.de wrote:
The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Note:
As discussed with Marek on IRC, this patch enables what is supposed to be the correct MMU settings for ARMv7, which causes a sharp Ethernet performance drop (40%) but also a strong general memory access performance hit (a copy of 4 MB is almost instantaneous without the patch and takes 2-3 seconds with it).
I've tested Marek's patch on my current Armada XP platform (also ARMv7). And also noticed a performance drop by about 30-40%. The dhrystone command can be used for this testing as well (CONFIG_CMD_DHRYSTONE).
So this is definitely a NAK from me to this current patch.
This is a wrong approach, this patch just fixes another patch which was broken from the getgo and the TTBR0 reg was not programmed correctly due to that. This patch must be applied, but what we need to decide is whether we need _another_ patch which removes the S-bit from the pagetable or how to deal with that part.
No, we don't have to apply this patch. The original patch here (97840b5) was likely not run-time tested when submitted as it was using the then-defunct CONFIG_ARMV7 symbol, and was likely a bugfix in an internal tree that was pushed upstream (which is on the whole, good).
I find it surprising that such a patch, which modifies common code even, was not scrutinized enough.
So in sum U-Boot has always been as broken in some specific regard before that patch as it is after that patch. So we have time to see what we need to do about enabling this feature correctly and even ensuring that we don't happen to say break things once Linux is up too.
In that case, I am looking forward to better suggestions.
I still disagree that this patch should not be applied, it corrects code which was broken. The slowdown caused by the original patch is a separate issue and should be corrected by a separate patch. If these two patches get applied at once, that is fine.
Best regards, Marek Vasut

Hello Marek,
On Thu, 10 Dec 2015 04:53:01 +0100, Marek Vasut marex@denx.de wrote:
On Thursday, December 10, 2015 at 03:27:08 AM, Tom Rini wrote:
On Wed, Dec 09, 2015 at 03:09:13PM +0100, Marek Vasut wrote:
On Wednesday, December 09, 2015 at 02:02:03 PM, Stefan Roese wrote:
Hi All!
On 09.12.2015 09:09, Albert ARIBAUD wrote:
On Tue, 8 Dec 2015 14:43:42 +0100, Marek Vasut marex@denx.de wrote:
The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Note:
As discussed with Marek on IRC, this patch enables what is supposed to be the correct MMU settings for ARMv7, which causes a sharp Ethernet performance drop (40%) but also a strong general memory access performance hit (a copy of 4 MB is almost instantaneous without the patch and takes 2-3 seconds with it).
I've tested Marek's patch on my current Armada XP platform (also ARMv7). And also noticed a performance drop by about 30-40%. The dhrystone command can be used for this testing as well (CONFIG_CMD_DHRYSTONE).
So this is definitely a NAK from me to this current patch.
This is a wrong approach, this patch just fixes another patch which was broken from the getgo and the TTBR0 reg was not programmed correctly due to that. This patch must be applied, but what we need to decide is whether we need _another_ patch which removes the S-bit from the pagetable or how to deal with that part.
No, we don't have to apply this patch. The original patch here (97840b5) was likely not run-time tested when submitted as it was using the then-defunct CONFIG_ARMV7 symbol, and was likely a bugfix in an internal tree that was pushed upstream (which is on the whole, good).
I find it surprising that such a patch, which modifies common code even, was not scrutinized enough.
So in sum U-Boot has always been as broken in some specific regard before that patch as it is after that patch. So we have time to see what we need to do about enabling this feature correctly and even ensuring that we don't happen to say break things once Linux is up too.
In that case, I am looking forward to better suggestions.
I still disagree that this patch should not be applied, it corrects code which was broken. The slowdown caused by the original patch is a separate issue and should be corrected by a separate patch. If these two patches get applied at once, that is fine.
This patch has several effects:
- it selects proper ARMv7 translation table level 1 bit definitions; - it provides proper ARMv7 definitions for WT/WB/WA; - it selects proper ARMv7 settings for TTBR0.
All these are correct as per the docs I have (although I may have missed something during the readings (and cross-readings with Marek) of these last hours/days.
Now, one specific effect goes against performance, and it is the setting of bit S in all TT entries. This bit makes the corresponding region shareable, but for all I know, in U-Boot we don't have more than one core accessing the same memory or registers sets so -- at least for the major part of its execution -- there is no reason for any region to be shareable.
/That/ effect I certainly don't want.
The rest I am fine with.
So, if we apply this patch minus the inclusion of the S bit in the definition of DCACHE_OFF (and hence of all DCACHE_* enum members after it), then we get code that is more correct from a theoretical standpoint, and does not degrade performance (which is a hint, admittedly weak, to me that it is not incorrect from a practical standpoint.
It does not fix the USB and QSPI issues in the socfpga case, but I am not sure these are caused by the S bit being zero.
Best regards, Marek Vasut
Amicalement,

Hi!
This patch has several effects:
- it selects proper ARMv7 translation table level 1 bit definitions;
- it provides proper ARMv7 definitions for WT/WB/WA;
- it selects proper ARMv7 settings for TTBR0.
All these are correct as per the docs I have (although I may have missed something during the readings (and cross-readings with Marek) of these last hours/days.
Now, one specific effect goes against performance, and it is the setting of bit S in all TT entries. This bit makes the corresponding region shareable, but for all I know, in U-Boot we don't have more than one core accessing the same memory or registers sets so -- at least for the major part of its execution -- there is no reason for any region to be shareable.
Well, I'm currently working on AMP patch, which will mean two processors at the same time in u-boot.
Also... we provide memory modify operations for the user. User may be trying to communicate with second core.
/That/ effect I certainly don't want.
How big is the slowdown from S bit?
Best regards, Pavel

On Monday, December 14, 2015 at 08:48:16 AM, Pavel Machek wrote:
Hi!
This patch has several effects:
- it selects proper ARMv7 translation table level 1 bit definitions;
- it provides proper ARMv7 definitions for WT/WB/WA;
- it selects proper ARMv7 settings for TTBR0.
All these are correct as per the docs I have (although I may have missed something during the readings (and cross-readings with Marek) of these last hours/days.
Now, one specific effect goes against performance, and it is the setting of bit S in all TT entries. This bit makes the corresponding region shareable, but for all I know, in U-Boot we don't have more than one core accessing the same memory or registers sets so -- at least for the major part of its execution -- there is no reason for any region to be shareable.
Well, I'm currently working on AMP patch, which will mean two processors at the same time in u-boot.
Also... we provide memory modify operations for the user. User may be trying to communicate with second core.
You have a special usecase, so you should tweak your U-Boot for that using some sort of config option.
/That/ effect I certainly don't want.
How big is the slowdown from S bit?
30-40% .
Best regards, Marek Vasut

Hello Pavel,
On Mon, 14 Dec 2015 08:48:16 +0100, Pavel Machek pavel@denx.de wrote:
Hi!
This patch has several effects:
- it selects proper ARMv7 translation table level 1 bit definitions;
- it provides proper ARMv7 definitions for WT/WB/WA;
- it selects proper ARMv7 settings for TTBR0.
All these are correct as per the docs I have (although I may have missed something during the readings (and cross-readings with Marek) of these last hours/days.
Now, one specific effect goes against performance, and it is the setting of bit S in all TT entries. This bit makes the corresponding region shareable, but for all I know, in U-Boot we don't have more than one core accessing the same memory or registers sets so -- at least for the major part of its execution -- there is no reason for any region to be shareable.
Well, I'm currently working on AMP patch, which will mean two processors at the same time in u-boot.
Will they share memory or will they use another mechanism for sync?
Also... we provide memory modify operations for the user. User may be trying to communicate with second core.
/That/ effect I certainly don't want.
How big is the slowdown from S bit?
A 4MB memory-to-memory transfer goes from instantaneous to 2-3 seconds; Ethernet performance drops by 40%.
Best regards,
Amicalement,
participants (5)
-
Albert ARIBAUD
-
Marek Vasut
-
Pavel Machek
-
Stefan Roese
-
Tom Rini