[PATCH v2 0/2] arm: stm32mp1: activate data cache in SPL and before relocation

V2 after first feedbacks of the previous patch "arm: stm32mp1: activate data cache in SPL and before relocation" http://patchwork.ozlabs.org/patch/1263815/
This new serie depends on the ARM cache serie: http://patchwork.ozlabs.org/project/uboot/list/?series=168378
I move tlb in .data section and simplify the implementation by reusing the default weak function dram_bank_mmu_setup() for MMU configuration and mmu_set_region_dcache_behaviour() to setup the specific behavior.
I also activate data cache on DDR for SPL.
For information the gain of the second patch is limited (few ms) for boot from SDCARD: the SDMMC IP use internal DMA and data cache on DDR is not really used.
Gain should be better for other boot use-case.
Example of bootstage report on STM32MP157C-DK2, boot from SD card.
1/ For trusted boot chain with TF-A
a) Before
STM32MP> bootstage report Timer summary in microseconds (9 records): Mark Elapsed Stage 0 0 reset 583,290 583,290 board_init_f 2,348,898 1,765,608 board_init_r 2,664,580 315,682 id=64 2,704,027 39,447 id=65 2,704,729 702 main_loop 5,563,519 2,858,790 id=175
Accumulated time: 41,696 dm_r 615,561 dm_f
b) After the serie
STM32MP> bootstage report Timer summary in microseconds (9 records): Mark Elapsed Stage 0 0 reset 583,401 583,401 board_init_f 727,725 144,324 board_init_r 1,043,362 315,637 id=64 1,082,806 39,444 id=65 1,083,507 701 main_loop 3,680,827 2,597,320 id=175
Accumulated time: 36,047 dm_f 41,718 dm_r
2/ And for the basic boot chain with SPL
a) Before:
STM32MP> bootstage report Timer summary in microseconds (12 records): Mark Elapsed Stage 0 0 reset 195,613 195,613 SPL 837,867 642,254 end SPL 840,117 2,250 board_init_f 2,739,639 1,899,522 board_init_r 3,066,815 327,176 id=64 3,103,377 36,562 id=65 3,104,078 701 main_loop 3,142,171 38,093 id=175
Accumulated time: 38,124 dm_spl 41,956 dm_r 648,861 dm_f
b) After the serie
STM32MP> bootstage report Timer summary in microseconds (12 records): Mark Elapsed Stage 0 0 reset 195,859 195,859 SPL 330,190 134,331 end SPL 332,408 2,218 board_init_f 482,688 150,280 board_init_r 808,694 326,006 id=64 845,029 36,335 id=65 845,730 701 main_loop 3,281,876 2,436,146 id=175
Accumulated time: 3,169 dm_spl 36,041 dm_f 41,701 dm_r
STM32MP> bootstage report Timer summary in microseconds (12 records): Mark Elapsed Stage 0 0 reset 211,036 211,036 SPL 343,393 132,357 end SPL 345,645 2,252 board_init_f 496,596 150,951 board_init_r 822,256 325,660 id=64 858,451 36,195 id=65 859,153 702 main_loop 3,414,706 2,555,553 id=175
Accumulated time: 3,132 dm_spl 36,005 dm_f 41,695 dm_r
Changes in v2: - create a new function early_enable_caches - use TLB in .init section - use the default weak dram_bank_mmu_setup() and use mmu_set_region_dcache_behaviour() to setup the early MMU configuration - enable data cache on DDR in SPL, after DDR controller initialization - new
Patrick Delaunay (2): arm: stm32mp: activate data cache in SPL and before relocation arm: stm32mp: activate data cache on DDR in SPL
arch/arm/mach-stm32mp/cpu.c | 43 ++++++++++++++++++++++++++++++++++++- arch/arm/mach-stm32mp/spl.c | 21 ++++++++++++++++++ 2 files changed, 63 insertions(+), 1 deletion(-)

Activate the data cache in SPL and in U-Boot before relocation.
In arch_cpu_init(), the function early_enable_caches() sets the early TLB, early_tlb[] located .init section, and set cacheable: - for SPL, all the SYSRAM - for U-Boot, all the DDR
After relocation, the function enable_caches() (called by board_r) reconfigures the MMU with new TLB location (reserved in board_f.c::reserve_mmu) and re-enable the data cache.
This patch allows to reduce the execution time, particularly - for the device tree parsing in U-Boot pre-reloc stage (dm_extended_scan_fd =>dm_scan_fdt) - in I2C timing computation in SPL (stm32_i2c_choose_solution())
For example, the result on STM32MP157C-DK2 board is: 1,6s gain for trusted boot chain with TF-A 2,2s gain for basic boot chain with SPL
As TLB is added in .data section, the binary size increased and the SPL load time by ROM code increased (30ms on DK2).
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com ---
Changes in v2: - create a new function early_enable_caches - use TLB in .init section - use the default weak dram_bank_mmu_setup() and use mmu_set_region_dcache_behaviour() to setup the early MMU configuration - enable data cache on DDR in SPL, after DDR controller initialization
arch/arm/mach-stm32mp/cpu.c | 43 ++++++++++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/arch/arm/mach-stm32mp/cpu.c b/arch/arm/mach-stm32mp/cpu.c index 36a9205819..c22c1a9bbc 100644 --- a/arch/arm/mach-stm32mp/cpu.c +++ b/arch/arm/mach-stm32mp/cpu.c @@ -75,6 +75,12 @@ #define PKG_SHIFT 27 #define PKG_MASK GENMASK(2, 0)
+/* + * early TLB into the .data section so that it not get cleared + * with 16kB allignment (see TTBR0_BASE_ADDR_MASK) + */ +u8 early_tlb[PGTABLE_SIZE] __section(".data") __aligned(0x4000); + #if !defined(CONFIG_SPL) || defined(CONFIG_SPL_BUILD) #ifndef CONFIG_STM32MP1_TRUSTED static void security_init(void) @@ -186,6 +192,32 @@ u32 get_bootmode(void) TAMP_BOOT_MODE_SHIFT; }
+/* + * initialize the MMU and activate cache in SPL or in U- Boot pre-reloc stage + * MMU/TLB is updated in enable_caches() for U-Boot after relocation + * or is deactivated in U-Boot entry function start.S::cpu_init_cp15 + */ +static void early_enable_caches(void) +{ + /* I-cache is already enabled in start.S: cpu_init_cp15 */ + + if (CONFIG_IS_ENABLED(SYS_DCACHE_OFF)) + return; + + gd->arch.tlb_size = PGTABLE_SIZE; + gd->arch.tlb_addr = (unsigned long)&early_tlb; + + dcache_enable(); + + if (IS_ENABLED(CONFIG_SPL_BUILD)) + mmu_set_region_dcache_behaviour(STM32_SYSRAM_BASE, + STM32_SYSRAM_SIZE, + DCACHE_DEFAULT_OPTION); + else + mmu_set_region_dcache_behaviour(STM32_DDR_BASE, STM32_DDR_SIZE, + DCACHE_DEFAULT_OPTION); +} + /* * Early system init */ @@ -193,6 +225,8 @@ int arch_cpu_init(void) { u32 boot_mode;
+ early_enable_caches(); + /* early armv7 timer init: needed for polling */ timer_init();
@@ -225,7 +259,14 @@ int arch_cpu_init(void)
void enable_caches(void) { - /* Enable D-cache. I-cache is already enabled in start.S */ + /* I-cache is already enabled in start.S: icache_enable() not needed */ + + /* deactivate the data cache, early enabled in arch_cpu_init() */ + dcache_disable(); + /* + * update MMU after relocation and enable the data cache + * warning: the TLB location udpated in board_f.c::reserve_mmu + */ dcache_enable(); }

On 4/3/20 11:25 AM, Patrick Delaunay wrote: [...]
diff --git a/arch/arm/mach-stm32mp/cpu.c b/arch/arm/mach-stm32mp/cpu.c index 36a9205819..c22c1a9bbc 100644 --- a/arch/arm/mach-stm32mp/cpu.c +++ b/arch/arm/mach-stm32mp/cpu.c @@ -75,6 +75,12 @@ #define PKG_SHIFT 27 #define PKG_MASK GENMASK(2, 0)
+/*
- early TLB into the .data section so that it not get cleared
- with 16kB allignment (see TTBR0_BASE_ADDR_MASK)
- */
+u8 early_tlb[PGTABLE_SIZE] __section(".data") __aligned(0x4000);
Can you early-malloc this one ? (why do you need this in __section("data") ?)
[...]

Dear Marek,
From: Marek Vasut marex@denx.de Sent: vendredi 3 avril 2020 23:32
On 4/3/20 11:25 AM, Patrick Delaunay wrote: [...]
diff --git a/arch/arm/mach-stm32mp/cpu.c b/arch/arm/mach-stm32mp/cpu.c index 36a9205819..c22c1a9bbc 100644 --- a/arch/arm/mach-stm32mp/cpu.c +++ b/arch/arm/mach-stm32mp/cpu.c @@ -75,6 +75,12 @@ #define PKG_SHIFT 27 #define PKG_MASK GENMASK(2, 0)
+/*
- early TLB into the .data section so that it not get cleared
- with 16kB allignment (see TTBR0_BASE_ADDR_MASK) */
+u8 early_tlb[PGTABLE_SIZE] __section(".data") __aligned(0x4000);
Can you early-malloc this one ?
I try to early maloc and it is failing because my code in arch_cpu_init() is executed before the early poll initialization done in spl_common_init () called by spl_early_init() So it too late for my use case....
And if I initialise the MMU and the cache after this function it is too late, as dm_init_and_scan and fdt parsin is also called in spl_common_init()
(why do you need this in __section("data") ?)
I try to use .bss and it is failing because the bss is resetted to 0 in SPL after board_init_f, and the MMU is cleared without notice.
In fact BBS is not available, board_init_f() can use only stack variables and global_data (see README:258).
When I investigate the issue, I found CONFIG_SPL_EARLY_BSS that explain this point :
config SPL_EARLY_BSS depends on ARM && !ARM64 bool "Allows initializing BSS early before entering board_init_f" help On some platform we have sufficient memory available early on to allow setting up and using a basic BSS prior to entering board_init_f. Activating this option will also de-activate the clearing of BSS during the SPL relocation process, thus allowing to carry state from board_init_f to board_init_r by way of BSS.
So it is s compromise between harcoded addred (end of SYSRAM) or glabal variable in .data section
V2 patch with .data seems more elegant for me (it avoid assumption on U-Boot size for preloc case).
And if you have size issue for SPL you can deactivate cache for SPL only (CONFIG_SPL_SYS_DCACHE_OFF).
[...]
Regards
Patrick

On 4/9/20 8:32 PM, Patrick DELAUNAY wrote:
Dear Marek,
From: Marek Vasut marex@denx.de Sent: vendredi 3 avril 2020 23:32
On 4/3/20 11:25 AM, Patrick Delaunay wrote: [...]
diff --git a/arch/arm/mach-stm32mp/cpu.c b/arch/arm/mach-stm32mp/cpu.c index 36a9205819..c22c1a9bbc 100644 --- a/arch/arm/mach-stm32mp/cpu.c +++ b/arch/arm/mach-stm32mp/cpu.c @@ -75,6 +75,12 @@ #define PKG_SHIFT 27 #define PKG_MASK GENMASK(2, 0)
+/*
- early TLB into the .data section so that it not get cleared
- with 16kB allignment (see TTBR0_BASE_ADDR_MASK) */
+u8 early_tlb[PGTABLE_SIZE] __section(".data") __aligned(0x4000);
Can you early-malloc this one ?
I try to early maloc and it is failing because my code in arch_cpu_init() is executed before the early poll initialization done in spl_common_init () called by spl_early_init() So it too late for my use case....
And if I initialise the MMU and the cache after this function it is too late, as dm_init_and_scan and fdt parsin is also called in spl_common_init()
Aha, OK. Can you document it in the commit message ? That's a real good piece of information.
(why do you need this in __section("data") ?)
I try to use .bss and it is failing because the bss is resetted to 0 in SPL after board_init_f, and the MMU is cleared without notice.
In fact BBS is not available, board_init_f() can use only stack variables and global_data (see README:258).
When I investigate the issue, I found CONFIG_SPL_EARLY_BSS that explain this point :
config SPL_EARLY_BSS depends on ARM && !ARM64 bool "Allows initializing BSS early before entering board_init_f" help On some platform we have sufficient memory available early on to allow setting up and using a basic BSS prior to entering board_init_f. Activating this option will also de-activate the clearing of BSS during the SPL relocation process, thus allowing to carry state from board_init_f to board_init_r by way of BSS.
So it is s compromise between harcoded addred (end of SYSRAM) or glabal variable in .data section
V2 patch with .data seems more elegant for me (it avoid assumption on U-Boot size for preloc case).
And if you have size issue for SPL you can deactivate cache for SPL only (CONFIG_SPL_SYS_DCACHE_OFF).
OK

Activate cache on DDR to improves the accesses to DDR used by SPL: - CONFIG_SPL_BSS_START_ADDR - CONFIG_SYS_SPL_MALLOC_START
Cache is configured only when DDR is fully initialized, to avoid speculative access and issue in get_ram_size(). Data cache is deactivated at the end of SPL, to flush the data cache and the TLB.
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com ---
Changes in v2: - new
arch/arm/mach-stm32mp/spl.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/arch/arm/mach-stm32mp/spl.c b/arch/arm/mach-stm32mp/spl.c index 9cd7b418a4..279121af75 100644 --- a/arch/arm/mach-stm32mp/spl.c +++ b/arch/arm/mach-stm32mp/spl.c @@ -4,6 +4,7 @@ */
#include <common.h> +#include <cpu_func.h> #include <dm.h> #include <hang.h> #include <spl.h> @@ -117,4 +118,24 @@ void board_init_f(ulong dummy) printf("DRAM init failed: %d\n", ret); hang(); } + + /* + * activate cache on DDR only when DDR is fully initialized + * to avoid speculative access and issue in get_ram_size() + */ + if (!CONFIG_IS_ENABLED(SYS_DCACHE_OFF)) + mmu_set_region_dcache_behaviour(STM32_DDR_BASE, STM32_DDR_SIZE, + DCACHE_DEFAULT_OPTION); +} + +void spl_board_prepare_for_boot(void) +{ + dcache_disable(); + debug("SPL bye\n"); +} + +void spl_board_prepare_for_boot_linux(void) +{ + dcache_disable(); + debug("SPL bye\n"); }

On 4/3/20 11:25 AM, Patrick Delaunay wrote:
Activate cache on DDR to improves the accesses to DDR used by SPL:
- CONFIG_SPL_BSS_START_ADDR
- CONFIG_SYS_SPL_MALLOC_START
Cache is configured only when DDR is fully initialized, to avoid speculative access and issue in get_ram_size(). Data cache is deactivated at the end of SPL, to flush the data cache and the TLB.
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com
Changes in v2:
- new
arch/arm/mach-stm32mp/spl.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/arch/arm/mach-stm32mp/spl.c b/arch/arm/mach-stm32mp/spl.c index 9cd7b418a4..279121af75 100644 --- a/arch/arm/mach-stm32mp/spl.c +++ b/arch/arm/mach-stm32mp/spl.c @@ -4,6 +4,7 @@ */
#include <common.h> +#include <cpu_func.h> #include <dm.h> #include <hang.h> #include <spl.h> @@ -117,4 +118,24 @@ void board_init_f(ulong dummy) printf("DRAM init failed: %d\n", ret); hang(); }
- /*
* activate cache on DDR only when DDR is fully initialized
* to avoid speculative access and issue in get_ram_size()
*/
- if (!CONFIG_IS_ENABLED(SYS_DCACHE_OFF))
mmu_set_region_dcache_behaviour(STM32_DDR_BASE, STM32_DDR_SIZE,
DCACHE_DEFAULT_OPTION);
+}
+void spl_board_prepare_for_boot(void) +{
- dcache_disable();
- debug("SPL bye\n");
+}
+void spl_board_prepare_for_boot_linux(void) +{
- dcache_disable();
- debug("SPL bye\n");
Is the debug() statement really needed ? I think the common SPL code already has some.

Dear Marek,
From: Marek Vasut marex@denx.de Sent: vendredi 3 avril 2020 23:33
On 4/3/20 11:25 AM, Patrick Delaunay wrote:
Activate cache on DDR to improves the accesses to DDR used by SPL:
- CONFIG_SPL_BSS_START_ADDR
- CONFIG_SYS_SPL_MALLOC_START
Cache is configured only when DDR is fully initialized, to avoid speculative access and issue in get_ram_size(). Data cache is deactivated at the end of SPL, to flush the data cache and the TLB.
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com
Changes in v2:
- new
[...]
+void spl_board_prepare_for_boot(void) {
- dcache_disable();
- debug("SPL bye\n");
+}
+void spl_board_prepare_for_boot_linux(void) +{
- dcache_disable();
- debug("SPL bye\n");
Is the debug() statement really needed ? I think the common SPL code already has some.
Not needed, I will drop them in V3.
Patrick

On 4/9/20 8:10 PM, Patrick DELAUNAY wrote:
Dear Marek,
From: Marek Vasut marex@denx.de Sent: vendredi 3 avril 2020 23:33
On 4/3/20 11:25 AM, Patrick Delaunay wrote:
Activate cache on DDR to improves the accesses to DDR used by SPL:
- CONFIG_SPL_BSS_START_ADDR
- CONFIG_SYS_SPL_MALLOC_START
Cache is configured only when DDR is fully initialized, to avoid speculative access and issue in get_ram_size(). Data cache is deactivated at the end of SPL, to flush the data cache and the TLB.
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com
Changes in v2:
- new
[...]
+void spl_board_prepare_for_boot(void) {
- dcache_disable();
- debug("SPL bye\n");
+}
+void spl_board_prepare_for_boot_linux(void) +{
- dcache_disable();
- debug("SPL bye\n");
Is the debug() statement really needed ? I think the common SPL code already has some.
Not needed, I will drop them in V3.
Thanks
participants (3)
-
Marek Vasut
-
Patrick DELAUNAY
-
Patrick Delaunay