[U-Boot] [PATCH 0/3] arm: reduce .bss section clear time

This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)

This commit enables the following configs: - CONFIG_USE_ARCH_MEMCPY - CONFIG_USE_ARCH_MEMSET This increases the performance of memcpy/memset and also reduces the boot time.
This was tested on Trats2. A quick test with trace. Boot time from start to main_loop() entry: - ~1527ms - before this change (arch memset enabled for .bss clear) - ~1384ms - after this change
Signed-off-by: Przemyslaw Marczak p.marczak@samsung.com Cc: Minkyu Kang mk7.kang@samsung.com Cc: Akshay Saraswat akshay.s@samsung.com Cc: Simon Glass sjg@chromium.org Cc: Sjoerd Simons sjoerd.simons@collabora.co.uk --- include/configs/exynos-common.h | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/include/configs/exynos-common.h b/include/configs/exynos-common.h index 1f3ee55..5c14c40 100644 --- a/include/configs/exynos-common.h +++ b/include/configs/exynos-common.h @@ -30,6 +30,9 @@ #define CONFIG_SKIP_LOWLEVEL_INIT #define CONFIG_BOARD_EARLY_INIT_F
+#define CONFIG_USE_ARCH_MEMCPY +#define CONFIG_USE_ARCH_MEMSET + /* Keep L2 Cache Disabled */ #define CONFIG_CMD_CACHE

For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
This was tested on Trats2. A quick test with trace. Boot time from start to main_loop() entry: - ~1384ms - before this change - ~888ms - after this change
Signed-off-by: Przemyslaw Marczak p.marczak@samsung.com Cc: Albert Aribaud albert.u.boot@aribaud.net Cc: Tom Rini trini@ti.com --- arch/arm/lib/crt0.S | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S index 22df3e5..fab3d2c 100644 --- a/arch/arm/lib/crt0.S +++ b/arch/arm/lib/crt0.S @@ -115,14 +115,22 @@ here: bl c_runtime_cpu_setup /* we still call old routine here */
ldr r0, =__bss_start /* this is auto-relocated! */ - ldr r1, =__bss_end /* this is auto-relocated! */
+#ifdef CONFIG_USE_ARCH_MEMSET + ldr r3, =__bss_end /* this is auto-relocated! */ + mov r1, #0x00000000 /* prepare zero to clear BSS */ + + subs r2, r3, r0 /* r2 = memset len */ + bl memset +#else + ldr r1, =__bss_end /* this is auto-relocated! */ mov r2, #0x00000000 /* prepare zero to clear BSS */
clbss_l:cmp r0, r1 /* while not at end of BSS */ strlo r2, [r0] /* clear 32-bit BSS word */ addlo r0, r0, #4 /* move to next */ blo clbss_l +#endif
bl coloured_LED_init bl red_led_on

Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak p.marczak@samsung.com wrote:
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted.
Amicalement,

On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak p.marczak@samsung.com wrote:
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted.
So this brings up something I've wondered about for a long while. We have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux kernel. The kernel uses them for all ARM platforms. Why do we not always use these functions? I have a very vague notion it was a size thing...

Hi Tom,
On Feb 2, 2015, at 19:25 , Tom Rini trini@ti.com wrote:
On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak p.marczak@samsung.com wrote:
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted.
So this brings up something I've wondered about for a long while. We have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux kernel. The kernel uses them for all ARM platforms. Why do we not always use these functions? I have a very vague notion it was a size thing…
That is a good question. Are we being hobbled cause of MLO? If so we can use the short (and slow) methods in that case and use the fast methods in the normal case. It seems that this is warranted in this case.
However in the particular case of dfu I think it’s best to avoid the large static buffers. Or if we do use the large buffers let’s put them in a linker segment that does not get zeroed on start.
-- Tom
Regards
— Pantelis

On Mon, Feb 02, 2015 at 07:28:14PM +0200, Pantelis Antoniou wrote:
Hi Tom,
On Feb 2, 2015, at 19:25 , Tom Rini trini@ti.com wrote:
On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak p.marczak@samsung.com wrote:
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted.
So this brings up something I've wondered about for a long while. We have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux kernel. The kernel uses them for all ARM platforms. Why do we not always use these functions? I have a very vague notion it was a size thing…
That is a good question. Are we being hobbled cause of MLO? If so we can use the short (and slow) methods in that case and use the fast methods in the normal case. It seems that this is warranted in this case.
I'm not sure, but I can test easily enough. But even then we may want to opt a few targets in to the current (slow) path and make the default the optimized path.
However in the particular case of dfu I think it’s best to avoid the large static buffers. Or if we do use the large buffers let’s put them in a linker segment that does not get zeroed on start.
Yes, I owe the rest of the series my attention too :)

On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak p.marczak@samsung.com wrote:
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted.
I want to jump up here again. Note that the arch memset/memcpy routines are in asm and I don't belive require a C environment. Why don't we simply use the asm versions for everyone and backport whatever we need from the kernel to re-sync there as it's not a choice there and it's a performance win too?

Hello Tom,
On 02/12/2015 04:37 PM, Tom Rini wrote:
On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak p.marczak@samsung.com wrote:
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted.
I want to jump up here again. Note that the arch memset/memcpy routines are in asm and I don't belive require a C environment. Why don't we simply use the asm versions for everyone and backport whatever we need from the kernel to re-sync there as it's not a choice there and it's a performance win too?
Right, for ARM the mentioned routines doesn't require C env. But if we could achieve some improvement in this place, then maybe it has sense to add some new code just for bss.
I will try to combine and make some timing tests on Monday.
Best regards,

For writing files, DFU implementation requires the file buffer with the len at least of file size. For big files it requires the same big buffer.
Previously the file buffer was allocated as a static variable, so it was a part of U-Boot .bss section. For 32MiB len of buffer we have 32MiB of additional space, required for this section.
The .bss needs to be cleared after the relocation. This introduces an additional boot delay at every start, but usually the dfu feature is not required at the standard boot, so the buffer should be allocated only if required.
This patch removes the static allocation of this buffer, and alloc it with memalign after first call of function: - dfu_fill_entity_mmc() and the buffer is freed on dfu_free_entity() call.
This was tested on Trats2. A quick test with trace. Boot time from start to main_loop() entry: - ~888ms - before this change (arch memset enabled for .bss clear) - ~464ms - after this change
Signed-off-by: Przemyslaw Marczak p.marczak@samsung.com Cc: Lukasz Majewski l.majewski@samsung.com Cc: Stephen Warren swarren@nvidia.com Cc: Pantelis Antoniou panto@antoniou-consulting.com Cc: Tom Rini trini@ti.com Cc: Marek Vasut marek.vasut@gmail.com --- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/drivers/dfu/dfu_mmc.c b/drivers/dfu/dfu_mmc.c index 62d72fe..fd865e1 100644 --- a/drivers/dfu/dfu_mmc.c +++ b/drivers/dfu/dfu_mmc.c @@ -16,8 +16,7 @@ #include <fat.h> #include <mmc.h>
-static unsigned char __aligned(CONFIG_SYS_CACHELINE_SIZE) - dfu_file_buf[CONFIG_SYS_DFU_MAX_FILE_SIZE]; +static unsigned char *dfu_file_buf; static long dfu_file_buf_len;
static int mmc_access_part(struct dfu_entity *dfu, struct mmc *mmc, int part) @@ -211,7 +210,7 @@ int dfu_flush_medium_mmc(struct dfu_entity *dfu)
if (dfu->layout != DFU_RAW_ADDR) { /* Do stuff here. */ - ret = mmc_file_op(DFU_OP_WRITE, dfu, &dfu_file_buf, + ret = mmc_file_op(DFU_OP_WRITE, dfu, dfu_file_buf, &dfu_file_buf_len);
/* Now that we're done */ @@ -263,6 +262,14 @@ int dfu_read_medium_mmc(struct dfu_entity *dfu, u64 offset, void *buf, return ret; }
+void dfu_free_entity_mmc(struct dfu_entity *dfu) +{ + if (dfu_file_buf) { + free(dfu_file_buf); + dfu_file_buf = NULL; + } +} + /* * @param s Parameter string containing space-separated arguments: * 1st: @@ -370,6 +377,18 @@ int dfu_fill_entity_mmc(struct dfu_entity *dfu, char *devstr, char *s) dfu->write_medium = dfu_write_medium_mmc; dfu->flush_medium = dfu_flush_medium_mmc; dfu->inited = 0; + dfu->free_entity = dfu_free_entity_mmc; + + /* Check if file buffer is ready */ + if (!dfu_file_buf) { + dfu_file_buf = memalign(CONFIG_SYS_CACHELINE_SIZE, + CONFIG_SYS_DFU_MAX_FILE_SIZE); + if (!dfu_file_buf) { + error("Could not memalign 0x%x bytes", + CONFIG_SYS_DFU_MAX_FILE_SIZE); + return -ENOMEM; + } + }
return 0; }

Hi Przemyslaw,
On 28.01.2015 13:55, Przemyslaw Marczak wrote:
This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)
Looking at the commit messages of this patchset I can conclude that your overall boot time reduction is:
from ~1527ms to ~464ms
This is amazing! Congrats. :)
We really should in general make more use of the optimized functions and take care that the buffers (e.g. the DFU buffer in this case) are used in a sane way.
Thanks, Stefan

Hello Stefan,
On 01/28/2015 02:12 PM, Stefan Roese wrote:
Hi Przemyslaw,
On 28.01.2015 13:55, Przemyslaw Marczak wrote:
This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)
Looking at the commit messages of this patchset I can conclude that your overall boot time reduction is:
from ~1527ms to ~464ms
This is amazing! Congrats. :)
Thank you. I was also amazed.
The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
We really should in general make more use of the optimized functions and take care that the buffers (e.g. the DFU buffer in this case) are used in a sane way.
Thanks, Stefan
Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this...
Best regards,

Hi Przemyslaw,
On Jan 28, 2015, at 16:10 , Przemyslaw Marczak p.marczak@samsung.com wrote:
Hello Stefan,
On 01/28/2015 02:12 PM, Stefan Roese wrote:
Hi Przemyslaw,
On 28.01.2015 13:55, Przemyslaw Marczak wrote:
This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)
Looking at the commit messages of this patchset I can conclude that your overall boot time reduction is:
from ~1527ms to ~464ms
This is amazing! Congrats. :)
Thank you. I was also amazed.
The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
We really should in general make more use of the optimized functions and take care that the buffers (e.g. the DFU buffer in this case) are used in a sane way.
Thanks, Stefan
Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this…
Those numbers are indeed incredible; I suppose the caches are disabled?
Best regards,
Przemyslaw Marczak Samsung R&D Institute Poland Samsung Electronics p.marczak@samsung.com
Regards
— Pantelis

Hello,
On 01/28/2015 03:18 PM, Pantelis Antoniou wrote:
Hi Przemyslaw,
On Jan 28, 2015, at 16:10 , Przemyslaw Marczak p.marczak@samsung.com wrote:
Hello Stefan,
On 01/28/2015 02:12 PM, Stefan Roese wrote:
Hi Przemyslaw,
On 28.01.2015 13:55, Przemyslaw Marczak wrote:
This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)
Looking at the commit messages of this patchset I can conclude that your overall boot time reduction is:
from ~1527ms to ~464ms
This is amazing! Congrats. :)
Thank you. I was also amazed.
The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
We really should in general make more use of the optimized functions and take care that the buffers (e.g. the DFU buffer in this case) are used in a sane way.
Thanks, Stefan
Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this…
Those numbers are indeed incredible; I suppose the caches are disabled?
The caches are enabled after the relocation, in one of board_init_r calls.
Best regards,
Przemyslaw Marczak Samsung R&D Institute Poland Samsung Electronics p.marczak@samsung.com
Regards
— Pantelis
Best regards,

Hi Przemyslaw,
On Jan 28, 2015, at 16:30 , Przemyslaw Marczak p.marczak@samsung.com wrote:
Hello,
On 01/28/2015 03:18 PM, Pantelis Antoniou wrote:
Hi Przemyslaw,
On Jan 28, 2015, at 16:10 , Przemyslaw Marczak p.marczak@samsung.com wrote:
Hello Stefan,
On 01/28/2015 02:12 PM, Stefan Roese wrote:
Hi Przemyslaw,
On 28.01.2015 13:55, Przemyslaw Marczak wrote:
This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)
Looking at the commit messages of this patchset I can conclude that your overall boot time reduction is:
from ~1527ms to ~464ms
This is amazing! Congrats. :)
Thank you. I was also amazed.
The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
We really should in general make more use of the optimized functions and take care that the buffers (e.g. the DFU buffer in this case) are used in a sane way.
Thanks, Stefan
Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this…
Those numbers are indeed incredible; I suppose the caches are disabled?
The caches are enabled after the relocation, in one of board_init_r calls.
How big is this .bss section? We’re talking about something that takes 1.5secs to clear a few MBs of memory? This is horrible.
Even at the optimized case .5secs is too much.
Best regards,
Przemyslaw Marczak Samsung R&D Institute Poland Samsung Electronics p.marczak@samsung.com
Regards
— Pantelis
Best regards,
Przemyslaw Marczak Samsung R&D Institute Poland Samsung Electronics p.marczak@samsung.com
Regards
— Pantelis

Hello,
On 01/28/2015 03:34 PM, Pantelis Antoniou wrote:
Hi Przemyslaw,
On Jan 28, 2015, at 16:30 , Przemyslaw Marczak p.marczak@samsung.com wrote:
Hello,
On 01/28/2015 03:18 PM, Pantelis Antoniou wrote:
Hi Przemyslaw,
On Jan 28, 2015, at 16:10 , Przemyslaw Marczak p.marczak@samsung.com wrote:
Hello Stefan,
On 01/28/2015 02:12 PM, Stefan Roese wrote:
Hi Przemyslaw,
On 28.01.2015 13:55, Przemyslaw Marczak wrote:
This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)
Looking at the commit messages of this patchset I can conclude that your overall boot time reduction is:
from ~1527ms to ~464ms
This is amazing! Congrats. :)
Thank you. I was also amazed.
The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
We really should in general make more use of the optimized functions and take care that the buffers (e.g. the DFU buffer in this case) are used in a sane way.
Thanks, Stefan
Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this…
Those numbers are indeed incredible; I suppose the caches are disabled?
The caches are enabled after the relocation, in one of board_init_r calls.
How big is this .bss section? We’re talking about something that takes 1.5secs to clear a few MBs of memory? This is horrible.
Even at the optimized case .5secs is too much.
The .bss section was about 32.3 MB, where the 32MB were required for dfu file buf. With the third patch, the .bss section is about 300k. I'm not saying, that this is fast booting. But the achieved improvement is really good for this config
Best regards,
Przemyslaw Marczak Samsung R&D Institute Poland Samsung Electronics p.marczak@samsung.com
Regards
— Pantelis
Best regards,
Przemyslaw Marczak Samsung R&D Institute Poland Samsung Electronics p.marczak@samsung.com
Regards
— Pantelis
Best regards,

Hello,
On 01/28/2015 01:55 PM, Przemyslaw Marczak wrote:
This patchset reduces the boot time for ARM architecture, Exynos boards, and boards with DFU enabled(ARM).
For tested Trats2 device, this was done in three steps.
First was enable the arch memcpy and memset. The second step was enable memset for .bss clear. The third step for reduce this operation is to keep .bss section small as possible.
The .bss section will grow if we have a lot of static variables. This section is cleared before jump to the relocated U-Boot, and it's done word by word. To reduce the time for this step, we can enable arch memset, which uses multiple ARM registers.
For configs with DFU enabled, we can find the dfu buffer in this section, which has at least 8MB (32MB for trats2). This is a lot of useless data, which is not required for standard boot. So this buffer should be dynamic allocated.
Przemyslaw Marczak (3): exynos: config: enable arch memcpy and arch memset arm: relocation: clear .bss section with arch memset if defined dfu: mmc: file buffer: remove static allocation
arch/arm/lib/crt0.S | 10 +++++++++- drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++--- include/configs/exynos-common.h | 3 +++ 3 files changed, 34 insertions(+), 4 deletions(-)
So I made some additional tests with the oscilloscope.
Quick about the measurement: The board is Odroid X2; Exynos4412(this one have gpio header).
Time is measured between change the state of one GPIo pin.
GPIO HI - set the gpio register in "reset" label in: arch/arm/cpu/armv7/start.S GPIO LO - set gpio register with "bootcmd" with setting register by "mw.l ..."
${bootdelay}=0
odroid_defconfig = .bss ~32.3MB
I tested few changes: - 850ms - no changes: - 840ms - + CONFIG_USE_ARCH_MEMCPY/MEMSET - 540ms - .bss memset (patch 2) - 210ms - dynamic allocation dfu file buf (patch 3)
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
For the testing purposes I changed the memset area in mem_malloc_init(). The CONFIG_SYS_MALLOC_LEN is unchanged, so the dfu can still alloc 2x32MB...
The results: - 158ms - malloc memset len: 40MB - 109ms - malloc memset len: 1MB
And a quick test for Trats2 with trace clock cycle counter: - 333ms - malloc memset len: 1MB (for the standard config it was more than 1520ms)
The malloc memset can't be removed now, because it requires check/change to calloc a lot of calls, but the board can boot if I set this to 256K.
So the final improvement which could be achieved for the odroid config is 850ms -> 109 ms. This is about 8 times faster.
And the tested boards difference: - Trats2 - 800MHz - Odroid X2 - 1000MHz - different BL1/BL2
Now I'm not so sure about the measurement reliability using the trace.
The Trats2 has no gpios header, and now I don't have time for the combinations.
So enable the DFU in the board config will increase the boot time. But the real reason is that the malloc memory area is set to zero on boot.
I think, that we should follow the malloc/calloc/realloc differences like in this description: http://man7.org/linux/man-pages/man3/malloc.3.html
Now I go for some holidays, and probably I will be unreachable until 9-th February. Sorry for troubles.
Best regards,

Dear All,
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
I think that the above excerpt is _really_ important and should be discussed.
I've "cut" it from the original post, so it won't get lost between the lines.
It seems really strange, that malloc() area is cleared after relocation. Which means that all "first" malloc'ed buffers get implicitly zeroed.
Przemek is right here that this zeroing shouldn't be performed.
I'm also concerned about potential bugs, which show up (or even worse - won't show up soon) after this change.
Hence, I would like to ask directly the community about the possible solutions.
Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
On the one hand removing memset() at [1] speeds up booting time and makes malloc() doing what is is supposed to do.
On the other hand there might be in space some boards, which rely on this memset and without it some wired things may start to happening.

Hi Lukasz,
On 2 February 2015 at 01:46, Lukasz Majewski l.majewski@samsung.com wrote:
Dear All,
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
I think that the above excerpt is _really_ important and should be discussed.
I've "cut" it from the original post, so it won't get lost between the lines.
It seems really strange, that malloc() area is cleared after relocation. Which means that all "first" malloc'ed buffers get implicitly zeroed.
Przemek is right here that this zeroing shouldn't be performed.
I'm also concerned about potential bugs, which show up (or even worse - won't show up soon) after this change.
Hence, I would like to ask directly the community about the possible solutions.
Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
On the one hand removing memset() at [1] speeds up booting time and makes malloc() doing what is is supposed to do.
On the other hand there might be in space some boards, which rely on this memset and without it some wired things may start to happening.
I think removing it is a good idea. It was one optimisation that I did for boot time in the Chromium tree. If you do it now (and Tom agrees) then there is plenty of time to test for this release cycle. You could go further and add a test CONFIG which fills it with some other non-zero value.
Regards, Simon

Hi Simon,
Hi Lukasz,
On 2 February 2015 at 01:46, Lukasz Majewski l.majewski@samsung.com wrote:
Dear All,
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
I think that the above excerpt is _really_ important and should be discussed.
I've "cut" it from the original post, so it won't get lost between the lines.
It seems really strange, that malloc() area is cleared after relocation. Which means that all "first" malloc'ed buffers get implicitly zeroed.
Przemek is right here that this zeroing shouldn't be performed.
I'm also concerned about potential bugs, which show up (or even worse - won't show up soon) after this change.
Hence, I would like to ask directly the community about the possible solutions.
Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
On the one hand removing memset() at [1] speeds up booting time and makes malloc() doing what is is supposed to do.
On the other hand there might be in space some boards, which rely on this memset and without it some wired things may start to happening.
I think removing it is a good idea. It was one optimisation that I did for boot time in the Chromium tree. If you do it now (and Tom agrees) then there is plenty of time to test for this release cycle. You could go further and add a test CONFIG which fills it with some other non-zero value.
Tom, is such approach acceptable for you?
Regards, Simon

On Thu, Feb 05, 2015 at 10:51:00AM +0100, Lukasz Majewski wrote:
Hi Simon,
Hi Lukasz,
On 2 February 2015 at 01:46, Lukasz Majewski l.majewski@samsung.com wrote:
Dear All,
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
I think that the above excerpt is _really_ important and should be discussed.
I've "cut" it from the original post, so it won't get lost between the lines.
It seems really strange, that malloc() area is cleared after relocation. Which means that all "first" malloc'ed buffers get implicitly zeroed.
Przemek is right here that this zeroing shouldn't be performed.
I'm also concerned about potential bugs, which show up (or even worse - won't show up soon) after this change.
Hence, I would like to ask directly the community about the possible solutions.
Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
On the one hand removing memset() at [1] speeds up booting time and makes malloc() doing what is is supposed to do.
On the other hand there might be in space some boards, which rely on this memset and without it some wired things may start to happening.
I think removing it is a good idea. It was one optimisation that I did for boot time in the Chromium tree. If you do it now (and Tom agrees) then there is plenty of time to test for this release cycle. You could go further and add a test CONFIG which fills it with some other non-zero value.
Tom, is such approach acceptable for you?
I was thinking at first we should default to a poisoned value. But given what we're seeing with generic board updates (lots of boards aren't even build-tested at every release which isn't really a surprise), I think the "funky" boards which may exist are probably not going to be seen for a while anyhow so we'd have to default to a poison for a long while. So yes, lets just add a CONFIG option (and Kconfig line) to optionally do it and default to no memset.
... but I just audited everyone doing "malloc (" and found a few things to fixup so we really do want to take a poke around.

Hello,
On 02/12/2015 05:07 PM, Tom Rini wrote:
On Thu, Feb 05, 2015 at 10:51:00AM +0100, Lukasz Majewski wrote:
Hi Simon,
Hi Lukasz,
On 2 February 2015 at 01:46, Lukasz Majewski l.majewski@samsung.com wrote:
Dear All,
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
I think that the above excerpt is _really_ important and should be discussed.
I've "cut" it from the original post, so it won't get lost between the lines.
It seems really strange, that malloc() area is cleared after relocation. Which means that all "first" malloc'ed buffers get implicitly zeroed.
Przemek is right here that this zeroing shouldn't be performed.
I'm also concerned about potential bugs, which show up (or even worse - won't show up soon) after this change.
Hence, I would like to ask directly the community about the possible solutions.
Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
On the one hand removing memset() at [1] speeds up booting time and makes malloc() doing what is is supposed to do.
On the other hand there might be in space some boards, which rely on this memset and without it some wired things may start to happening.
I think removing it is a good idea. It was one optimisation that I did for boot time in the Chromium tree. If you do it now (and Tom agrees) then there is plenty of time to test for this release cycle. You could go further and add a test CONFIG which fills it with some other non-zero value.
Tom, is such approach acceptable for you?
I was thinking at first we should default to a poisoned value. But given what we're seeing with generic board updates (lots of boards aren't even build-tested at every release which isn't really a surprise), I think the "funky" boards which may exist are probably not going to be seen for a while anyhow so we'd have to default to a poison for a long while. So yes, lets just add a CONFIG option (and Kconfig line) to optionally do it and default to no memset.
... but I just audited everyone doing "malloc (" and found a few things to fixup so we really do want to take a poke around.
The present malloc implementation, which includes the memset at init, could be little confusing. Because of this memset. the "few" first calls of malloc will always return a pointer to the zeroed memory region. Of course, the only calloc do the right job and set with zeros the memory region, which was used by any previous malloc call.
So, maybe some code expects zeroed memory also when using malloc.
In this case, I would like to skip this memset as an option.
This also requires skip some checking in calloc code.
I will send the patch with such option for Kconfig.
Best regards,

On Fri, Feb 13, 2015 at 04:48:15PM +0100, Przemyslaw Marczak wrote:
Hello,
On 02/12/2015 05:07 PM, Tom Rini wrote:
On Thu, Feb 05, 2015 at 10:51:00AM +0100, Lukasz Majewski wrote:
Hi Simon,
Hi Lukasz,
On 2 February 2015 at 01:46, Lukasz Majewski l.majewski@samsung.com wrote:
Dear All,
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
I think that the above excerpt is _really_ important and should be discussed.
I've "cut" it from the original post, so it won't get lost between the lines.
It seems really strange, that malloc() area is cleared after relocation. Which means that all "first" malloc'ed buffers get implicitly zeroed.
Przemek is right here that this zeroing shouldn't be performed.
I'm also concerned about potential bugs, which show up (or even worse - won't show up soon) after this change.
Hence, I would like to ask directly the community about the possible solutions.
Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
On the one hand removing memset() at [1] speeds up booting time and makes malloc() doing what is is supposed to do.
On the other hand there might be in space some boards, which rely on this memset and without it some wired things may start to happening.
I think removing it is a good idea. It was one optimisation that I did for boot time in the Chromium tree. If you do it now (and Tom agrees) then there is plenty of time to test for this release cycle. You could go further and add a test CONFIG which fills it with some other non-zero value.
Tom, is such approach acceptable for you?
I was thinking at first we should default to a poisoned value. But given what we're seeing with generic board updates (lots of boards aren't even build-tested at every release which isn't really a surprise), I think the "funky" boards which may exist are probably not going to be seen for a while anyhow so we'd have to default to a poison for a long while. So yes, lets just add a CONFIG option (and Kconfig line) to optionally do it and default to no memset.
... but I just audited everyone doing "malloc (" and found a few things to fixup so we really do want to take a poke around.
The present malloc implementation, which includes the memset at init, could be little confusing. Because of this memset. the "few" first calls of malloc will always return a pointer to the zeroed memory region. Of course, the only calloc do the right job and set with zeros the memory region, which was used by any previous malloc call.
So, maybe some code expects zeroed memory also when using malloc.
In this case, I would like to skip this memset as an option.
This also requires skip some checking in calloc code.
I will send the patch with such option for Kconfig.
There might be a few things which are depending on this odd behviour but they shouldn't. Most things are zeroing out twice since it's normal to malloc+memset if you care about contents being zero. The only real hang-up I see is that we're half way to release, more or less. Maybe we do this early post v2015.04 release to give more time to catch the odd cases? And spend some time now auditing.

Hello Simon,
On 02/02/2015 07:15 PM, Simon Glass wrote:
Hi Lukasz,
On 2 February 2015 at 01:46, Lukasz Majewski l.majewski@samsung.com wrote:
Dear All,
And the next is interesting. odroid_defconfig has more than 80MB for malloc (we need about 64mb for the DFU now, to be able write 32MB file).
This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set to 0 in function mem_malloc_init(). So for this config that function sets more than 80MB to zero.
This is not good, because we shouldn't expect zeroed memory returned by malloc pointer. This is a job for calloc.
Especially if some command expects zeroed memory after malloc, probably after few next calls - it can crash...
I think that the above excerpt is _really_ important and should be discussed.
I've "cut" it from the original post, so it won't get lost between the lines.
It seems really strange, that malloc() area is cleared after relocation. Which means that all "first" malloc'ed buffers get implicitly zeroed.
Przemek is right here that this zeroing shouldn't be performed.
I'm also concerned about potential bugs, which show up (or even worse - won't show up soon) after this change.
Hence, I would like to ask directly the community about the possible solutions.
Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
On the one hand removing memset() at [1] speeds up booting time and makes malloc() doing what is is supposed to do.
On the other hand there might be in space some boards, which rely on this memset and without it some wired things may start to happening.
I think removing it is a good idea. It was one optimisation that I did for boot time in the Chromium tree. If you do it now (and Tom agrees) then there is plenty of time to test for this release cycle. You could go further and add a test CONFIG which fills it with some other non-zero value.
Regards, Simon
Filling the malloc area with some pattern was a good idea to find out, why my trats2 had some issue after skip the memset with zeros in malloc init.
And actually the issue was not in malloc call, but it was in calloc.
The present implementation assumes that memory reserved for malloc is zeroed at init. And the calloc do the check, how much of the allocated memory is fresh(doesn't require zeroing).
After skip this fresh memory check, the calloc works fine.
Anyway, I think that this should be optional and tested on every config, before enable.
I would like to test something and will send the updated patch set on Monday.
Best regards,
participants (7)
-
Albert ARIBAUD
-
Lukasz Majewski
-
Pantelis Antoniou
-
Przemyslaw Marczak
-
Simon Glass
-
Stefan Roese
-
Tom Rini