Re: [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined

2 Feb 2015


      On  2 Feb 2015, bpringlemeir@nbsps.com wrote:
...
On 31 Jan 2015, albert.u.boot@aribaud.net wrote:
...
Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
p.marczak@samsung.com wrote:
...
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
will highly increase the memset/memcpy performance. This is able
thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled,
so it takes some time, but zeroing the BSS memory takes much more
longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using
the arch memcpy for relocation has no significant meaning.
The same test confirms that enable the memset for zeroing BSS,
reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after
the relocation process. For ARM boards, this can be enabled
in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
...
Since the issue is that zeroing is done one word at a time, could we
not simply clear r3 as well as r2 (possibly even r4 and r5 too) and
do a double (possibly quadruple) write loop? That would avoid calling
a libc routine from the almost sole file in U-Boot where a C
environment is not necessarily granted.
I thought the same thing...
...
diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S index
22df3e5..fab3d2c 100644 --- a/arch/arm/lib/crt0.S +++
b/arch/arm/lib/crt0.S @@ -115,14 +115,22 @@ here: bl
c_runtime_cpu_setup /* we still call old routine here */
ldr	r0, =__bss_start	/* this is auto-relocated! */

ldr	r1, =__bss_end		/* this is auto-relocated! */

+#ifdef CONFIG_USE_ARCH_MEMSET

ldr	r3, =__bss_end		/* this is auto-relocated! */
mov	r1, #0x00000000		/* prepare zero to clear BSS */

subs	r2, r3, r0		/* r2 = memset len */
bl	memset

+#else

ldr	r1, =__bss_end		/* this is auto-relocated! */
mov	r2, #0x00000000		/* prepare zero to clear BSS */

clbss_l:cmp	r0, r1			/* while not at end of BSS */
   strlo	r2, [r0]		/* clear 32-bit BSS word */
   addlo	r0, r0, #4		/* move to next */
   blo	clbss_l
+#endif
This is great news (increase boot speed).  Maybe if this files wasn't
conditional?  Assuming the the 'BSS' is aligned (LDS enforced),
ldr     r0, =__bss_start        /* this is auto-relocated! */
         ldr     r1, =__bss_end          /* this is auto-relocated! */
+        mov     r2, #0         /* prepare zero to clear BSS */
+        mov     r3, #0
+        mov     r4, #0
+        mov     r5, #0
+        mov     r6, #0
+        mov     r7, #0
+        mov     r8, #0
+        mov     lr, #0
+        b       1f
+        .align  4
  clbss_l:
+         stmia   r0!, {r2-r8,lr}        /* clear 32 BSS word */
+         stmia   r0!, {r2-r8,lr}        /* easy to unroll */
+ 1:      cmp     r0, r1
          blo     clbss_l
The code should work on all ARM versions?  Then every platform would
benefit.  I think the larger portion of the 'ARCH memset' is to handle
alignment issues.  Sometimes the tail/head portion can be handled
efficiently with 'strd', etc which is only on some CPUs.  It is easy to
setup the BSS so that both the head/tail are aligned.... but I think the
above code only requires multiples of 32bytes as total BSS size (it is
easy to jump into the middle of an unrolled loop with 'add pc, rn<<2,
#constant').  The size/iteration can be easily tweaked (8/16/32 bytes).
At least in principal *if* there is some size alignment on BSS it is
fairly easy to write some generic ARM to quickly clear the BSS that will
be just as competitive as any ARCH memset version.  The above code is
adding about 13 words of code.
Fwiw,
Bill Pringlemeir.

Bill Pringlemeir

tags (0)

participants (1)