
Hello Tom,
On 02/12/2015 04:37 PM, Tom Rini wrote:
On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
Hello Przemyslaw,
On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak p.marczak@samsung.com wrote:
For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY, will highly increase the memset/memcpy performance. This is able thanks to the ARM multiple register instructions.
Unfortunatelly the relocation is done without the cache enabled, so it takes some time, but zeroing the BSS memory takes much more longer, especially for the configs with big static buffers.
A quick test confirms, that the boot time improvement after using the arch memcpy for relocation has no significant meaning. The same test confirms that enable the memset for zeroing BSS, reduces the boot time.
So this patch enables the arch memset for zeroing the BSS after the relocation process. For ARM boards, this can be enabled in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
Since the issue is that zeroing is done one word at a time, could we not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do a double (possibly quadruple) write loop? That would avoid calling a libc routine from the almost sole file in U-Boot where a C environment is not necessarily granted.
I want to jump up here again. Note that the arch memset/memcpy routines are in asm and I don't belive require a C environment. Why don't we simply use the asm versions for everyone and backport whatever we need from the kernel to re-sync there as it's not a choice there and it's a performance win too?
Right, for ARM the mentioned routines doesn't require C env. But if we could achieve some improvement in this place, then maybe it has sense to add some new code just for bss.
I will try to combine and make some timing tests on Monday.
Best regards,