
Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf:
On 27/03/2017 01:38, Simon Glass wrote:
Most of the time the optimised memset() is what we want. For extreme situations such as TPL it may be too large. For example on the 'rock' board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and the rodata bug, this patch is enough to reduce the TPL image below the limit.
Signed-off-by: Simon Glass sjg@chromium.org
lib/Kconfig | 9 +++++++++ lib/string.c | 6 ++++-- 2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/lib/Kconfig b/lib/Kconfig index 65c01573e1..5bf512d8c0 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -52,6 +52,15 @@ config LIB_RAND help This library provides pseudo-random number generator functions.
+config FAST_MEMSET
- bool "Use an optimised memset()"
- default y
- help
The faster memset() is the arch-specific one (if available) enabled
by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
better performance by write a word at a time. Disable this option
to reduce code size slightly at the cost of some speed.
The comment sounds slightly confused - it took me a few times of reading it until I grasped what it was trying to tell me :).
source lib/dhry/Kconfig
source lib/rsa/Kconfig diff --git a/lib/string.c b/lib/string.c index 67d5f6a421..159493ed17 100644 --- a/lib/string.c +++ b/lib/string.c @@ -437,8 +437,10 @@ char *strswab(const char *s) void * memset(void * s,int c,size_t count) { unsigned long *sl = (unsigned long *) s;
- unsigned long cl = 0; char *s8;
+#ifdef CONFIG_FAST_MEMSET
unsigned long cl = 0; int i;
/* do it one word at a time (32 bits or 64 bits) while possible */
@@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count) count -= sizeof(*sl); } }
- /* fill 8 bits at a time */
+#endif /* fill 8 bits at a time */
So while this is all neat, a few ideas:
- Would having memset in a header improve things even more? After all,
each external function call clobbers registers that you need to save/restore...
I'd guess it really depends on the size constraints. The regular libgeneric memset compiles on my rk3188 tpl to a total of 64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset comes down to 14bytes on my rk3188.
On the rk3188 the only memset user is board_init_f, so here memset is called only once without needing to save registers and I'd guess if an implementation really is that size-constrained to worry about 50bytes this one caller will probably always be the only one?
- How much would GOLD save you? Have you tried? U-Boot is small enough
of a code base that global optimizations should be able to give significant size savings.
I think the issue that this is trying to solve is to allow more toolchains to be used and thus make rebuilds on changes work on a lot of boards at the same time with random toolchains.
gcc-6.3 already produces way smaller results (well within the size constraints the rk3188 has) than for example the gcc-4.9 used by buildman as baseline toolchain.