[U-Boot] [PATCH V2 0/3] make memcpy and memset faster

I've added 32-bit lcd to the Nomadik (not submitted yet), and I found the scroll to be very slow, as the screen is big.
Instead of activating the "if 0" stanza for 32-bit scroll in lcd.c, I'd better have a faster memcpy/memset globally. So this patch set adds ulong-wide memcpy and memset, then removes the "#if 0" part in the scroll function. For me scrolling is 4 times faster on a 32 bit system.
V2: I incorporated most of the comments, but I didn't change the for loops to help the compiler optimizing it, since nowadays gcc is already doing the loops his own way irrespective of what i write.
Similarly, I'm not interested in "4 bytes at a time, then 1 at a time" as it's quite a corner case. If such optimizations are really useful, then we'd better have hand-crafted assembly for each arch, possibly lifted from glibc.
Alessandro Rubini (3): memcpy: copy one word at a time if possible memset: fill one word at a time if possible lcd: remove '#if 0' 32-bit scroll, now memcpy does it
common/lcd.c | 21 --------------------- lib_generic/string.c | 34 +++++++++++++++++++++++++++++----- 2 files changed, 29 insertions(+), 26 deletions(-)

From: Alessandro Rubini rubini@unipv.it
Signed-off-by: Alessandro Rubini rubini@unipv.it Acked-by: Andrea Gallo andrea.gallo@stericsson.com --- lib_generic/string.c | 17 +++++++++++++---- 1 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/lib_generic/string.c b/lib_generic/string.c index 181eda6..9911941 100644 --- a/lib_generic/string.c +++ b/lib_generic/string.c @@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count) * You should not use this function to access IO space, use memcpy_toio() * or memcpy_fromio() instead. */ -void * memcpy(void * dest,const void *src,size_t count) +void * memcpy(void *dest, const void *src, size_t count) { - char *tmp = (char *) dest, *s = (char *) src; + char *d8 = (char *)dest, *s8 = (char *)src; + unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
+ /* if all data is aligned (common case), copy a word at a time */ + if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { + count /= sizeof(unsigned long); + while (count--) + *dl++ = *sl++; + return dest; + } + /* else, use 1-byte copy */ while (count--) - *tmp++ = *s++; + *d8++ = *s8++;
return dest; }

From: Alessandro Rubini rubini@unipv.it
Signed-off-by: Alessandro Rubini rubini@unipv.it Acked-by: Andrea Gallo andrea.gallo@stericsson.com --- lib_generic/string.c | 17 ++++++++++++++++- 1 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/lib_generic/string.c b/lib_generic/string.c index 9911941..5f7aff9 100644 --- a/lib_generic/string.c +++ b/lib_generic/string.c @@ -404,7 +404,22 @@ char *strswab(const char *s) void * memset(void * s,int c,size_t count) { char *xs = (char *) s; - + unsigned long *sl = (unsigned long *) s; + unsigned long cl = 0; + int i; + + /* do it one word at a time (32 bits or 64 bits) if possible */ + if ( ((count | (int)s) & (sizeof(long) - 1)) == 0) { + count /= sizeof(long); + for (i=0; i<sizeof(long); ++i) { + cl <<= 8; + cl |= c & 0xff; + } + while (count--) + *sl++ = cl; + return s; + } + /* else, fill 8 bits at a time */ while (count--) *xs++ = c;

From: Alessandro Rubini rubini@unipv.it
Signed-off-by: Alessandro Rubini rubini@unipv.it Acked-by: Andrea Gallo andrea.gallo@stericsson.com --- common/lcd.c | 21 --------------------- 1 files changed, 0 insertions(+), 21 deletions(-)
diff --git a/common/lcd.c b/common/lcd.c index dc8fea6..4e31618 100644 --- a/common/lcd.c +++ b/common/lcd.c @@ -99,32 +99,11 @@ static int lcd_getfgcolor (void);
static void console_scrollup (void) { -#if 1 /* Copy up rows ignoring the first one */ memcpy (CONSOLE_ROW_FIRST, CONSOLE_ROW_SECOND, CONSOLE_SCROLL_SIZE);
/* Clear the last one */ memset (CONSOLE_ROW_LAST, COLOR_MASK(lcd_color_bg), CONSOLE_ROW_SIZE); -#else - /* - * Poor attempt to optimize speed by moving "long"s. - * But the code is ugly, and not a bit faster :-( - */ - ulong *t = (ulong *)CONSOLE_ROW_FIRST; - ulong *s = (ulong *)CONSOLE_ROW_SECOND; - ulong l = CONSOLE_SCROLL_SIZE / sizeof(ulong); - uchar c = lcd_color_bg & 0xFF; - ulong val= (c<<24) | (c<<16) | (c<<8) | c; - - while (l--) - *t++ = *s++; - - t = (ulong *)CONSOLE_ROW_LAST; - l = CONSOLE_ROW_SIZE / sizeof(ulong); - - while (l-- > 0) - *t++ = val; -#endif }
/*----------------------------------------------------------------------*/

Dear Alessandro Rubini,
In message cover.1255000877.git.rubini@unipv.it you wrote:
Similarly, I'm not interested in "4 bytes at a time, then 1 at a time" as it's quite a corner case. If such optimizations are really useful, then we'd better have hand-crafted assembly for each arch, possibly lifted from glibc.
I disagree here, especially as the change is actually trivial to implement and proably results even in smaller code size.
Best regards,
Wolfgang Denk

On Thursday 08 October 2009 07:29:51 Alessandro Rubini wrote:
Similarly, I'm not interested in "4 bytes at a time, then 1 at a time" as it's quite a corner case. If such optimizations are really useful, then we'd better have hand-crafted assembly for each arch, possibly lifted from glibc.
why ? it's trivial to implement with little code impact. have your code run while the len is larger than 4 (sizeof-whatever), then fall through to the loop that runs while the len is larger than 0 instead of immediately returning. -mike
participants (3)
-
Alessandro Rubini
-
Mike Frysinger
-
Wolfgang Denk