Re: [U-Boot] [PATCH V2 1/3] memcpy: copy one word at a time if possible

On Thu, 2009-10-08 at 13:30 +0200, Alessandro Rubini wrote:
From: Alessandro Rubini rubini@unipv.it
Signed-off-by: Alessandro Rubini rubini@unipv.it Acked-by: Andrea Gallo andrea.gallo@stericsson.com
lib_generic/string.c | 17 +++++++++++++---- 1 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/lib_generic/string.c b/lib_generic/string.c index 181eda6..9911941 100644 --- a/lib_generic/string.c +++ b/lib_generic/string.c @@ -446,12 +446,21 @@ char * bcopy(const char * src, char * dest, int count)
- You should not use this function to access IO space, use memcpy_toio()
- or memcpy_fromio() instead.
*/ -void * memcpy(void * dest,const void *src,size_t count) +void * memcpy(void *dest, const void *src, size_t count) {
- char *tmp = (char *) dest, *s = (char *) src;
char *d8 = (char *)dest, *s8 = (char *)src;
unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
/* if all data is aligned (common case), copy a word at a time */
if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) {
count /= sizeof(unsigned long);
while (count--)
*dl++ = *sl++;
return dest;
}
/* else, use 1-byte copy */ while (count--)
*tmp++ = *s++;
*d8++ = *s8++;
return dest;
}
Hi Alessandro, No interest in the suggestion to not require count to be an exact multiple of 4/8? I don't think it would be that hard to update the logic accordingly and this would let your code be utilized much more often, especially if/when we run on a 64-bit machine. Conceptually it seems like a cleaner implementation too, but that's probably just my preference:)
Best, Peter

No interest in the suggestion to not require count to be an exact multiple of 4/8?
Actually, I wrote about that in my patch 0/3.
I don't think it would be that hard to update the logic accordingly and this would let your code be utilized much more often, especially if/when we run on a 64-bit machine.
That's true, but I think the most important case is lcd scrolling, where it's usually a big power of two -- that's where we had the #ifdef, so the problem was known, I suppose.
Currently, I don't even know if this is going to be picked up, so I don't want to go too far -- and in that case I would like to measure things and be able to test for stupid bugs, so it takes time. Scrolling the video is an easy test for memcpy.
If there's interest, there's memmov to fix; it's used pretty often and it's the same idea as memcpy (well, scrolling should use memmove, in theory).
/alessandro

On Thu, 2009-10-08 at 18:00 +0200, Alessandro Rubini wrote:
No interest in the suggestion to not require count to be an exact multiple of 4/8?
Actually, I wrote about that in my patch 0/3.
Sorry, I should have read more thoroughly.
I don't think it would be that hard to update the logic accordingly and this would let your code be utilized much more often, especially if/when we run on a 64-bit machine.
That's true, but I think the most important case is lcd scrolling, where it's usually a big power of two -- that's where we had the #ifdef, so the problem was known, I suppose.
I think the most important case for *you* is lcd scrolling, but for 99% of everyone else, it isn't at all:) memcpy() and memset() are used 100 times more often in non-lcd related code and most boards don't even have LCDs. In my opinion, we shouldn't be fixing/bloating common code for 1 outlying scenario, we should be trying to improve it for common cases.
Currently, I don't even know if this is going to be picked up, so I don't want to go too far -- and in that case I would like to measure things and be able to test for stupid bugs, so it takes time.
Sure, I understand where you're coming from. FWIW, if people don't want to pick this up as it affects lots of people, you could always add an architecture-specific memcpy/memset implementation.
If there's interest, there's memmov to fix; it's used pretty often and it's the same idea as memcpy (well, scrolling should use memmove, in theory).
I'll quit pestering, just wanted to put my $.02 in:)
Best, Peter

That's true, but I think the most important case is lcd scrolling, where it's usually a big power of two -- that's where we had the #ifdef, so the problem was known, I suppose.
I think the most important case for *you* is lcd scrolling, but for 99% of everyone else, it isn't at all:)
Well, its a big memcpy, and it has direct effect on the user. Every other copy is smaller, or has no interactive value.
memcpy() and memset() are used 100 times more often in non-lcd related code and most boards don't even have LCDs.
That's true. But it's only a boot loader (I just looked at what Nicolas Pitre did in the kernel for ARM strcpy and, well....).
So I made some measures (it's one of Pike's rules of programming:
* Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
)
I booted in u-boot, typed "setenv stdout serial" then "boot", which goes over the ethernet. Stopped the system after u-boot gave over control to the kernel. Result: 10412 memcopies so divided (number, length):
3941 4 1583 6 772 20 1 46 1 47 3 60 1024 64 1 815 1 888 770 1148 1543 1480 1 2283 1 3836 770 4096
So I dare say non-power-of-4 is a minority anyways: 1587 calls, 12689 bytes. i.e. 15.2% of the calls and 0.2% of the data.
Data collected in memory with patch below, used with following line:
od -An -t d4 logfile | awk '{print $4}' | sort -n | uniq -c
diff --git a/include/configs/nhk8815.h b/include/configs/nhk8815.h index edd698e..a390f28 100644 --- a/include/configs/nhk8815.h +++ b/include/configs/nhk8815.h @@ -28,6 +28,8 @@
#include <nomadik.h>
+#define CONFIG_MCLOGSIZE (16*1024) + #define CONFIG_ARM926EJS #define CONFIG_NOMADIK #define CONFIG_NOMADIK_8815 /* cpu variant */ diff --git a/lib_generic/string.c b/lib_generic/string.c index 5f7aff9..5afa11e 100644 --- a/lib_generic/string.c +++ b/lib_generic/string.c @@ -19,6 +19,7 @@ #include <linux/string.h> #include <linux/ctype.h> #include <malloc.h> +#include <common.h>
#if 0 /* not used - was: #ifndef __HAVE_ARCH_STRNICMP */ @@ -461,11 +462,29 @@ char * bcopy(const char * src, char * dest, int count) * You should not use this function to access IO space, use memcpy_toio() * or memcpy_fromio() instead. */ + +#ifndef CONFIG_MCLOGSIZE /* if you want to log the memcpy calls, define it */ +#define CONFIG_MCLOGSIZE 0 +#endif +struct mclog {int idx; void *dst; const void *src; int cnt;}; +static struct mclog mclog[CONFIG_MCLOGSIZE]; + void * memcpy(void *dest, const void *src, size_t count) { char *d8 = (char *)dest, *s8 = (char *)src; unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
+ if (CONFIG_MCLOGSIZE) { + static int idx; + struct mclog *p = mclog + (idx % (CONFIG_MCLOGSIZE ?: 1)); + if (!idx) printf("memcpy log at %p, size 0x%x\n", + mclog, sizeof(mclog)); + p->idx = idx++; + p->dst = dest; + p->src = src; + p->cnt = count; + } + /* if all data is aligned (common case), copy a word at a time */ if ( (((int)dest | (int)src | count) & (sizeof(long) - 1)) == 0) { count /= sizeof(unsigned long);

On Thu, 2009-10-08 at 20:23 +0200, Alessandro Rubini wrote:
That's true, but I think the most important case is lcd scrolling, where it's usually a big power of two -- that's where we had the #ifdef, so the problem was known, I suppose.
I think the most important case for *you* is lcd scrolling, but for 99% of everyone else, it isn't at all:)
Well, its a big memcpy, and it has direct effect on the user. Every other copy is smaller, or has no interactive value.
memcpy() and memset() are used 100 times more often in non-lcd related code and most boards don't even have LCDs.
That's true. But it's only a boot loader (I just looked at what Nicolas Pitre did in the kernel for ARM strcpy and, well....).
So I made some measures (it's one of Pike's rules of programming:
* Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
)
I booted in u-boot, typed "setenv stdout serial" then "boot", which goes over the ethernet. Stopped the system after u-boot gave over control to the kernel. Result: 10412 memcopies so divided (number, length):
3941 4 1583 6 772 20 1 46 1 47 3 60 1024 64 1 815 1 888 770 1148 1543 1480 1 2283 1 3836 770 4096
So I dare say non-power-of-4 is a minority anyways: 1587 calls, 12689 bytes. i.e. 15.2% of the calls and 0.2% of the data.
The statistics are going to be very different for different scenarios. For example, network operations seem to be the majority of your large memcpys, this isn't the case for everyone. If/when U-Boot runs on 64-bit hardware, the stats will change too.
In any case, my only suggestion would be that if we're improving memcpy()/memset(), do the extra 10% of effort required to make them a little better. That 10% of effort will improve 15.2% of all memcpy() calls for the foreseeable future:)
I honestly don't care much what implementation you choose as I only currently use PPC, which has their own memcpy()/memset(). I'm only trying to be helpful, feel free to proceed however you see fit, I promise I won't comment on future patches:)
Best, Peter

The statistics are going to be very different for different scenarios.
Yes, I know.
For example, network operations seem to be the majority of your large memcpys, this isn't the case for everyone.
True. I noticed it after sending -- although I expected it.
In any case, my only suggestion would be that if we're improving memcpy()/memset(), do the extra 10% of effort required to make them a little better. That 10% of effort will improve 15.2% of all memcpy() calls for the foreseeable future:)
It mainly depends on Wolfgang but, hey, it's not 10% of effort.
I promise I won't comment on future patches:)
No problem at all. And I apologize if my tone looked rude, it wasn't meant to. Thank you for your comments.
/alessandro, who didn't notice ppc has an asm implementation of its own

Dear Alessandro Rubini,
In message 20091008191734.GA13161@mail.gnudd.com you wrote:
In any case, my only suggestion would be that if we're improving memcpy()/memset(), do the extra 10% of effort required to make them a little better. That 10% of effort will improve 15.2% of all memcpy() calls for the foreseeable future:)
It mainly depends on Wolfgang but, hey, it's not 10% of effort.
It's less, right ? :-)
I think now we have this part of code in focus, we should to the Right Thing :-)
Best regards,
Wolfgang Denk

Dear Alessandro Rubini,
In message 20091008160026.GA9077@mail.gnudd.com you wrote:
No interest in the suggestion to not require count to be an exact multiple of 4/8?
Actually, I wrote about that in my patch 0/3.
Hm. Your argument was not exctly convincing, though.
Currently, I don't even know if this is going to be picked up, so I don't want to go too far -- and in that case I would like to measure things and be able to test for stupid bugs, so it takes time. Scrolling the video is an easy test for memcpy.
Sure this is going to ber picked uyp. And actually the implementation can be made even simpler doing both.
If there's interest, there's memmov to fix; it's used pretty often and it's the same idea as memcpy (well, scrolling should use memmove, in theory).
Volunteers welcome :-)
Best regards,
Wolfgang Denk
participants (4)
-
Alessandro Rubini
-
Alessandro Rubini
-
Peter Tyser
-
Wolfgang Denk