[PATCH] string: Use memcpy() within memmove() when we can

A common use of memmove() can be handled by memcpy(). Also memcpy() includes an optimization for large sizes: it copies a word at a time. So we can get a speed-up by calling memcpy() to handle our move in this case.
Update memmove() to call also memcpy() if the source don't overlap the destination (src + count <= dest).
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com --- This patch allows to save 38ms for Kernel Image extraction (7327624 Bytes) from FIT loaded at 0xC2000000 for ARMV7 board STM32MP157C-EV1, and with kernel destination = Load Address: 0xc4000000, located after the FIT without overlap, compared with destination = Load Address: 0xc0008000.
-> 14,332 us vs 52,239 in bootstage report
In this case the memmove funtion is called in common/image.c::memmove_wd() to handle overlap.
lib/string.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/string.c b/lib/string.c index ae7835f600..ef8ead976c 100644 --- a/lib/string.c +++ b/lib/string.c @@ -567,7 +567,7 @@ void * memmove(void * dest,const void *src,size_t count) { char *tmp, *s;
- if (dest <= src) { + if (dest <= src || (src + count) <= dest) { memcpy(dest, src, count); } else { tmp = (char *) dest + count;

On 25/11/2020 12.26, Patrick Delaunay wrote:
A common use of memmove() can be handled by memcpy(). Also memcpy() includes an optimization for large sizes: it copies a word at a time. So we can get a speed-up by calling memcpy() to handle our move in this case.
Update memmove() to call also memcpy() if the source don't overlap the destination (src + count <= dest).
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com
This patch allows to save 38ms for Kernel Image extraction (7327624 Bytes) from FIT loaded at 0xC2000000 for ARMV7 board STM32MP157C-EV1, and with kernel destination = Load Address: 0xc4000000, located after the FIT without overlap, compared with destination = Load Address: 0xc0008000.
-> 14,332 us vs 52,239 in bootstage report
In this case the memmove funtion is called in common/image.c::memmove_wd() to handle overlap.
lib/string.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/string.c b/lib/string.c index ae7835f600..ef8ead976c 100644 --- a/lib/string.c +++ b/lib/string.c @@ -567,7 +567,7 @@ void * memmove(void * dest,const void *src,size_t count) { char *tmp, *s;
- if (dest <= src) {
- if (dest <= src || (src + count) <= dest) { memcpy(dest, src, count);
Hm. So the condition you add (src + count <= dest) implies no overlap at all, so that's ok. So this doesn't really have anything to do with your patch per se.
The existing condition relies on memcpy doing forward-copying. That's the case if the implementation from lib/string.c is in use, i.e. if __HAVE_ARCH_MEMCPY is not defined. And if an arch defines __HAVE_ARCH_MEMMOVE, this memmove() is not used.
But AFAICT, there's a potential problem for the case where __HAVE_ARCH_MEMCPY is defined but __HAVE_ARCH_MEMMOVE is not, and e.g. arch/arm/include/asm/string.h does
#if CONFIG_IS_ENABLED(USE_ARCH_MEMCPY) #define __HAVE_ARCH_MEMCPY #endif #undef __HAVE_ARCH_MEMMOVE
Of course, the arch-specific implementation _may_ also do forward-copying (I haven't tried to decipher it), but that seems to be a rather fragile assumption. At the very least, some comments would be in order.
Rasmus

On Wed, Nov 25, 2020 at 01:07:43PM +0100, Rasmus Villemoes wrote:
On 25/11/2020 12.26, Patrick Delaunay wrote:
A common use of memmove() can be handled by memcpy(). Also memcpy() includes an optimization for large sizes: it copies a word at a time. So we can get a speed-up by calling memcpy() to handle our move in this case.
Update memmove() to call also memcpy() if the source don't overlap the destination (src + count <= dest).
Signed-off-by: Patrick Delaunay patrick.delaunay@st.com
This patch allows to save 38ms for Kernel Image extraction (7327624 Bytes) from FIT loaded at 0xC2000000 for ARMV7 board STM32MP157C-EV1, and with kernel destination = Load Address: 0xc4000000, located after the FIT without overlap, compared with destination = Load Address: 0xc0008000.
-> 14,332 us vs 52,239 in bootstage report
In this case the memmove funtion is called in common/image.c::memmove_wd() to handle overlap.
lib/string.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/string.c b/lib/string.c index ae7835f600..ef8ead976c 100644 --- a/lib/string.c +++ b/lib/string.c @@ -567,7 +567,7 @@ void * memmove(void * dest,const void *src,size_t count) { char *tmp, *s;
- if (dest <= src) {
- if (dest <= src || (src + count) <= dest) { memcpy(dest, src, count);
Hm. So the condition you add (src + count <= dest) implies no overlap at all, so that's ok. So this doesn't really have anything to do with your patch per se.
The existing condition relies on memcpy doing forward-copying. That's the case if the implementation from lib/string.c is in use, i.e. if __HAVE_ARCH_MEMCPY is not defined. And if an arch defines __HAVE_ARCH_MEMMOVE, this memmove() is not used.
But AFAICT, there's a potential problem for the case where __HAVE_ARCH_MEMCPY is defined but __HAVE_ARCH_MEMMOVE is not, and e.g. arch/arm/include/asm/string.h does
#if CONFIG_IS_ENABLED(USE_ARCH_MEMCPY) #define __HAVE_ARCH_MEMCPY #endif #undef __HAVE_ARCH_MEMMOVE
Of course, the arch-specific implementation _may_ also do forward-copying (I haven't tried to decipher it), but that seems to be a rather fragile assumption. At the very least, some comments would be in order.
Looking at this deeper, today, ARM (non-64bit) can and usually but not always does use the asm optimized memcpy / memset. No other optimized functions were copied from Linux, and no other arches seem to use them today either. I think in sum then, Patrick can you please do a v2 that adds a comment here, in case we get more optimizations in the future? And thanks for the review here Rasmus!
participants (3)
-
Patrick Delaunay
-
Rasmus Villemoes
-
Tom Rini