
For better performance I added 32-bytes aligned memcpy32. Pleae check it.
Did you measure how much of performance this gains? Is it really worth the effort? In any case, the file needs a
GPL license
header.
we can feel that it's more faster than before. OK I added
GPL license
How much? 5%? 20%? 50%?
Architecture optimized copy is fast about 3 times in OMAP 16xx series (ARM 9 core) in U-Boot
Here's test results memory copy (Unit: nsec) 96 MHz 192 MHz unsigned long * 189,440 ~ 192,000 158,720 ~ 160,000 architecture optimized copy 56,320 ~ 58,880 53,760 ~ 55, 040
where increment time is 2560 nsec in 96 MHz and 1280 nsec in 192 MHz
On the issue of architecture specific memcpy()s. There appears to be an issue with the ppc optimized memcpy() in that it assumes alignment of both source and destination buffers. Our solution was to disable architecture specific memcpy() routines. We did however need the performance of a long word copy so we made our own memcpy() routine that we use during elf load, this isn't very elegant however.
It would be useful to have a generic C version of memcpy/memset/etc() for 8, 16 and 32 (maybe 64), bit architectures. Maybe a macro with a SIZE parameter would be suitable for generating the functions. The implementation would likely be very similar to your memcpy32() and it would be quite a bit easier to add support for unaligned pointers to a C version of memcpy() vs. modifying the asm code for each architecture.
Chris