
From: Tom Rini trini@konsulko.com On Mon, Jan 02, 2023 at 04:02:06PM +0000, Maier, Brandon L
I understand Linux replaced their custom zstd code with the Facebook zstd
code, so the changes are widespread. The new zstd does support many optional performance features as documented in zstd's lib/README[1], which I have already added a KConfig to disable all of them in this patch. For reference, if I enable the performance options the build size increases as so:
aarch64: (for 2/2 boards) all +27432.0 rodata +320.0 text +27112.0 arm: (for 5/5 boards) all +21801.6 rodata +256.0 text +21545.6 sandbox: (for 2/2 boards) all +59004.0 bss +16.0 rodata +800.0 text
+58188.0
I haven't had any luck finding another way to decrease build size.
[1]
https://github.com/facebook/zstd/blob/dev/lib/README.md#modular-build
After skimming over that, did you see if not disabling inlining, and letting the compiler figure it out is a big size penalty?
That is one of the settings I am already enabling. To clarify from that README, the settings I am enabling in U-Boot are HUF_FORCE_DECOMPRESS_X1, ZSTD_FORCE_DECOMPRESS_SEQUENCES_SHORT, ZSTD_NO_INLINE, and ZSTD_STRIP_ERROR_STRINGS. Which is what upstream zstd uses for their own ZSTD_LIB_MINIFY option in lib/libzstd.mk.
If I drop the ZSTD_NO_INLINE, the build sizes increase as follows:
aarch64: (for 2/2 boards) all +6212.0 rodata +8.0 text +6204.0 arm: (for 5/5 boards) all +7408.0 text +7408.0 sandbox: (for 2/2 boards) all +21928.0 bss -16.0 data +16.0 rodata +224.0 text +21704.0
Also, while the arch-level details are good, a specific platform set of numbers instead might give some ideas on where maybe reductions could come from.
I looked over each platform, but the size deltas are consistent within each architecture.
What stands out though is that the old ZSTD ARM code is about 4.6kb smaller (25%) than the old ZSTD AARCH64 code. The newer ZSTD doesn't do that, both ARM and AARCH64 are about the same in size. That is why there is a 4k jump in size for just the arm boards. Comparing the functions list between the two, both archs appear similar except ZSTD_decompressBlock_internal() grows significantly on aarch64 for some reason. I haven't seen any obvious root cause though.
Here is the full breakdown of platform size deltas
aarch64: (for 2/2 boards) all +535.5 rodata +965.5 text -430.0 turris_mox : all +572 rodata +968 text -396 mvebu_espressobin-88f3720: all +499 rodata +963 text -464 arm: (for 5/5 boards) all +4489.6 rodata +940.0 text +3549.6 turris_omnia : all +4608 rodata +940 text +3668 m53menlo : all +4460 rodata +940 text +3520 dh_imx6 : all +4460 rodata +940 text +3520 stm32mp15_dhcom_basic: all +4460 rodata +940 text +3520 stm32mp15_dhcor_basic: all +4460 rodata +940 text +3520 sandbox: (for 2/2 boards) all +8.0 rodata +784.0 text -776.0 sandbox : all +528 bss -32 rodata +768 text -208 sandbox64 : all -512 bss +32 rodata +800 text -1344
Finally, what is the whole list of platforms that grow?
All of them except sandbox64 grow by some amount.
It's all not ideal, yes, but it looks like BTRFS is the main user, right now, which isn't widely enabled. So maybe we can look towards improving upstream a bit here, if motivated.
I can't speak for the BTRFS code as I don't have a suitable platform to test it on. But the zstd code does appear to be broken on master. Running the zstd compression test in sandbox or sandbox64 both segfault. And I have similar problems trying to decompress FIT images on our AARCH64 platform. The primary motivation to this patch is that the new version does work. And this seems like the appropriate fix given Linux has abandoned the original version of this code.
-- Tom