
On Wed, Aug 14, 2024 at 01:49:43PM -0400, Raymond Mao wrote:
Hi Tom,
On Wed, 14 Aug 2024 at 11:07, Tom Rini trini@konsulko.com wrote:
On Wed, Aug 14, 2024 at 09:42:08AM -0400, Raymond Mao wrote:
Hi Tom,
On Fri, 2 Aug 2024 at 11:34, Raymond Mao raymond.mao@linaro.org wrote:
Hi Tom,
On Thu, 1 Aug 2024 at 16:46, Tom Rini trini@konsulko.com wrote:
On Wed, Jul 31, 2024 at 10:25:10AM -0700, Raymond Mao wrote:
Integrate MbedTLS v3.6 LTS (currently v3.6.0) with U-Boot.
Motivations:
- MbedTLS is well maintained with LTS versions.
- LWIP is integrated with MbedTLS and easily to enable HTTPS.
- MbedTLS recently switched license back to GPLv2.
Prerequisite:
This patch series requires mbedtls git repo to be added as a subtree to the main U-Boot repo via: $ git subtree add --prefix lib/mbedtls/external/mbedtls \ https://github.com/Mbed-TLS/mbedtls.git \ v3.6.0 --squash Moreover, due to the Windows-style files from mbedtls git repo, we need to convert the CRLF endings to LF and do a commit manually: $ git add --renormalize . $ git commit
New Kconfig options:
`MBEDTLS_LIB` is for MbedTLS general switch. `MBEDTLS_LIB_CRYPTO` is for replacing original digest and crypto
libs
with
MbedTLS. `MBEDTLS_LIB_X509` is for replacing original X509, PKCS7, MSCode,
ASN1,
and Pubkey parser with MbedTLS. `LEGACY_CRYPTO` is introduced as a main switch for legacy crypto
library.
`LEGACY_CRYPTO_BASIC` is for the basic crypto functionalities and `LEGACY_CRYPTO_CERT` is for the certificate related functionalities. For each of the algorithm, a pair of `<alg>_LEGACY` and
`<alg>_MBEDTLS`
Kconfig options are introduced. Meanwhile, `SPL_` Kconfig options
are
introduced.
In this patch set, MBEDTLS_LIB, MBEDTLS_LIB_CRYPTO and
MBEDTLS_LIB_X509
are by default enabled in qemu_arm64_defconfig and sandbox_defconfig for testing purpose.
Patches for external MbedTLS project:
Since U-Boot uses Microsoft Authentication Code to verify PE/COFFs executables which is not supported by MbedTLS at the moment, addtional patches for MbedTLS are created to adapt with the EFI
loader:
- Decoding of Microsoft Authentication Code.
- Decoding of PKCS#9 Authenticate Attributes.
- Extending MbedTLS PKCS#7 lib to support multiple signer's
certificates.
- MbedTLS native test suites for PKCS#7 signer's info.
All above 4 patches (tagged with `mbedtls/external`) are submitted
to
MbedTLS project and being reviewed, eventually they should be part
of
MbedTLS LTS release. But before that, please merge them into U-Boot, otherwise the
building
will be broken when MBEDTLS_LIB_X509 is enabled.
See below PR link for the reference: https://github.com/Mbed-TLS/mbedtls/pull/9001
Miscellaneous:
Optimized MbedTLS library size by tailoring the config file and disabling all unnecessary features for EFI loader. From v2, original libs (rsa, asn1_decoder, rsa_helper, md5, sha1,
sha256,
sha512) are completely replaced when MbedTLS is enabled. From v3, the size-growth is slightly reduced by refactoring Hash
functions.
Target(QEMU arm64) size-growth when enabling MbedTLS: v1: 6.03% v2: 4.66% From v3: 4.55%
Please see the latest output from buildman for size-growth on QEMU
arm64,
Sandbox and Nanopi A64. [1]
Let us inline the growth on qemu_arm64 for a moment: aarch64: (for 1/1 boards) all +6916.0 bss -32.0 data -64.0 rodata +200.0 text +6812.0 qemu_arm64 : all +6916 bss -32 data -64 rodata +200
text
+6812 u-boot: add: 28/-17, grow: 12/-16 bytes: 15492/-8304
(7188)
function old
new
delta mbedtls_internal_sha1_process -
4540
+4540 mbedtls_internal_md5_process -
2928
+2928 mbedtls_internal_sha256_process -
2052
+2052 mbedtls_internal_sha512_process -
1056
+1056 K -
896
+896 mbedtls_sha512_finish -
556
+556 mbedtls_sha256_finish -
484
+484 mbedtls_sha1_finish -
420
+420 mbedtls_sha512_starts -
340
+340 mbedtls_md5_finish -
336
+336 mbedtls_sha512_update -
264
+264 mbedtls_sha256_update -
252
+252 mbedtls_sha1_update -
236
+236 mbedtls_md5_update -
236
+236 mbedtls_sha512 -
148
+148 mbedtls_sha256_starts -
124
+124 hash_init_sha512 52
128
+76 hash_init_sha256 52
128
+76 mbedtls_sha1_starts -
72
+72 mbedtls_md5_starts -
60
+60 hash_init_sha1 52
112
+60 mbedtls_platform_zeroize -
56
+56 mbedtls_sha512_free -
16
+16 mbedtls_sha256_free -
16
+16 mbedtls_sha1_free -
16
+16 mbedtls_md5_free -
16
+16 hash_finish_sha512 72
88
+16 hash_finish_sha256 72
88
+16 hash_finish_sha1 72
88
+16 sha512_csum_wd 68
80
+12 sha256_csum_wd 68
80
+12 sha1_csum_wd 68
80
+12 md5_wd 68
80
+12 mbedtls_sha512_init -
12
+12 mbedtls_sha256_init -
12
+12 mbedtls_sha1_init -
12
+12 mbedtls_md5_init -
12
+12 memset_func -
8
+8 sha512_update 4
8
+4 sha384_update 4
8
+4 sha256_update 12
8
-4 sha1_update 12
8
-4 sha256_process 16
-16 sha1_process 16
-16 hash_update_sha512 36
16
-20 hash_update_sha256 36
16
-20 hash_update_sha1 36
16
-20 MD5Init 56
36
-20 sha1_starts 60
36
-24 hash_update_sha384 36
-36 hash_init_sha384 52
-52 sha384_csum_wd 68
12
-56 sha256_starts 104
40
-64 sha256_padding 64
-64 sha1_padding 64
-64 hash_finish_sha384 72
-72 sha512_finish 152
36
-116 sha512_starts 168
40
-128 sha384_starts 168
40
-128 sha384_finish 152
4
-148 MD5Final 196
44
-152 sha512_base_do_finalize 160
-160 static.sha256_update 228
-228 static.sha1_update 240
-240 sha512_base_do_update 244
-244 MD5Update 260
-260 sha1_finish 300
36
-264 sha256_finish 404
36
-368 sha256_armv8_ce_process 428
-428 sha1_armv8_ce_process 484
-484 sha512_K 640
-640 sha512_block_fn 1212
-1212 MD5Transform 2552
-2552
And to start with, that's not bad. In fact, tossing LTO in before
mbedTLS
only changes the top-line a little: aarch64: (for 1/1 boards) all +5120.0 bss -16.0 data -64.0 rodata +200.0 text +5000.0 qemu_arm64 : all +5120 bss -16 data -64 rodata +200
text
+5000 u-boot: add: 19/-18, grow: 11/-7 bytes: 14696/-7884
(6812)
But, is there something we can do still? mbedTLS is a more robust solution and I'm accepting there will be growth. But still the process/start/finish is much larger. Is there something configurable there?
I have investigated all those MbedTLS native functions with big-size
(_process/_update/_finish). For MD5 and SHA1, we don't have turnable configs. For SHA256 and SHA512, there are a few configs:
- Performance configs only for Armv8/a64. I didn't turn that on, which might affect the target size as well.
- Smaller implementation with lower size (only for non-Armv8/a64) at
the
expense of losing performance. I didn't enable both, as #1 is more for performance and might
potentially
increase target size; #2 compromises the performance and only for non-Armv8/a64. Looks like that both don't help in reducing the size of qemu_arm64. But I will try #1 on qemu_arm64 and #2 on sandbox and let you know the size impact soon.
The smaller footprint implementation for SHA256/512 can reduce the
target
size significantly on those "<hash>_process()" functions. Please see below output from
buildman:
aarch64: (for 2/2 boards) all -1468.0 bss +16.0 data -64.0 rodata
+200.0
text -1620.0 qemu_arm64 : all +4608 bss +80 data -64 rodata +200 text +4392 u-boot: add: 29/-17, grow: 12/-16 bytes: 13072/-8304
(4768)
nanopi_a64 : all -7544 bss -48 data -64 rodata +200 text
-7632 u-boot: add: 21/-8, grow: 4/-8 bytes: 10692/-4364 (6328) sandbox: (for 1/1 boards) all +19312.0 data +1440.0 rodata -4128.0
text
+22000.0 sandbox : all +19312 data +1440 rodata -4128 text
+22000
u-boot: add: 258/-206, grow: 122/-59 bytes: 90286/-76286
(14000)
Since this is a trade-off between size and performance, I will add one
more
kconfig to allow the user to turn it on/off. What are your thoughts?
On the other hand, the "Armv8/a64 only" options depend on NEON instructions, so I will keep them off.
Yes, we should have it as a Kconfig option. Can you perhaps do a real-world performance test (so some not-qemu platform) where you sha256 a big hunk of memory? That might help inform the defaults.
Below are a few tests I did on my Ubuntu PC, leveraging the benchmark tool. I measured two key factors, throughput and the number of CPU cycles per byte. Each process was done with a 1KB data block. The throughput was measured in an average data process (KiB/s) in 10 seconds. The number of CPU cycles per byte is an average number of processing 4096 times with the same data block size.
See below result:
Original: SHA-256 : 87972 KiB/s, 23 cycles/byte SHA-512 : 107328 KiB/s, 20 cycles/byte
Smaller implementations: SHA-256 : 65872 KiB/s, 36 cycles/byte SHA-512 : 94059 KiB/s, 23 cycles/byte
Please note that this is not very precise as it depends on many other factors, but we still can take this as a reference. As we can see here, the "smaller" one is not too bad. I would put the kconfig for the "smaller" one as a default config when MbedTLS is selected for qemu64 and sandbox. Other platforms can decide what they want depends on the real performance on each platform. Is this fine for you? Please feel free to advise.
Thanks. Yes, we should just default to "smaller" for everyone.
Also, sorry, I don't understand your comment about NEON instructions. Is it the kind of thing where there's too much core variability in terms of who has what? If so, it too should be an option, but not enabled by default.
The Armv8/A64 accelerating config depends on some of the NEON instructions
for example, vaddq_u32() and vaddq_u64(). Please see this link for the reference: https://developer.arm.com/documentation/dht0002/a/Introducing-NEON/Developin... I think we are missing this in U-Boot. I would suggest ignoring it at the moment (keep them all off).
Ah OK, follow-up work then.