[RFC PATCH v2 0/3] Coroutines

This series introduces a simple coroutines framework and uses it to speed up the efi_init_obj_list() function. It is just an example to trigger discussions; hopefully other places in U-Boot can benefit from a similar treatment. Suggestions are welcome.
I came up with this idea after analyzing some profiling data (ftrace) taken during the execution of the "printenv -e" command on a Kria KV260 board. When the command is first invoked, efi_init_obj_list() is called which takes a significant amount of time (2.872 seconds). This time can be split into 0.570 s for efi_disks_register(), 0.811 s for efi_tcg2_register() and 1.418 s for efi_tcg2_do_initial_measurement(). All the other child functions are much quicker. Another interesting observation is that a large part of the time is actually spent waiting in __udelay(). More precisely: - For efi_disks_register(): 421 ms / 570 ms = 73.8 % spent in __udelay() - For efi_tcg2_register(): 805 ms / 811 ms = 99.1 % spent in __udelay() - For efi_tcg2_do_initial_measurement(): 1395.025 ms / 1418.372 ms = 98.3 % spent in __udelay()
Given the above data, it is reasonable to think that a nice speedup could be obtained if these functions could somehow be run in parallel. efi_tcg2_do_initial_measurement() unfortunately needs to be excluded because it depends on the first two. But efi_disks_register() and efi_tcg2_register() are clearly independant and are therefore good candidates for concurrent execution.
So, I considered two options:
- Spin up a secondary core to take care of one function while the other one runs on the main core - Introduce some kind of software scheduling
I quickly ruled out the first one for several reasons: initializing a secondary core is typically quite hardware-specific, it would not scale well if more functions than available cores would need to be run in parallel, it would make debugging harder etc. Software scheduling however can be accomplished quite easily, especially since we don't need to consider preemptive multitasking. Coroutines [1] for example can perfectly do the job. They provide a way to save and restore the execution context (registers and stack). Here is how it look like:
static void do_some_work(int v) { int i;
for (i = 0; i < 5; i++) { printf("%d", v); /* Save context and resume main thread */ co_yield(); } }
static void co1(void *arg) { do_some_work(1); /* Mark coroutine as "done" and resume main thread */ co_exit(); }
static void co2(void *arg) { do_some_work(2); co_exit(); }
void main_thread(void) { struct co *co1, *co2;
co1 = co_create(co1, ...); co2 = co_create(co2, ...);
do { printf("A"); if (!co1->done) { /* Invoke or resume first coroutine */ co_resume(co1); } printf("B"); if (!cor21->done) { /* Invoke or resume second coroutine */ co_resume(co2); } } while (!(co1->done && co2->done));
/* At this point, co1 and co2 have both called co_exit() */ }
The above example would print: A1B2A1B2A1B2A1B2A1B2.
- The first commit introduces the coroutine framework, loosely based on libaco [2]. The code was simplified and reformatted to better suit U-Boot. - The second commit modifies efi_init_obj_list() in order to turn efi_disks_register() and efi_tcg2_register() into coroutines when COROUTINES in enabled. On a KV260 board with a SD card inserted, this saves about .6 second (2.2 s instead of 2.8 s). - The third commit applies coroutines to usb_init(), which can significantly reduce the time it takes to scan multiple buses. Tested on arm64 QEMU with 4 XHCI buses: the USB scan takes 2.2 s instead of 5.8 s.
[1] https://en.wikipedia.org/wiki/Coroutine [2] https://github.com/hnes/libaco/
Changes in v2
- Remove x86 and x86_64 arch code since it is untested - Add missing SPDX license tag to arch/arm/cpu/armv8/co_switch.S - Change Apache-2.0 SPDX license tag to "Apache-2.0 OR GPL-2.0-or-later" - Apply coroutines to the USB bus scan in usb_init() Jerome Forissier (3): Introduce coroutines framework efi_loader: optimize efi_init_obj_list() with coroutines usb: scan multiple buses simultaneously with coroutines
arch/arm/cpu/armv8/Makefile | 1 + arch/arm/cpu/armv8/co_switch.S | 36 +++++++ drivers/usb/host/usb-uclass.c | 152 +++++++++++++++++++++++++++++- include/coroutines.h | 130 ++++++++++++++++++++++++++ lib/Kconfig | 10 ++ lib/Makefile | 2 + lib/coroutines.c | 165 +++++++++++++++++++++++++++++++++ lib/efi_loader/efi_setup.c | 113 +++++++++++++++++++++- lib/time.c | 14 ++- 9 files changed, 615 insertions(+), 8 deletions(-) create mode 100644 arch/arm/cpu/armv8/co_switch.S create mode 100644 include/coroutines.h create mode 100644 lib/coroutines.c

Adds the COROUTINES Kconfig symbol which introduces a new internal API for coroutines support. As explained in the Kconfig file, this is meant to provide some kind of cooperative multi-tasking with the goal to improve performance by overlapping lengthy operations.
The API as well as the implementation is very much inspired from libaco [1]. The reference implementation is simplified to remove all things not needed in U-Boot, the coding style is updated, and the aco_ prefix is replaced by co_.
I believe the stack handling could be simplified: the stack of the main coroutine could probably probably be used by the secondary coroutines instead of allocating a new stack dynamically.
Only i386, x86_64 and aarch64 are supported at the moment. Other architectures need to provide a _co_switch() function in assembly.
Only aarch64 has been tested.
[1] https://github.com/hnes/libaco/
Signed-off-by: Jerome Forissier jerome.forissier@linaro.org --- arch/arm/cpu/armv8/Makefile | 1 + arch/arm/cpu/armv8/co_switch.S | 36 +++++++ include/coroutines.h | 130 ++++++++++++++++++++++++++ lib/Kconfig | 10 ++ lib/Makefile | 2 + lib/coroutines.c | 165 +++++++++++++++++++++++++++++++++ 6 files changed, 344 insertions(+) create mode 100644 arch/arm/cpu/armv8/co_switch.S create mode 100644 include/coroutines.h create mode 100644 lib/coroutines.c
diff --git a/arch/arm/cpu/armv8/Makefile b/arch/arm/cpu/armv8/Makefile index 2e71ff2dc97..6d07b6aa9f9 100644 --- a/arch/arm/cpu/armv8/Makefile +++ b/arch/arm/cpu/armv8/Makefile @@ -46,3 +46,4 @@ obj-$(CONFIG_TARGET_BCMNS3) += bcmns3/ obj-$(CONFIG_XEN) += xen/ obj-$(CONFIG_ARMV8_CE_SHA1) += sha1_ce_glue.o sha1_ce_core.o obj-$(CONFIG_ARMV8_CE_SHA256) += sha256_ce_glue.o sha256_ce_core.o +obj-$(CONFIG_COROUTINES) += co_switch.o diff --git a/arch/arm/cpu/armv8/co_switch.S b/arch/arm/cpu/armv8/co_switch.S new file mode 100644 index 00000000000..4405e89ec56 --- /dev/null +++ b/arch/arm/cpu/armv8/co_switch.S @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* void _co_switch(struct uco *from_co, struct uco *to_co); */ +.text +.globl _co_switch +.type _co_switch, @function +_co_switch: + // x0: from_co + // x1: to_co + // from_co and to_co layout: { pc, sp, x19-x29 } + + // Save context to from_co (x0) + // AAPCS64 says "A subroutine invocation must preserve the contents of the + // registers r19-r29 and SP" + adr x2, 1f // pc we should use to resume after this function + mov x3, sp + stp x2, x3, [x0, #0] // pc, sp + stp x19, x20, [x0, #16] + stp x21, x22, [x0, #32] + stp x23, x24, [x0, #48] + stp x25, x26, [x0, #64] + stp x27, x28, [x0, #80] + stp x29, x30, [x0, #96] + + // Load new context from to_co (x1) + ldp x2, x3, [x1, #0] // pc, sp + ldp x19, x20, [x1, #16] + ldp x21, x22, [x1, #32] + ldp x23, x24, [x1, #48] + ldp x25, x26, [x1, #64] + ldp x27, x28, [x1, #80] + ldp x29, x30, [x1, #96] + mov sp, x3 + br x2 + +1: // Return to the caller + ret diff --git a/include/coroutines.h b/include/coroutines.h new file mode 100644 index 00000000000..b85b656127c --- /dev/null +++ b/include/coroutines.h @@ -0,0 +1,130 @@ +/* SPDX-License-Identifier: Apache-2.0 OR GPL-2.0-or-later */ +/* + * Copyright 2018 Sen Han 00hnes@gmail.com + * Copyright 2025 Linaro Limited + */ + +#ifndef _COROUTINES_H_ +#define _COROUTINES_H_ + +#ifndef CONFIG_COROUTINES + +static inline void co_yield(void) {} +static inline void co_exit(void) {} + +#else + +#ifdef __UBOOT__ +#include <log.h> +#else +#include <assert.h> +#endif +#include <limits.h> +#include <stdbool.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h> + +#ifdef __aarch64__ +#define CO_REG_IDX_RETADDR 0 +#define CO_REG_IDX_SP 1 +#else +#error Architecture no supported +#endif + +struct co_save_stack { + void* ptr; + size_t sz; + size_t valid_sz; + size_t max_cpsz; /* max copy size in bytes */ +}; + +struct co_stack { + void *ptr; + size_t sz; + void *align_highptr; + void *align_retptr; + size_t align_validsz; + size_t align_limit; + struct co *owner; + void *real_ptr; + size_t real_sz; +}; + +struct co { + /* CPU state: callee-saved registers plus SP and PC */ + void *reg[14]; // pc, sp, x19-x29, x30 (lr) + + struct co *main_co; + void *arg; + bool done; + + void (*fp)(void); + + struct co_save_stack save_stack; + struct co_stack *stack; +}; + +extern struct co *current_co; + +static inline struct co *co_get_co(void) +{ + return current_co; +} + +static inline void *co_get_arg(void) +{ + return co_get_co()->arg; +} + +struct co_stack *co_stack_new(size_t sz); + +void co_stack_destroy(struct co_stack *s); + +struct co *co_create(struct co *main_co, + struct co_stack *stack, + size_t save_stack_sz, void (*fp)(void), + void *arg); + +void co_resume(struct co *resume_co); + +void co_destroy(struct co *co); + +void *_co_switch(struct co *from_co, struct co *to_co); + +static inline void _co_yield_to_main_co(struct co *yield_co) +{ + assert(yield_co); + assert(yield_co->main_co); + _co_switch(yield_co, yield_co->main_co); +} + +static inline void co_yield(void) +{ + if (current_co) + _co_yield_to_main_co(current_co); +} + +static inline bool co_is_main_co(struct co *co) +{ + return !co->main_co; +} + +static inline void co_exit(void) +{ + struct co *co = co_get_co(); + + if (!co) + return; + co->done = true; + assert(co->stack->owner == co); + co->stack->owner = NULL; + co->stack->align_validsz = 0; + _co_yield_to_main_co(co); + assert(false); +} + +#endif /* CONFIG_COROUTINES */ +#endif /* _COROUTINES_H_ */ diff --git a/lib/Kconfig b/lib/Kconfig index 8f1a96d98c4..b6c1380b927 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -1226,6 +1226,16 @@ config PHANDLE_CHECK_SEQ enable this config option to distinguish them using phandles in fdtdec_get_alias_seq() function.
+config COROUTINES + bool "Enable coroutine support" + help + Coroutines allow to implement a simple form of cooperative + multi-tasking. The main thread of execution registers one or + more functions as coroutine entry points, then it schedules one + of them. At any point the scheduled coroutine may yield, that is, + suspend its execution and return back to the main thread. At this + point another coroutine may be scheduled and so on until all the + registered coroutines are done. endmenu
source "lib/fwu_updates/Kconfig" diff --git a/lib/Makefile b/lib/Makefile index 5cb3278d2ef..7b809151f5a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -159,6 +159,8 @@ obj-$(CONFIG_LIB_ELF) += elf.o
obj-$(CONFIG_$(PHASE_)SEMIHOSTING) += semihosting.o
+obj-$(CONFIG_COROUTINES) += coroutines.o + # # Build a fast OID lookup registry from include/linux/oid_registry.h # diff --git a/lib/coroutines.c b/lib/coroutines.c new file mode 100644 index 00000000000..20c5aba5510 --- /dev/null +++ b/lib/coroutines.c @@ -0,0 +1,165 @@ +// SPDX-License-Identifier: Apache-2.0 OR GPL-2.0-or-later + +// Copyright 2018 Sen Han 00hnes@gmail.com +// Copyright 2025 Linaro Limited + +#include <coroutines.h> +#include <stdio.h> +#include <stdint.h> + + +/* Current co-routine */ +struct co *current_co; + +struct co_stack *co_stack_new(size_t sz) +{ + struct co_stack *p = calloc(1, sizeof(*p)); + uintptr_t u_p; + + if (!p) + return NULL; + + if (sz < 4096) + sz = 4096; + + p->sz = sz; + p->ptr = malloc(sz); + if (!p->ptr) { + free(p); + return NULL; + } + + p->owner = NULL; + u_p = (uintptr_t)(p->sz - (sizeof(void*) << 1) + (uintptr_t)p->ptr); + u_p = (u_p >> 4) << 4; + p->align_highptr = (void*)u_p; + p->align_retptr = (void*)(u_p - sizeof(void*)); + assert(p->sz > (16 + (sizeof(void*) << 1) + sizeof(void*))); + p->align_limit = p->sz - 16 - (sizeof(void*) << 1); + + return p; +} + +void co_stack_destroy(struct co_stack *s){ + if (!s) + return; + free(s->ptr); + free(s); +} + +struct co *co_create(struct co *main_co, + struct co_stack *stack, + size_t save_stack_sz, + void (*fp)(void), void *arg) +{ + struct co *p = malloc(sizeof(*p)); + assert(p); + memset(p, 0, sizeof(*p)); + + if (main_co) { + assert(stack); + p->stack = stack; + p->reg[CO_REG_IDX_RETADDR] = (void *)fp; + // FIXME original code uses align_retptr; causes a crash + p->reg[CO_REG_IDX_SP] = p->stack->align_highptr; + p->main_co = main_co; + p->arg = arg; + p->fp = fp; + if (!save_stack_sz) + save_stack_sz = 64; + p->save_stack.ptr = malloc(save_stack_sz); + assert(p->save_stack.ptr); + p->save_stack.sz = save_stack_sz; + p->save_stack.valid_sz = 0; + } else { + p->main_co = NULL; + p->arg = arg; + p->fp = fp; + p->stack = NULL; + p->save_stack.ptr = NULL; + } + return p; +} + +static void grab_stack(struct co *resume_co) +{ + struct co *owner_co = resume_co->stack->owner; + + if (owner_co) { + assert(owner_co->stack == resume_co->stack); + assert((uintptr_t)(owner_co->stack->align_retptr) >= + (uintptr_t)(owner_co->reg[CO_REG_IDX_SP])); + assert((uintptr_t)owner_co->stack->align_highptr - + (uintptr_t)owner_co->stack->align_limit + <= (uintptr_t)owner_co->reg[CO_REG_IDX_SP]); + owner_co->save_stack.valid_sz = + (uintptr_t)owner_co->stack->align_retptr - + (uintptr_t)owner_co->reg[CO_REG_IDX_SP]; + if (owner_co->save_stack.sz < owner_co->save_stack.valid_sz) { + free(owner_co->save_stack.ptr); + owner_co->save_stack.ptr = NULL; + do { + owner_co->save_stack.sz <<= 1; + assert(owner_co->save_stack.sz > 0); + } while (owner_co->save_stack.sz < + owner_co->save_stack.valid_sz); + owner_co->save_stack.ptr = + malloc(owner_co->save_stack.sz); + assert(owner_co->save_stack.ptr); + } + if (owner_co->save_stack.valid_sz > 0) + memcpy(owner_co->save_stack.ptr, + owner_co->reg[CO_REG_IDX_SP], + owner_co->save_stack.valid_sz); + if (owner_co->save_stack.valid_sz > + owner_co->save_stack.max_cpsz) + owner_co->save_stack.max_cpsz = + owner_co->save_stack.valid_sz; + owner_co->stack->owner = NULL; + owner_co->stack->align_validsz = 0; + } + assert(!resume_co->stack->owner); + assert(resume_co->save_stack.valid_sz <= + resume_co->stack->align_limit - sizeof(void *)); + if (resume_co->save_stack.valid_sz > 0) + memcpy((void*) + (uintptr_t)(resume_co->stack->align_retptr) - + resume_co->save_stack.valid_sz, + resume_co->save_stack.ptr, + resume_co->save_stack.valid_sz); + if (resume_co->save_stack.valid_sz > resume_co->save_stack.max_cpsz) + resume_co->save_stack.max_cpsz = resume_co->save_stack.valid_sz; + resume_co->stack->align_validsz = + resume_co->save_stack.valid_sz + sizeof(void *); + resume_co->stack->owner = resume_co; +} + +void co_resume(struct co *resume_co) +{ + assert(resume_co && resume_co->main_co && !resume_co->done); + + if (resume_co->stack->owner != resume_co) + grab_stack(resume_co); + + current_co = resume_co; + _co_switch(resume_co->main_co, resume_co); + current_co = resume_co->main_co; +} + +void co_destroy(struct co *co){ + if (!co) + return; + + if(co_is_main_co(co)){ + free(co); + current_co = NULL; + } else { + if(co->stack->owner == co){ + co->stack->owner = NULL; + co->stack->align_validsz = 0; + } + free(co->save_stack.ptr); + co->save_stack.ptr = NULL; + free(co); + } +}

On 1/28/25 11:19, Jerome Forissier wrote:
Adds the COROUTINES Kconfig symbol which introduces a new internal API for coroutines support. As explained in the Kconfig file, this is meant to provide some kind of cooperative multi-tasking with the goal to improve performance by overlapping lengthy operations.
The API as well as the implementation is very much inspired from libaco [1]. The reference implementation is simplified to remove all things not needed in U-Boot, the coding style is updated, and the aco_ prefix is replaced by co_.
I believe the stack handling could be simplified: the stack of the main
Please use imperative mood.
coroutine could probably probably be used by the secondary coroutines instead of allocating a new stack dynamically.
Only i386, x86_64 and aarch64 are supported at the moment. Other architectures need to provide a _co_switch() function in assembly.
Likely should be updated.
Only aarch64 has been tested.
[1] https://github.com/hnes/libaco/
Signed-off-by: Jerome Forissier jerome.forissier@linaro.org
arch/arm/cpu/armv8/Makefile | 1 + arch/arm/cpu/armv8/co_switch.S | 36 +++++++ include/coroutines.h | 130 ++++++++++++++++++++++++++ lib/Kconfig | 10 ++ lib/Makefile | 2 + lib/coroutines.c | 165 +++++++++++++++++++++++++++++++++
Checkpatch is reporting some issues. Please fix them.
total: 17 errors, 19 warnings, 1 checks, 359 lines checked
6 files changed, 344 insertions(+) create mode 100644 arch/arm/cpu/armv8/co_switch.S create mode 100644 include/coroutines.h create mode 100644 lib/coroutines.c
diff --git a/arch/arm/cpu/armv8/Makefile b/arch/arm/cpu/armv8/Makefile index 2e71ff2dc97..6d07b6aa9f9 100644 --- a/arch/arm/cpu/armv8/Makefile +++ b/arch/arm/cpu/armv8/Makefile @@ -46,3 +46,4 @@ obj-$(CONFIG_TARGET_BCMNS3) += bcmns3/ obj-$(CONFIG_XEN) += xen/ obj-$(CONFIG_ARMV8_CE_SHA1) += sha1_ce_glue.o sha1_ce_core.o obj-$(CONFIG_ARMV8_CE_SHA256) += sha256_ce_glue.o sha256_ce_core.o +obj-$(CONFIG_COROUTINES) += co_switch.o diff --git a/arch/arm/cpu/armv8/co_switch.S b/arch/arm/cpu/armv8/co_switch.S new file mode 100644 index 00000000000..4405e89ec56 --- /dev/null +++ b/arch/arm/cpu/armv8/co_switch.S @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* void _co_switch(struct uco *from_co, struct uco *to_co); */ +.text +.globl _co_switch +.type _co_switch, @function +_co_switch:
- // x0: from_co
- // x1: to_co
- // from_co and to_co layout: { pc, sp, x19-x29 }
- // Save context to from_co (x0)
- // AAPCS64 says "A subroutine invocation must preserve the contents of the
- // registers r19-r29 and SP"
- adr x2, 1f // pc we should use to resume after this function
- mov x3, sp
- stp x2, x3, [x0, #0] // pc, sp
- stp x19, x20, [x0, #16]
- stp x21, x22, [x0, #32]
- stp x23, x24, [x0, #48]
- stp x25, x26, [x0, #64]
- stp x27, x28, [x0, #80]
- stp x29, x30, [x0, #96]
- // Load new context from to_co (x1)
- ldp x2, x3, [x1, #0] // pc, sp
- ldp x19, x20, [x1, #16]
- ldp x21, x22, [x1, #32]
- ldp x23, x24, [x1, #48]
- ldp x25, x26, [x1, #64]
- ldp x27, x28, [x1, #80]
- ldp x29, x30, [x1, #96]
- mov sp, x3
- br x2
+1: // Return to the caller
- ret
diff --git a/include/coroutines.h b/include/coroutines.h new file mode 100644 index 00000000000..b85b656127c --- /dev/null +++ b/include/coroutines.h @@ -0,0 +1,130 @@ +/* SPDX-License-Identifier: Apache-2.0 OR GPL-2.0-or-later */ +/*
- Copyright 2018 Sen Han 00hnes@gmail.com
- Copyright 2025 Linaro Limited
- */
+#ifndef _COROUTINES_H_ +#define _COROUTINES_H_
+#ifndef CONFIG_COROUTINES
+static inline void co_yield(void) {} +static inline void co_exit(void) {}
+#else
+#ifdef __UBOOT__ +#include <log.h> +#else +#include <assert.h> +#endif
Do we need ifdef at all? Are you testing it out of u-boot too? Or is this just untested code?
+#include <limits.h> +#include <stdbool.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h>
+#ifdef __aarch64__ +#define CO_REG_IDX_RETADDR 0 +#define CO_REG_IDX_SP 1 +#else +#error Architecture no supported +#endif
+struct co_save_stack {
- void* ptr;
- size_t sz;
- size_t valid_sz;
- size_t max_cpsz; /* max copy size in bytes */
+};
+struct co_stack {
- void *ptr;
- size_t sz;
- void *align_highptr;
- void *align_retptr;
- size_t align_validsz;
- size_t align_limit;
- struct co *owner;
- void *real_ptr;
- size_t real_sz;
+};
+struct co {
- /* CPU state: callee-saved registers plus SP and PC */
- void *reg[14]; // pc, sp, x19-x29, x30 (lr)
- struct co *main_co;
- void *arg;
- bool done;
- void (*fp)(void);
- struct co_save_stack save_stack;
- struct co_stack *stack;
+};
+extern struct co *current_co;
+static inline struct co *co_get_co(void) +{
- return current_co;
+}
+static inline void *co_get_arg(void) +{
- return co_get_co()->arg;
+}
+struct co_stack *co_stack_new(size_t sz);
+void co_stack_destroy(struct co_stack *s);
+struct co *co_create(struct co *main_co,
struct co_stack *stack,
size_t save_stack_sz, void (*fp)(void),
void *arg);
+void co_resume(struct co *resume_co);
+void co_destroy(struct co *co);
+void *_co_switch(struct co *from_co, struct co *to_co);
+static inline void _co_yield_to_main_co(struct co *yield_co) +{
- assert(yield_co);
- assert(yield_co->main_co);
- _co_switch(yield_co, yield_co->main_co);
+}
+static inline void co_yield(void) +{
- if (current_co)
_co_yield_to_main_co(current_co);
+}
+static inline bool co_is_main_co(struct co *co) +{
- return !co->main_co;
+}
+static inline void co_exit(void) +{
- struct co *co = co_get_co();
- if (!co)
return;
- co->done = true;
- assert(co->stack->owner == co);
- co->stack->owner = NULL;
- co->stack->align_validsz = 0;
- _co_yield_to_main_co(co);
- assert(false);
+}
+#endif /* CONFIG_COROUTINES */ +#endif /* _COROUTINES_H_ */ diff --git a/lib/Kconfig b/lib/Kconfig index 8f1a96d98c4..b6c1380b927 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -1226,6 +1226,16 @@ config PHANDLE_CHECK_SEQ enable this config option to distinguish them using phandles in fdtdec_get_alias_seq() function.
+config COROUTINES
- bool "Enable coroutine support"
- help
Coroutines allow to implement a simple form of cooperative
multi-tasking. The main thread of execution registers one or
more functions as coroutine entry points, then it schedules one
of them. At any point the scheduled coroutine may yield, that is,
suspend its execution and return back to the main thread. At this
point another coroutine may be scheduled and so on until all the
registered coroutines are done.
newline
endmenu
source "lib/fwu_updates/Kconfig" diff --git a/lib/Makefile b/lib/Makefile index 5cb3278d2ef..7b809151f5a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -159,6 +159,8 @@ obj-$(CONFIG_LIB_ELF) += elf.o
obj-$(CONFIG_$(PHASE_)SEMIHOSTING) += semihosting.o
+obj-$(CONFIG_COROUTINES) += coroutines.o
- # # Build a fast OID lookup registry from include/linux/oid_registry.h #
diff --git a/lib/coroutines.c b/lib/coroutines.c new file mode 100644 index 00000000000..20c5aba5510 --- /dev/null +++ b/lib/coroutines.c @@ -0,0 +1,165 @@ +// SPDX-License-Identifier: Apache-2.0 OR GPL-2.0-or-later
+// Copyright 2018 Sen Han 00hnes@gmail.com +// Copyright 2025 Linaro Limited
+#include <coroutines.h> +#include <stdio.h> +#include <stdint.h>
+/* Current co-routine */ +struct co *current_co;
+struct co_stack *co_stack_new(size_t sz) +{
- struct co_stack *p = calloc(1, sizeof(*p));
- uintptr_t u_p;
- if (!p)
return NULL;
- if (sz < 4096)
sz = 4096;
- p->sz = sz;
- p->ptr = malloc(sz);
- if (!p->ptr) {
free(p);
return NULL;
- }
- p->owner = NULL;
- u_p = (uintptr_t)(p->sz - (sizeof(void*) << 1) + (uintptr_t)p->ptr);
- u_p = (u_p >> 4) << 4;
- p->align_highptr = (void*)u_p;
- p->align_retptr = (void*)(u_p - sizeof(void*));
- assert(p->sz > (16 + (sizeof(void*) << 1) + sizeof(void*)));
- p->align_limit = p->sz - 16 - (sizeof(void*) << 1);
- return p;
+}
+void co_stack_destroy(struct co_stack *s){
- if (!s)
return;
- free(s->ptr);
- free(s);
+}
+struct co *co_create(struct co *main_co,
struct co_stack *stack,
size_t save_stack_sz,
void (*fp)(void), void *arg)
+{
- struct co *p = malloc(sizeof(*p));
- assert(p);
- memset(p, 0, sizeof(*p));
- if (main_co) {
assert(stack);
p->stack = stack;
p->reg[CO_REG_IDX_RETADDR] = (void *)fp;
// FIXME original code uses align_retptr; causes a crash
p->reg[CO_REG_IDX_SP] = p->stack->align_highptr;
p->main_co = main_co;
p->arg = arg;
p->fp = fp;
if (!save_stack_sz)
save_stack_sz = 64;
p->save_stack.ptr = malloc(save_stack_sz);
assert(p->save_stack.ptr);
p->save_stack.sz = save_stack_sz;
p->save_stack.valid_sz = 0;
- } else {
p->main_co = NULL;
p->arg = arg;
p->fp = fp;
p->stack = NULL;
p->save_stack.ptr = NULL;
- }
- return p;
+}
+static void grab_stack(struct co *resume_co) +{
- struct co *owner_co = resume_co->stack->owner;
- if (owner_co) {
assert(owner_co->stack == resume_co->stack);
assert((uintptr_t)(owner_co->stack->align_retptr) >=
(uintptr_t)(owner_co->reg[CO_REG_IDX_SP]));
assert((uintptr_t)owner_co->stack->align_highptr -
(uintptr_t)owner_co->stack->align_limit
<= (uintptr_t)owner_co->reg[CO_REG_IDX_SP]);
owner_co->save_stack.valid_sz =
(uintptr_t)owner_co->stack->align_retptr -
(uintptr_t)owner_co->reg[CO_REG_IDX_SP];
if (owner_co->save_stack.sz < owner_co->save_stack.valid_sz) {
free(owner_co->save_stack.ptr);
owner_co->save_stack.ptr = NULL;
do {
owner_co->save_stack.sz <<= 1;
assert(owner_co->save_stack.sz > 0);
} while (owner_co->save_stack.sz <
owner_co->save_stack.valid_sz);
owner_co->save_stack.ptr =
malloc(owner_co->save_stack.sz);
assert(owner_co->save_stack.ptr);
}
if (owner_co->save_stack.valid_sz > 0)
memcpy(owner_co->save_stack.ptr,
owner_co->reg[CO_REG_IDX_SP],
owner_co->save_stack.valid_sz);
if (owner_co->save_stack.valid_sz >
owner_co->save_stack.max_cpsz)
owner_co->save_stack.max_cpsz =
owner_co->save_stack.valid_sz;
owner_co->stack->owner = NULL;
owner_co->stack->align_validsz = 0;
- }
- assert(!resume_co->stack->owner);
- assert(resume_co->save_stack.valid_sz <=
resume_co->stack->align_limit - sizeof(void *));
- if (resume_co->save_stack.valid_sz > 0)
memcpy((void*)
(uintptr_t)(resume_co->stack->align_retptr) -
resume_co->save_stack.valid_sz,
resume_co->save_stack.ptr,
resume_co->save_stack.valid_sz);
- if (resume_co->save_stack.valid_sz > resume_co->save_stack.max_cpsz)
resume_co->save_stack.max_cpsz = resume_co->save_stack.valid_sz;
- resume_co->stack->align_validsz =
resume_co->save_stack.valid_sz + sizeof(void *);
- resume_co->stack->owner = resume_co;
+}
+void co_resume(struct co *resume_co) +{
- assert(resume_co && resume_co->main_co && !resume_co->done);
- if (resume_co->stack->owner != resume_co)
grab_stack(resume_co);
- current_co = resume_co;
- _co_switch(resume_co->main_co, resume_co);
- current_co = resume_co->main_co;
+}
+void co_destroy(struct co *co){
obviously this is not u-boot coding style. I expect checkpatch reports this too.
- if (!co)
return;
- if(co_is_main_co(co)){
free(co);
current_co = NULL;
- } else {
if(co->stack->owner == co){
co->stack->owner = NULL;
co->stack->align_validsz = 0;
}
free(co->save_stack.ptr);
co->save_stack.ptr = NULL;
free(co);
- }
+}
Thanks, Michal

On 1/28/25 12:09, Michal Simek wrote:
On 1/28/25 11:19, Jerome Forissier wrote:
Adds the COROUTINES Kconfig symbol which introduces a new internal API for coroutines support. As explained in the Kconfig file, this is meant to provide some kind of cooperative multi-tasking with the goal to improve performance by overlapping lengthy operations.
The API as well as the implementation is very much inspired from libaco [1]. The reference implementation is simplified to remove all things not needed in U-Boot, the coding style is updated, and the aco_ prefix is replaced by co_.
I believe the stack handling could be simplified: the stack of the main
Please use imperative mood.
OK. I will drop the remark about stack handling simplification because it is not very helpful here.
coroutine could probably probably be used by the secondary coroutines instead of allocating a new stack dynamically.
Only i386, x86_64 and aarch64 are supported at the moment. Other architectures need to provide a _co_switch() function in assembly.
Likely should be updated.
OK
Only aarch64 has been tested.
[1] https://github.com/hnes/libaco/
Signed-off-by: Jerome Forissier jerome.forissier@linaro.org
arch/arm/cpu/armv8/Makefile | 1 + arch/arm/cpu/armv8/co_switch.S | 36 +++++++ include/coroutines.h | 130 ++++++++++++++++++++++++++ lib/Kconfig | 10 ++ lib/Makefile | 2 + lib/coroutines.c | 165 +++++++++++++++++++++++++++++++++
Checkpatch is reporting some issues. Please fix them.
total: 17 errors, 19 warnings, 1 checks, 359 lines checked
All the checkpatch issues will be fixed in v3.
6 files changed, 344 insertions(+) create mode 100644 arch/arm/cpu/armv8/co_switch.S create mode 100644 include/coroutines.h create mode 100644 lib/coroutines.c
diff --git a/arch/arm/cpu/armv8/Makefile b/arch/arm/cpu/armv8/Makefile index 2e71ff2dc97..6d07b6aa9f9 100644 --- a/arch/arm/cpu/armv8/Makefile +++ b/arch/arm/cpu/armv8/Makefile @@ -46,3 +46,4 @@ obj-$(CONFIG_TARGET_BCMNS3) += bcmns3/ obj-$(CONFIG_XEN) += xen/ obj-$(CONFIG_ARMV8_CE_SHA1) += sha1_ce_glue.o sha1_ce_core.o obj-$(CONFIG_ARMV8_CE_SHA256) += sha256_ce_glue.o sha256_ce_core.o +obj-$(CONFIG_COROUTINES) += co_switch.o diff --git a/arch/arm/cpu/armv8/co_switch.S b/arch/arm/cpu/armv8/co_switch.S new file mode 100644 index 00000000000..4405e89ec56 --- /dev/null +++ b/arch/arm/cpu/armv8/co_switch.S @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* void _co_switch(struct uco *from_co, struct uco *to_co); */ +.text +.globl _co_switch +.type _co_switch, @function +_co_switch: + // x0: from_co + // x1: to_co + // from_co and to_co layout: { pc, sp, x19-x29 }
+ // Save context to from_co (x0) + // AAPCS64 says "A subroutine invocation must preserve the contents of the + // registers r19-r29 and SP" + adr x2, 1f // pc we should use to resume after this function + mov x3, sp + stp x2, x3, [x0, #0] // pc, sp + stp x19, x20, [x0, #16] + stp x21, x22, [x0, #32] + stp x23, x24, [x0, #48] + stp x25, x26, [x0, #64] + stp x27, x28, [x0, #80] + stp x29, x30, [x0, #96]
+ // Load new context from to_co (x1) + ldp x2, x3, [x1, #0] // pc, sp + ldp x19, x20, [x1, #16] + ldp x21, x22, [x1, #32] + ldp x23, x24, [x1, #48] + ldp x25, x26, [x1, #64] + ldp x27, x28, [x1, #80] + ldp x29, x30, [x1, #96] + mov sp, x3 + br x2
+1: // Return to the caller + ret diff --git a/include/coroutines.h b/include/coroutines.h new file mode 100644 index 00000000000..b85b656127c --- /dev/null +++ b/include/coroutines.h @@ -0,0 +1,130 @@ +/* SPDX-License-Identifier: Apache-2.0 OR GPL-2.0-or-later */ +/*
- Copyright 2018 Sen Han 00hnes@gmail.com
- Copyright 2025 Linaro Limited
- */
+#ifndef _COROUTINES_H_ +#define _COROUTINES_H_
+#ifndef CONFIG_COROUTINES
+static inline void co_yield(void) {} +static inline void co_exit(void) {}
+#else
+#ifdef __UBOOT__ +#include <log.h> +#else +#include <assert.h> +#endif
Do we need ifdef at all? Are you testing it out of u-boot too? Or is this just untested code?
It is a leftover from when I used to test outside U-Boot (in a regular Linux application) but that is not useful anymore. I'll remove the conditional.
+#include <limits.h> +#include <stdbool.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h>
+#ifdef __aarch64__ +#define CO_REG_IDX_RETADDR 0 +#define CO_REG_IDX_SP 1 +#else +#error Architecture no supported +#endif
+struct co_save_stack { + void* ptr; + size_t sz; + size_t valid_sz; + size_t max_cpsz; /* max copy size in bytes */ +};
+struct co_stack { + void *ptr; + size_t sz; + void *align_highptr; + void *align_retptr; + size_t align_validsz; + size_t align_limit; + struct co *owner; + void *real_ptr; + size_t real_sz; +};
+struct co { + /* CPU state: callee-saved registers plus SP and PC */ + void *reg[14]; // pc, sp, x19-x29, x30 (lr)
+ struct co *main_co; + void *arg; + bool done;
+ void (*fp)(void);
+ struct co_save_stack save_stack; + struct co_stack *stack; +};
+extern struct co *current_co;
+static inline struct co *co_get_co(void) +{ + return current_co; +}
+static inline void *co_get_arg(void) +{ + return co_get_co()->arg; +}
+struct co_stack *co_stack_new(size_t sz);
+void co_stack_destroy(struct co_stack *s);
+struct co *co_create(struct co *main_co, + struct co_stack *stack, + size_t save_stack_sz, void (*fp)(void), + void *arg);
+void co_resume(struct co *resume_co);
+void co_destroy(struct co *co);
+void *_co_switch(struct co *from_co, struct co *to_co);
+static inline void _co_yield_to_main_co(struct co *yield_co) +{ + assert(yield_co); + assert(yield_co->main_co); + _co_switch(yield_co, yield_co->main_co); +}
+static inline void co_yield(void) +{ + if (current_co) + _co_yield_to_main_co(current_co); +}
+static inline bool co_is_main_co(struct co *co) +{ + return !co->main_co; +}
+static inline void co_exit(void) +{ + struct co *co = co_get_co();
+ if (!co) + return; + co->done = true; + assert(co->stack->owner == co); + co->stack->owner = NULL; + co->stack->align_validsz = 0; + _co_yield_to_main_co(co); + assert(false); +}
+#endif /* CONFIG_COROUTINES */ +#endif /* _COROUTINES_H_ */ diff --git a/lib/Kconfig b/lib/Kconfig index 8f1a96d98c4..b6c1380b927 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -1226,6 +1226,16 @@ config PHANDLE_CHECK_SEQ enable this config option to distinguish them using phandles in fdtdec_get_alias_seq() function. +config COROUTINES + bool "Enable coroutine support" + help + Coroutines allow to implement a simple form of cooperative + multi-tasking. The main thread of execution registers one or + more functions as coroutine entry points, then it schedules one + of them. At any point the scheduled coroutine may yield, that is, + suspend its execution and return back to the main thread. At this + point another coroutine may be scheduled and so on until all the + registered coroutines are done.
newline
endmenu source "lib/fwu_updates/Kconfig" diff --git a/lib/Makefile b/lib/Makefile index 5cb3278d2ef..7b809151f5a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -159,6 +159,8 @@ obj-$(CONFIG_LIB_ELF) += elf.o obj-$(CONFIG_$(PHASE_)SEMIHOSTING) += semihosting.o +obj-$(CONFIG_COROUTINES) += coroutines.o
# # Build a fast OID lookup registry from include/linux/oid_registry.h # diff --git a/lib/coroutines.c b/lib/coroutines.c new file mode 100644 index 00000000000..20c5aba5510 --- /dev/null +++ b/lib/coroutines.c @@ -0,0 +1,165 @@ +// SPDX-License-Identifier: Apache-2.0 OR GPL-2.0-or-later
+// Copyright 2018 Sen Han 00hnes@gmail.com +// Copyright 2025 Linaro Limited
+#include <coroutines.h> +#include <stdio.h> +#include <stdint.h>
+/* Current co-routine */ +struct co *current_co;
+struct co_stack *co_stack_new(size_t sz) +{ + struct co_stack *p = calloc(1, sizeof(*p)); + uintptr_t u_p;
+ if (!p) + return NULL;
+ if (sz < 4096) + sz = 4096;
+ p->sz = sz; + p->ptr = malloc(sz); + if (!p->ptr) { + free(p); + return NULL; + }
+ p->owner = NULL; + u_p = (uintptr_t)(p->sz - (sizeof(void*) << 1) + (uintptr_t)p->ptr); + u_p = (u_p >> 4) << 4; + p->align_highptr = (void*)u_p; + p->align_retptr = (void*)(u_p - sizeof(void*)); + assert(p->sz > (16 + (sizeof(void*) << 1) + sizeof(void*))); + p->align_limit = p->sz - 16 - (sizeof(void*) << 1);
+ return p; +}
+void co_stack_destroy(struct co_stack *s){ + if (!s) + return; + free(s->ptr); + free(s); +}
+struct co *co_create(struct co *main_co, + struct co_stack *stack, + size_t save_stack_sz, + void (*fp)(void), void *arg) +{ + struct co *p = malloc(sizeof(*p)); + assert(p); + memset(p, 0, sizeof(*p));
+ if (main_co) { + assert(stack); + p->stack = stack; + p->reg[CO_REG_IDX_RETADDR] = (void *)fp; + // FIXME original code uses align_retptr; causes a crash + p->reg[CO_REG_IDX_SP] = p->stack->align_highptr; + p->main_co = main_co; + p->arg = arg; + p->fp = fp; + if (!save_stack_sz) + save_stack_sz = 64; + p->save_stack.ptr = malloc(save_stack_sz); + assert(p->save_stack.ptr); + p->save_stack.sz = save_stack_sz; + p->save_stack.valid_sz = 0; + } else { + p->main_co = NULL; + p->arg = arg; + p->fp = fp; + p->stack = NULL; + p->save_stack.ptr = NULL; + } + return p; +}
+static void grab_stack(struct co *resume_co) +{ + struct co *owner_co = resume_co->stack->owner;
+ if (owner_co) { + assert(owner_co->stack == resume_co->stack); + assert((uintptr_t)(owner_co->stack->align_retptr) >= + (uintptr_t)(owner_co->reg[CO_REG_IDX_SP])); + assert((uintptr_t)owner_co->stack->align_highptr - + (uintptr_t)owner_co->stack->align_limit + <= (uintptr_t)owner_co->reg[CO_REG_IDX_SP]); + owner_co->save_stack.valid_sz = + (uintptr_t)owner_co->stack->align_retptr - + (uintptr_t)owner_co->reg[CO_REG_IDX_SP]; + if (owner_co->save_stack.sz < owner_co->save_stack.valid_sz) { + free(owner_co->save_stack.ptr); + owner_co->save_stack.ptr = NULL; + do { + owner_co->save_stack.sz <<= 1; + assert(owner_co->save_stack.sz > 0); + } while (owner_co->save_stack.sz < + owner_co->save_stack.valid_sz); + owner_co->save_stack.ptr = + malloc(owner_co->save_stack.sz); + assert(owner_co->save_stack.ptr); + } + if (owner_co->save_stack.valid_sz > 0) + memcpy(owner_co->save_stack.ptr, + owner_co->reg[CO_REG_IDX_SP], + owner_co->save_stack.valid_sz); + if (owner_co->save_stack.valid_sz > + owner_co->save_stack.max_cpsz) + owner_co->save_stack.max_cpsz = + owner_co->save_stack.valid_sz; + owner_co->stack->owner = NULL; + owner_co->stack->align_validsz = 0; + } + assert(!resume_co->stack->owner); + assert(resume_co->save_stack.valid_sz <= + resume_co->stack->align_limit - sizeof(void *)); + if (resume_co->save_stack.valid_sz > 0) + memcpy((void*) + (uintptr_t)(resume_co->stack->align_retptr) - + resume_co->save_stack.valid_sz, + resume_co->save_stack.ptr, + resume_co->save_stack.valid_sz); + if (resume_co->save_stack.valid_sz > resume_co->save_stack.max_cpsz) + resume_co->save_stack.max_cpsz = resume_co->save_stack.valid_sz; + resume_co->stack->align_validsz = + resume_co->save_stack.valid_sz + sizeof(void *); + resume_co->stack->owner = resume_co; +}
+void co_resume(struct co *resume_co) +{ + assert(resume_co && resume_co->main_co && !resume_co->done);
+ if (resume_co->stack->owner != resume_co) + grab_stack(resume_co);
+ current_co = resume_co; + _co_switch(resume_co->main_co, resume_co); + current_co = resume_co->main_co; +}
+void co_destroy(struct co *co){
obviously this is not u-boot coding style. I expect checkpatch reports this too.
Yep. Addressed in v3.
+ if (!co) + return;
+ if(co_is_main_co(co)){ + free(co); + current_co = NULL; + } else { + if(co->stack->owner == co){ + co->stack->owner = NULL; + co->stack->align_validsz = 0; + } + free(co->save_stack.ptr); + co->save_stack.ptr = NULL; + free(co); + } +}
Thanks, Michal
Thanks for the review.

When COROUTINES is enabled, schedule efi_disks_register() and efi_tcg2_register() as two coroutines in efi_init_obj_list() instead of invoking them sequentially. The voluntary yield point is introduced inside udelay() which is called frequently as a result of each function polling the hardware. This allows the two coroutines to make progress simultaneously and reduce the wall clock time required by efi_init_obj_list().
Tested on Kria KV260 with a microSD card inserted with the "printenv -e" command. With COROUTINES disabled, efi_init_obj_list() completes in 2821 ms on average (2825, 2822, 2817). With COROUTINES enabled, it takes 2265 ms (2262, 2260, 2272). That is a reduction of 556 ms which is not bad at all considering that measured separately, efi_tcg2_register() takes ~825 ms and efi_disks_register() needs ~600 ms, so assuming they would overlap perfectly one can expect a 600 ms improvement at best.
The code size penalty for this improvement is 1340 bytes.
Signed-off-by: Jerome Forissier jerome.forissier@linaro.org --- lib/efi_loader/efi_setup.c | 113 +++++++++++++++++++++++++++++++++++-- lib/time.c | 14 ++++- 2 files changed, 122 insertions(+), 5 deletions(-)
diff --git a/lib/efi_loader/efi_setup.c b/lib/efi_loader/efi_setup.c index aa59bc7779d..94160f4bd86 100644 --- a/lib/efi_loader/efi_setup.c +++ b/lib/efi_loader/efi_setup.c @@ -7,10 +7,12 @@
#define LOG_CATEGORY LOGC_EFI
+#include <coroutines.h> #include <efi_loader.h> #include <efi_variable.h> #include <log.h> #include <asm-generic/unaligned.h> +#include <stdlib.h>
#define OBJ_LIST_NOT_INITIALIZED 1
@@ -208,6 +210,46 @@ out: return -1; }
+#if CONFIG_IS_ENABLED(COROUTINES) + +static void efi_disks_register_co(void) +{ + efi_status_t ret; + + if (efi_obj_list_initialized != OBJ_LIST_NOT_INITIALIZED) + goto out; + + /* + * Probe block devices to find the ESP. + * efi_disks_register() must be called before efi_init_variables(). + */ + ret = efi_disks_register(); + if (ret != EFI_SUCCESS) + efi_obj_list_initialized = ret; +out: + co_exit(); +} + +static void efi_tcg2_register_co(void) +{ + efi_status_t ret = EFI_SUCCESS; + + if (efi_obj_list_initialized != OBJ_LIST_NOT_INITIALIZED) + goto out; + + if (IS_ENABLED(CONFIG_EFI_TCG2_PROTOCOL)) { + ret = efi_tcg2_register(); + if (ret != EFI_SUCCESS) + efi_obj_list_initialized = ret; + } +out: + co_exit(); +} + +extern int udelay_yield; + +#endif /* COROUTINES */ + /** * efi_init_obj_list() - Initialize and populate EFI object list * @@ -216,6 +258,12 @@ out: efi_status_t efi_init_obj_list(void) { efi_status_t ret = EFI_SUCCESS; +#if CONFIG_IS_ENABLED(COROUTINES) + struct co_stack *stk = NULL; + struct co *main_co = NULL; + struct co *co1 = NULL; + struct co *co2 = NULL; +#endif
/* Initialize once only */ if (efi_obj_list_initialized != OBJ_LIST_NOT_INITIALIZED) @@ -224,6 +272,53 @@ efi_status_t efi_init_obj_list(void) /* Set up console modes */ efi_setup_console_size();
+#if CONFIG_IS_ENABLED(COROUTINES) + main_co = co_create(NULL, NULL, 0, NULL, NULL); + if (!main_co) { + ret = EFI_OUT_OF_RESOURCES; + goto out; + } + + stk = co_stack_new(8192); + if (!stk) { + ret = EFI_OUT_OF_RESOURCES; + goto out; + } + + co1 = co_create(main_co, stk, 0, efi_disks_register_co, NULL); + if (!co1) { + ret = EFI_OUT_OF_RESOURCES; + goto out; + } + + co2 = co_create(main_co, stk, 0, efi_tcg2_register_co, NULL); + if (!co2) { + ret = EFI_OUT_OF_RESOURCES; + goto out; + } + + udelay_yield = 0xCAFEDECA; + do { + if (!co1->done) + co_resume(co1); + if (!co2->done) + co_resume(co2); + } while (!(co1->done && co2->done)); + udelay_yield = 0; + + co_stack_destroy(stk); + co_destroy(main_co); + co_destroy(co1); + co_destroy(co2); + stk = NULL; + main_co = co1 = co2 = NULL; + + if (efi_obj_list_initialized != OBJ_LIST_NOT_INITIALIZED) { + /* Some kind of error was saved by a coroutine */ + ret = efi_obj_list_initialized; + goto out; + } +#else /* * Probe block devices to find the ESP. * efi_disks_register() must be called before efi_init_variables(). @@ -232,6 +327,13 @@ efi_status_t efi_init_obj_list(void) if (ret != EFI_SUCCESS) goto out;
+ if (IS_ENABLED(CONFIG_EFI_TCG2_PROTOCOL)) { + ret = efi_tcg2_register(); + if (ret != EFI_SUCCESS) + efi_obj_list_initialized = ret; + } +#endif + /* Initialize variable services */ ret = efi_init_variables(); if (ret != EFI_SUCCESS) @@ -272,10 +374,6 @@ efi_status_t efi_init_obj_list(void) }
if (IS_ENABLED(CONFIG_EFI_TCG2_PROTOCOL)) { - ret = efi_tcg2_register(); - if (ret != EFI_SUCCESS) - goto out; - ret = efi_tcg2_do_initial_measurement(); if (ret == EFI_SECURITY_VIOLATION) goto out; @@ -350,6 +448,13 @@ efi_status_t efi_init_obj_list(void) !IS_ENABLED(CONFIG_EFI_CAPSULE_ON_DISK_EARLY)) ret = efi_launch_capsules(); out: +#if CONFIG_IS_ENABLED(COROUTINES) + co_stack_destroy(stk); + co_destroy(main_co); + co_destroy(co1); + co_destroy(co2); efi_obj_list_initialized = ret; +#endif + return ret; } diff --git a/lib/time.c b/lib/time.c index d88edafb196..c11288102fe 100644 --- a/lib/time.c +++ b/lib/time.c @@ -17,6 +17,7 @@ #include <asm/global_data.h> #include <asm/io.h> #include <linux/delay.h> +#include <coroutines.h>
#ifndef CFG_WD_PERIOD # define CFG_WD_PERIOD (10 * 1000 * 1000) /* 10 seconds default */ @@ -190,6 +191,8 @@ void __weak __udelay(unsigned long usec)
/* ------------------------------------------------------------------------- */
+int udelay_yield; + void udelay(unsigned long usec) { ulong kv; @@ -197,7 +200,16 @@ void udelay(unsigned long usec) do { schedule(); kv = usec > CFG_WD_PERIOD ? CFG_WD_PERIOD : usec; - __udelay(kv); + if (CONFIG_IS_ENABLED(COROUTINES) && + udelay_yield == 0xCAFEDECA) { + ulong t0 = timer_get_us(); + do { + co_yield(); + __udelay(10); + } while (timer_get_us() < t0 + kv); + } else { + __udelay(kv); + } usec -= kv; } while(usec); }

Use the coroutines framework to scan USB buses in parallel for better performance. Tested on arm64 QEMU on a somewhat contrived example (4 USB buses, each with one audio device, one keyboard, one mouse and one tablet).
$ make qemu_arm64_defconfig $ make -j$(nproc) CROSS_COMPILE="ccache aarch64-linux-gnu-" $ qemu-system-aarch64 -M virt -nographic -cpu max -bios u-boot.bin \ $(for i in {1..4}; do echo -device qemu-xhci,id=xhci$i \ -device\ usb-{audio,kbd,mouse,tablet},bus=xhci$i.0; \ done)
The time spent in usb_init() is reported on the console and shows a significant improvement with COROUTINES enabled.
** Without COROUTINES
Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found USB: 4 bus(es) scanned in 5873 ms
** With COROUTINES
Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Scanning 4 USB bus(es)... done Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found USB: 4 bus(es) scanned in 2213 ms
Signed-off-by: Jerome Forissier jerome.forissier@linaro.org --- drivers/usb/host/usb-uclass.c | 152 +++++++++++++++++++++++++++++++++- 1 file changed, 149 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/usb-uclass.c b/drivers/usb/host/usb-uclass.c index bfec303e7af..3104efe7f9e 100644 --- a/drivers/usb/host/usb-uclass.c +++ b/drivers/usb/host/usb-uclass.c @@ -9,6 +9,7 @@ #define LOG_CATEGORY UCLASS_USB
#include <bootdev.h> +#include <coroutines.h> #include <dm.h> #include <errno.h> #include <log.h> @@ -18,6 +19,8 @@ #include <dm/lists.h> #include <dm/uclass-internal.h>
+#include <time.h> + static bool asynch_allowed;
struct usb_uclass_priv { @@ -221,6 +224,40 @@ int usb_stop(void) return err; }
+static int nbus; + +#if CONFIG_IS_ENABLED(COROUTINES) +static void usb_scan_bus(struct udevice *bus, bool recurse) +{ + struct usb_bus_priv *priv; + struct udevice *dev; + int ret; + + priv = dev_get_uclass_priv(bus); + + assert(recurse); /* TODO: Support non-recusive */ + + debug("\n"); + ret = usb_scan_device(bus, 0, USB_SPEED_FULL, &dev); + if (ret) + printf("Scanning bus %s failed, error %d\n", bus->name, ret); +} + +static void usb_report_devices(struct uclass *uc) +{ + struct usb_bus_priv *priv; + struct udevice *bus; + + uclass_foreach_dev(bus, uc) { + priv = dev_get_uclass_priv(bus); + printf("Bus %s: ", bus->name); + if (priv->next_addr == 0) + printf("No USB device found\n"); + else + printf("%d USB device(s) found\n", priv->next_addr); + } +} +#else static void usb_scan_bus(struct udevice *bus, bool recurse) { struct usb_bus_priv *priv; @@ -240,7 +277,81 @@ static void usb_scan_bus(struct udevice *bus, bool recurse) printf("No USB Device found\n"); else printf("%d USB Device(s) found\n", priv->next_addr); + nbus++; } +#endif + +#if CONFIG_IS_ENABLED(COROUTINES) +extern int udelay_yield; + +static void usb_scan_bus_co(void) +{ + usb_scan_bus((struct udevice *)co_get_arg(), true); + co_exit(); +} + +static struct co_stack *stk; +static struct co *main_co; +static struct co **co; +static int co_sz = 8; + +static int add_usb_scan_bus_co(struct udevice *bus) +{ + if (!co) { + co = malloc(co_sz * sizeof(*co)); + if (!co) + return -ENOMEM; + } + if (nbus == co_sz) { + struct co **nco; + + co_sz *= 2; + nco = realloc(co, co_sz * sizeof(*co)); + if (!nco) + return -ENOMEM; + co = nco; + } + if (!main_co) { + main_co = co_create(NULL, NULL, 0, NULL, NULL); + if (!main_co) + return -ENOMEM; + } + if (!stk) { + stk = co_stack_new(32768); + if (!stk) + return -ENOMEM; + } + co[nbus] = co_create(main_co, stk, 0, usb_scan_bus_co, bus); + if (!co[nbus]) + return -ENOMEM; + nbus++; + return 0; +} + +static void usb_scan_cleanup(void) +{ + int i; + + for (i = 0; i < nbus; i++) { + co_destroy(co[i]); + co[i] = NULL; + } + nbus = 0; + co_destroy(main_co); + main_co = NULL; + co_stack_destroy(stk); + stk = NULL; +} +#else +static int add_usb_scan_bus_co(struct udevice *bus) +{ + return 0; +} + +static void usb_scan_cleanup(void) +{ +} +#endif
static void remove_inactive_children(struct uclass *uc, struct udevice *bus) { @@ -289,6 +400,7 @@ static int usb_probe_companion(struct udevice *bus)
int usb_init(void) { + unsigned long t0 = timer_get_us(); int controllers_initialized = 0; struct usb_uclass_priv *uc_priv; struct usb_bus_priv *priv; @@ -355,10 +467,40 @@ int usb_init(void) continue;
priv = dev_get_uclass_priv(bus); - if (!priv->companion) - usb_scan_bus(bus, true); + if (!priv->companion) { + if (CONFIG_IS_ENABLED(COROUTINES)) { + ret = add_usb_scan_bus_co(bus); + if (ret) + goto out; + } else { + usb_scan_bus(bus, true); + } + } }
+#if CONFIG_IS_ENABLED(COROUTINES) + { + bool done; + int i; + + printf("Scanning %d USB bus(es)... ", nbus); + udelay_yield = 0xCAFEDECA; + do { + done = true; + for (i = 0; i < nbus; i++) { + if (!co[i]->done) { + done = false; + co_resume(co[i]); + } + } + } while (!done); + udelay_yield = 0; + printf("done\n"); + + usb_report_devices(uc); + } +#endif + /* * Now that the primary controllers have been scanned and have handed * over any devices they do not understand to their companions, scan @@ -388,7 +530,11 @@ int usb_init(void) /* if we were not able to find at least one working bus, bail out */ if (controllers_initialized == 0) printf("No USB controllers found\n"); - +out: + if (nbus) + printf("USB: %d bus(es) scanned in %ld ms\n", nbus, + (timer_get_us() - t0) / 1000); + usb_scan_cleanup(); return usb_started ? 0 : -ENOENT; }

Hi,
út 28. 1. 2025 v 11:20 odesílatel Jerome Forissier jerome.forissier@linaro.org napsal:
Use the coroutines framework to scan USB buses in parallel for better performance. Tested on arm64 QEMU on a somewhat contrived example (4 USB buses, each with one audio device, one keyboard, one mouse and one tablet).
$ make qemu_arm64_defconfig $ make -j$(nproc) CROSS_COMPILE="ccache aarch64-linux-gnu-" $ qemu-system-aarch64 -M virt -nographic -cpu max -bios u-boot.bin \ $(for i in {1..4}; do echo -device qemu-xhci,id=xhci$i \ -device\ usb-{audio,kbd,mouse,tablet},bus=xhci$i.0; \ done)
The time spent in usb_init() is reported on the console and shows a significant improvement with COROUTINES enabled.
** Without COROUTINES
Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found USB: 4 bus(es) scanned in 5873 ms
** With COROUTINES
Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Scanning 4 USB bus(es)... done Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found USB: 4 bus(es) scanned in 2213 ms
Signed-off-by: Jerome Forissier jerome.forissier@linaro.org
drivers/usb/host/usb-uclass.c | 152 +++++++++++++++++++++++++++++++++- 1 file changed, 149 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/usb-uclass.c b/drivers/usb/host/usb-uclass.c index bfec303e7af..3104efe7f9e 100644 --- a/drivers/usb/host/usb-uclass.c +++ b/drivers/usb/host/usb-uclass.c @@ -9,6 +9,7 @@ #define LOG_CATEGORY UCLASS_USB
#include <bootdev.h> +#include <coroutines.h> #include <dm.h> #include <errno.h> #include <log.h> @@ -18,6 +19,8 @@ #include <dm/lists.h> #include <dm/uclass-internal.h>
+#include <time.h>
static bool asynch_allowed;
struct usb_uclass_priv { @@ -221,6 +224,40 @@ int usb_stop(void) return err; }
+static int nbus;
+#if CONFIG_IS_ENABLED(COROUTINES) +static void usb_scan_bus(struct udevice *bus, bool recurse) +{
struct usb_bus_priv *priv;
struct udevice *dev;
int ret;
priv = dev_get_uclass_priv(bus);
assert(recurse); /* TODO: Support non-recusive */
debug("\n");
ret = usb_scan_device(bus, 0, USB_SPEED_FULL, &dev);
if (ret)
printf("Scanning bus %s failed, error %d\n", bus->name, ret);
+}
+static void usb_report_devices(struct uclass *uc) +{
struct usb_bus_priv *priv;
struct udevice *bus;
uclass_foreach_dev(bus, uc) {
priv = dev_get_uclass_priv(bus);
printf("Bus %s: ", bus->name);
if (priv->next_addr == 0)
printf("No USB device found\n");
else
printf("%d USB device(s) found\n", priv->next_addr);
}
+} +#else static void usb_scan_bus(struct udevice *bus, bool recurse) { struct usb_bus_priv *priv; @@ -240,7 +277,81 @@ static void usb_scan_bus(struct udevice *bus, bool recurse) printf("No USB Device found\n"); else printf("%d USB Device(s) found\n", priv->next_addr);
nbus++;
} +#endif
+#if CONFIG_IS_ENABLED(COROUTINES) +extern int udelay_yield;
+static void usb_scan_bus_co(void) +{
usb_scan_bus((struct udevice *)co_get_arg(), true);
co_exit();
+}
+static struct co_stack *stk; +static struct co *main_co; +static struct co **co; +static int co_sz = 8;
+static int add_usb_scan_bus_co(struct udevice *bus) +{
if (!co) {
co = malloc(co_sz * sizeof(*co));
if (!co)
return -ENOMEM;
}
if (nbus == co_sz) {
struct co **nco;
co_sz *= 2;
nco = realloc(co, co_sz * sizeof(*co));
if (!nco)
return -ENOMEM;
co = nco;
}
if (!main_co) {
main_co = co_create(NULL, NULL, 0, NULL, NULL);
if (!main_co)
return -ENOMEM;
}
if (!stk) {
stk = co_stack_new(32768);
if (!stk)
return -ENOMEM;
}
co[nbus] = co_create(main_co, stk, 0, usb_scan_bus_co, bus);
if (!co[nbus])
return -ENOMEM;
nbus++;
return 0;
+}
+static void usb_scan_cleanup(void) +{
int i;
for (i = 0; i < nbus; i++) {
co_destroy(co[i]);
co[i] = NULL;
}
nbus = 0;
co_destroy(main_co);
main_co = NULL;
co_stack_destroy(stk);
stk = NULL;
+} +#else +static int add_usb_scan_bus_co(struct udevice *bus) +{
return 0;
+}
+static void usb_scan_cleanup(void) +{ +} +#endif
static void remove_inactive_children(struct uclass *uc, struct udevice *bus) { @@ -289,6 +400,7 @@ static int usb_probe_companion(struct udevice *bus)
int usb_init(void) {
unsigned long t0 = timer_get_us(); int controllers_initialized = 0; struct usb_uclass_priv *uc_priv; struct usb_bus_priv *priv;
@@ -355,10 +467,40 @@ int usb_init(void) continue;
priv = dev_get_uclass_priv(bus);
if (!priv->companion)
usb_scan_bus(bus, true);
if (!priv->companion) {
if (CONFIG_IS_ENABLED(COROUTINES)) {
ret = add_usb_scan_bus_co(bus);
if (ret)
goto out;
} else {
usb_scan_bus(bus, true);
}
} }
+#if CONFIG_IS_ENABLED(COROUTINES)
{
bool done;
int i;
printf("Scanning %d USB bus(es)... ", nbus);
udelay_yield = 0xCAFEDECA;
do {
done = true;
for (i = 0; i < nbus; i++) {
if (!co[i]->done) {
done = false;
co_resume(co[i]);
}
}
} while (!done);
udelay_yield = 0;
printf("done\n");
usb_report_devices(uc);
}
+#endif
/* * Now that the primary controllers have been scanned and have handed * over any devices they do not understand to their companions, scan
@@ -388,7 +530,11 @@ int usb_init(void) /* if we were not able to find at least one working bus, bail out */ if (controllers_initialized == 0) printf("No USB controllers found\n");
+out:
if (nbus)
printf("USB: %d bus(es) scanned in %ld ms\n", nbus,
(timer_get_us() - t0) / 1000);
usb_scan_cleanup(); return usb_started ? 0 : -ENOENT;
}
-- 2.43.0
I have tested it on kr260 which is using 2 usb interfaces and there is an issue with usb hub initialization. That board has two hubs connected over i2c and only one of them is initialized over i2c. It means there is some work which needs to happen and likely some locking should be in place.
Thanks, Michal

On 1/28/25 12:58, Michal Simek wrote:
Hi,
út 28. 1. 2025 v 11:20 odesílatel Jerome Forissier jerome.forissier@linaro.org napsal:
Use the coroutines framework to scan USB buses in parallel for better performance. Tested on arm64 QEMU on a somewhat contrived example (4 USB buses, each with one audio device, one keyboard, one mouse and one tablet).
$ make qemu_arm64_defconfig $ make -j$(nproc) CROSS_COMPILE="ccache aarch64-linux-gnu-" $ qemu-system-aarch64 -M virt -nographic -cpu max -bios u-boot.bin \ $(for i in {1..4}; do echo -device qemu-xhci,id=xhci$i \ -device\ usb-{audio,kbd,mouse,tablet},bus=xhci$i.0; \ done)
The time spent in usb_init() is reported on the console and shows a significant improvement with COROUTINES enabled.
** Without COROUTINES
Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found scanning bus xhci_pci for devices... 6 USB Device(s) found USB: 4 bus(es) scanned in 5873 ms
** With COROUTINES
Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Bus xhci_pci: Register 8001040 NbrPorts 8 Starting the controller USB XHCI 1.00 Scanning 4 USB bus(es)... done Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found Bus xhci_pci: 6 USB device(s) found USB: 4 bus(es) scanned in 2213 ms
Signed-off-by: Jerome Forissier jerome.forissier@linaro.org
drivers/usb/host/usb-uclass.c | 152 +++++++++++++++++++++++++++++++++- 1 file changed, 149 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/host/usb-uclass.c b/drivers/usb/host/usb-uclass.c index bfec303e7af..3104efe7f9e 100644 --- a/drivers/usb/host/usb-uclass.c +++ b/drivers/usb/host/usb-uclass.c @@ -9,6 +9,7 @@ #define LOG_CATEGORY UCLASS_USB
#include <bootdev.h> +#include <coroutines.h> #include <dm.h> #include <errno.h> #include <log.h> @@ -18,6 +19,8 @@ #include <dm/lists.h> #include <dm/uclass-internal.h>
+#include <time.h>
static bool asynch_allowed;
struct usb_uclass_priv { @@ -221,6 +224,40 @@ int usb_stop(void) return err; }
+static int nbus;
+#if CONFIG_IS_ENABLED(COROUTINES) +static void usb_scan_bus(struct udevice *bus, bool recurse) +{
struct usb_bus_priv *priv;
struct udevice *dev;
int ret;
priv = dev_get_uclass_priv(bus);
assert(recurse); /* TODO: Support non-recusive */
debug("\n");
ret = usb_scan_device(bus, 0, USB_SPEED_FULL, &dev);
if (ret)
printf("Scanning bus %s failed, error %d\n", bus->name, ret);
+}
+static void usb_report_devices(struct uclass *uc) +{
struct usb_bus_priv *priv;
struct udevice *bus;
uclass_foreach_dev(bus, uc) {
priv = dev_get_uclass_priv(bus);
printf("Bus %s: ", bus->name);
if (priv->next_addr == 0)
printf("No USB device found\n");
else
printf("%d USB device(s) found\n", priv->next_addr);
}
+} +#else static void usb_scan_bus(struct udevice *bus, bool recurse) { struct usb_bus_priv *priv; @@ -240,7 +277,81 @@ static void usb_scan_bus(struct udevice *bus, bool recurse) printf("No USB Device found\n"); else printf("%d USB Device(s) found\n", priv->next_addr);
nbus++;
} +#endif
+#if CONFIG_IS_ENABLED(COROUTINES) +extern int udelay_yield;
+static void usb_scan_bus_co(void) +{
usb_scan_bus((struct udevice *)co_get_arg(), true);
co_exit();
+}
+static struct co_stack *stk; +static struct co *main_co; +static struct co **co; +static int co_sz = 8;
+static int add_usb_scan_bus_co(struct udevice *bus) +{
if (!co) {
co = malloc(co_sz * sizeof(*co));
if (!co)
return -ENOMEM;
}
if (nbus == co_sz) {
struct co **nco;
co_sz *= 2;
nco = realloc(co, co_sz * sizeof(*co));
if (!nco)
return -ENOMEM;
co = nco;
}
if (!main_co) {
main_co = co_create(NULL, NULL, 0, NULL, NULL);
if (!main_co)
return -ENOMEM;
}
if (!stk) {
stk = co_stack_new(32768);
if (!stk)
return -ENOMEM;
}
co[nbus] = co_create(main_co, stk, 0, usb_scan_bus_co, bus);
if (!co[nbus])
return -ENOMEM;
nbus++;
return 0;
+}
+static void usb_scan_cleanup(void) +{
int i;
for (i = 0; i < nbus; i++) {
co_destroy(co[i]);
co[i] = NULL;
}
nbus = 0;
co_destroy(main_co);
main_co = NULL;
co_stack_destroy(stk);
stk = NULL;
+} +#else +static int add_usb_scan_bus_co(struct udevice *bus) +{
return 0;
+}
+static void usb_scan_cleanup(void) +{ +} +#endif
static void remove_inactive_children(struct uclass *uc, struct udevice *bus) { @@ -289,6 +400,7 @@ static int usb_probe_companion(struct udevice *bus)
int usb_init(void) {
unsigned long t0 = timer_get_us(); int controllers_initialized = 0; struct usb_uclass_priv *uc_priv; struct usb_bus_priv *priv;
@@ -355,10 +467,40 @@ int usb_init(void) continue;
priv = dev_get_uclass_priv(bus);
if (!priv->companion)
usb_scan_bus(bus, true);
if (!priv->companion) {
if (CONFIG_IS_ENABLED(COROUTINES)) {
ret = add_usb_scan_bus_co(bus);
if (ret)
goto out;
} else {
usb_scan_bus(bus, true);
}
} }
+#if CONFIG_IS_ENABLED(COROUTINES)
{
bool done;
int i;
printf("Scanning %d USB bus(es)... ", nbus);
udelay_yield = 0xCAFEDECA;
do {
done = true;
for (i = 0; i < nbus; i++) {
if (!co[i]->done) {
done = false;
co_resume(co[i]);
}
}
} while (!done);
udelay_yield = 0;
printf("done\n");
usb_report_devices(uc);
}
+#endif
/* * Now that the primary controllers have been scanned and have handed * over any devices they do not understand to their companions, scan
@@ -388,7 +530,11 @@ int usb_init(void) /* if we were not able to find at least one working bus, bail out */ if (controllers_initialized == 0) printf("No USB controllers found\n");
+out:
if (nbus)
printf("USB: %d bus(es) scanned in %ld ms\n", nbus,
(timer_get_us() - t0) / 1000);
usb_scan_cleanup(); return usb_started ? 0 : -ENOENT;
}
-- 2.43.0
I have tested it on kr260 which is using 2 usb interfaces and there is an issue with usb hub initialization. That board has two hubs connected over i2c and only one of them is initialized over i2c. It means there is some work which needs to happen and likely some locking should be in place.
Hmmmm... I was kind of expecting that :-/ Do you think the kv260 has a similar problem? If so I can use mine to troubleshoot.
Thanks,

On 1/28/25 11:19 AM, Jerome Forissier wrote:
Use the coroutines framework to scan USB buses in parallel for better performance. Tested on arm64 QEMU on a somewhat contrived example (4 USB buses, each with one audio device, one keyboard, one mouse and one tablet).
$ make qemu_arm64_defconfig $ make -j$(nproc) CROSS_COMPILE="ccache aarch64-linux-gnu-" $ qemu-system-aarch64 -M virt -nographic -cpu max -bios u-boot.bin \ $(for i in {1..4}; do echo -device qemu-xhci,id=xhci$i \ -device\ usb-{audio,kbd,mouse,tablet},bus=xhci$i.0; \ done)
The time spent in usb_init() is reported on the console and shows a significant improvement with COROUTINES enabled.
Have you considered using the cyclic framework ( cyclic_register() and co. ) to fully offload USB operations away from the main thread, i.e. to make the U-Boot shell and e.g. 'usb start' run fully in parallel ? That could then be extended to block transfers, which could run in background, and ... we would also get a concept of shell pipes to move data around like we do in Linux.

Hi Jerome,
On Tue, 28 Jan 2025 at 03:20, Jerome Forissier jerome.forissier@linaro.org wrote:
This series introduces a simple coroutines framework and uses it to speed up the efi_init_obj_list() function. It is just an example to trigger discussions; hopefully other places in U-Boot can benefit from a similar treatment. Suggestions are welcome.
I came up with this idea after analyzing some profiling data (ftrace) taken during the execution of the "printenv -e" command on a Kria KV260 board. When the command is first invoked, efi_init_obj_list() is called which takes a significant amount of time (2.872 seconds). This time can be split into 0.570 s for efi_disks_register(), 0.811 s for efi_tcg2_register() and 1.418 s for efi_tcg2_do_initial_measurement(). All the other child functions are much quicker. Another interesting observation is that a large part of the time is actually spent waiting in __udelay(). More precisely:
- For efi_disks_register(): 421 ms / 570 ms = 73.8 % spent in __udelay()
- For efi_tcg2_register(): 805 ms / 811 ms = 99.1 % spent in __udelay()
- For efi_tcg2_do_initial_measurement(): 1395.025 ms / 1418.372 ms = 98.3 % spent in __udelay()
Given the above data, it is reasonable to think that a nice speedup could be obtained if these functions could somehow be run in parallel. efi_tcg2_do_initial_measurement() unfortunately needs to be excluded because it depends on the first two. But efi_disks_register() and efi_tcg2_register() are clearly independant and are therefore good candidates for concurrent execution.
So, I considered two options:
- Spin up a secondary core to take care of one function while the other
one runs on the main core
- Introduce some kind of software scheduling
I quickly ruled out the first one for several reasons: initializing a secondary core is typically quite hardware-specific, it would not scale well if more functions than available cores would need to be run in parallel, it would make debugging harder etc.
Yes this is fairly brittle IMO. Aide: Years ago I did a prototype of using a second core to speed up loading from MMC. It turned out that just 'kicking' the mmc controller at the start was enough to speed things up. I suspect that these days the controllers are faster and there is less of a delay getting going.
Software scheduling however can be accomplished quite easily, especially since we don't need to consider preemptive multitasking. Coroutines [1] for example can perfectly do the job. They provide a way to save and restore the execution context (registers and stack). Here is how it look like:
static void do_some_work(int v) { int i;
for (i = 0; i < 5; i++) { printf("%d", v); /* Save context and resume main thread */ co_yield(); }
}
static void co1(void *arg) { do_some_work(1); /* Mark coroutine as "done" and resume main thread */ co_exit(); }
static void co2(void *arg) { do_some_work(2); co_exit(); }
void main_thread(void) { struct co *co1, *co2;
co1 = co_create(co1, ...); co2 = co_create(co2, ...); do { printf("A"); if (!co1->done) { /* Invoke or resume first coroutine */ co_resume(co1); } printf("B"); if (!cor21->done) { /* Invoke or resume second coroutine */ co_resume(co2); } } while (!(co1->done && co2->done)); /* At this point, co1 and co2 have both called co_exit() */
}
The above example would print: A1B2A1B2A1B2A1B2A1B2.
- The first commit introduces the coroutine framework, loosely based on
libaco [2]. The code was simplified and reformatted to better suit U-Boot.
- The second commit modifies efi_init_obj_list() in order to turn
efi_disks_register() and efi_tcg2_register() into coroutines when COROUTINES in enabled. On a KV260 board with a SD card inserted, this saves about .6 second (2.2 s instead of 2.8 s).
- The third commit applies coroutines to usb_init(), which can
significantly reduce the time it takes to scan multiple buses. Tested on arm64 QEMU with 4 XHCI buses: the USB scan takes 2.2 s instead of 5.8 s.
[1] https://en.wikipedia.org/wiki/Coroutine [2] https://github.com/hnes/libaco/
Changes in v2
- Remove x86 and x86_64 arch code since it is untested
- Add missing SPDX license tag to arch/arm/cpu/armv8/co_switch.S
- Change Apache-2.0 SPDX license tag to "Apache-2.0 OR GPL-2.0-or-later"
- Apply coroutines to the USB bus scan in usb_init()
Jerome Forissier (3): Introduce coroutines framework efi_loader: optimize efi_init_obj_list() with coroutines usb: scan multiple buses simultaneously with coroutines
arch/arm/cpu/armv8/Makefile | 1 + arch/arm/cpu/armv8/co_switch.S | 36 +++++++ drivers/usb/host/usb-uclass.c | 152 +++++++++++++++++++++++++++++- include/coroutines.h | 130 ++++++++++++++++++++++++++ lib/Kconfig | 10 ++ lib/Makefile | 2 + lib/coroutines.c | 165 +++++++++++++++++++++++++++++++++ lib/efi_loader/efi_setup.c | 113 +++++++++++++++++++++- lib/time.c | 14 ++- 9 files changed, 615 insertions(+), 8 deletions(-) create mode 100644 arch/arm/cpu/armv8/co_switch.S create mode 100644 include/coroutines.h create mode 100644 lib/coroutines.c
-- 2.43.0
Another idea would be to use the cyclic framework.
To use cyclic for USB, we would need a struct that holds the state of the USB scanning. This is likely just a few index variables. The nice thing about cyclic is that we are a bit more in control of what is going on and we know at any point what state things are in. We have a way of aborting when we want to. Arguably many of the high-level functions in U-Boot should be written as:
int do_more(struct my_state *state)
and just go until they run out of things to immediately do*
It would collect the state for each subsystem in one place.
The other nice thing is that we can progress the boot with multiple subsystems at the same time, with less risk of strange behaviour (I suspect?). In other words, start USB scanning 'in the background' while continuing to allow the user to type stuff at the cmdline.
I believe that we would very quickly want to get progress info from the operation, but it's hard to be sure when it is safe to read it, if the coroutines are doing their own thing on their own time with limited visibility from the 'main' process.
You can see this sort of API with libfdt (fdt_first/next_subnode()), driver model (uclass_first/next_device() and bootstd (bootflow_scan_first/next()).
If we are going to have coroutines, then we should have sandbox support, at least. How does it look inside the debugger?
Regards, Simon
* BTW, with EFI, do the disks and TPM actually all need to be probed right at the start? I'm not quite convinced that anyone has seriously looked at this.
participants (5)
-
Jerome Forissier
-
Marek Vasut
-
Michal Simek
-
Michal Simek
-
Simon Glass