Re: [PATCH 12/20] cyclic: Rise default CYCLIC_MAX_CPU_TIME_US to 5000

19 Jun 2024


      On Wed, Jun 19, 2024 at 10:21:51AM +0200, Rasmus Villemoes wrote:
...
On 18/06/2024 23.03, Tim Harvey wrote:
...
On Tue, Jun 18, 2024 at 7:32 AM Tom Rini trini@konsulko.com wrote:
...
...
Stefan and Tom,
I'm seeing CI issues here even with 5000us [1]:
host bind 0 /tmp/sandbox/persistent-cyclic function wdt-gpio-level
took too long: 5368us vs 5000us max
Yes, 5ms is way too little when you're not the only thing running on the
cpu, which is why I went with 100ms.
Random thoughts and questions:
(1) Do we have any way to programmatically grab all the logs from azure
CI, so we can get some kind of objective statistics on the number after
"took too long:". Clicking through the web interface and randomly
searching is too painful.
It would also be helpful to know what percentage of CI runs have failed
due to that, versus due to some genuine error.
I don't think we can easily grab logs via API. And at least for
https://dev.azure.com/u-boot/u-boot/_build?definitionId=2&_a=summary a
lot of the failures have rolled off already because we only get so many
logs and I try and mark full release logs as keep forever. But
anecdotally I can say 75%+ of the Azure runs fail at least once due to
this, and most fail enough to fail the pipeline (we get 2 tries per
job).
...
(2) I considered a patch that just added a
default $something big if SANDBOX
to config CYCLIC_MAX_CPU_TIME_US, but since the problem also hit qemu, I
dropped that. But, if my patch is too ugly (and I might tend to think
that myself...), perhaps at least this would be an added improvement
over the generic bump to 5000us.
I was fine with your approach really, maybe a bit bigger of a comment
and note under doc/ as well why we do it?
...
(3) I also thought that perhaps for sandbox, we should simply measure
the time using clock_gettime(CLOCK_PROCESS_CPUTIME_ID), instead of
wallclock time. But it's a little ugly to implement since the "now"
variable is both used to decide if its time to run the callback, and as
a starting point for measuring cpu time, and we probably still want the
"is it time" to be measured on wallclock and not however much cpu-time
the u-boot process has been given. Or maybe we don't, and
CLOCK_PROCESS_CPUTIME_ID would simply be a better backend for
os_get_nsec(). Sure, time in the sandbox would progress slower than on
the host, but does that actually matter?
(4) Btw., what kind of clock tick do we even get when run under qemu? I
don't have much experience with qemu, but from quick googling it seems
that -icount would be interesting. Also see
https://github.com/zephyrproject-rtos/zephyr/issues/14173 . From quick
reading it seems there were some issues back in 2019, but that today it
mostly works for them, except some SMP issues (that are certainly not
relevant to U-Boot).
That could be interesting too, yeah. I do worry it might open its own
set of problems to figure out tho.
-- 
Tom