
On Wed, Jun 19, 2024 at 10:21:51AM +0200, Rasmus Villemoes wrote:
On 18/06/2024 23.03, Tim Harvey wrote:
On Tue, Jun 18, 2024 at 7:32 AM Tom Rini trini@konsulko.com wrote:
Stefan and Tom,
I'm seeing CI issues here even with 5000us [1]: host bind 0 /tmp/sandbox/persistent-cyclic function wdt-gpio-level took too long: 5368us vs 5000us max
Yes, 5ms is way too little when you're not the only thing running on the cpu, which is why I went with 100ms.
Random thoughts and questions:
(1) Do we have any way to programmatically grab all the logs from azure CI, so we can get some kind of objective statistics on the number after "took too long:". Clicking through the web interface and randomly searching is too painful.
It would also be helpful to know what percentage of CI runs have failed due to that, versus due to some genuine error.
I don't think we can easily grab logs via API. And at least for https://dev.azure.com/u-boot/u-boot/_build?definitionId=2&_a=summary a lot of the failures have rolled off already because we only get so many logs and I try and mark full release logs as keep forever. But anecdotally I can say 75%+ of the Azure runs fail at least once due to this, and most fail enough to fail the pipeline (we get 2 tries per job).
(2) I considered a patch that just added a
default $something big if SANDBOX
to config CYCLIC_MAX_CPU_TIME_US, but since the problem also hit qemu, I dropped that. But, if my patch is too ugly (and I might tend to think that myself...), perhaps at least this would be an added improvement over the generic bump to 5000us.
I was fine with your approach really, maybe a bit bigger of a comment and note under doc/ as well why we do it?
(3) I also thought that perhaps for sandbox, we should simply measure the time using clock_gettime(CLOCK_PROCESS_CPUTIME_ID), instead of wallclock time. But it's a little ugly to implement since the "now" variable is both used to decide if its time to run the callback, and as a starting point for measuring cpu time, and we probably still want the "is it time" to be measured on wallclock and not however much cpu-time the u-boot process has been given. Or maybe we don't, and CLOCK_PROCESS_CPUTIME_ID would simply be a better backend for os_get_nsec(). Sure, time in the sandbox would progress slower than on the host, but does that actually matter?
(4) Btw., what kind of clock tick do we even get when run under qemu? I don't have much experience with qemu, but from quick googling it seems that -icount would be interesting. Also see https://github.com/zephyrproject-rtos/zephyr/issues/14173 . From quick reading it seems there were some issues back in 2019, but that today it mostly works for them, except some SMP issues (that are certainly not relevant to U-Boot).
That could be interesting too, yeah. I do worry it might open its own set of problems to figure out tho.