Re: [PATCH] CI: Add automatic retry for test.py jobs

13 Jul 2023


      Hi Tom,
On Wed, 12 Jul 2023 at 14:38, Tom Rini trini@konsulko.com wrote:
...
On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
...
Hi Tom,
On Wed, 12 Jul 2023 at 11:09, Tom Rini trini@konsulko.com wrote:
...
On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
...
Hi Tom,
On Tue, 11 Jul 2023 at 20:33, Tom Rini trini@konsulko.com wrote:
...
It is not uncommon for some of the QEMU-based jobs to fail not because
of a code issue but rather because of a timing issue or similar problem
that is out of our control. Make use of the keywords that Azure and
GitLab provide so that we will automatically re-run these when they fail
2 times. If they fail that often it is likely we have found a real issue
to investigate.
Signed-off-by: Tom Rini trini@konsulko.com
.azure-pipelines.yml | 1 +
 .gitlab-ci.yml       | 1 +
 2 files changed, 2 insertions(+)
This seems like a slippery slope. Do we know why things fail? I wonder
if we should disable the tests / builders instead, until it can be
corrected?
It happens in Azure, so it's not just the broken runner problem we have
in GitLab. And the problem is timing, as I said in the commit.
Sometimes we still get the RTC test failing. Other times we don't get
QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
How do we keep this list from growing?
Do we need to? The problem is in essence since we rely on free
resources, sometimes some heavy lifts take longer.  That's what this
flag is for.
I'm fairly sure the RTC thing could be made deterministic.
The spawning thing...is there a timeout for that? What actually fails?
...
...
...
...
I'll note that we don't have this problem with sandbox tests.
OK, but that's not relevant?
It is relevant to the discussion about using QEMU instead of sandbox,
e.g. with the TPM. I recall a discussion with Ilias a while back.
I'm sure we could make sandbox take too long to start as well, if enough
other things are going on with the system.  And sandbox has its own set
of super frustrating issues instead, so I don't think this is a great
argument to have right here (I have to run it in docker, to get around
some application version requirements and exclude event_dump, bootmgr,
abootimg and gpt tests, which could otherwise run, but fail for me).
I haven't heard about this before. Is there anything that could be done?
Regards.
Simon