
Dear Tom,
In message 20210812134833.GU858@bill-the-cat you wrote:
Alright, lets take a look at what kind of area of the code we're talking about. uclass_get is a pretty fundamental thing. If that fails, your system is on fire. Things are massively corrupt.
Full agreement here.
So yes, return codes need to be checked and passed. But no, not every single error path needs to print to the user along every part of an error path either.
So if "the system is on fire" is one of the cases where an error message should be omitted to save maybe 50 or 100 bytes of image size? This sounds wrong to me.
When a system crashes or hangs, it is extremely helpful to gen an indication of what happened.
Printing messages only with debug enabled is pretty worthless, as in the real world the failures will happen in the field when running the production image with debug not enabled.
And as soon as you do enable debug, you will have a different image, which may or may not show the problem. We have all been there before.
That would be the next patch in the series where the BSP author isn't currently checking the return value, and this series doesn't change that. Perhaps it should, and CC the maintainer.
Indeed.
But I think has been said a few times over the course of this series, what exactly is one going to do about the failure? Getting specific for a moment, if you're in the case of "shutdown the watchdog" and the watchdog doesn't shutdown like you want it to, do you hang and hope the watchdog is alive to kick things still? hang and lock the system? Figure the system is on fire anyhow but add another message to the failure spew?
Adding a message about the _cause_ of the failure, i. e. at least information about the place where it was first detected, is what will be helpful to the engineer who is supposed to analyze the problem.
And yes, if such a fundamental operation fails, it is wise to abort the operation of this device - either by hard resetting it or by hanging it, depending of what the chances are that a reboot might fix the problem.
IMO it is better to fail a broken device in a reliable mode instead of letting it run and having more spectacular crashes (likely with more serious consequences) happen later.
Again, I think the change that's needed to this patch is to make it return the error code to the caller. Let the caller decide. And make sure to CC the board maintainer on the next go-round so they can chime in about that.
If we agree that in the disputed case "the system is on fire", then there is actually very little to decide. There should be only one possible action: stop here, before more damage happens.
Best regards,
Wolfgang Denk