Broken watchdog in u-boot master branch

Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.

On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
Adding the watchdog custodian. Have you bisected this behavior yet? Thanks.

On Monday 10 October 2022 09:55:12 Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
Adding the watchdog custodian. Have you bisected this behavior yet? Thanks.
No, I have not bisected it yet, I have not had time for it as I did other thing (asm compiler error, HW runtime issues, ...) mentioned in other threads.

On 10/10/2022 15.55, Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, a few thoughts.
First, I assume that that board has a very coarse-grained tick, probably just 1000Hz. Otherwise it would be pretty amazing for cpu_time to come out as 10ms exactly. That's not the board's fault, of course, just an observation, but it is something we need to bear in mind. If the resolution is merely 100Hz, so 10ms is simply the granularity, we cannot really meaningfully compare the cpu_time to anything less than that, because every once in a while it _will_ happen that we sample "now" just before the tick, run the function, then sample again just after, and it may only have taken 17us, yet the diff comes out as 10ms.
Second, perhaps the threshold should not be a compile-time constant, but instead a fraction of the requested call frequency (say 1.5%, 1/64). I.e., if we've registered a function to be called every 10 seconds, we'd check if its runtime exceeded (10000000 >> 6) us. Preferably per above that bound is rounded up to a multiple of the timer's granularity (we can get that, right?)
Third, perhaps we shouldn't disable it, but just print a (one-time) warning. Adding a "already-warned" field to struct cyclic_info is certainly simple enough.
Rasmus

On Tuesday 11 October 2022 09:18:56 Rasmus Villemoes wrote:
On 10/10/2022 15.55, Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, a few thoughts.
First, I assume that that board has a very coarse-grained tick, probably just 1000Hz.
IIRC i2c is running at 100kHz. But u-boot i2c driver contains more udelay() calls and more busy loops. And I was told that there are also some hw erratas. So at the end if some other driver calls u-boot dm function for i2c transfer, it can spend more time than expected...
Also there is another issue is that udelay() itself calls watchdog reset function. So if watchdog is on i2c which driver calls udelay() it means that watchdog reset function may be called in infinite loop. For this reason n900 watchdog driver has lock which prevents and eliminate this loop (if somebody calls reset during active lock then reset does nothing).
Otherwise it would be pretty amazing for cpu_time to come out as 10ms exactly. That's not the board's fault, of course, just an observation, but it is something we need to bear in mind. If the resolution is merely 100Hz, so 10ms is simply the granularity, we cannot really meaningfully compare the cpu_time to anything less than that, because every once in a while it _will_ happen that we sample "now" just before the tick, run the function, then sample again just after, and it may only have taken 17us, yet the diff comes out as 10ms.
Second, perhaps the threshold should not be a compile-time constant, but instead a fraction of the requested call frequency (say 1.5%, 1/64). I.e., if we've registered a function to be called every 10 seconds, we'd check if its runtime exceeded (10000000 >> 6) us. Preferably per above that bound is rounded up to a multiple of the timer's granularity (we can get that, right?)
Third, perhaps we shouldn't disable it, but just print a (one-time) warning. Adding a "already-warned" field to struct cyclic_info is certainly simple enough.
Rasmus

On 11.10.22 09:18, Rasmus Villemoes wrote:
On 10/10/2022 15.55, Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, a few thoughts.
First, I assume that that board has a very coarse-grained tick, probably just 1000Hz. Otherwise it would be pretty amazing for cpu_time to come out as 10ms exactly. That's not the board's fault, of course, just an observation, but it is something we need to bear in mind. If the resolution is merely 100Hz, so 10ms is simply the granularity, we cannot really meaningfully compare the cpu_time to anything less than that, because every once in a while it _will_ happen that we sample "now" just before the tick, run the function, then sample again just after, and it may only have taken 17us, yet the diff comes out as 10ms.
Second, perhaps the threshold should not be a compile-time constant, but instead a fraction of the requested call frequency (say 1.5%, 1/64). I.e., if we've registered a function to be called every 10 seconds, we'd check if its runtime exceeded (10000000 >> 6) us.
In general I think this is a good idea. Now that we have multiple users of the cyclic IF in one board it shows, that this threshold might better be defined as per-cyclic-function. Or ...
Preferably per above that bound is rounded up to a multiple of the timer's granularity (we can get that, right?)
Third, perhaps we shouldn't disable it, but just print a (one-time) warning. Adding a "already-warned" field to struct cyclic_info is certainly simple enough.
... just print this warning once. As it's pretty simple, I'll send a patch implementing this change later today. We can work on other improvements later, once we've fixed this watchdog breakage.
Thanks, Stefan

On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, looking at this in more detail, we have CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where the too long comes from). And picking a random CI run: https://source.denx.de/u-boot/u-boot/-/jobs/511177 I do see we hit this in CI once, but not every time, QEMU runs here. Is that the max time is configurable enough to satisfy your concerns here?

On Monday 10 October 2022 12:28:18 Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, looking at this in more detail, we have CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where the too long comes from). And picking a random CI run: https://source.denx.de/u-boot/u-boot/-/jobs/511177 I do see we hit this in CI once, but not every time, QEMU runs here. Is that the max time is configurable enough to satisfy your concerns here?
-- Tom
It is needed to investigate, how to _properly_ fix this issue, not just workarounded it. Probably other boards may be affected.

On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 12:28:18 Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, looking at this in more detail, we have CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where the too long comes from). And picking a random CI run: https://source.denx.de/u-boot/u-boot/-/jobs/511177 I do see we hit this in CI once, but not every time, QEMU runs here. Is that the max time is configurable enough to satisfy your concerns here?
It is needed to investigate, how to _properly_ fix this issue, not just workarounded it. Probably other boards may be affected.
So it's the cyclic watchdog code, which we merged as early as possible that's the reason here. And it was merged as early as we could to see if there's problems. Are there problems? We're seeing "system too slow, disabling" on QEMU, sometimes, and the value of too slow is configurable. I know you reported other problems with n900 HW, so we can't see if it's failing there, and I don't have any omap3 HW setup in my lab atm, just newer generation boards and don't see the problem there. Which is why I'm asking, is being able to configure the "too slow" value enough? Or is there something else that needs to be done?

On Monday 10 October 2022 13:40:38 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 12:28:18 Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, looking at this in more detail, we have CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where the too long comes from). And picking a random CI run: https://source.denx.de/u-boot/u-boot/-/jobs/511177 I do see we hit this in CI once, but not every time, QEMU runs here. Is that the max time is configurable enough to satisfy your concerns here?
It is needed to investigate, how to _properly_ fix this issue, not just workarounded it. Probably other boards may be affected.
So it's the cyclic watchdog code, which we merged as early as possible that's the reason here. And it was merged as early as we could to see if there's problems. Are there problems? We're seeing "system too slow, disabling" on QEMU, sometimes, and the value of too slow is configurable. I know you reported other problems with n900 HW, so we can't see if it's failing there
I was tested it with older asm code (as described in that other email, via git checkout commit -- file) on n900 HW and watchdog problem is there too. Phone reboots in about 20 seconds. But as I do not have serial console, I do not know if that "disabling" message is printed there too (but I guess it is).
and I don't have any omap3 HW setup in my lab atm, just newer generation boards and don't see the problem there. Which is why I'm asking, is being able to configure the "too slow" value enough? Or is there something else that needs to be done?

On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:40:38 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 12:28:18 Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, looking at this in more detail, we have CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where the too long comes from). And picking a random CI run: https://source.denx.de/u-boot/u-boot/-/jobs/511177 I do see we hit this in CI once, but not every time, QEMU runs here. Is that the max time is configurable enough to satisfy your concerns here?
It is needed to investigate, how to _properly_ fix this issue, not just workarounded it. Probably other boards may be affected.
So it's the cyclic watchdog code, which we merged as early as possible that's the reason here. And it was merged as early as we could to see if there's problems. Are there problems? We're seeing "system too slow, disabling" on QEMU, sometimes, and the value of too slow is configurable. I know you reported other problems with n900 HW, so we can't see if it's failing there
I was tested it with older asm code (as described in that other email, via git checkout commit -- file) on n900 HW and watchdog problem is there too. Phone reboots in about 20 seconds. But as I do not have serial console, I do not know if that "disabling" message is printed there too (but I guess it is).
I think I'm a bit baffled at this point, honestly. The watchdog timeout is 60 seconds. If you're confident in it being about 20 seconds, consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10 seconds) should let you see if U-Boot has configured the watchdog and it's being tripped, or if it's still at the prior stage value.
I would have expected that QEMU would see problems that real HW doesn't (the value in your log is much higher than the one in CI), but I could be wrong here.

On Monday 10 October 2022 13:56:10 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:40:38 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 12:28:18 Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
Hello! Watchdog code seems to be broken in u-boot master branch. On Nokia N900 I'm getting following message in qemu:
cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
Seems that watchdog core code is not prepared for "slower" watchdogs which communicate over slower i2c bus, like it is the case for N900.
Disabling slower watchdog is a bad idea as it would result in reboot loop instead of slower - but working code.
So, looking at this in more detail, we have CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where the too long comes from). And picking a random CI run: https://source.denx.de/u-boot/u-boot/-/jobs/511177 I do see we hit this in CI once, but not every time, QEMU runs here. Is that the max time is configurable enough to satisfy your concerns here?
It is needed to investigate, how to _properly_ fix this issue, not just workarounded it. Probably other boards may be affected.
So it's the cyclic watchdog code, which we merged as early as possible that's the reason here. And it was merged as early as we could to see if there's problems. Are there problems? We're seeing "system too slow, disabling" on QEMU, sometimes, and the value of too slow is configurable. I know you reported other problems with n900 HW, so we can't see if it's failing there
I was tested it with older asm code (as described in that other email, via git checkout commit -- file) on n900 HW and watchdog problem is there too. Phone reboots in about 20 seconds. But as I do not have serial console, I do not know if that "disabling" message is printed there too (but I guess it is).
I think I'm a bit baffled at this point, honestly. The watchdog timeout is 60 seconds. If you're confident in it being about 20 seconds, consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10 seconds) should let you see if U-Boot has configured the watchdog and it's being tripped, or if it's still at the prior stage value.
$ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
Also watchdog is started by NOLO (which loads and execute U-Boot) so there can be some smaller timeout.
So I have feeling that on the real HW is same issue. cyclic code disabled watchdog kicking and then watchdog restarted phone.
I do not remember exact time (if it is 20s or 25s; I have not measured it precisely), but it sounds plausible.
I would have expected that QEMU would see problems that real HW doesn't (the value in your log is much higher than the one in CI), but I could be wrong here.
-- Tom

On Mon, Oct 10, 2022 at 08:01:23PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:56:10 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:40:38 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 12:28:18 Tom Rini wrote:
On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote: > Hello! Watchdog code seems to be broken in u-boot master branch. > On Nokia N900 I'm getting following message in qemu: > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling > > Seems that watchdog core code is not prepared for "slower" watchdogs > which communicate over slower i2c bus, like it is the case for N900. > > Disabling slower watchdog is a bad idea as it would result in reboot > loop instead of slower - but working code.
So, looking at this in more detail, we have CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where the too long comes from). And picking a random CI run: https://source.denx.de/u-boot/u-boot/-/jobs/511177 I do see we hit this in CI once, but not every time, QEMU runs here. Is that the max time is configurable enough to satisfy your concerns here?
It is needed to investigate, how to _properly_ fix this issue, not just workarounded it. Probably other boards may be affected.
So it's the cyclic watchdog code, which we merged as early as possible that's the reason here. And it was merged as early as we could to see if there's problems. Are there problems? We're seeing "system too slow, disabling" on QEMU, sometimes, and the value of too slow is configurable. I know you reported other problems with n900 HW, so we can't see if it's failing there
I was tested it with older asm code (as described in that other email, via git checkout commit -- file) on n900 HW and watchdog problem is there too. Phone reboots in about 20 seconds. But as I do not have serial console, I do not know if that "disabling" message is printed there too (but I guess it is).
I think I'm a bit baffled at this point, honestly. The watchdog timeout is 60 seconds. If you're confident in it being about 20 seconds, consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10 seconds) should let you see if U-Boot has configured the watchdog and it's being tripped, or if it's still at the prior stage value.
$ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
Also watchdog is started by NOLO (which loads and execute U-Boot) so there can be some smaller timeout.
So I have feeling that on the real HW is same issue. cyclic code disabled watchdog kicking and then watchdog restarted phone.
I do not remember exact time (if it is 20s or 25s; I have not measured it precisely), but it sounds plausible.
OK, so what happens if you increase CONFIG_CYCLIC_MAX_CPU_TIME_US to something very high (so we should still enable the watchdog and configure the timeout) along with CONFIG_WATCHDOG_TIMEOUT_MSECS being high too (so if we can't service it in time really it's so long as to be noticeable) ? Or CONFIG_WATCHDOG_TIMEOUT_MSECS to something much lower (so that if the device is resetting quicker we're crashing elsewhere) ?

On Mon, Oct 10, 2022 at 02:14:25PM -0400, Tom Rini wrote:
On Mon, Oct 10, 2022 at 08:01:23PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:56:10 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:40:38 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 12:28:18 Tom Rini wrote: > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote: > > Hello! Watchdog code seems to be broken in u-boot master branch. > > On Nokia N900 I'm getting following message in qemu: > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling > > > > Seems that watchdog core code is not prepared for "slower" watchdogs > > which communicate over slower i2c bus, like it is the case for N900. > > > > Disabling slower watchdog is a bad idea as it would result in reboot > > loop instead of slower - but working code. > > So, looking at this in more detail, we have > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where > the too long comes from). And picking a random CI run: > https://source.denx.de/u-boot/u-boot/-/jobs/511177 > I do see we hit this in CI once, but not every time, QEMU runs here. Is > that the max time is configurable enough to satisfy your concerns here?
It is needed to investigate, how to _properly_ fix this issue, not just workarounded it. Probably other boards may be affected.
So it's the cyclic watchdog code, which we merged as early as possible that's the reason here. And it was merged as early as we could to see if there's problems. Are there problems? We're seeing "system too slow, disabling" on QEMU, sometimes, and the value of too slow is configurable. I know you reported other problems with n900 HW, so we can't see if it's failing there
I was tested it with older asm code (as described in that other email, via git checkout commit -- file) on n900 HW and watchdog problem is there too. Phone reboots in about 20 seconds. But as I do not have serial console, I do not know if that "disabling" message is printed there too (but I guess it is).
I think I'm a bit baffled at this point, honestly. The watchdog timeout is 60 seconds. If you're confident in it being about 20 seconds, consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10 seconds) should let you see if U-Boot has configured the watchdog and it's being tripped, or if it's still at the prior stage value.
$ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
Also watchdog is started by NOLO (which loads and execute U-Boot) so there can be some smaller timeout.
So I have feeling that on the real HW is same issue. cyclic code disabled watchdog kicking and then watchdog restarted phone.
I do not remember exact time (if it is 20s or 25s; I have not measured it precisely), but it sounds plausible.
OK, so what happens if you increase CONFIG_CYCLIC_MAX_CPU_TIME_US to something very high (so we should still enable the watchdog and configure the timeout) along with CONFIG_WATCHDOG_TIMEOUT_MSECS being high too (so if we can't service it in time really it's so long as to be noticeable) ? Or CONFIG_WATCHDOG_TIMEOUT_MSECS to something much lower (so that if the device is resetting quicker we're crashing elsewhere) ?
OK, on my beagleboard xM with a small change: diff --git a/drivers/watchdog/omap_wdt.c b/drivers/watchdog/omap_wdt.c index ca2bc7cfb59e..f0e57b4f7286 100644 --- a/drivers/watchdog/omap_wdt.c +++ b/drivers/watchdog/omap_wdt.c @@ -39,7 +39,7 @@ #include <common.h> #include <log.h> #include <watchdog.h> -#include <asm/arch/hardware.h> +#include <asm/ti-common/omap_wdt.h> #include <asm/io.h> #include <asm/processor.h> #include <asm/arch/cpu.h>
On my beagleboard xM I now see: U-Boot SPL 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400) Trying to boot from MMC1
U-Boot 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)
OMAP3630/3730-GP ES1.1, CPU-OPP2, L3-200MHz, Max CPU Clock 800 MHz Model: TI OMAP3 BeagleBoard OMAP3 Beagle board + LPDDR/NAND I2C: ready DRAM: 256 MiB Core: 45 devices, 19 uclasses, devicetree: separate WDT: Started wdt@48314000 without servicing (60s timeout) NAND: 0 MiB MMC: OMAP SD/MMC: 0 Loading Environment from NAND... *** Warning - readenv() failed, using default environment
Beagle xM Rev A/B No EEPROM on expansion board OMAP die ID: 6e5e00211ff00000015739eb08031024 Net: No ethernet found. Hit any key to stop autoboot: 0
So, this is as close as I can get to testing on n900 HW, and it's fine here.

On Monday 10 October 2022 15:24:13 Tom Rini wrote:
On Mon, Oct 10, 2022 at 02:14:25PM -0400, Tom Rini wrote:
On Mon, Oct 10, 2022 at 08:01:23PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:56:10 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
On Monday 10 October 2022 13:40:38 Tom Rini wrote:
On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote: > On Monday 10 October 2022 12:28:18 Tom Rini wrote: > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote: > > > Hello! Watchdog code seems to be broken in u-boot master branch. > > > On Nokia N900 I'm getting following message in qemu: > > > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling > > > > > > Seems that watchdog core code is not prepared for "slower" watchdogs > > > which communicate over slower i2c bus, like it is the case for N900. > > > > > > Disabling slower watchdog is a bad idea as it would result in reboot > > > loop instead of slower - but working code. > > > > So, looking at this in more detail, we have > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where > > the too long comes from). And picking a random CI run: > > https://source.denx.de/u-boot/u-boot/-/jobs/511177 > > I do see we hit this in CI once, but not every time, QEMU runs here. Is > > that the max time is configurable enough to satisfy your concerns here? > > It is needed to investigate, how to _properly_ fix this issue, not just > workarounded it. Probably other boards may be affected.
So it's the cyclic watchdog code, which we merged as early as possible that's the reason here. And it was merged as early as we could to see if there's problems. Are there problems? We're seeing "system too slow, disabling" on QEMU, sometimes, and the value of too slow is configurable. I know you reported other problems with n900 HW, so we can't see if it's failing there
I was tested it with older asm code (as described in that other email, via git checkout commit -- file) on n900 HW and watchdog problem is there too. Phone reboots in about 20 seconds. But as I do not have serial console, I do not know if that "disabling" message is printed there too (but I guess it is).
I think I'm a bit baffled at this point, honestly. The watchdog timeout is 60 seconds. If you're confident in it being about 20 seconds, consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10 seconds) should let you see if U-Boot has configured the watchdog and it's being tripped, or if it's still at the prior stage value.
$ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
Also watchdog is started by NOLO (which loads and execute U-Boot) so there can be some smaller timeout.
So I have feeling that on the real HW is same issue. cyclic code disabled watchdog kicking and then watchdog restarted phone.
I do not remember exact time (if it is 20s or 25s; I have not measured it precisely), but it sounds plausible.
OK, so what happens if you increase CONFIG_CYCLIC_MAX_CPU_TIME_US to something very high (so we should still enable the watchdog and configure the timeout) along with CONFIG_WATCHDOG_TIMEOUT_MSECS being high too (so if we can't service it in time really it's so long as to be noticeable) ? Or CONFIG_WATCHDOG_TIMEOUT_MSECS to something much lower (so that if the device is resetting quicker we're crashing elsewhere) ?
OK, on my beagleboard xM with a small change: diff --git a/drivers/watchdog/omap_wdt.c b/drivers/watchdog/omap_wdt.c index ca2bc7cfb59e..f0e57b4f7286 100644 --- a/drivers/watchdog/omap_wdt.c +++ b/drivers/watchdog/omap_wdt.c @@ -39,7 +39,7 @@ #include <common.h> #include <log.h> #include <watchdog.h> -#include <asm/arch/hardware.h> +#include <asm/ti-common/omap_wdt.h> #include <asm/io.h> #include <asm/processor.h> #include <asm/arch/cpu.h>
On my beagleboard xM I now see: U-Boot SPL 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400) Trying to boot from MMC1
U-Boot 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)
OMAP3630/3730-GP ES1.1, CPU-OPP2, L3-200MHz, Max CPU Clock 800 MHz Model: TI OMAP3 BeagleBoard OMAP3 Beagle board + LPDDR/NAND I2C: ready DRAM: 256 MiB Core: 45 devices, 19 uclasses, devicetree: separate WDT: Started wdt@48314000 without servicing (60s timeout) NAND: 0 MiB MMC: OMAP SD/MMC: 0 Loading Environment from NAND... *** Warning - readenv() failed, using default environment
Beagle xM Rev A/B No EEPROM on expansion board OMAP die ID: 6e5e00211ff00000015739eb08031024 Net: No ethernet found. Hit any key to stop autoboot: 0
So, this is as close as I can get to testing on n900 HW, and it's fine here.
-- Tom
This is omap watchdog which has registers directly on the processor.
With N900 is issue that uses watchdog connected via i2c (not that omap watchdog) and there is some issue in new u-boot that i2c does not work at high speeds. So i2c is configured at lower speed now (when it works correctly). u-boot i2c driver has more udelay() calls during i2c transfer and I think that this is the reason why n900 watchdog kick spend much more time... (just guessing). If you have watchdog registers directly mapped to MMIO (and even part of processor) then any ldr or str instruction is executed immediately and not with ms delay timeouts.
It was easier to lower i2c speed than debugging it as nobody knows why since some u-boot version, i2c transfers started failing at higher speeds.
I will try to play with those config options later to see what happens.
participants (4)
-
Pali Rohár
-
Rasmus Villemoes
-
Stefan Roese
-
Tom Rini