[PATCH next RFC 0/5] rockchip: panic() when DRAM init fails

I am the unfortunate owner of an RK3399 Puma that is failing DRAM init every now and then with the following cryptic error message: """ pctl_start: Failed to init pctl channel 0 """
When it fails, the current logic is implemented in such a way that the board enters an infinite while loop. This is not ideal for embedded systems that are not easily accessible and also an issue when having devices in a CI lab for example.
Therefore, let's simply panic if the DRAM fails to init. This was tested only on RK3399 Puma, similar changes may be required for other SoCs to properly propagate the errors.
While this is all but a work-around instead of fixing the DRAM init sequence, I believe it is important for resilience of devices in the field.
Marking this as RFC because: 1) panic()s for all Rockchip SoCs but only tested on RK3399 2) unsure about error return values (ENODEV and the likes), not sure what would be better than that though :/
Note that one can still hang() if they want by setting PANIC_HANG symbol.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- Quentin Schulz (5): ram: rk3399: allow to fail DRAM init if pctl_start fails ram: rk3399: fail probe if DRAM init failed rockchip: spl/tpl: panic when DRAM init failed ram: rk3399: merge two consecutive ifs with same condition ram: rk3399: fail DRAM init when pctl channel init fails instead of hanging
arch/arm/mach-rockchip/spl.c | 7 +++---- arch/arm/mach-rockchip/tpl.c | 6 ++---- drivers/ram/rockchip/sdram_rk3399.c | 24 +++++++++++------------- 3 files changed, 16 insertions(+), 21 deletions(-) --- base-commit: 56accc56b9aab87ef4809ccc588e1257969cd271 change-id: 20241105-rk3399-dram-init-207325c2d9ca
Best regards,

From: Quentin Schulz quentin.schulz@cherry.de
pctl_start() can fail but how this is handled currently is by calling hang() directly in the driver instead of propagating the error and letting other code handling it.
This prepares for a future commit which removes the hang() in favor of letting Rockchip's arch code handle it in a common place.
No change in behavior expected in this commit.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- drivers/ram/rockchip/sdram_rk3399.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/ram/rockchip/sdram_rk3399.c b/drivers/ram/rockchip/sdram_rk3399.c index 6fa8f268770d1c6d69431857e115264957d75703..63d7852555cea3296b7cd9eed422c4d5d1c9776b 100644 --- a/drivers/ram/rockchip/sdram_rk3399.c +++ b/drivers/ram/rockchip/sdram_rk3399.c @@ -791,9 +791,9 @@ static void set_ds_odt(const struct chan_info *chan, phy_io_config(denali_phy, denali_ctl, params, mr5); }
-static void pctl_start(struct dram_info *dram, - struct rk3399_sdram_params *params, - u32 channel_mask) +static int pctl_start(struct dram_info *dram, + struct rk3399_sdram_params *params, + u32 channel_mask) { const struct chan_info *chan_0 = &dram->chan[0]; const struct chan_info *chan_1 = &dram->chan[1]; @@ -892,6 +892,8 @@ static void pctl_start(struct dram_info *dram, params->phy_regs.denali_phy[937] & 0xFF); } + + return 0; }
static int pctl_cfg(struct dram_info *dram, const struct chan_info *chan, @@ -2940,7 +2942,8 @@ static int sdram_init(struct dram_info *dram, }
/* start to trigger initialization */ - pctl_start(dram, params, 3); + if (pctl_start(dram, params, 3)) + return -EINVAL;
/* LPDDR2/LPDDR3 need to wait DAI complete, max 10us */ if (dramtype == LPDDR3)

From: Quentin Schulz quentin.schulz@cherry.de
We should fail the probe of the DRAM device if the DRAM init failed so let's return a negative error-code if rk3399_dmc_init returns non-zero.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- drivers/ram/rockchip/sdram_rk3399.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/ram/rockchip/sdram_rk3399.c b/drivers/ram/rockchip/sdram_rk3399.c index 63d7852555cea3296b7cd9eed422c4d5d1c9776b..5a82c7ccbdf161526f0d1450dde6881441e59543 100644 --- a/drivers/ram/rockchip/sdram_rk3399.c +++ b/drivers/ram/rockchip/sdram_rk3399.c @@ -3157,7 +3157,7 @@ static int rk3399_dmc_probe(struct udevice *dev) priv->pmugrf = syscon_get_first_range(ROCKCHIP_SYSCON_PMUGRF); debug("%s: pmugrf = %p\n", __func__, priv->pmugrf); if (phase_sdram_init() && rk3399_dmc_init(dev)) - return 0; + return -EINVAL;
/* * There is no point in checking the SDRAM size in TPL as it is not

From: Quentin Schulz quentin.schulz@cherry.de
A failed DRAM init is synonym with a broken board, so let's panic instead of silently skipping the rest of the xPL stage and making it fail later in some other code parts.
This also has the benefit of (by default) resetting the CPU which could help recover the device without human intervention were the DRAM init issue related to not-100%-reproducibly fail the DRAM init.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- arch/arm/mach-rockchip/spl.c | 7 +++---- arch/arm/mach-rockchip/tpl.c | 6 ++---- 2 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/arch/arm/mach-rockchip/spl.c b/arch/arm/mach-rockchip/spl.c index f4d29bbdd17e8b57dc0b2fdd90a8600fba6ee866..9b2d7023fb862bf696ad376b6f0b30e834aa41be 100644 --- a/arch/arm/mach-rockchip/spl.c +++ b/arch/arm/mach-rockchip/spl.c @@ -112,10 +112,9 @@ void board_init_f(ulong dummy) #if !defined(CONFIG_TPL) || defined(CONFIG_SPL_RAM) debug("\nspl:init dram\n"); ret = dram_init(); - if (ret) { - printf("DRAM init failed: %d\n", ret); - return; - } + if (ret) + panic("DRAM init failed: %d\n", ret); + gd->ram_top = gd->ram_base + get_effective_memsize(); gd->ram_top = board_get_usable_ram_top(gd->ram_size);
diff --git a/arch/arm/mach-rockchip/tpl.c b/arch/arm/mach-rockchip/tpl.c index bbb9329e725af79ea4c4049aa7890a4a143e7df5..5da23913c147cab18377f09b566f28b0bee4e935 100644 --- a/arch/arm/mach-rockchip/tpl.c +++ b/arch/arm/mach-rockchip/tpl.c @@ -55,10 +55,8 @@ void board_init_f(ulong dummy) timer_init();
ret = uclass_get_device(UCLASS_RAM, 0, &dev); - if (ret) { - printf("DRAM init failed: %d\n", ret); - return; - } + if (ret) + panic("DRAM init failed: %d\n", ret); }
int board_return_to_bootrom(struct spl_image_info *spl_image,

From: Quentin Schulz quentin.schulz@cherry.de
Nothing changes channel_mask in the first if block of the same condition so it is safe to merge the two if blocks with the same condition into one.
No intended change in behavior.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- drivers/ram/rockchip/sdram_rk3399.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/ram/rockchip/sdram_rk3399.c b/drivers/ram/rockchip/sdram_rk3399.c index 5a82c7ccbdf161526f0d1450dde6881441e59543..d953dda13cd3daa689fa362dc511c372db454064 100644 --- a/drivers/ram/rockchip/sdram_rk3399.c +++ b/drivers/ram/rockchip/sdram_rk3399.c @@ -811,9 +811,7 @@ static int pctl_start(struct dram_info *dram, if (channel_mask & 1) { writel(0x01000000, &ddrc0_con_0); clrsetbits_le32(&denali_phy_0[957], 0x3 << 24, 0x2 << 24); - }
- if (channel_mask & 1) { count = 0; while (!(readl(&denali_ctl_0[203]) & (1 << 3))) { if (count > 1000) { @@ -849,8 +847,7 @@ static int pctl_start(struct dram_info *dram, if (channel_mask & 2) { writel(0x01000000, &ddrc1_con_0); clrsetbits_le32(&denali_phy_1[957], 0x3 << 24, 0x2 << 24); - } - if (channel_mask & 2) { + count = 0; while (!(readl(&denali_ctl_1[203]) & (1 << 3))) { if (count > 1000) {

From: Quentin Schulz quentin.schulz@cherry.de
Instead of hanging via an infinite while loop, propagate the fail to the caller and let it handle the fail. For RK3399, this means that panic() will be called, (by default) resetting the CPU and giving another chance at doing a DRAM init.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- drivers/ram/rockchip/sdram_rk3399.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/ram/rockchip/sdram_rk3399.c b/drivers/ram/rockchip/sdram_rk3399.c index d953dda13cd3daa689fa362dc511c372db454064..591e1469afb25ba606d8f7e44f63e50905b4ab31 100644 --- a/drivers/ram/rockchip/sdram_rk3399.c +++ b/drivers/ram/rockchip/sdram_rk3399.c @@ -817,8 +817,7 @@ static int pctl_start(struct dram_info *dram, if (count > 1000) { printf("%s: Failed to init pctl channel 0\n", __func__); - while (1) - ; + return -1; } udelay(1); count++; @@ -853,8 +852,7 @@ static int pctl_start(struct dram_info *dram, if (count > 1000) { printf("%s: Failed to init pctl channel 1\n", __func__); - while (1) - ; + return -2; } udelay(1); count++;

Hi Quentin,
On 2024/11/6 01:21, Quentin Schulz wrote:
From: Quentin Schulz quentin.schulz@cherry.de
Instead of hanging via an infinite while loop, propagate the fail to the caller and let it handle the fail. For RK3399, this means that panic() will be called, (by default) resetting the CPU and giving another chance at doing a DRAM init.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de
drivers/ram/rockchip/sdram_rk3399.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/ram/rockchip/sdram_rk3399.c b/drivers/ram/rockchip/sdram_rk3399.c index d953dda13cd3daa689fa362dc511c372db454064..591e1469afb25ba606d8f7e44f63e50905b4ab31 100644 --- a/drivers/ram/rockchip/sdram_rk3399.c +++ b/drivers/ram/rockchip/sdram_rk3399.c @@ -817,8 +817,7 @@ static int pctl_start(struct dram_info *dram, if (count > 1000) {
Could you try with increase the timeout count from 1000 to 1000000?
Thanks,
- Kever
printf("%s: Failed to init pctl channel 0\n", __func__);
while (1)
;
return -1; } udelay(1); count++;
@@ -853,8 +852,7 @@ static int pctl_start(struct dram_info *dram, if (count > 1000) { printf("%s: Failed to init pctl channel 1\n", __func__);
while (1)
;
return -2; } udelay(1); count++;

Hi Kever,
On 11/8/24 4:19 AM, Kever Yang wrote:
Hi Quentin,
On 2024/11/6 01:21, Quentin Schulz wrote:
From: Quentin Schulz quentin.schulz@cherry.de
Instead of hanging via an infinite while loop, propagate the fail to the caller and let it handle the fail. For RK3399, this means that panic() will be called, (by default) resetting the CPU and giving another chance at doing a DRAM init.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de
drivers/ram/rockchip/sdram_rk3399.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/ram/rockchip/sdram_rk3399.c b/drivers/ram/ rockchip/sdram_rk3399.c index d953dda13cd3daa689fa362dc511c372db454064..591e1469afb25ba606d8f7e44f63e50905b4ab31 100644 --- a/drivers/ram/rockchip/sdram_rk3399.c +++ b/drivers/ram/rockchip/sdram_rk3399.c @@ -817,8 +817,7 @@ static int pctl_start(struct dram_info *dram, if (count > 1000) {
Could you try with increase the timeout count from 1000 to 1000000?
Just tested, didn't help, failed at the same place.
Cheers, Quentin

Hi Quentin,
On 2024/11/6 01:21, Quentin Schulz wrote:
I am the unfortunate owner of an RK3399 Puma that is failing DRAM init every now and then with the following cryptic error message: """ pctl_start: Failed to init pctl channel 0 """
When it fails, the current logic is implemented in such a way that the board enters an infinite while loop. This is not ideal for embedded systems that are not easily accessible and also an issue when having devices in a CI lab for example.
It's fine for this patch to make the choice in panic to hang or to reset the system.
But I don't understand how this help you, does it may success to init the dram after
the reset in your case? I think in most case the board will keep reset because this is
more likely hardware issue and not able to became success after reset.
Thanks,
- Kever
Therefore, let's simply panic if the DRAM fails to init. This was tested only on RK3399 Puma, similar changes may be required for other SoCs to properly propagate the errors.
While this is all but a work-around instead of fixing the DRAM init sequence, I believe it is important for resilience of devices in the field.
Marking this as RFC because:
- panic()s for all Rockchip SoCs but only tested on RK3399
- unsure about error return values (ENODEV and the likes), not sure what would be better than that though :/
Note that one can still hang() if they want by setting PANIC_HANG symbol.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de
Quentin Schulz (5): ram: rk3399: allow to fail DRAM init if pctl_start fails ram: rk3399: fail probe if DRAM init failed rockchip: spl/tpl: panic when DRAM init failed ram: rk3399: merge two consecutive ifs with same condition ram: rk3399: fail DRAM init when pctl channel init fails instead of hanging
arch/arm/mach-rockchip/spl.c | 7 +++---- arch/arm/mach-rockchip/tpl.c | 6 ++---- drivers/ram/rockchip/sdram_rk3399.c | 24 +++++++++++------------- 3 files changed, 16 insertions(+), 21 deletions(-)
base-commit: 56accc56b9aab87ef4809ccc588e1257969cd271 change-id: 20241105-rk3399-dram-init-207325c2d9ca
Best regards,

Hi Kever,
On 11/7/24 4:35 AM, Kever Yang wrote:
Hi Quentin,
On 2024/11/6 01:21, Quentin Schulz wrote:
I am the unfortunate owner of an RK3399 Puma that is failing DRAM init every now and then with the following cryptic error message: """ pctl_start: Failed to init pctl channel 0 """
When it fails, the current logic is implemented in such a way that the board enters an infinite while loop. This is not ideal for embedded systems that are not easily accessible and also an issue when having devices in a CI lab for example.
It's fine for this patch to make the choice in panic to hang or to reset the system.
But I don't understand how this help you, does it may success to init the dram after
the reset in your case? I think in most case the board will keep reset because this is
more likely hardware issue and not able to became success after reset.
Exactly. After a reset (manual by pressing the reset button, or by panic()) the board works again. I don't have exact numbers for the probability for this to happen, but I had a board doing boot-into-Linux tests in a loop and it failed multiple times in the same day. Gut feeling is around ~10-20 times for ~2200 reboots.
I am NOT saying this is not related to a HW issue though, but I've been using that RK3399 Puma for a while, and sometimes I have this issue. The rest of the time it seems to "work fine" (haven't run memtest or extensive testing on that unit though).
I am NOT sure BUT I think applying https://github.com/rockchip-linux/u-boot/commit/e1652d3918b182e107addd5e6f45... on upstream U-Boot helped lower the probability (it still happened though!). Could be totally unrelated though and just flukes.
Cheers, Quentin

Hi Kever,
On 11/5/24 6:21 PM, Quentin Schulz wrote:
I am the unfortunate owner of an RK3399 Puma that is failing DRAM init every now and then with the following cryptic error message: """ pctl_start: Failed to init pctl channel 0 """
When it fails, the current logic is implemented in such a way that the board enters an infinite while loop. This is not ideal for embedded systems that are not easily accessible and also an issue when having devices in a CI lab for example.
Therefore, let's simply panic if the DRAM fails to init. This was tested only on RK3399 Puma, similar changes may be required for other SoCs to properly propagate the errors.
While this is all but a work-around instead of fixing the DRAM init sequence, I believe it is important for resilience of devices in the field.
Marking this as RFC because:
- panic()s for all Rockchip SoCs but only tested on RK3399
- unsure about error return values (ENODEV and the likes), not sure what would be better than that though :/
Note that one can still hang() if they want by setting PANIC_HANG symbol.
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de
What should we do with those patches?
Additionally, is there something I could do to help fix the DDR init on RK3399? Increasing the timeout didn't help sadly :/
Cheers, Quentin
participants (3)
-
Kever Yang
-
Quentin Schulz
-
Quentin Schulz