Re: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

newer
[PATCH v2 0/3] binman: Make FIT...

older
from...

Rick Chen

11 Aug 2020 11 Aug '20

8:20 a.m.

Hi Heinrich

...

From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] Sent: Tuesday, August 11, 2020 11:57 AM To: Rick Jian-Zhi Chen(陳建志) Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

At least on the Kendryte K210:

gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 gd->arch.ipi[1].arg1= 0x0000000000000000

We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

...

Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de

arch/riscv/lib/smp.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 --- a/arch/riscv/lib/smp.c +++ b/arch/riscv/lib/smp.c @@ -96,6 +96,8 @@ void handle_ipi(ulong hart) return; }
  if (!smp_function)
          return;
  smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);  }

I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...

Thanks, Rick

...

-- 2.28.0

Show replies by date

Heinrich Schuchardt

11 Aug 11 Aug

9:50 a.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 11.08.20 08:20, Rick Chen wrote:

...

Hi Heinrich

...
From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] Sent: Tuesday, August 11, 2020 11:57 AM To: Rick Jian-Zhi Chen(陳建志) Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

At least on the Kendryte K210:

gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 gd->arch.ipi[1].arg1= 0x0000000000000000

We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

I have been running U-Boot on the MaixDuino.

Without this patch secondary_hart_loop() is reached only once. With the patch it is reached several thousand times.

I would not expect NULL to contain any code that should be executed by the secondary hart. See doc/board/sipeed/maix.rst:

Address Size Description ========== ========= =========== 0x00000000 0x1000 debug 0x00001000 0x1000 rom 0x02000000 0xC000 clint

...

...
Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de

arch/riscv/lib/smp.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 --- a/arch/riscv/lib/smp.c +++ b/arch/riscv/lib/smp.c @@ -96,6 +96,8 @@ void handle_ipi(ulong hart) return; }
  if (!smp_function)
          return;
  smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);  }
I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...

Your comment was: "Please remove the sanity check. If zero address is the intended jump point, then system will work abnormal."

The only place where gd->arch.ipi[hart].addr is set is:

arch/riscv/lib/smp.c:53 send_ipi_many(): gd->arch.ipi[reg].addr = ipi->addr;

send_ipi_many() is only called in smp_call_function().

So the line

smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);

can only work if smp_function() has been called before this point at any time. The start code only calls it in spl_secondary_hart_stack_gd_setup().

Why do we call handle_ipi() on the secondary hart at all?

Where do you expect it to jump if U-Boot is the first binary loaded?

Even if spl_secondary_hart_stack_gd_setup() has initialized the jump address to secondary_hart_relocate(), do we want the secondary hart to execute it again and again?

Best regards

Heinrich

Sean Anderson

12:32 p.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 8/11/20 3:50 AM, Heinrich Schuchardt wrote:

...

On 11.08.20 08:20, Rick Chen wrote:

...
Hi Heinrich

...
From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] Sent: Tuesday, August 11, 2020 11:57 AM To: Rick Jian-Zhi Chen(陳建志) Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

At least on the Kendryte K210:

gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 gd->arch.ipi[1].arg1= 0x0000000000000000

We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

I have been running U-Boot on the MaixDuino.

Without this patch secondary_hart_loop() is reached only once. With the patch it is reached several thousand times.

Hm, interesting. To me, this is a symptom of something else going terribly wrong. I originally had this check in place so that it would be easier to detect these sorts of errors. I don't think this is the correct fix, however. We should really try and find the root cause of the bug.

...

I would not expect NULL to contain any code that should be executed by the secondary hart. See doc/board/sipeed/maix.rst:

Address Size Description ========== ========= =========== 0x00000000 0x1000 debug 0x00001000 0x1000 rom 0x02000000 0xC000 clint

...
...
Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de

arch/riscv/lib/smp.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 --- a/arch/riscv/lib/smp.c +++ b/arch/riscv/lib/smp.c @@ -96,6 +96,8 @@ void handle_ipi(ulong hart) return; }
  if (!smp_function)
          return;
  smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);  }
I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...
Your comment was: "Please remove the sanity check. If zero address is the intended jump point, then system will work abnormal."

The only place where gd->arch.ipi[hart].addr is set is:

arch/riscv/lib/smp.c:53 send_ipi_many(): gd->arch.ipi[reg].addr = ipi->addr;

send_ipi_many() is only called in smp_call_function().

So the line

smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);

can only work if smp_function() has been called before this point at any time. The start code only calls it in spl_secondary_hart_stack_gd_setup().

Can you retry with [1]? I think Bin Meng removed a call to sbi_init_ipi in [2] as part of a larger revert. The actual revert-of-revert is in [3], though it really should be split out into its own patch. The original commit making these changes is [4].

Note that the situation before [4] was that the IPI got initialized by the first hart to call any IPI function. If that hart was not the boot hart, Bad Things started to happen (e.g. race conditions, memory corruption, etc). In that patch, I moved the initialization to its own function so we would not have any race conditions and instead have (easier-to-debug imo) calls to handle_ipi with bogus arguments. Of course, when everything is working properly, we should get neither of these.

[1] https://patchwork.ozlabs.org/project/uboot/list/?series=193047 [2] https://patchwork.ozlabs.org/project/uboot/patch/1595225827-23674-1-git-send... [3] https://patchwork.ozlabs.org/project/uboot/patch/20200729095636.1077054-5-se... [4] https://patchwork.ozlabs.org/project/uboot/patch/20200521161503.384823-15-se...

...

Why do we call handle_ipi() on the secondary hart at all?

Presumably to handle the IPI it got sent? Sorry, I'm confused as to what you're getting at.

...

Where do you expect it to jump if U-Boot is the first binary loaded?

Even if spl_secondary_hart_stack_gd_setup() has initialized the jump address to secondary_hart_relocate(), do we want the secondary hart to execute it again and again?

--Sean

Heinrich Schuchardt

18 Aug 18 Aug

11 a.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 11.08.20 12:32, Sean Anderson wrote:

...

On 8/11/20 3:50 AM, Heinrich Schuchardt wrote:

...
On 11.08.20 08:20, Rick Chen wrote:

...
Hi Heinrich

...
From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] Sent: Tuesday, August 11, 2020 11:57 AM To: Rick Jian-Zhi Chen(陳建志) Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

At least on the Kendryte K210:

gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 gd->arch.ipi[1].arg1= 0x0000000000000000

We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

I have been running U-Boot on the MaixDuino.

Without this patch secondary_hart_loop() is reached only once. With the patch it is reached several thousand times.

Hm, interesting. To me, this is a symptom of something else going terribly wrong. I originally had this check in place so that it would be easier to detect these sorts of errors. I don't think this is the correct fix, however. We should really try and find the root cause of the bug.

...
I would not expect NULL to contain any code that should be executed by the secondary hart. See doc/board/sipeed/maix.rst:

Address Size Description ========== ========= =========== 0x00000000 0x1000 debug 0x00001000 0x1000 rom 0x02000000 0xC000 clint

...
...
Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de

arch/riscv/lib/smp.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 --- a/arch/riscv/lib/smp.c +++ b/arch/riscv/lib/smp.c @@ -96,6 +96,8 @@ void handle_ipi(ulong hart) return; }
  if (!smp_function)
          return;
  smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);  }
I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...
Your comment was: "Please remove the sanity check. If zero address is the intended jump point, then system will work abnormal."

The only place where gd->arch.ipi[hart].addr is set is:

arch/riscv/lib/smp.c:53 send_ipi_many(): gd->arch.ipi[reg].addr = ipi->addr;

send_ipi_many() is only called in smp_call_function().

So the line

smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);

can only work if smp_function() has been called before this point at any time. The start code only calls it in spl_secondary_hart_stack_gd_setup().
Can you retry with [1]? I think Bin Meng removed a call to sbi_init_ipi in [2] as part of a larger revert. The actual revert-of-revert is in [3], though it really should be split out into its own patch. The original commit making these changes is [4].

Patch series [1] makes not difference. You still have

gd->arch.ipi[0].addr= 0x0000000000000000 gd->arch.ipi[1].addr= 0x0000000000000000

and the secondary hart jumps to NULL and never returns.

...

Note that the situation before [4] was that the IPI got initialized by the first hart to call any IPI function. If that hart was not the boot hart, Bad Things started to happen (e.g. race conditions, memory corruption, etc). In that patch, I moved the initialization to its own function so we would not have any race conditions and instead have (easier-to-debug imo) calls to handle_ipi with bogus arguments. Of course, when everything is working properly, we should get neither of these.

[1] https://patchwork.ozlabs.org/project/uboot/list/?series=193047 [2] https://patchwork.ozlabs.org/project/uboot/patch/1595225827-23674-1-git-send... [3] https://patchwork.ozlabs.org/project/uboot/patch/20200729095636.1077054-5-se... [4] https://patchwork.ozlabs.org/project/uboot/patch/20200521161503.384823-15-se...

...
Why do we call handle_ipi() on the secondary hart at all?

Presumably to handle the IPI it got sent? Sorry, I'm confused as to what you're getting at.

I cannot see anything to be done by a secondary hart in case of a software interrupt.

Best regards

Heinrich

...

...
Where do you expect it to jump if U-Boot is the first binary loaded?

Even if spl_secondary_hart_stack_gd_setup() has initialized the jump address to secondary_hart_relocate(), do we want the secondary hart to execute it again and again?

--Sean

Sean Anderson

12:26 p.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 8/18/20 5:00 AM, Heinrich Schuchardt wrote:

...

On 11.08.20 12:32, Sean Anderson wrote:

...
On 8/11/20 3:50 AM, Heinrich Schuchardt wrote:

...
On 11.08.20 08:20, Rick Chen wrote:

...
Hi Heinrich

...
From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] Sent: Tuesday, August 11, 2020 11:57 AM To: Rick Jian-Zhi Chen(陳建志) Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

At least on the Kendryte K210:

gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 gd->arch.ipi[1].arg1= 0x0000000000000000

We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

I have been running U-Boot on the MaixDuino.

Without this patch secondary_hart_loop() is reached only once. With the patch it is reached several thousand times.

Hm, interesting. To me, this is a symptom of something else going terribly wrong. I originally had this check in place so that it would be easier to detect these sorts of errors. I don't think this is the correct fix, however. We should really try and find the root cause of the bug.

...
I would not expect NULL to contain any code that should be executed by the secondary hart. See doc/board/sipeed/maix.rst:

Address Size Description ========== ========= =========== 0x00000000 0x1000 debug 0x00001000 0x1000 rom 0x02000000 0xC000 clint

...
...
Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de

arch/riscv/lib/smp.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 --- a/arch/riscv/lib/smp.c +++ b/arch/riscv/lib/smp.c @@ -96,6 +96,8 @@ void handle_ipi(ulong hart) return; }
  if (!smp_function)
          return;
  smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);  }
I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...
Your comment was: "Please remove the sanity check. If zero address is the intended jump point, then system will work abnormal."

The only place where gd->arch.ipi[hart].addr is set is:

arch/riscv/lib/smp.c:53 send_ipi_many(): gd->arch.ipi[reg].addr = ipi->addr;

send_ipi_many() is only called in smp_call_function().

So the line

smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);

can only work if smp_function() has been called before this point at any time. The start code only calls it in spl_secondary_hart_stack_gd_setup().
Can you retry with [1]? I think Bin Meng removed a call to sbi_init_ipi in [2] as part of a larger revert. The actual revert-of-revert is in [3], though it really should be split out into its own patch. The original commit making these changes is [4].
Patch series [1] makes not difference. You still have

gd->arch.ipi[0].addr= 0x0000000000000000 gd->arch.ipi[1].addr= 0x0000000000000000

and the secondary hart jumps to NULL and never returns.

Hm, then there is a code path where an IPI gets triggered by something other than opensbi.

...

...
Note that the situation before [4] was that the IPI got initialized by the first hart to call any IPI function. If that hart was not the boot hart, Bad Things started to happen (e.g. race conditions, memory corruption, etc). In that patch, I moved the initialization to its own function so we would not have any race conditions and instead have (easier-to-debug imo) calls to handle_ipi with bogus arguments. Of course, when everything is working properly, we should get neither of these.

[1] https://patchwork.ozlabs.org/project/uboot/list/?series=193047 [2] https://patchwork.ozlabs.org/project/uboot/patch/1595225827-23674-1-git-send... [3] https://patchwork.ozlabs.org/project/uboot/patch/20200729095636.1077054-5-se... [4] https://patchwork.ozlabs.org/project/uboot/patch/20200521161503.384823-15-se...

...
Why do we call handle_ipi() on the secondary hart at all?

Presumably to handle the IPI it got sent? Sorry, I'm confused as to what you're getting at.

I cannot see anything to be done by a secondary hart in case of a software interrupt.

Isn't it supposed to run the function which the boot hart sent to it?

--Sean

Heinrich Schuchardt

1 Sep 1 Sep

12:51 p.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 8/18/20 11:00 AM, Heinrich Schuchardt wrote:

...

On 11.08.20 12:32, Sean Anderson wrote:

...
On 8/11/20 3:50 AM, Heinrich Schuchardt wrote:

...
On 11.08.20 08:20, Rick Chen wrote:

...
Hi Heinrich

...
From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] Sent: Tuesday, August 11, 2020 11:57 AM To: Rick Jian-Zhi Chen(陳建志) Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

At least on the Kendryte K210:

gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 gd->arch.ipi[1].arg1= 0x0000000000000000

We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

I have been running U-Boot on the MaixDuino.

Without this patch secondary_hart_loop() is reached only once. With the patch it is reached several thousand times.

Hm, interesting. To me, this is a symptom of something else going terribly wrong. I originally had this check in place so that it would be easier to detect these sorts of errors. I don't think this is the correct fix, however. We should really try and find the root cause of the bug.

...
I would not expect NULL to contain any code that should be executed by the secondary hart. See doc/board/sipeed/maix.rst:

Address Size Description ========== ========= =========== 0x00000000 0x1000 debug 0x00001000 0x1000 rom 0x02000000 0xC000 clint

...
...
Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de

arch/riscv/lib/smp.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 --- a/arch/riscv/lib/smp.c +++ b/arch/riscv/lib/smp.c @@ -96,6 +96,8 @@ void handle_ipi(ulong hart) return; }
  if (!smp_function)
          return;
   smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);  }
I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...
Your comment was: "Please remove the sanity check. If zero address is the intended jump point, then system will work abnormal."

The only place where gd->arch.ipi[hart].addr is set is:

arch/riscv/lib/smp.c:53 send_ipi_many(): gd->arch.ipi[reg].addr = ipi->addr;

send_ipi_many() is only called in smp_call_function().

So the line

smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);

can only work if smp_function() has been called before this point at any time. The start code only calls it in spl_secondary_hart_stack_gd_setup().
Can you retry with [1]? I think Bin Meng removed a call to sbi_init_ipi in [2] as part of a larger revert. The actual revert-of-revert is in [3], though it really should be split out into its own patch. The original commit making these changes is [4].
Patch series [1] makes not difference. You still have

gd->arch.ipi[0].addr= 0x0000000000000000 gd->arch.ipi[1].addr= 0x0000000000000000

and the secondary hart jumps to NULL and never returns.

...
Note that the situation before [4] was that the IPI got initialized by the first hart to call any IPI function. If that hart was not the boot hart, Bad Things started to happen (e.g. race conditions, memory corruption, etc). In that patch, I moved the initialization to its own function so we would not have any race conditions and instead have (easier-to-debug imo) calls to handle_ipi with bogus arguments. Of course, when everything is working properly, we should get neither of these.

[1] https://patchwork.ozlabs.org/project/uboot/list/?series=193047 [2] https://patchwork.ozlabs.org/project/uboot/patch/1595225827-23674-1-git-send... [3] https://patchwork.ozlabs.org/project/uboot/patch/20200729095636.1077054-5-se... [4] https://patchwork.ozlabs.org/project/uboot/patch/20200521161503.384823-15-se...

...
Why do we call handle_ipi() on the secondary hart at all?

Presumably to handle the IPI it got sent? Sorry, I'm confused as to what you're getting at.

I cannot see anything to be done by a secondary hart in case of a software interrupt.

Best regards

Heinrich

When resetting the sipeed_maix_bitm_defconfig without the patch I see a crash:

=> reset resetting ... Unhandled exception: Illegal instruction EPC: 0000000000000000 RA: 000000008000021a TVAL: 00000002d947c509 ### ERROR ### Please RESET the board ###

Objdump shows that it is related to the secondary_hart_loop:

j secondary_hart_loop 8000021a: b7ed j 80000204 <secondary_hart_loop>

This crash is avoided with this patch.

How do we get forward?

Best regards

Heinrich

...

...
...
Where do you expect it to jump if U-Boot is the first binary loaded?

Even if spl_secondary_hart_stack_gd_setup() has initialized the jump address to secondary_hart_relocate(), do we want the secondary hart to execute it again and again?

--Sean

Sean Anderson

1:14 p.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 9/1/20 6:51 AM, Heinrich Schuchardt wrote:

...

On 8/18/20 11:00 AM, Heinrich Schuchardt wrote:

...
On 11.08.20 12:32, Sean Anderson wrote:

...
On 8/11/20 3:50 AM, Heinrich Schuchardt wrote:

...
On 11.08.20 08:20, Rick Chen wrote:

...
Hi Heinrich

...
From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] Sent: Tuesday, August 11, 2020 11:57 AM To: Rick Jian-Zhi Chen(陳建志) Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

At least on the Kendryte K210:

gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 gd->arch.ipi[1].arg1= 0x0000000000000000

We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

I have been running U-Boot on the MaixDuino.

Without this patch secondary_hart_loop() is reached only once. With the patch it is reached several thousand times.

Hm, interesting. To me, this is a symptom of something else going terribly wrong. I originally had this check in place so that it would be easier to detect these sorts of errors. I don't think this is the correct fix, however. We should really try and find the root cause of the bug.

...
I would not expect NULL to contain any code that should be executed by the secondary hart. See doc/board/sipeed/maix.rst:

Address    Size      Description ========== ========= =========== 0x00000000 0x1000    debug 0x00001000 0x1000    rom 0x02000000 0xC000    clint

...
...
Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de

arch/riscv/lib/smp.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 --- a/arch/riscv/lib/smp.c +++ b/arch/riscv/lib/smp.c @@ -96,6 +96,8 @@ void handle_ipi(ulong hart)                  return;          }

+       if (!smp_function) +               return;          smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1); }

I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...

Your comment was: "Please remove the sanity check. If zero address is the intended jump point, then system will work abnormal."

The only place where gd->arch.ipi[hart].addr is set is:

arch/riscv/lib/smp.c:53 send_ipi_many(): gd->arch.ipi[reg].addr = ipi->addr;

send_ipi_many() is only called in smp_call_function().

So the line

smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);

can only work if smp_function() has been called before this point at any time. The start code only calls it in spl_secondary_hart_stack_gd_setup().

Can you retry with [1]? I think Bin Meng removed a call to sbi_init_ipi in [2] as part of a larger revert. The actual revert-of-revert is in [3], though it really should be split out into its own patch. The original commit making these changes is [4].

Patch series [1] makes not difference. You still have

gd->arch.ipi[0].addr= 0x0000000000000000 gd->arch.ipi[1].addr= 0x0000000000000000

and the secondary hart jumps to NULL and never returns.

...
Note that the situation before [4] was that the IPI got initialized by the first hart to call any IPI function. If that hart was not the boot hart, Bad Things started to happen (e.g. race conditions, memory corruption, etc). In that patch, I moved the initialization to its own function so we would not have any race conditions and instead have (easier-to-debug imo) calls to handle_ipi with bogus arguments. Of course, when everything is working properly, we should get neither of these.

[1] https://patchwork.ozlabs.org/project/uboot/list/?series=193047 [2] https://patchwork.ozlabs.org/project/uboot/patch/1595225827-23674-1-git-send... [3] https://patchwork.ozlabs.org/project/uboot/patch/20200729095636.1077054-5-se... [4] https://patchwork.ozlabs.org/project/uboot/patch/20200521161503.384823-15-se...

...
Why do we call handle_ipi() on the secondary hart at all?

Presumably to handle the IPI it got sent? Sorry, I'm confused as to what you're getting at.

I cannot see anything to be done by a secondary hart in case of a software interrupt.

Best regards

Heinrich

When resetting the sipeed_maix_bitm_defconfig without the patch I see a crash:

=> reset resetting ... Unhandled exception: Illegal instruction EPC: 0000000000000000 RA: 000000008000021a TVAL: 00000002d947c509 ### ERROR ### Please RESET the board ###

Hm, interesting. I don't see that error. What commit are you testing with?

...

Objdump shows that it is related to the secondary_hart_loop:

j secondary_hart_loop 8000021a: b7ed j 80000204

<secondary_hart_loop>

This crash is avoided with this patch.

How do we get forward?

Best regards

Heinrich

...
...
...
Where do you expect it to jump if U-Boot is the first binary loaded?

Even if spl_secondary_hart_stack_gd_setup() has initialized the jump address to secondary_hart_relocate(), do we want the secondary hart to execute it again and again?

--Sean

Heinrich Schuchardt

1:23 p.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 01.09.20 13:14, Sean Anderson wrote:

...

On 9/1/20 6:51 AM, Heinrich Schuchardt wrote:

...
On 8/18/20 11:00 AM, Heinrich Schuchardt wrote:

...
On 11.08.20 12:32, Sean Anderson wrote:

...
On 8/11/20 3:50 AM, Heinrich Schuchardt wrote:

...
On 11.08.20 08:20, Rick Chen wrote:

...
Hi Heinrich

> From: Heinrich Schuchardt [mailto:xypron.glpk@gmx.de] > Sent: Tuesday, August 11, 2020 11:57 AM > To: Rick Jian-Zhi Chen(陳建志) > Cc: Sean Anderson; Lukas Auer; Simon Glass; Anup Patel; Daniel Schwierzeck; u-boot@lists.denx.de; Heinrich Schuchardt > Subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi() > > At least on the Kendryte K210: > > gd->arch.available_harts= 0x0000000000000003 arch.ipi[0].addr= > gd->0x0000000000000000 arch.ipi[0].arg0= 0x0000000000000000 > gd->arch.ipi[0].arg1= 0x0000000000000000 arch.ipi[1].addr= > gd->0x0000000000000000 arch.ipi[1].arg0= 0x0000000000000000 > gd->arch.ipi[1].arg1= 0x0000000000000000 > > We should not jump to 0x0 to handle an interrupt.

Can you explain why K210 be affected by it ?

I have been running U-Boot on the MaixDuino.

Without this patch secondary_hart_loop() is reached only once. With the patch it is reached several thousand times.

Hm, interesting. To me, this is a symptom of something else going terribly wrong. I originally had this check in place so that it would be easier to detect these sorts of errors. I don't think this is the correct fix, however. We should really try and find the root cause of the bug.

...
I would not expect NULL to contain any code that should be executed by the secondary hart. See doc/board/sipeed/maix.rst:

Address Size Description ========== ========= =========== 0x00000000 0x1000 debug 0x00001000 0x1000 rom 0x02000000 0xC000 clint

...
> > Signed-off-by: Heinrich Schuchardt xypron.glpk@gmx.de > --- > arch/riscv/lib/smp.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/riscv/lib/smp.c b/arch/riscv/lib/smp.c index ac22136314..d725fa32e8 100644 > --- a/arch/riscv/lib/smp.c > +++ b/arch/riscv/lib/smp.c > @@ -96,6 +96,8 @@ void handle_ipi(ulong hart) > return; > } > > + if (!smp_function) > + return; > smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1); }

I remember Sean add this check in [v10,14/21] riscv: Clean up IPI initialization code . And I ask him to remove. https://patchwork.ozlabs.org/project/uboot/patch/20200503024637.327733-15-se...

Your comment was: "Please remove the sanity check. If zero address is the intended jump point, then system will work abnormal."

The only place where gd->arch.ipi[hart].addr is set is:

arch/riscv/lib/smp.c:53 send_ipi_many(): gd->arch.ipi[reg].addr = ipi->addr;

send_ipi_many() is only called in smp_call_function().

So the line

smp_function(hart, gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1);

can only work if smp_function() has been called before this point at any time. The start code only calls it in spl_secondary_hart_stack_gd_setup().

Can you retry with [1]? I think Bin Meng removed a call to sbi_init_ipi in [2] as part of a larger revert. The actual revert-of-revert is in [3], though it really should be split out into its own patch. The original commit making these changes is [4].

Patch series [1] makes not difference. You still have

gd->arch.ipi[0].addr= 0x0000000000000000 gd->arch.ipi[1].addr= 0x0000000000000000

and the secondary hart jumps to NULL and never returns.

...
Note that the situation before [4] was that the IPI got initialized by the first hart to call any IPI function. If that hart was not the boot hart, Bad Things started to happen (e.g. race conditions, memory corruption, etc). In that patch, I moved the initialization to its own function so we would not have any race conditions and instead have (easier-to-debug imo) calls to handle_ipi with bogus arguments. Of course, when everything is working properly, we should get neither of these.

[1] https://patchwork.ozlabs.org/project/uboot/list/?series=193047 [2] https://patchwork.ozlabs.org/project/uboot/patch/1595225827-23674-1-git-send... [3] https://patchwork.ozlabs.org/project/uboot/patch/20200729095636.1077054-5-se... [4] https://patchwork.ozlabs.org/project/uboot/patch/20200521161503.384823-15-se...

...
Why do we call handle_ipi() on the secondary hart at all?

Presumably to handle the IPI it got sent? Sorry, I'm confused as to what you're getting at.

I cannot see anything to be done by a secondary hart in case of a software interrupt.

Best regards

Heinrich

When resetting the sipeed_maix_bitm_defconfig without the patch I see a crash:

=> reset resetting ... Unhandled exception: Illegal instruction EPC: 0000000000000000 RA: 000000008000021a TVAL: 00000002d947c509 ### ERROR ### Please RESET the board ###

Hm, interesting. I don't see that error. What commit are you testing with?

origin/master 23e333a5c083a000d0cabc

sipeed_maix_bitm_defconfig plus

CONFIG_DEBUG_UART=y CONFIG_DEBUG_UART_SIFIVE=y CONFIG_DEBUG_UART_BASE=0x38000000 CONFIG_DEBUG_UART_CLOCK=390000000

on a Maixduino.

Best regards

Heinrich

...

...
Objdump shows that it is related to the secondary_hart_loop:

j secondary_hart_loop 8000021a: b7ed j 80000204

<secondary_hart_loop>

This crash is avoided with this patch.

How do we get forward?

Best regards

Heinrich

...
...
...
Where do you expect it to jump if U-Boot is the first binary loaded?

Even if spl_secondary_hart_stack_gd_setup() has initialized the jump address to secondary_hart_relocate(), do we want the secondary hart to execute it again and again?

--Sean

Sean Anderson

4:38 p.m.

New subject: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

On 9/1/20 7:23 AM, Heinrich Schuchardt wrote:

...

On 01.09.20 13:14, Sean Anderson wrote:

...
On 9/1/20 6:51 AM, Heinrich Schuchardt wrote:

...
When resetting the sipeed_maix_bitm_defconfig without the patch I see a crash:

=> reset resetting ... Unhandled exception: Illegal instruction EPC: 0000000000000000 RA: 000000008000021a TVAL: 00000002d947c509 ### ERROR ### Please RESET the board ###

Hm, interesting. I don't see that error. What commit are you testing with?

origin/master 23e333a5c083a000d0cabc

sipeed_maix_bitm_defconfig plus
CONFIG_DEBUG_UART=y
CONFIG_DEBUG_UART_SIFIVE=y
CONFIG_DEBUG_UART_BASE=0x38000000
CONFIG_DEBUG_UART_CLOCK=390000000
on a Maixduino.

Ok, I can reproduce that crash, but only with the debug uart enabled. I suspect it happens every time, but we only get an error with the debug uart. I enabled CONFIG_SHOW_REGS, but there was some interference between the output and the (next) instance of U-Boot.

...

Best regards

Heinrich

...
...
Objdump shows that it is related to the secondary_hart_loop:
    j       secondary_hart_loop
8000021a:   b7ed                    j       80000204
<secondary_hart_loop>

It is actually related to the previous instruction

80000216: 56e000ef jal ra,80000784 <handle_ipi>

secondary_hart_loop is never reached because handle_ipi jumps to 0x0.

I decided to do a bit more experimentation with the following patch

diff --git i/arch/riscv/lib/smp.c w/arch/riscv/lib/smp.c index ac22136314..8be320f6ae 100644 --- i/arch/riscv/lib/smp.c +++ w/arch/riscv/lib/smp.c @@ -54,6 +54,10 @@ static int send_ipi_many(struct ipi_data *ipi, int wait) gd->arch.ipi[reg].arg0 = ipi->arg0; gd->arch.ipi[reg].arg1 = ipi->arg1;

+ printf("sending %lx(%lx, %lx) to hart %u\n", ipi->addr, + ipi->arg0, ipi->arg1, reg); + + __smp_mb(); ret = riscv_send_ipi(reg); if (ret) { pr_err("Cannot send IPI to hart %d\n", reg); @@ -77,15 +81,23 @@ void handle_ipi(ulong hart) { int ret; void (*smp_function)(ulong hart, ulong arg0, ulong arg1); + int ipi;

if (hart >= CONFIG_NR_CPUS) return; + riscv_get_ipi(hart, &ipi);

__smp_mb();

smp_function = (void (*)(ulong, ulong, ulong))gd->arch.ipi[hart].addr; invalidate_icache_all();

+ printf("calling %p(%lx, %lx) on hart %lu, ipi = %d\n", smp_function, + gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1, hart, ipi); + + if (!ipi) + return; + /* * Clear the IPI to acknowledge the request before jumping to the * requested function.

With this patch I get the following output

=> reset resetting ... calling 0000000000000000(0, 0) on hart 1, ipi = 0 calling 0000000000000000(0, 0) on hart 1, ipi = 0 calling 0000000000000000(0, 0) on hart 1, ipi = 0 calling 0000000000000000(0, 0) on hart 1, ipi = 0 calling 0000000000000000(0, 0) on hart 1, ipi = 1

U-Boot 2020.10-rc3-00033-g23e333a5c0-dirty (Sep 01 2020 - 09:55:43 -0400)

DRAM: 8 MiB sending 805cc1fa(80589040, 8058cdd0) to hart 1 In: serial@38000000 Out: serial@38000000 Err: serial@38000000 =>

There is no "calling" line after the "sending" line because hart 1 is hung from jumping to 0x0.

...

From this log, it is clear that something other than smp_call_function

is sending an ipi, and, further, that there is a delay between when the IRQ_M_SOFT bit is asserted and when the clint has a pending interrupt.

To further investigate this, I enabled debug output. This causes some garbling in the console, as both harts try and write at the same time. To alleviate this, I applied the following patch which adds locking to puts and putc (NB this only works on K210)

diff --git i/common/console.c w/common/console.c index 8e50af1f9d..31622df9b3 100644 --- i/common/console.c +++ w/common/console.c @@ -515,7 +515,35 @@ static inline void pre_console_puts(const char *s) {} static inline void print_pre_console_buffer(int flushpoint) {} #endif

-void putc(const char c) +/* Static location which will survive across resets */ +u32 *console_spinlock = (void *)0x80400000; + +void console_lock(void) +{ + u32 tmp = 1; + + /* + * We only ever write 1 to the lock, so if anything else is there then + * this is the first access to the spinlock, and we have thus acquired + * it. This will not work if the spinlock is ever (un)initialized to + * exactly 1, but I'll take my chances :) + */ + while (tmp == 1) + asm volatile ("amoswap.w.aq %0, %0, %1" + : "+r" (tmp), "+A" (console_spinlock) + : + : "memory"); +} + +void console_unlock(void) +{ + asm volatile ("amoswap.w.rl x0, x0, %0" + : "+A" (console_spinlock) + : + : "memory"); +} + +static void _putc(const char c) { #ifdef CONFIG_SANDBOX /* sandbox can send characters to stdout before it has a console */ @@ -563,7 +591,14 @@ void putc(const char c) } }

-void puts(const char *s) +void putc(const char c) +{ + console_lock(); + _putc(c); + console_unlock(); +} + +static void _puts(const char *s) { #ifdef CONFIG_SANDBOX /* sandbox can send characters to stdout before it has a console */ @@ -614,6 +649,13 @@ void puts(const char *s) } }

+void puts(const char *s) +{ + console_lock(); + _puts(s); + console_unlock(); +} + #ifdef CONFIG_CONSOLE_RECORD int console_record_init(void) {

With this patch applied the output is [1] (and also far too long for this email). Note that hart 1 first recieves a software interrupt on line 46, which is also the first output after reset. The next line (initcall: 0000000080005cc0) is from initcall_run_list calling initf_bootstage, the first initcall after log_init. This indicates to me that the pending interrupt eithe was triggered by start.S, or was never cleared. Perhaps line 69 (csrc MODE_PREFIX(ip), t0) is a no-op.

The clint only indicates an interrupt is present on line 723. Line 722 (initcall: 000000008001ce3a) indicates a call to timer_init. This is the initcall immediately following arch_cpu_init_dm. Although this function initializes the IPI, afact it does not make any writes to it. This does suggest that some kind of effect does happen.

--Sean

[1] https://gist.github.com/Forty-Bot/e010480c34f31f0ecc150bbacd552ede

1709

Age (days ago)

1730

Last active (days ago)

List overview

Download

8 comments

4 participants

tags (0)

participants (4)

Heinrich Schuchardt
Heinrich Schuchardt
Rick Chen
Sean Anderson