Re: [PATCH 1/1] riscv: don't jump to 0x0 in handle_ipi()

1 Sep 2020

On 9/1/20 7:23 AM, Heinrich Schuchardt wrote:
...
On 01.09.20 13:14, Sean Anderson wrote:
...
On 9/1/20 6:51 AM, Heinrich Schuchardt wrote:
...
When resetting the sipeed_maix_bitm_defconfig without the patch I see a
crash:
=> reset
resetting ...
Unhandled exception: Illegal instruction
EPC: 0000000000000000 RA: 000000008000021a TVAL: 00000002d947c509
### ERROR ### Please RESET the board ###
Hm, interesting. I don't see that error. What commit are you testing with?
origin/master 23e333a5c083a000d0cabc
sipeed_maix_bitm_defconfig plus
CONFIG_DEBUG_UART=y
CONFIG_DEBUG_UART_SIFIVE=y
CONFIG_DEBUG_UART_BASE=0x38000000
CONFIG_DEBUG_UART_CLOCK=390000000


on a Maixduino.
Ok, I can reproduce that crash, but only with the debug uart enabled. I
suspect it happens every time, but we only get an error with the debug
uart. I enabled CONFIG_SHOW_REGS, but there was some interference
between the output and the (next) instance of U-Boot.
...
Best regards
Heinrich
...
...
Objdump shows that it is related to the secondary_hart_loop:
    j       secondary_hart_loop
8000021a:   b7ed                    j       80000204

<secondary_hart_loop>
It is actually related to the previous instruction
80000216:   56e000ef                jal     ra,80000784 <handle_ipi>
secondary_hart_loop is never reached because handle_ipi jumps to 0x0.
I decided to do a bit more experimentation with the following patch

diff --git i/arch/riscv/lib/smp.c w/arch/riscv/lib/smp.c
index ac22136314..8be320f6ae 100644
--- i/arch/riscv/lib/smp.c
+++ w/arch/riscv/lib/smp.c
@@ -54,6 +54,10 @@ static int send_ipi_many(struct ipi_data *ipi, int wait)
                gd->arch.ipi[reg].arg0 = ipi->arg0;
                gd->arch.ipi[reg].arg1 = ipi->arg1;
+               printf("sending %lx(%lx, %lx) to hart %u\n", ipi->addr,
+                      ipi->arg0, ipi->arg1, reg);
+
+               __smp_mb();
                ret = riscv_send_ipi(reg);
                if (ret) {
                        pr_err("Cannot send IPI to hart %d\n", reg);
@@ -77,15 +81,23 @@ void handle_ipi(ulong hart)
 {
        int ret;
        void (*smp_function)(ulong hart, ulong arg0, ulong arg1);
+       int ipi;
if (hart >= CONFIG_NR_CPUS)
                return;
+       riscv_get_ipi(hart, &ipi);
__smp_mb();
smp_function = (void (*)(ulong, ulong, ulong))gd->arch.ipi[hart].addr;
        invalidate_icache_all();
+       printf("calling %p(%lx, %lx) on hart %lu, ipi = %d\n", smp_function,
+              gd->arch.ipi[hart].arg0, gd->arch.ipi[hart].arg1, hart, ipi);
+
+       if (!ipi)
+               return;
+
        /*
         * Clear the IPI to acknowledge the request before jumping to the
         * requested function.
With this patch I get the following output
=> reset
resetting ...
calling 0000000000000000(0, 0) on hart 1, ipi = 0
calling 0000000000000000(0, 0) on hart 1, ipi = 0
calling 0000000000000000(0, 0) on hart 1, ipi = 0
calling 0000000000000000(0, 0) on hart 1, ipi = 0
calling 0000000000000000(0, 0) on hart 1, ipi = 1
U-Boot 2020.10-rc3-00033-g23e333a5c0-dirty (Sep 01 2020 - 09:55:43 -0400)
DRAM:  8 MiB
sending 805cc1fa(80589040, 8058cdd0) to hart 1
In:    serial@38000000
Out:   serial@38000000
Err:   serial@38000000
=>
There is no "calling" line after the "sending" line because hart 1 is
hung from jumping to 0x0.
...
From this log, it is clear that something other than smp_call_function
is sending an ipi, and, further, that there is a delay between when the
IRQ_M_SOFT bit is asserted and when the clint has a pending interrupt.
To further investigate this, I enabled debug output. This causes some
garbling in the console, as both harts try and write at the same time.
To alleviate this, I applied the following patch which adds locking to
puts and putc (NB this only works on K210)
diff --git i/common/console.c w/common/console.c
index 8e50af1f9d..31622df9b3 100644
--- i/common/console.c
+++ w/common/console.c
@@ -515,7 +515,35 @@ static inline void pre_console_puts(const char *s) {}
 static inline void print_pre_console_buffer(int flushpoint) {}
 #endif
-void putc(const char c)
+/* Static location which will survive across resets */
+u32 *console_spinlock = (void *)0x80400000;
+
+void console_lock(void)
+{
+       u32 tmp = 1;
+
+       /*
+        * We only ever write 1 to the lock, so if anything else is there then
+        * this is the first access to the spinlock, and we have thus acquired
+        * it. This will not work if the spinlock is ever (un)initialized to
+        * exactly 1, but I'll take my chances :)
+        */
+       while (tmp == 1)
+               asm volatile ("amoswap.w.aq %0, %0, %1"
+                             : "+r" (tmp), "+A" (console_spinlock)
+                             :
+                             : "memory");
+}
+
+void console_unlock(void)
+{
+       asm volatile ("amoswap.w.rl x0, x0, %0"
+                     : "+A" (console_spinlock)
+                     :
+                     : "memory");
+}
+
+static void _putc(const char c)
 {
 #ifdef CONFIG_SANDBOX
        /* sandbox can send characters to stdout before it has a console */
@@ -563,7 +591,14 @@ void putc(const char c)
        }
 }
-void puts(const char *s)
+void putc(const char c)
+{
+       console_lock();
+       _putc(c);
+       console_unlock();
+}
+
+static void _puts(const char *s)
 {
 #ifdef CONFIG_SANDBOX
        /* sandbox can send characters to stdout before it has a console */
@@ -614,6 +649,13 @@ void puts(const char *s)
        }
 }
+void puts(const char *s)
+{
+       console_lock();
+       _puts(s);
+       console_unlock();
+}
+
 #ifdef CONFIG_CONSOLE_RECORD
 int console_record_init(void)
 {
With this patch applied the output is [1] (and also far too long for
this email). Note that hart 1 first recieves a software interrupt on
line 46, which is also the first output after reset. The next line
(initcall: 0000000080005cc0) is from initcall_run_list calling
initf_bootstage, the first initcall after log_init. This indicates to me
that the pending interrupt eithe was triggered by start.S, or was never
cleared. Perhaps line 69 (csrc    MODE_PREFIX(ip), t0) is a no-op.
The clint only indicates an interrupt is present on line 723. Line 722
(initcall: 000000008001ce3a) indicates a call to timer_init. This is the
initcall immediately following arch_cpu_init_dm. Although this function
initializes the IPI, afact it does not make any writes to it. This does
suggest that some kind of effect does happen.
--Sean
[1] https://gist.github.com/Forty-Bot/e010480c34f31f0ecc150bbacd552ede