
Hello Marek,
On Tue, 8 Dec 2015 14:43:42 +0100, Marek Vasut marex@denx.de wrote:
The arch/arm/lib/cache-cp15.c checks for CONFIG_ARMV7 and if this macro is set, it configures TTBR0 register. This register must be configured for the cache on ARMv7 to operate correctly.
The problem is that noone actually sets the CONFIG_ARMV7 macro and thus the TTBR0 is not configured at all. On SoCFPGA, this produces all sorts of minor issues which are hard to replicate, for example certain USB sticks are not detected or QSPI NOR sometimes fails to write pages completely.
The solution is to replace CONFIG_ARMV7 test with CONFIG_CPU_V7 one. This is correct because the code which added the test(s) for CONFIG_ARMV7 was added shortly after CONFIG_ARMV7 was replaced by CONFIG_CPU_V7 and this code was not adjusted correctly to reflect that change.
Note:
As discussed with Marek on IRC, this patch enables what is supposed to be the correct MMU settings for ARMv7, which causes a sharp Ethernet performance drop (40%) but also a strong general memory access performance hit (a copy of 4 MB is almost instantaneous without the patch and takes 2-3 seconds with it).
I would like to either fix the performance or come up with an explanation for it before I pick this patch.
Marek's analysis shows the only MMU-related effect of the patch is that the S bit gets set in first-level MMU table entries.
The S bit makes a mmu region shareable with other IPs within the silicon (DMA engines, other cores, possibly even a piece of bus or interconnect). For instance, it will cause memory writes within the region to propagate to other IPs (which [must] also have defined this region as shareable) so that these IPs know the region has been written to and update their cache state accordingly.
With the S bit clear, USB and QSPI fail to work on Marek's board, probably because writes do not propagate properly between the core and these IPs; with the S bit set, USB and QSPI work but cache performance is very reduced.
My hypothesis right now is that some other IP(s) hamper(s) the propagation of shareable region writes; for instance, some IP is off or in the wrong state, and does not properly respond, causing some stall.
At first I suspected the second core, which was off during the tests; but Marek tried with that core on and it did not improve performance.
We can keep on looking into finding IPs that might affect shareable accesses and try to turn them on or off more or less at random, but that is time-consuming, and I'm not even sure I'm on the right track.
I will welcome suggestions if anyone with more experience in MMU shareability than us -- or with brilliant insight :) -- has any.
Amicalement,