
Hi,
From: Simon Glass sjg@chromium.org Sent: jeudi 26 mars 2020 17:20
Hi Patrick,
On Wed, 25 Mar 2020 at 09:57, Patrick DELAUNAY patrick.delaunay@st.com wrote:
Hi,
From: Marek Vasut marex@denx.de Sent: mercredi 25 mars 2020 00:39
Hi,
I was looking at the STM32MP1 boot time and I noticed it takes about 2 seconds to get to U-Boot.
Thanks for the feedback.
To be clear, the SPL is not the ST priority as we have many limitation (mainly on power management) for the SPL boot chain
(stm32mp15_basic_defconfig):
Rom code => SPL => U-Boot
The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot (stm32mp15_trusted_defconfg).
One problem is the insane I2C timing calculation in stm32f7 i2c driver, which is almost a mallocator and CPU stress test and takes about 1 second to complete in SPL -- we need some simpler replacement for that, possibly the one in DWC I2C driver might do?
Our first idea to manage this I2C settings (prescaler/timings setting) was to set this values in device tree, but this binding was refused so this function stm32_i2c_choose_solution()
Was the binding refused in linux? Could we add something U-Boot-specific then? I think having 'early' timings, etc. is very handy. We are doing this on x86.
https://patchwork.ozlabs.org/patch/740214/ st,i2c-timing : A 32-bit I2C timing register value
Of course it has traditionally been impossible to convince Linux people to add this sort of thing. Still, I think we should do it. Our U-Boot-specific files allow this.
provided the better settings for any input clock and I2C frequency (called for
each probe).
Yes.... it is one possible solution. I already propose it internally.
But it is brutal and not optimum solution: try all the solution to found the better
one.
And the performance problem of this loop (shared code between Linux / U-Boot/TF-A drivers) had be already see/checked on ST side in TF-A context.
We should be able to calculate it, like with dw-i2c.
I checked drivers/i2c/designware_i2c.c... Nothing obviously applicable on ST IP.
In fact today I also challenge the I2C responsible for the need of this loop to found optimum parameter in bootloader.
I think that 'good enough' register value could be found with few operation (as in designware_i2c.c).
And moreover it seems tuning isn't really needed if we limit the I2C speed at 400kHz.
I still waiting internal feedback but with COVID-19 it is more difficult here.
We try to improve the solution, without success, but finally the performance issue was solved by dcache activation in TF-A before to execute
this loop.
I would like to see patches to enable the cache. We did this some years ago in a Chromebook and it made a big difference. It is not that hard.
Yes I am working on this patch, an today it is already functional.
https://gitlab.denx.de/u-boot/custodians/u-boot-stm/-/commit/3399fb37c3b7db6...
Updated bootstage report are available in the commit message.
I just need to cross check if the TLB and the cache is correctly managed if I only activate/ deactivate cache with CP15 function.
And don't want miss something for the sensible point.
But as in SPL the data cache is not activated, this loop has terrible performance.
We need to ding again of this topic for U-Boot point of view (SPL & also in U-Boot, before relocation and after relocation) .
And I had shared this issue with the ST owner of this code.
For information, I add some trace and I get for same code execution on DK2
board.
- 440ms in SPL (dcache OFF)
- 36ms in U-Boot (dcache ON)
Another item I found is that, in U-Boot, initf_dm() takes about half a second and so does serial_init(). I didn't dig into it to find out why, but I suspect it has to do with the massive amount of UCLASSes the DM has to traverse OR with the CPU being slow at that point, as the clock
driver didn't get probed just yet.
Thoughts ?
Yes, it is the first parsing of device tree, and it is really slow... directly linked to device tree size and libfdt.
I wonder if we can improve this. There was a change to how the drivers were bound (changing the ordering). We could perhaps revert that for SPL.
I no issue in SPL as the reduced device tree is short.
The issue in the U-Boot pre-reloc stage (with full device tree but without cache).
And because it is done before relocation (before dache enable).
Measurement on DK2 = 649ms
It is a other topic in my TODO list.
I want to explore livetree activation to reduce the DT parsing time.
Not in SPL though I suspect.
No 😊
In U-Boot proper (it is in my TODO list)
Patrick