Re: U-Boot 2023.10 does not boot from uSD on RPi4

27 Oct 2023


      Apologies for the late public reply.
On Thu, Oct 19, 2023 at 03:06:48PM +0100, Peter Robinson wrote:
...
On Fri, Oct 13, 2023 at 6:48 PM Tom Rini trini@konsulko.com wrote:
...
On Fri, Oct 13, 2023 at 05:22:03PM +0100, Peter Robinson wrote:
...
On Fri, Oct 13, 2023 at 5:09 PM Peter Robinson pbrobinson@gmail.com wrote:
...
On Tue, Oct 10, 2023 at 3:58 PM Simon Glass sjg@chromium.org wrote:
...
Hi,
On Tue, 10 Oct 2023 at 04:39, Guillaume Gardet Guillaume.Gardet@arm.com wrote:
...
> -----Original Message-----
> From: Peter Robinson pbrobinson@gmail.com
> Sent: Tuesday, October 10, 2023 12:22 PM
> To: Guillaume Gardet Guillaume.Gardet@arm.com
> Cc: mbrugger@suse.com; Ivan Ivanov ivan.ivanov@suse.com; Simon Glass
> sjg@chromium.org; u-boot@lists.denx.de
> Subject: Re: U-Boot 2023.10 does not boot from uSD on RPi4
>
> On Tue, Oct 10, 2023 at 10:26 AM Guillaume Gardet
> Guillaume.Gardet@arm.com wrote:
> >
> > Hi,
> >
> > U-Boot 2023.10 does not boot from uSD on RPi4.
> > This has been found on openSUSE Tumbleweed. The only diff we need is:
> >   -CONFIG_OF_EMBED=y
> >   +CONFIG_OF_BOARD=y
> > To use firmware provided Device Tree. But that should not affect the mmc
> behavior too much, I think.
>
> I've been booting Fedora fine on a RPi4 BUT there's issues with the display
> turning off [1] when the accelerated display modules load
> (vc4) as a result of this patch set. Can you confirm if that's the same problem
> you're seeing?
No, that's not my problem. My issue is grub was not loaded by u-boot from uSD.
It seems more like Simon's problem: https://lists.denx.de/pipermail/u-boot/2023-October/533162.html
@Simon, can you check if the patch below fixes your boot problem on RPi4, please?
This has been reported at least twice before. There is a fix [2] which is in my queue to apply.
Looking at that patch it scans the first 3 devices, how does it handle
non storage devices like SDIO WiFi modules? It shouldn't be trying to
scan those.
And in the case of the RPi the other enabled SDHCI interface is the
WiFi, why would we even be trying to boot off a non storage interface,
something here just feels broken/wrong in general.
The patch does make it work with pure upstream, and I'm not sure why
the Fedora build boots it fine out of the box, but the patch just
feels like it's hacking around some other underlying problem with
bootstd, we didn't have this problem with the previous method and
trying to boot off non storage interfaces feels like it could cause
other problems.
I think the answer here is that we're doing the best we can given that
we just don't know until run time what we have. In the case where sdhci
Well that's not entirely true in the case of mmc/sdhci, we know what
devices could be storage, such as when a device is a mSD or eMMC or a
wifi interface, those don't change from boot to boot, a SDHCI
interface on one boot is not mysteriously going to become a emmc
storage unit the next boot.
Getting in to specifics here, I believe one of the issues is that RPi 3
uses SDHCI #1 for WiFi and #0 for micro SD card and RPi 4 is the other
way. So rpi_arm64 U-Boot binaries have to talk at both devices to see
what's there. We _should_ be doing this in such a way that we discover
both quickly enough and safely enough that we don't have a block device
and stop. It's possible there's some quirk handling code needed from the
upstream sdhci drivers for these chipsets, but I don't know them well
enough.
...
...
is something other than storage, we get as far as asking "are you a
block device?" which then fails when sdhci is a WiFi an not an eMMC.
This does mean the user could notice "Card did not respond to voltage
select! : -110" being printed, and I don't know if we should do
anything about that (it's a handy message when your uSD isn't fully
inserted, etc).  But since we (can) support everything on a single
build, we just have to figure it out at run time.
It has caused issues and it causes bug reports from users which is an
issue for me as a maintainer as it wastes my time. In short it's not a
great user experience.
Are we talking about the "Card did not ..." messages? If so, maybe we
should lower the priority there from pr_err. If the probe itself leads
to further errors, I would _really_ love to see the reports and how to
reproduce it. As best I can see through the code, we're doing things
safely and the command/response is "this is not a block device" and
stop.
...
Overall the last few U-Boot releases have been a nightmare from my
PoV, I have spent *all* my available time for upstream U-Boot dealing
with regressions.
First, I appreciate your time here. We've all been testing things to
the best of our time and resources, but that's everyones constraint too.
And some of the regressions I believe you've had to deal with are
unfortunate and part of how other components we have to use are / aren't
documented. I hope we've gotten the Rockchip side of things sorted out
now but I think one or two of those cases truly confused most of us.
...
In the case of the RPi I currently have 3 issues, 1) display 2) mSD 3)
USB (that Ivan has also mentioned). The 3 of these together make
things very hard to bisect and I am struggling. I also have 3 other
devices with issues I'm trying to debug for the Fedora release, and
the asahi people have also reported [1] regressions in their fork. I
honestly regret applying the bootstd patches.
For the record, Peter and worked on these off-list. The display one (and
some other Fedora issues) came down to the expected device tree not
being the one that was passed over to Linux. That wasn't bootstd's fault
either. We didn't sync up on the mSD issue. The USB one that Ivan
mentioned has been fixed, and was a bug (and missing test) that Simon
has addressed in the thread where that was fixed.
Looking at the USB XHCI error, I think this is yet another case of
U-Boot being in a "stuck" position as Marek has been asking for someone
to please work on re-syncing the driver with the kernel portion but
simply not having the time to do it. I believe Eugen Hristev has
volunteered to start re-syncing (incrementally rather than all the way
up to current).
...
When even Simon [2] is losing track of things I think we need to
change approach, the problems here upstream are nearly breaking me and
for Fedora I am now considering just forking U-Boot and cherry picking
the patches from upstream we need for particular devices and features.
It's absolutely not something I want to do but I feel it's getting to
the point I need to do it for the Fedora users and my sanity.
I want to say that it seems that one high level issue was that U-Boot's
device tree was the one being passed to the Fedora kernel and not the
device tree from the kernel and so issues that arose once Linux was
booted were from that. And that was not a bootstd issue but the
combination of using the bootmenu and efi bootmgr, which did not end up
loading the expected dtb.
...
I like the concept of bootstd and other features but the quantity of
patches, and sometimes other series of change for changes sake, where
the testing is clearly either not there, or is relying on "it works on
CI" [3] (and other examples) and is clearly not tested on real HW
makes some of the churn hugely problematic,
Here's one problem we all have, and I'm not sure how we can fully
address it. Today, I put all of my merges through our pytest tests, on
real hardware, on a few platforms. I also try and remember to at leas
towards the end of the release fire up the console on them and let
whatever OS I had installed autoboot and come up to a prompt. But since
it's manual testing, it doesn't happen consistently. I'm working towards
getting a lab setup here (Konsulko) and using lab manager to manage the
devices. Then do what needs doing so that kernelci can run the U-Boot
pytests. And then have at least some of the quick kernelci tests
themselves fire off. But this is a thing I want, not a business case, so
work progresses when our engineers have spare time.
We also have some companies doing their own frequency and availability
of upstream testing. I know in the past NVIDIA folks have mentioned they
monitor my tree and run pytest on their hardware and speak up when it
breaks, but I haven't heard from them in a while. I also know TI does
regular testing of upstream on a large number of their platforms. And
Toradex has recently given a talk about how they test a combination of
upstream components (U-Boot, Kernel, OE) regularly and report
regressions (and just did re some iMX changes). Oh, and I know Linaro is
doing some specific tests of testing too, but I don't recall the details
(and Ilias has offered to walk me through it). I don't know what else
is being tested out there, with regularity.
In some ways it's understandable that we don't have as much hardware
testing going on compared with the Linux Kernel in that you either need
a platform that lets you load firmware via USB/UART/etc or you need some
type of sdmux board which isn't commercially available (but are
made+sold by people). I'd love more hardware testing. And I want to get
more progress on making our pytest suite be able to be triggered by
kernelci, or at least lab manager or something as that might reduce the
overhead for other groups with labs to turn this testing on. But it's
all volunteer, and it depends on what people have available to them.
Since you've mentioned bootstd, I know Simon tested what he has
available at a number of combinations of distributions, and that's how
we've gotten a lot of issues addressed before it came to trying it out
on Pi and so getting picked up by Fedora/SuSE/others.
...
similarly the applying of
patches when there's been opposition and push back for the sake of it
(eg NFSv1 patch) as is things like force enabling people's pet
projects (looking at VBE here) where there's no actual real world
users and real security ramifications (alternate unaudited boot
methods of devices) also adds to my thought process for forking.
So this is where things are a lot trickier I think. One persons
pet-project is another persons production use-case. I don't get the
NFSv1 use case myself, but someone that is using it for production work,
and has been for a while contributed it upstream. Yay for new
contributors, that's how we grow. And it didn't impact the overall build
size by much (which is a common concern when adding a new feature). In
hindsight, yes, we also should have stopped enabling NFS by default
since (and especially the forms we support) is a legacy protocol.
VBE is a chicken-and-egg. Is it widely used right now? No. But it also
builds off of (iirc) the ChromeOS way of doing secure boot and some of
the lessons learned there over the years, and leverages old and well
tested at this point technologies like signed FIT images to solve the
problems that people are trying to figure out how exactly to solve
instead with UKI, on the EFI side. And not everyone is in agreement
that the EFI path is the best path forward in every case for modern
chips. So is Fedora right to disable VBE by default? Sure. But also,
personally, I'm tired of "security" as a reason. We let users modify
arbitrary locations in memory with arbitrary values by default and load
and execute arbitrary payloads. Do we protect ourselves at runtime now?
Yes, sure. Could someone work-around that? Yes. To be clear, I do see
"make a secure U-Boot that users can Trust" is a good and valid use
case. And yes, I see "companies want to Trust their deployed platform"
is a valid use case too. So if you have an end goal of "Fedora ships
U-Boot that users can Trust", disabling VBE isn't the first step, but
working with Simon on the things he's doing so that you can't drop down
to a prompt and start modifying memory should be on that list.
One of those things I do for every pull request / merge of my branches
is do a world build before/after, and see what's growing where
size-wise, and for what platforms. I try and keep global behavior from
changing without reason, be it bug fix or pretty important new feature.
...
I feel we as a project need to have a proper discussion about these
things.
Yes, we should all talk more. Maybe we're long enough in to COVID now
that some of the virtual meeting fatigue has subsided, and we take a
page from OpenEmbedded and setup a regular time-rolling video/audio
chat.
And building off of something I had mentioned to you, yes, I do need to
reach out to more people, more often, myself. So this is also an
invitation to anyone else reading along and saying to themselves that
I've missed something or I'm wrong about something or just need to tell
me something, send me an email, and if you want to talk, we can schedule
something. And I should email a number of people directly too, with
that message.
-- 
Tom