32-bit DMA limit for devices (and drivers)

Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be: - Let the driver allocate the RX and TX buffers separately, placing them below 4GB in the address space (using lmb_reserve(), I guess?) - Use those RX buffers and hand the addresses back to the upper layers. - We already copy TX packets, so this would also be covered, in this situation. Other drivers might need to introduce copying.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)? In our case we make the buffers part of our priv struct, so should there be an option to let the priv_auto allocation come from below 4GB?
Grateful for any input on this!
Thanks! Andre
[1] https://linux-sunxi.org/X96_Mate [2] https://lists.denx.de/pipermail/u-boot/2021-April/448327.html

Date: Fri, 30 Apr 2021 12:21:21 +0100 From: Andre Przywara andre.przywara@arm.com
Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be:
- Let the driver allocate the RX and TX buffers separately, placing them below 4GB in the address space (using lmb_reserve(), I guess?)
- Use those RX buffers and hand the addresses back to the upper layers.
- We already copy TX packets, so this would also be covered, in this situation. Other drivers might need to introduce copying.
What you describe here is called a bounce buffer approach. I believe Linux developers also refer to this as swiotlb.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)? In our case we make the buffers part of our priv struct, so should there be an option to let the priv_auto allocation come from below 4GB?
Grateful for any input on this!
I looked into this a bit when I was trying to figure out what to do on Apple M1 systems where I have a somewhat related issue. These systems have an IOMMU that can't be bypassed. Since I don't want to add IOMMU infrastructure to U-Boot, I set up the IOMMU to map a fixed block of physical memory and make sure that all allocations of memory come from that block of memory. In this case this is fairly easy to achieve. U-Boot allocates memory from the top of usable memory, so as long as I let the IOMMU map that high memory, things work. U-Boot doesn't need a lot of memory, so a block of 512MB is more than sufficient.
In your case this means that as long as you set the top of usable memory to an address < 4G, U-Boot itself should be fine and no bounce buffers are needed. You have to make sure the addresses in the U-Boot environment for loading things like the kernel and the FDT are set to an address < 4G as well.
For EFI things are different though. You want to expose all physical memory in the EFI memory map. This means that an EFI application (such as an OS loader) may pick memory > 4G and use it to do I/O. For this purpose U-Boot already implements bounce buffers. See the CONFIG_EFI_LOADER_BOUNCE_BUFFER option.
Hope that helps!

On Fri, 30 Apr 2021 14:02:52 +0200 (CEST) Mark Kettenis mark.kettenis@xs4all.nl wrote:
Hi Mark,
thanks for the reply!
(CC:ing Alex and Heinrich for the UEFI questions below)
Date: Fri, 30 Apr 2021 12:21:21 +0100 From: Andre Przywara andre.przywara@arm.com
Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be:
- Let the driver allocate the RX and TX buffers separately, placing them below 4GB in the address space (using lmb_reserve(), I guess?)
- Use those RX buffers and hand the addresses back to the upper layers.
- We already copy TX packets, so this would also be covered, in this situation. Other drivers might need to introduce copying.
What you describe here is called a bounce buffer approach. I believe Linux developers also refer to this as swiotlb.
Yes, but it's not entirely the same as bounce buffering in Linux, since we allocate the buffers ourselves, in the driver, so we have full control over it. The problem I face is that malloc() works on the heap (which is high), or we use the automatic priv_alloc mechanism, which uses the heap as well, IIUC.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)? In our case we make the buffers part of our priv struct, so should there be an option to let the priv_auto allocation come from below 4GB?
Grateful for any input on this!
I looked into this a bit when I was trying to figure out what to do on Apple M1 systems where I have a somewhat related issue. These systems have an IOMMU that can't be bypassed. Since I don't want to add IOMMU infrastructure to U-Boot, I set up the IOMMU to map a fixed block of physical memory and make sure that all allocations of memory come from that block of memory. In this case this is fairly easy to achieve. U-Boot allocates memory from the top of usable memory, so as long as I let the IOMMU map that high memory, things work. U-Boot doesn't need a lot of memory, so a block of 512MB is more than sufficient.
I'd rather not play around with the visible memory size (see below). And while technically there is a (scatter/gather) IOMMU in the SoC, it would be too big guns for that small problem.
In your case this means that as long as you set the top of usable memory to an address < 4G, U-Boot itself should be fine and no bounce buffers are needed. You have to make sure the addresses in the U-Boot environment for loading things like the kernel and the FDT are set to an address < 4G as well.
For EFI things are different though. You want to expose all physical memory in the EFI memory map.
Not only for UEFI, since U-Boot populates the DT memory node even for booti/bootm, in arch/arm/lib/bootm-fdt.c:arch_fixup_fdt(). So limiting the memory is not an option, since this would be passed on to the OS.
This means that an EFI application (such as an OS loader) may pick memory > 4G and use it to do I/O.
I think we should be safe here, as the driver has full control over the buffers: For TX we copy already, to use "fire-and-forget", so we just start the DMA and return. And for RX U-Boot network drivers return the buffer address, so it's our own buffer again. So wherever higher layers put the packets, we should be good (given our own buffers are).
So I guess my question boils down to: How can I best allocate buffers from "low" memory? And do those buffers carveouts make it into the UEFI memory map, as reserved regions? Or can UEFI differentiate between boot services and runtime services allocations? The buffers would be needed during boot services, for the UEFI network protocol. But later on they can be abandoned.
this purpose U-Boot already implements bounce buffers. See the CONFIG_EFI_LOADER_BOUNCE_BUFFER option.
Interesting, thanks, I will have a look at that. Maybe that contains some useful traces to other code.
Cheers, Andre

Hi!
Dne petek, 30. april 2021 ob 15:34:28 CEST je Andre Przywara napisal(a):
On Fri, 30 Apr 2021 14:02:52 +0200 (CEST) Mark Kettenis mark.kettenis@xs4all.nl wrote:
Hi Mark,
thanks for the reply!
(CC:ing Alex and Heinrich for the UEFI questions below)
Date: Fri, 30 Apr 2021 12:21:21 +0100 From: Andre Przywara andre.przywara@arm.com
Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be:
Let the driver allocate the RX and TX buffers separately, placing them
below 4GB in the address space (using lmb_reserve(), I guess?)
Use those RX buffers and hand the addresses back to the upper layers.
We already copy TX packets, so this would also be covered, in this
situation. Other drivers might need to introduce copying.
What you describe here is called a bounce buffer approach. I believe Linux developers also refer to this as swiotlb.
Yes, but it's not entirely the same as bounce buffering in Linux, since we allocate the buffers ourselves, in the driver, so we have full control over it. The problem I face is that malloc() works on the heap (which is high), or we use the automatic priv_alloc mechanism, which uses the heap as well, IIUC.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)? In our case we make the buffers part of our priv struct, so should there be an option to let the priv_auto allocation come from below 4GB?
Grateful for any input on this!
I looked into this a bit when I was trying to figure out what to do on Apple M1 systems where I have a somewhat related issue. These systems have an IOMMU that can't be bypassed. Since I don't want to add IOMMU infrastructure to U-Boot, I set up the IOMMU to map a fixed block of physical memory and make sure that all allocations of memory come from that block of memory. In this case this is fairly easy to achieve. U-Boot allocates memory from the top of usable memory, so as long as I let the IOMMU map that high memory, things work. U-Boot doesn't need a lot of memory, so a block of 512MB is more than sufficient.
I'd rather not play around with the visible memory size (see below). And while technically there is a (scatter/gather) IOMMU in the SoC, it would be too big guns for that small problem.
IOMMU is connected only to video related cores, so it's not an option here.
Best regards, Jernej
In your case this means that as long as you set the top of usable memory to an address < 4G, U-Boot itself should be fine and no bounce buffers are needed. You have to make sure the addresses in the U-Boot environment for loading things like the kernel and the FDT are set to an address < 4G as well.
For EFI things are different though. You want to expose all physical memory in the EFI memory map.
Not only for UEFI, since U-Boot populates the DT memory node even for booti/bootm, in arch/arm/lib/bootm-fdt.c:arch_fixup_fdt(). So limiting the memory is not an option, since this would be passed on to the OS.
This means that an EFI application (such as an OS loader) may pick memory > 4G and use it to do I/O.
I think we should be safe here, as the driver has full control over the buffers: For TX we copy already, to use "fire-and-forget", so we just start the DMA and return. And for RX U-Boot network drivers return the buffer address, so it's our own buffer again. So wherever higher layers put the packets, we should be good (given our own buffers are).
So I guess my question boils down to: How can I best allocate buffers from "low" memory? And do those buffers carveouts make it into the UEFI memory map, as reserved regions? Or can UEFI differentiate between boot services and runtime services allocations? The buffers would be needed during boot services, for the UEFI network protocol. But later on they can be abandoned.
this purpose U-Boot already implements bounce buffers. See the CONFIG_EFI_LOADER_BOUNCE_BUFFER option.
Interesting, thanks, I will have a look at that. Maybe that contains some useful traces to other code.
Cheers, Andre

On Fri, Apr 30, 2021 at 7:22 PM Andre Przywara andre.przywara@arm.com wrote:
Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be:
- Let the driver allocate the RX and TX buffers separately, placing them below 4GB in the address space (using lmb_reserve(), I guess?)
- Use those RX buffers and hand the addresses back to the upper layers.
- We already copy TX packets, so this would also be covered, in this situation. Other drivers might need to introduce copying.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)?
My understanding is that the relocated address of U-Boot should be below 4GB then there is no problem for the 32-bit DMA. I thought this is a rule to be followed by every board, but this is not the case on your board?
In our case we make the buffers part of our priv struct, so should there be an option to let the priv_auto allocation come from below 4GB?
Grateful for any input on this!
Regards, Bin

From: Bin Meng bmeng.cn@gmail.com Date: Sat, 1 May 2021 19:45:02 +0800
On Fri, Apr 30, 2021 at 7:22 PM Andre Przywara andre.przywara@arm.com wrote:
Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be:
- Let the driver allocate the RX and TX buffers separately, placing them below 4GB in the address space (using lmb_reserve(), I guess?)
- Use those RX buffers and hand the addresses back to the upper layers.
- We already copy TX packets, so this would also be covered, in this situation. Other drivers might need to introduce copying.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)?
My understanding is that the relocated address of U-Boot should be below 4GB then there is no problem for the 32-bit DMA. I thought this is a rule to be followed by every board, but this is not the case on your board?
Yes, that was my impression as well. And I think that would work fine on this board as there is plenty of DRAM below 4GB. And this can be achieved by implementing the board_get_usable_ram_top() function.
As I indicated in my reply, some care is needed in the EFI subsystem, but there already is a solution for that. There is CONFIG_EFI_LOADER_BOUNCE_BUFFER, but that might not actually be needed in this case. By default the EFI subsystem will mark all conventional memory above "ram_top" as EFI_BOOT_SERVICES_DATA. So EFI applications uch as OS loaders will not allocate that memory until they've called ExitBootServices() at which point U-Boot will be completely out of the picture.
In our case we make the buffers part of our priv struct, so should there be an option to let the priv_auto allocation come from below 4GB?
Grateful for any input on this!
Regards, Bin

On Sat, 1 May 2021 14:23:32 +0200 (CEST) Mark Kettenis mark.kettenis@xs4all.nl wrote:
Hi,
From: Bin Meng bmeng.cn@gmail.com Date: Sat, 1 May 2021 19:45:02 +0800
On Fri, Apr 30, 2021 at 7:22 PM Andre Przywara andre.przywara@arm.com wrote:
Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be:
- Let the driver allocate the RX and TX buffers separately, placing them below 4GB in the address space (using lmb_reserve(), I guess?)
- Use those RX buffers and hand the addresses back to the upper layers.
- We already copy TX packets, so this would also be covered, in this situation. Other drivers might need to introduce copying.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)?
My understanding is that the relocated address of U-Boot should be below 4GB then there is no problem for the 32-bit DMA. I thought this is a rule to be followed by every board, but this is not the case on your board?
Bin, interesting, where is this coming from? Was this originally for 32-bit CPUs with some address extension (PAE/LPAE)? I think on *sane* 64-bit systems there would be no need for this restriction, except maybe for this 32-bit DMA limitation (which is more of a device problem).
Yes, that was my impression as well. And I think that would work fine on this board as there is plenty of DRAM below 4GB. And this can be achieved by implementing the board_get_usable_ram_top() function.
Ah, I think this is the thing I missed and was looking for: So we *can* restrict everything *U-Boot* to 32 bits and save us a lot of hassle.
Thanks for that hint!
As I indicated in my reply, some care is needed in the EFI subsystem, but there already is a solution for that. There is CONFIG_EFI_LOADER_BOUNCE_BUFFER, but that might not actually be needed in this case. By default the EFI subsystem will mark all conventional memory above "ram_top" as EFI_BOOT_SERVICES_DATA. So EFI applications uch as OS loaders will not allocate that memory until they've called ExitBootServices() at which point U-Boot will be completely out of the picture.
Oh nice, this looks like what I need. So EFI apps would never use this memory for I/O buffers.
So I gave this a try and this solves my problem quite neatly: Linux sees the full DRAM, but U-Boot never touches anything beyond 4GB. Briefly tested Linux with both EFI and booti. Will include the board_get_usable_ram_top() implementation in the v2 of my 4GB enablement patch.
Thanks again!
Cheers, Andre
In our case we make the buffers part of our priv struct, so should there be an option to let the priv_auto allocation come from below 4GB?
Grateful for any input on this!
Regards, Bin

Hi Andre,
On Sun, May 2, 2021 at 8:22 AM Andre Przywara andre.przywara@arm.com wrote:
On Sat, 1 May 2021 14:23:32 +0200 (CEST) Mark Kettenis mark.kettenis@xs4all.nl wrote:
Hi,
From: Bin Meng bmeng.cn@gmail.com Date: Sat, 1 May 2021 19:45:02 +0800
On Fri, Apr 30, 2021 at 7:22 PM Andre Przywara andre.przywara@arm.com wrote:
Hi,
We now see the first Allwinner devices [1] having DRAM located above 4GB in address space (4GB DRAM starting at 1GB). After one fix[2] this works somewhat fine, but the sun8i-emac network device is still limited to 32-bit DMA addresses. With U-Boot relocating itself (plus stack and heap) to the end of DRAM, it now runs completely beyond 4GB on those machines, so not giving pure 32-bit addresses for buffers anymore. In Linux we handle this easily by just keeping the default DMA mask at 32 bits, and letting the DMA framework deal with the nasty details.
I was wondering how this should be handled in U-Boot? The straight forward solution would be:
- Let the driver allocate the RX and TX buffers separately, placing them below 4GB in the address space (using lmb_reserve(), I guess?)
- Use those RX buffers and hand the addresses back to the upper layers.
- We already copy TX packets, so this would also be covered, in this situation. Other drivers might need to introduce copying.
This sounds like a common problem, so I was wondering if there is a more generic solution to this? Maybe there are already platforms or devices affected? Or should the whole heap and stack be moved below 4GB (if this is easily possible)?
My understanding is that the relocated address of U-Boot should be below 4GB then there is no problem for the 32-bit DMA. I thought this is a rule to be followed by every board, but this is not the case on your board?
Bin, interesting, where is this coming from? Was this originally for 32-bit CPUs with some address extension (PAE/LPAE)? I think on *sane* 64-bit systems there would be no need for this restriction, except maybe for this 32-bit DMA limitation (which is more of a device problem).
Please have a look at x86 and riscv target codes board_get_usable_ram_top() which limits the relocated address below 4G. I remember U-Boot shell does not support parsing 64-bit digits too.
Yes, that was my impression as well. And I think that would work fine on this board as there is plenty of DRAM below 4GB. And this can be achieved by implementing the board_get_usable_ram_top() function.
Ah, I think this is the thing I missed and was looking for: So we *can* restrict everything *U-Boot* to 32 bits and save us a lot of hassle.
Thanks for that hint!
As I indicated in my reply, some care is needed in the EFI subsystem, but there already is a solution for that. There is CONFIG_EFI_LOADER_BOUNCE_BUFFER, but that might not actually be needed in this case. By default the EFI subsystem will mark all conventional memory above "ram_top" as EFI_BOOT_SERVICES_DATA. So EFI applications uch as OS loaders will not allocate that memory until they've called ExitBootServices() at which point U-Boot will be completely out of the picture.
Oh nice, this looks like what I need. So EFI apps would never use this memory for I/O buffers.
So I gave this a try and this solves my problem quite neatly: Linux sees the full DRAM, but U-Boot never touches anything beyond 4GB. Briefly tested Linux with both EFI and booti. Will include the board_get_usable_ram_top() implementation in the v2 of my 4GB enablement patch.
Thanks again!
Regards, Bin
participants (4)
-
Andre Przywara
-
Bin Meng
-
Jernej Škrabec
-
Mark Kettenis