[U-Boot] [PATCH] ARM: support for cache coherent allocations

This is a draft implementation of cache coherent memory allocator. This simple implementation just reserves memory area below malloc space and leave it uncached even if data cache is enabled. Allocations are even simpler: code just verifies that we have enough space and increments the offset counter. No deallocations supported for now. In future versions we could probably use dlmalloc allocator to get space out of coherent pool.
Signed-off-by: Ilya Yanok ilya.yanok@cogentembedded.com --- arch/arm/include/asm/dma-mapping.h | 4 ++++ arch/arm/include/asm/global_data.h | 4 ++++ arch/arm/lib/Makefile | 1 + arch/arm/lib/board.c | 8 ++++++++ arch/arm/lib/cache-cp15.c | 5 +++++ arch/arm/lib/dma-coherent.c | 37 ++++++++++++++++++++++++++++++++++++ 6 files changed, 59 insertions(+) create mode 100644 arch/arm/lib/dma-coherent.c
diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h index 5bbb0a0..a2145fc 100644 --- a/arch/arm/include/asm/dma-mapping.h +++ b/arch/arm/include/asm/dma-mapping.h @@ -30,11 +30,15 @@ enum dma_data_direction { DMA_FROM_DEVICE = 2, };
+#ifndef CONFIG_DMA_COHERENT static void *dma_alloc_coherent(size_t len, unsigned long *handle) { *handle = (unsigned long)malloc(len); return (void *)*handle; } +#else +void *dma_alloc_coherent(size_t len, unsigned long *handle); +#endif
static inline unsigned long dma_map_single(volatile void *vaddr, size_t len, enum dma_data_direction dir) diff --git a/arch/arm/include/asm/global_data.h b/arch/arm/include/asm/global_data.h index c3ff789..4655035 100644 --- a/arch/arm/include/asm/global_data.h +++ b/arch/arm/include/asm/global_data.h @@ -76,6 +76,10 @@ typedef struct global_data { #if !(defined(CONFIG_SYS_ICACHE_OFF) && defined(CONFIG_SYS_DCACHE_OFF)) unsigned long tlb_addr; #endif +#ifdef CONFIG_DMA_COHERENT + unsigned long coherent_base; /* Start address of coherent space */ + unsigned long coherent_size; /* Size of coherent space */ +#endif const void *fdt_blob; /* Our device tree, NULL if none */ void **jt; /* jump table */ char env_buf[32]; /* buffer for getenv() before reloc. */ diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile index 39a9550..e91dcd0 100644 --- a/arch/arm/lib/Makefile +++ b/arch/arm/lib/Makefile @@ -40,6 +40,7 @@ GLCOBJS += div0.o COBJS-y += board.o COBJS-y += bootm.o COBJS-$(CONFIG_SYS_L2_PL310) += cache-pl310.o +COBJS-$(CONFIG_DMA_COHERENT) += dma-coherent.o COBJS-y += interrupts.o COBJS-y += reset.o SOBJS-$(CONFIG_USE_ARCH_MEMSET) += memset.o diff --git a/arch/arm/lib/board.c b/arch/arm/lib/board.c index 5270c11..6541a49 100644 --- a/arch/arm/lib/board.c +++ b/arch/arm/lib/board.c @@ -400,6 +400,14 @@ void board_init_f(ulong bootflag) debug("Reserving %zu Bytes for Global Data at: %08lx\n", sizeof (gd_t), addr_sp);
+#ifdef CONFIG_DMA_COHERENT + /* reserve space for cache coherent allocations */ + gd->coherent_size = ALIGN(CONFIG_DMA_COHERENT_SIZE, 1 << 20); + addr_sp &= ~((1 << 20) - 1); + addr_sp -= gd->coherent_size; + gd->coherent_base = addr_sp; +#endif + /* setup stackpointer for exeptions */ gd->irq_sp = addr_sp; #ifdef CONFIG_USE_IRQ diff --git a/arch/arm/lib/cache-cp15.c b/arch/arm/lib/cache-cp15.c index e6c3eae..c11e871 100644 --- a/arch/arm/lib/cache-cp15.c +++ b/arch/arm/lib/cache-cp15.c @@ -60,6 +60,11 @@ static inline void dram_bank_mmu_setup(int bank) for (i = bd->bi_dram[bank].start >> 20; i < (bd->bi_dram[bank].start + bd->bi_dram[bank].size) >> 20; i++) { +#ifdef CONFIG_DMA_COHERENT + if ((i >= gd->coherent_base >> 20) && + (i < (gd->coherent_base + gd->coherent_size) >> 20)) + continue; +#endif page_table[i] = i << 20 | (3 << 10) | CACHE_SETUP; } } diff --git a/arch/arm/lib/dma-coherent.c b/arch/arm/lib/dma-coherent.c new file mode 100644 index 0000000..30fa893 --- /dev/null +++ b/arch/arm/lib/dma-coherent.c @@ -0,0 +1,37 @@ +/* + * (C) Copyright 2012 + * Ilya Yanok, ilya.yanok@gmail.com + * + * See file CREDITS for list of people who contributed to this + * project. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of + * the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + */ + +#include <common.h> + +DECLARE_GLOBAL_DATA_PTR; + +size_t offset; + +void *dma_alloc_coherent(size_t size, unsigned long *handle) +{ + if (size + offset > gd->coherent_size) + return NULL; + + *handle = gd->coherent_base + offset; + offset += size; + + return (void*)(*handle); +}

Hi All,
On Thu, May 31, 2012 at 1:41 AM, Ilya Yanok ilya.yanok@cogentembedded.comwrote:
This is a draft implementation of cache coherent memory allocator. This simple implementation just reserves memory area below malloc space and leave it uncached even if data cache is enabled. Allocations are even simpler: code just verifies that we have enough space and increments the offset counter. No deallocations supported for now. In future versions we could probably use dlmalloc allocator to get space out of coherent pool.
Any comments on this?
Regards, Ilya.

On Thu, Jun 14, 2012 at 8:13 AM, Ilya Yanok ilya.yanok@cogentembedded.com wrote:
Hi All,
On Thu, May 31, 2012 at 1:41 AM, Ilya Yanok ilya.yanok@cogentembedded.comwrote:
This is a draft implementation of cache coherent memory allocator. This simple implementation just reserves memory area below malloc space and leave it uncached even if data cache is enabled. Allocations are even simpler: code just verifies that we have enough space and increments the offset counter. No deallocations supported for now. In future versions we could probably use dlmalloc allocator to get space out of coherent pool.
Any comments on this?
Albert?

Dear Ilya Yanok,
This is a draft implementation of cache coherent memory allocator. This simple implementation just reserves memory area below malloc space and leave it uncached even if data cache is enabled. Allocations are even simpler: code just verifies that we have enough space and increments the offset counter. No deallocations supported for now. In future versions we could probably use dlmalloc allocator to get space out of coherent pool.
Signed-off-by: Ilya Yanok ilya.yanok@cogentembedded.com
Hm, can't we just punch a hole in the MMU table at runtime instead of preallocating it like this?
Also, what is this for? Can we not simply flush/invalidate the caches?
Best regards, Marek Vasut

Hi Marek,
[sorry for copying, forget to CC the list]
On Sat, Jun 16, 2012 at 2:29 AM, Marek Vasut marek.vasut@gmail.com wrote:
Hm, can't we just punch a hole in the MMU table at runtime instead of preallocating it like this?
It's allocated at runtime now, do you mean allocate it on demand? Good point, Probably we can malloc big enough block and make it uncached directly from dma_alloc_coherent(). Is it what you suggest?
Also, what is this for? Can we not simply flush/invalidate the caches?
flush/invalidate can be racy for some hardware. Sometimes we need to write some field to DMA descriptor and then read another one. And because one cannot flush/invalidate individual bytes write/flush can destroy the field updated by hardware. (Well, we can invalidate/read before write/flush but that introduces a race).
Regards, Ilya.

Dear Ilya Yanok,
Hi Marek,
[sorry for copying, forget to CC the list]
On Sat, Jun 16, 2012 at 2:29 AM, Marek Vasut marek.vasut@gmail.com wrote:
Hm, can't we just punch a hole in the MMU table at runtime instead of preallocating it like this?
It's allocated at runtime now, do you mean allocate it on demand? Good point, Probably we can malloc big enough block and make it uncached directly from dma_alloc_coherent(). Is it what you suggest?
Kind of ... I mean rather insert an entry into MMU table at runtime that says "this region is uncached". But that'd need some hack in the mallocator now that I think about it. It might not be as simple as I thought at first.
On the other hand, most MMUs allow you to allocate stuff with 4k density, which should be ok.
Also, what is this for? Can we not simply flush/invalidate the caches?
flush/invalidate can be racy for some hardware. Sometimes we need to write some field to DMA descriptor and then read another one. And because one cannot flush/invalidate individual bytes write/flush can destroy the field updated by hardware. (Well, we can invalidate/read before write/flush but that introduces a race).
But that's shitty hardware. Do you really need to do it? Where? I fixed similar issue in fec_mxc.c recently.
Regards, Ilya.
Best regards, Marek Vasut

Hi Marek,
On Tue, Jun 19, 2012 at 3:37 AM, Marek Vasut marek.vasut@gmail.com wrote:
Kind of ... I mean rather insert an entry into MMU table at runtime that says "this region is uncached". But that'd need some hack in the mallocator now that I think about it. It might not be as simple as I thought at first.
Hm. I don't quite understand. Do you mean patch malloc code to handle turning caching off for the region? I doubt that's a good approach. I would make the malloc calling code to handle caching. How about something like this (pseudocode):
if (/* have enough room in space allocated previously*/ ) /* update existing allocation data and return */
newp = malloc_align(ALIGN(size, min_size), min_size); /* Allocate both size and offset aligned block. * min_size is minimal block size for which cache can be turned off, on ARM currently it's 1MB * malloc_align will return block starting on aligned address, I don't think current allocator has some support for this so effectively we will have to allocate ALIGN(size, min_size) + min_size - 1 bytes */
/* Patch pagetable (on ARM) or do whatever needed to make allocated block uncached */ /* Save data about unused space in the block (in case we allocated more than was requested) */
return newp;
Existing allocation data could be as simple pair of static variables. This is not optimal in some scenarios but very simple.
On the other hand, most MMUs allow you to allocate stuff with 4k density,
which should be ok.
Hm, I don't really feel like rebuilding page tables on the fly in U-Boot is a good idea. Currently U-Boot ARM uses page table with 1MB granularity.
Also, what is this for? Can we not simply flush/invalidate the caches?
flush/invalidate can be racy for some hardware. Sometimes we need to
write
some field to DMA descriptor and then read another one. And because one cannot flush/invalidate individual bytes write/flush can destroy the
field
updated by hardware. (Well, we can invalidate/read before write/flush but that introduces a race).
But that's shitty hardware. Do you really need to do it? Where? I fixed similar issue in fec_mxc.c recently.
CPSW switch on TI AM33x. Here is the code:
/* not the first packet - enqueue at the tail */ prev = chan->tail; desc_write(prev, hw_next, desc); chan->tail = desc;
/* next check if EOQ has been triggered already */ if (desc_read(prev, hw_mode) & CPDMA_DESC_EOQ) chan_write(chan, hdp, desc);
desc_{read/write} are just readl/writel on fields of desc structure. Note that one have to flush for desc_write() to reach the hardware but as fields of the DMA descriptor are going to be in the same cache line this flush will harm other fields that are probably updated by the hardware. Well, I have to say I never seen a manifestation of this problem in the wild and probably hardware somehow takes care of this situation but we can't be sure.
Regards, Ilya.
participants (3)
-
Ilya Yanok
-
Marek Vasut
-
Tom Rini