[U-Boot] [PATCH 0/9] arm64: Unify MMU code

newer
[U-Boot] [PATCH v2] armv7m: Add...

Alexander Graf

22 Feb 2016 22 Feb '16

2:57 a.m.

Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

Alex

Alexander Graf (9): thunderx: Calculate TCR dynamically arm64: Make full va map code more dynamic zymqmp: Replace home grown mmu code with generic table approach tegra: Replace home grown mmu code with generic table approach vexpress64: Add MMU tables dwmmc: Increase retry timeout hikey: Add MMU tables arm64: Remove non-full-va map code arm64: Only allow dcache disabled in SPL builds

arch/arm/cpu/armv8/cache_v8.c | 444 ++++++++++++++++++------- arch/arm/cpu/armv8/fsl-layerscape/cpu.c | 29 +- arch/arm/cpu/armv8/zynqmp/cpu.c | 169 ---------- arch/arm/include/asm/arch-fsl-layerscape/cpu.h | 94 +++--- arch/arm/include/asm/armv8/mmu.h | 120 ++----- arch/arm/include/asm/global_data.h | 6 +- arch/arm/include/asm/system.h | 7 +- arch/arm/mach-tegra/Makefile | 1 - arch/arm/mach-tegra/arm64-mmu.c | 131 -------- doc/README.arm64 | 20 -- drivers/mmc/dw_mmc.c | 2 +- include/configs/hikey.h | 18 +- include/configs/tegra210-common.h | 15 + include/configs/thunderx_88xx.h | 31 +- include/configs/vexpress_aemv8a.h | 19 +- include/configs/xilinx_zynqmp.h | 43 +++ 16 files changed, 526 insertions(+), 623 deletions(-) delete mode 100644 arch/arm/mach-tegra/arm64-mmu.c

-- 1.8.5.6

Show replies by date

Alexander Graf

22 Feb 22 Feb

2:57 a.m.

New subject: [U-Boot] [PATCH 1/9] thunderx: Calculate TCR dynamically

Based on the memory map we can determine a lot of hard coded fields of TCR, like the maximum VA and max PA we want to support. Calculate those dynamically to reduce the chance for pit falls.

Signed-off-by: Alexander Graf agraf@suse.de --- arch/arm/cpu/armv8/cache_v8.c | 59 +++++++++++++++++++++++++++++++++++++++- arch/arm/include/asm/armv8/mmu.h | 6 +--- include/configs/thunderx_88xx.h | 3 -- 3 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 71f0020..9229532 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -38,6 +38,58 @@ static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP; #define PTL1_ENTRIES CONFIG_SYS_PTL1_ENTRIES #define PTL2_ENTRIES CONFIG_SYS_PTL2_ENTRIES

+static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) +{ + u64 max_addr = 0; + u64 ips, va_bits; + u64 tcr; + int i; + + /* Find the largest address we need to support */ + for (i = 0; i < ARRAY_SIZE(mem_map); i++) + max_addr = max(max_addr, mem_map[i].base + mem_map[i].size); + + /* Calculate the maximum physical (and thus virtual) address */ + if (max_addr > (1ULL << 44)) { + ips = 5; + va_bits = 48; + } else if (max_addr > (1ULL << 42)) { + ips = 4; + va_bits = 44; + } else if (max_addr > (1ULL << 40)) { + ips = 3; + va_bits = 42; + } else if (max_addr > (1ULL << 36)) { + ips = 2; + va_bits = 40; + } else if (max_addr > (1ULL << 32)) { + ips = 1; + va_bits = 36; + } else { + ips = 0; + va_bits = 32; + } + + if (el == 1) { + tcr = TCR_EL1_RSVD | (ips << 32); + } else if (el == 2) { + tcr = TCR_EL2_RSVD | (ips << 16); + } else { + tcr = TCR_EL3_RSVD | (ips << 16); + } + + /* PTWs cacheable, inner/outer WBWA and inner shareable */ + tcr |= TCR_TG0_64K | TCR_SHARED_INNER | TCR_ORGN_WBWA | TCR_IRGN_WBWA; + tcr |= TCR_T0SZ(VA_BITS); + + if (pips) + *pips = ips; + if (pva_bits) + *pva_bits = va_bits; + + return tcr; +} + static void setup_pgtables(void) { int l1_e, l2_e; @@ -110,6 +162,10 @@ __weak void mmu_setup(void) /* Set up page tables only on BSP */ if (coreid == BSP_COREID) setup_pgtables(); + + el = current_el(); + set_ttbr_tcr_mair(el, gd->arch.tlb_addr, get_tcr(el, NULL, NULL), + MEMORY_ATTRIBUTES); #else /* Setup an identity-mapping for all spaces */ for (i = 0; i < (PGTABLE_SIZE >> 3); i++) { @@ -128,7 +184,6 @@ __weak void mmu_setup(void) } }

-#endif /* load TTBR0 */ el = current_el(); if (el == 1) { @@ -144,6 +199,8 @@ __weak void mmu_setup(void) TCR_EL3_RSVD | TCR_FLAGS | TCR_EL3_IPS_BITS, MEMORY_ATTRIBUTES); } +#endif + /* enable the mmu */ set_sctlr(get_sctlr() | CR_M); } diff --git a/arch/arm/include/asm/armv8/mmu.h b/arch/arm/include/asm/armv8/mmu.h index 897f010..39ff745 100644 --- a/arch/arm/include/asm/armv8/mmu.h +++ b/arch/arm/include/asm/armv8/mmu.h @@ -159,11 +159,6 @@ #define TCR_EL1_IPS_BITS (UL(3) << 32) /* 42 bits physical address */ #define TCR_EL2_IPS_BITS (3 << 16) /* 42 bits physical address */ #define TCR_EL3_IPS_BITS (3 << 16) /* 42 bits physical address */ -#else -#define TCR_EL1_IPS_BITS CONFIG_SYS_TCR_EL1_IPS_BITS -#define TCR_EL2_IPS_BITS CONFIG_SYS_TCR_EL2_IPS_BITS -#define TCR_EL3_IPS_BITS CONFIG_SYS_TCR_EL3_IPS_BITS -#endif

/* PTWs cacheable, inner/outer WBWA and inner shareable */ #define TCR_FLAGS (TCR_TG0_64K | \ @@ -171,6 +166,7 @@ TCR_ORGN_WBWA | \ TCR_IRGN_WBWA | \ TCR_T0SZ(VA_BITS)) +#endif

#define TCR_EL1_RSVD (1 << 31) #define TCR_EL2_RSVD (1 << 31 | 1 << 23) diff --git a/include/configs/thunderx_88xx.h b/include/configs/thunderx_88xx.h index cece4dd..b9f93ad 100644 --- a/include/configs/thunderx_88xx.h +++ b/include/configs/thunderx_88xx.h @@ -50,9 +50,6 @@ #define CONFIG_SYS_PGTABLE_SIZE \ ((CONFIG_SYS_PTL1_ENTRIES + \ CONFIG_SYS_MEM_MAP_SIZE * CONFIG_SYS_PTL2_ENTRIES) * 8) -#define CONFIG_SYS_TCR_EL1_IPS_BITS (5UL << 32) -#define CONFIG_SYS_TCR_EL2_IPS_BITS (5 << 16) -#define CONFIG_SYS_TCR_EL3_IPS_BITS (5 << 16)

/* Link Definitions */ #define CONFIG_SYS_TEXT_BASE 0x00500000

-- 1.8.5.6

Alexander Graf

2:57 a.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

Signed-off-by: Alexander Graf agraf@suse.de --- arch/arm/cpu/armv8/cache_v8.c | 346 +++++++++++++++++++++++++++++++------ arch/arm/include/asm/armv8/mmu.h | 68 ++++---- arch/arm/include/asm/global_data.h | 4 +- arch/arm/include/asm/system.h | 3 +- include/configs/thunderx_88xx.h | 14 +- 5 files changed, 332 insertions(+), 103 deletions(-)

diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 9229532..4369a83 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -2,6 +2,9 @@ * (C) Copyright 2013 * David Feng fenghua@phytium.com.cn * + * (C) Copyright 2016 + * Alexander Graf agraf@suse.de + * * SPDX-License-Identifier: GPL-2.0+ */

@@ -9,35 +12,40 @@ #include <asm/system.h> #include <asm/armv8/mmu.h>

-DECLARE_GLOBAL_DATA_PTR; - -#ifndef CONFIG_SYS_DCACHE_OFF +/* #define DEBUG_MMU */

-#ifdef CONFIG_SYS_FULL_VA -static void set_ptl1_entry(u64 index, u64 ptl2_entry) -{ - u64 *pgd = (u64 *)gd->arch.tlb_addr; - u64 value; +#ifdef DEBUG_MMU +#define DPRINTF(a, ...) printf("%s:%d: " a, __func__, __LINE__, __VA_ARGS__) +#else +#define DPRINTF(a, ...) do { } while(0) +#endif

- value = ptl2_entry | PTL1_TYPE_TABLE; - pgd[index] = value; -} +DECLARE_GLOBAL_DATA_PTR;

-static void set_ptl2_block(u64 ptl1, u64 bfn, u64 address, u64 memory_attrs) -{ - u64 *pmd = (u64 *)ptl1; - u64 value; +#ifndef CONFIG_SYS_DCACHE_OFF

- value = address | PTL2_TYPE_BLOCK | PTL2_BLOCK_AF; - value |= memory_attrs; - pmd[bfn] = value; -} +/* + * With 4k page granule, a virtual address is split into 4 lookup parts + * spanning 9 bits each: + * + * _______________________________________________ + * | | | | | | | + * | 0 | Lv0 | Lv1 | Lv2 | Lv3 | off | + * |_______|_______|_______|_______|_______|_______| + * 63-48 47-39 38-30 29-21 20-12 11-00 + * + * mask page size + * + * Lv0: FF8000000000 -- + * Lv1: 7FC0000000 1G + * Lv2: 3FE00000 2M + * Lv3: 1FF000 4K + * off: FFF + */

+#ifdef CONFIG_SYS_FULL_VA static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;

-#define PTL1_ENTRIES CONFIG_SYS_PTL1_ENTRIES -#define PTL2_ENTRIES CONFIG_SYS_PTL2_ENTRIES - static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) { u64 max_addr = 0; @@ -79,8 +87,8 @@ static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) }

/* PTWs cacheable, inner/outer WBWA and inner shareable */ - tcr |= TCR_TG0_64K | TCR_SHARED_INNER | TCR_ORGN_WBWA | TCR_IRGN_WBWA; - tcr |= TCR_T0SZ(VA_BITS); + tcr |= TCR_TG0_4K | TCR_SHARED_INNER | TCR_ORGN_WBWA | TCR_IRGN_WBWA; + tcr |= TCR_T0SZ(va_bits);

if (pips) *pips = ips; @@ -90,39 +98,196 @@ static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) return tcr; }

-static void setup_pgtables(void) +#define MAX_PTE_ENTRIES 512 + +static int pte_type(u64 *pte) +{ + return *pte & PTE_TYPE_MASK; +} + +/* Returns the LSB number for a PTE on level <level> */ +static int level2shift(int level) { - int l1_e, l2_e; - unsigned long pmd = 0; - unsigned long address; - - /* Setup the PMD pointers */ - for (l1_e = 0; l1_e < CONFIG_SYS_MEM_MAP_SIZE; l1_e++) { - gd->arch.pmd_addr[l1_e] = gd->arch.tlb_addr + - PTL1_ENTRIES * sizeof(u64); - gd->arch.pmd_addr[l1_e] += PTL2_ENTRIES * sizeof(u64) * l1_e; - gd->arch.pmd_addr[l1_e] = ALIGN(gd->arch.pmd_addr[l1_e], - 0x10000UL); + /* Page is 12 bits wide, every level translates 9 bits */ + return (12 + 9 * (3 - level)); +} + +static u64 *find_pte(u64 addr, int level) +{ + int start_level = 0; + u64 *pte; + u64 idx; + u64 va_bits; + int i; + + DPRINTF("addr=%llx level=%d\n", addr, level); + + get_tcr(0, NULL, &va_bits); + if (va_bits < 39) + start_level = 1; + + if (level < start_level) + return NULL; + + /* Walk through all page table levels to find our PTE */ + pte = (u64*)gd->arch.tlb_addr; + for (i = start_level; i < 4; i++) { + idx = (addr >> level2shift(i)) & 0x1FF; + pte += idx; + DPRINTF("idx=%llx PTE %p at level %d: %llx\n", idx, pte, i, *pte); + + /* Found it */ + if (i == level) + return pte; + /* PTE is no table (either invalid or block), can't traverse */ + if (pte_type(pte) != PTE_TYPE_TABLE) + return NULL; + /* Off to the next level */ + pte = (u64*)(*pte & 0x0000fffffffff000ULL); }

- /* Setup the page tables */ - for (l1_e = 0; l1_e < PTL1_ENTRIES; l1_e++) { - if (mem_map[pmd].base == - (uintptr_t)l1_e << PTL2_BITS) { - set_ptl1_entry(l1_e, gd->arch.pmd_addr[pmd]); - - for (l2_e = 0; l2_e < PTL2_ENTRIES; l2_e++) { - address = mem_map[pmd].base - + (uintptr_t)l2_e * BLOCK_SIZE; - set_ptl2_block(gd->arch.pmd_addr[pmd], l2_e, - address, mem_map[pmd].attrs); - } + /* Should never reach here */ + return NULL; +} + +/* Creates a new full table (512 entries) and sets *pte to refer to it */ +static u64 *create_table(void) +{ + u64 *new_table = (u64*)gd->arch.tlb_fillptr; + u64 pt_len = MAX_PTE_ENTRIES * sizeof(u64); + + /* Allocate MAX_PTE_ENTRIES pte entries */ + gd->arch.tlb_fillptr += pt_len; + + if (gd->arch.tlb_fillptr - gd->arch.tlb_addr > gd->arch.tlb_size) + panic("Insufficient RAM for page table: 0x%lx > 0x%lx", + gd->arch.tlb_fillptr - gd->arch.tlb_addr, + gd->arch.tlb_size); + + /* Mark all entries as invalid */ + memset(new_table, 0, pt_len);

- pmd++; - } else { - set_ptl1_entry(l1_e, 0); + return new_table; +} + +static void set_pte_table(u64 *pte, u64 *table) +{ + /* Point *pte to the new table */ + DPRINTF("Setting %p to addr=%p\n", pte, table); + *pte = PTE_TYPE_TABLE | (ulong)table; +} + +/* Add one mm_region map entry to the page tables */ +static void add_map(struct mm_region *map) +{ + u64 *pte; + u64 addr = map->base; + u64 size = map->size; + u64 attrs = map->attrs | PTE_TYPE_BLOCK | PTE_BLOCK_AF; + u64 blocksize; + int level; + u64 *new_table; + + while (size) { + pte = find_pte(addr, 0); + if (pte && (pte_type(pte) == PTE_TYPE_FAULT)) { + DPRINTF("Creating table for addr 0x%llx\n", addr); + new_table = create_table(); + set_pte_table(pte, new_table); } + + for (level = 1; level < 4; level++) { + pte = find_pte(addr, level); + blocksize = 1ULL << level2shift(level); + DPRINTF("Checking if pte fits for addr=%llx size=%llx " + "blocksize=%llx\n", addr, size, blocksize); + if (size >= blocksize && !(addr & (blocksize - 1))) { + /* Page fits, create block PTE */ + DPRINTF("Setting PTE %p to block addr=%llx\n", + pte, addr); + *pte = addr | attrs; + addr += blocksize; + size -= blocksize; + break; + } else if ((pte_type(pte) == PTE_TYPE_FAULT)) { + /* Page doesn't fit, create subpages */ + DPRINTF("Creating subtable for addr 0x%llx " + "blksize=%llx\n", addr, blocksize); + new_table = create_table(); + set_pte_table(pte, new_table); + } + } + } +} + +/* Splits a block PTE into table with subpages spanning the old block */ +static void split_block(u64 *pte, int level) +{ + u64 old_pte = *pte; + u64 *new_table; + u64 i = 0; + /* level describes the parent level, we need the child ones */ + int levelshift = level2shift(level + 1); + + if (pte_type(pte) != PTE_TYPE_BLOCK) + panic("PTE %p (%llx) is not a block", pte, old_pte); + + new_table = create_table(); + DPRINTF("Splitting pte %p (%llx) into %p\n", pte, old_pte, new_table); + + for (i = 0; i < MAX_PTE_ENTRIES; i++) { + new_table[i] = old_pte | (i << levelshift); + DPRINTF("Setting new_table[%lld] = %llx\n", i, new_table[i]); } + + /* Set the new table into effect */ + set_pte_table(pte, new_table); +} + +/* Returns the estimated required size of all page tables */ +u64 get_page_table_size(void) +{ + int i; + u64 one_pt = MAX_PTE_ENTRIES * sizeof(u64); + u64 size = 0; + + /* root page table */ + size += one_pt; + + for (i = 0; i < ARRAY_SIZE(mem_map); i++) { + struct mm_region *map = &mem_map[i]; + + /* Account for Lv0 page tables */ + size += one_pt * ((map->size >> 39) + 1); + + /* 1GB aligned pages fit already, so count the others */ + if (map->size & 0x3fffffffULL) + size += one_pt; + if (map->base & 0x3fffffffULL) + size += one_pt; + } + + /* Assume we may have to split up to 4 more page tables off */ + size += one_pt * 4; + + return size; +} + +static void setup_pgtables(void) +{ + int i; + + /* + * Allocate the first level we're on with invalidate entries. + * If the starting level is 0 (va_bits >= 39), then this is our + * Lv0 page table, otherwise it's the entry Lv1 page table. + */ + gd->arch.tlb_fillptr = gd->arch.tlb_addr; + create_table(); + + /* Now add all MMU table entries one after another to the table */ + for (i = 0; i < ARRAY_SIZE(mem_map); i++) + add_map(&mem_map[i]); }

#else @@ -157,10 +322,8 @@ __weak void mmu_setup(void) int el;

#ifdef CONFIG_SYS_FULL_VA - unsigned long coreid = read_mpidr() & CONFIG_COREID_MASK; - - /* Set up page tables only on BSP */ - if (coreid == BSP_COREID) + /* Set up page tables only once */ + if (!gd->arch.tlb_fillptr) setup_pgtables();

el = current_el(); @@ -311,6 +474,79 @@ void mmu_set_region_dcache_behaviour(phys_addr_t start, size_t size, flush_dcache_range(start, end); asm volatile("dsb sy"); } +#else +static bool is_aligned(u64 addr, u64 size, u64 align) +{ + return !(addr & (align - 1)) && !(size & (align - 1)); +} + +static u64 set_one_region(u64 start, u64 size, u64 attrs, int level) +{ + int levelshift = level2shift(level); + u64 levelsize = 1ULL << levelshift; + u64 *pte = find_pte(start, level); + + /* Can we can just modify the current level block PTE? */ + if (is_aligned(start, size, levelsize)) { + *pte &= ~PMD_ATTRINDX_MASK; + *pte |= attrs; + DPRINTF("Set attrs=%llx pte=%p level=%d\n", attrs, pte, level); + + return levelsize; + } + + /* Unaligned or doesn't fit, maybe split block into table */ + DPRINTF("addr=%llx level=%d pte=%p (%llx)\n", start, level, pte, *pte); + + /* Maybe we need to split the block into a table */ + if (pte_type(pte) == PTE_TYPE_BLOCK) + split_block(pte, level); + + /* And then double-check it became a table or already is one */ + if (pte_type(pte) != PTE_TYPE_TABLE) + panic("PTE %p (%llx) for addr=%llx should be a table", + pte, *pte, start); + + /* Roll on to the next page table level */ + return 0; +} + +void mmu_set_region_dcache_behaviour(phys_addr_t start, size_t size, + enum dcache_option option) +{ + u64 attrs = PMD_ATTRINDX(option); + u64 real_start = start; + u64 real_size = size; + + DPRINTF("start=%lx size=%lx\n", (ulong)start, (ulong)size); + + /* + * Loop through the address range until we find a page granule that fits + * our alignment constraints, then set it to the new cache attributes + */ + while (size > 0) { + int level; + u64 r; + + for (level = 1; level < 4; level++) { + r = set_one_region(start, size, attrs, level); + if (r) { + /* PTE successfully replaced */ + size -= r; + start += r; + break; + } + } + + } + + asm volatile("dsb sy"); + __asm_invalidate_tlb_all(); + asm volatile("dsb sy"); + asm volatile("isb"); + flush_dcache_range(real_start, real_start + real_size); + asm volatile("dsb sy"); +} #endif

#else /* CONFIG_SYS_DCACHE_OFF */ diff --git a/arch/arm/include/asm/armv8/mmu.h b/arch/arm/include/asm/armv8/mmu.h index 39ff745..1711433 100644 --- a/arch/arm/include/asm/armv8/mmu.h +++ b/arch/arm/include/asm/armv8/mmu.h @@ -26,15 +26,9 @@ #define VA_BITS (42) /* 42 bits virtual address */ #else #define VA_BITS CONFIG_SYS_VA_BITS -#define PTL2_BITS CONFIG_SYS_PTL2_BITS +#define PTE_BLOCK_BITS CONFIG_SYS_PTL2_BITS #endif

-/* PAGE_SHIFT determines the page size */ -#undef PAGE_SIZE -#define PAGE_SHIFT 16 -#define PAGE_SIZE (1 << PAGE_SHIFT) -#define PAGE_MASK (~(PAGE_SIZE-1)) - /* * block/section address mask and size definitions. */ @@ -42,10 +36,21 @@ #define SECTION_SHIFT 29 #define SECTION_SIZE (UL(1) << SECTION_SHIFT) #define SECTION_MASK (~(SECTION_SIZE-1)) + +/* PAGE_SHIFT determines the page size */ +#undef PAGE_SIZE +#define PAGE_SHIFT 16 +#define PAGE_SIZE (1 << PAGE_SHIFT) +#define PAGE_MASK (~(PAGE_SIZE-1)) + #else -#define BLOCK_SHIFT CONFIG_SYS_BLOCK_SHIFT -#define BLOCK_SIZE (UL(1) << BLOCK_SHIFT) -#define BLOCK_MASK (~(BLOCK_SIZE-1)) + +/* PAGE_SHIFT determines the page size */ +#undef PAGE_SIZE +#define PAGE_SHIFT 12 +#define PAGE_SIZE (1 << PAGE_SHIFT) +#define PAGE_MASK (~(PAGE_SIZE-1)) + #endif

/***************************************************************/ @@ -71,39 +76,28 @@ */

#ifdef CONFIG_SYS_FULL_VA -/* - * Level 1 descriptor (PGD). - */

-#define PTL1_TYPE_MASK (3 << 0) -#define PTL1_TYPE_TABLE (3 << 0) - -#define PTL1_TABLE_PXN (1UL << 59) -#define PTL1_TABLE_XN (1UL << 60) -#define PTL1_TABLE_AP (1UL << 61) -#define PTL1_TABLE_NS (1UL << 63) - - -/* - * Level 2 descriptor (PMD). - */ +#define PTE_TYPE_MASK (3 << 0) +#define PTE_TYPE_FAULT (0 << 0) +#define PTE_TYPE_TABLE (3 << 0) +#define PTE_TYPE_BLOCK (1 << 0)

-#define PTL2_TYPE_MASK (3 << 0) -#define PTL2_TYPE_FAULT (0 << 0) -#define PTL2_TYPE_TABLE (3 << 0) -#define PTL2_TYPE_BLOCK (1 << 0) +#define PTE_TABLE_PXN (1UL << 59) +#define PTE_TABLE_XN (1UL << 60) +#define PTE_TABLE_AP (1UL << 61) +#define PTE_TABLE_NS (1UL << 63)

/* * Block */ -#define PTL2_MEMTYPE(x) ((x) << 2) -#define PTL2_BLOCK_NON_SHARE (0 << 8) -#define PTL2_BLOCK_OUTER_SHARE (2 << 8) -#define PTL2_BLOCK_INNER_SHARE (3 << 8) -#define PTL2_BLOCK_AF (1 << 10) -#define PTL2_BLOCK_NG (1 << 11) -#define PTL2_BLOCK_PXN (UL(1) << 53) -#define PTL2_BLOCK_UXN (UL(1) << 54) +#define PTE_BLOCK_MEMTYPE(x) ((x) << 2) +#define PTE_BLOCK_NON_SHARE (0 << 8) +#define PTE_BLOCK_OUTER_SHARE (2 << 8) +#define PTE_BLOCK_INNER_SHARE (3 << 8) +#define PTE_BLOCK_AF (1 << 10) +#define PTE_BLOCK_NG (1 << 11) +#define PTE_BLOCK_PXN (UL(1) << 53) +#define PTE_BLOCK_UXN (UL(1) << 54)

#else /* diff --git a/arch/arm/include/asm/global_data.h b/arch/arm/include/asm/global_data.h index dcfa098..3dec1db 100644 --- a/arch/arm/include/asm/global_data.h +++ b/arch/arm/include/asm/global_data.h @@ -38,10 +38,10 @@ struct arch_global_data { unsigned long long timer_reset_value; #if !(defined(CONFIG_SYS_ICACHE_OFF) && defined(CONFIG_SYS_DCACHE_OFF)) unsigned long tlb_addr; + unsigned long tlb_size; #if defined(CONFIG_SYS_FULL_VA) - unsigned long pmd_addr[CONFIG_SYS_PTL1_ENTRIES]; + unsigned long tlb_fillptr; #endif - unsigned long tlb_size; #endif

#ifdef CONFIG_OMAP_COMMON diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h index 026e7ef..ffd6fe5 100644 --- a/arch/arm/include/asm/system.h +++ b/arch/arm/include/asm/system.h @@ -20,7 +20,8 @@ #ifndef CONFIG_SYS_FULL_VA #define PGTABLE_SIZE (0x10000) #else -#define PGTABLE_SIZE CONFIG_SYS_PGTABLE_SIZE +u64 get_page_table_size(void); +#define PGTABLE_SIZE get_page_table_size() #endif

/* 2MB granularity */ diff --git a/include/configs/thunderx_88xx.h b/include/configs/thunderx_88xx.h index b9f93ad..20b25f7 100644 --- a/include/configs/thunderx_88xx.h +++ b/include/configs/thunderx_88xx.h @@ -22,21 +22,19 @@

#define MEM_BASE 0x00500000

-#define CONFIG_COREID_MASK 0xffffff - #define CONFIG_SYS_FULL_VA

#define CONFIG_SYS_LOWMEM_BASE MEM_BASE

#define CONFIG_SYS_MEM_MAP {{0x000000000000UL, 0x40000000000UL, \ - PTL2_MEMTYPE(MT_NORMAL) | \ - PTL2_BLOCK_NON_SHARE}, \ + PTE_BLOCK_MEMTYPE(MT_NORMAL) | \ + PTE_BLOCK_NON_SHARE}, \ {0x800000000000UL, 0x40000000000UL, \ - PTL2_MEMTYPE(MT_DEVICE_NGNRNE) | \ - PTL2_BLOCK_NON_SHARE}, \ + PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE}, \ {0x840000000000UL, 0x40000000000UL, \ - PTL2_MEMTYPE(MT_DEVICE_NGNRNE) | \ - PTL2_BLOCK_NON_SHARE}, \ + PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE}, \ }

#define CONFIG_SYS_MEM_MAP_SIZE 3

-- 1.8.5.6

Stephen Warren

7:18 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...

The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...

diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c

...

+/* Creates a new full table (512 entries) and sets *pte to refer to it */ +static u64 *create_table(void)

I think that comment is stale (there's no *pte; I assume it should say "returns").

...

+/* Returns the estimated required size of all page tables */ +u64 get_page_table_size(void) +{

int i;

u64 one_pt = MAX_PTE_ENTRIES * sizeof(u64);

u64 size = 0;

/* root page table */

size += one_pt;

Isn't the root page table level 0? So, that accounts for the page table that's indexed by VA bits 47:39.

...

for (i = 0; i < ARRAY_SIZE(mem_map); i++) {
```
struct mm_region *map = &mem_map[i];
```
```
/* Account for Lv0 page tables */
```

size += one_pt * ((map->size >> 39) + 1);

So here, isn't the code accounting for level 1 tables, not level 0 tables? That is, the tables indexed by VA bits 38:30.

(As an aside for myself when I come back and read this later, the shift is 39, since we need to allocate as many tables as there are values for bits 39 and above).

...

/* 1GB aligned pages fit already, so count the others */

```
if (map->size & 0x3fffffffULL)
```
```
	size += one_pt;
```
```
if (map->base & 0x3fffffffULL)
```
```
	size += one_pt;
```

Here, I believe we're accounting for any required level 2 tables (which are indexed by VA bits 29:21).

We seem to be missing code inside the loop that accounts for any required level 3 tables (indexed by VA bits 20:12).

...

}

/* Assume we may have to split up to 4 more page tables off */

size += one_pt * 4;

Is this meant to be accounting for level 3 tables? If so, I'm not sure where the value "4" comes from. I would have expected "2" instead (for misaligned start/end) *and* for this calculation to be inside the loop rather than outside. Doesn't putting the calculation outside the loop make assumptions about how many mem_map[] entries are mis-aligned relative to 2MB sections?

...

diff --git a/arch/arm/include/asm/armv8/mmu.h b/arch/arm/include/asm/armv8/mmu.h

...

#define VA_BITS (42) /* 42 bits virtual address */ #else #define VA_BITS CONFIG_SYS_VA_BITS -#define PTL2_BITS CONFIG_SYS_PTL2_BITS +#define PTE_BLOCK_BITS CONFIG_SYS_PTL2_BITS #endif

When I "wrote" the Tegra ARMv8 MMU code (i.e. cut/pasted it from other ARMv8 MMU code), I recall finding some inconsistencies between the value of VA_BITS and *SECTION_SHIFT between different header files. I think that was e.g.:

...

arch/arm/include/asm/armv8/mmu.h:26:#define VA_BITS (42) /* 42 bits virtual address */ arch/arm/mach-tegra/arm64-mmu.c:25:#define TEGRA_VA_BITS 40 arch/arm/cpu/armv8/zynqmp/cpu.c:59:#define ZYNQMO_VA_BITS 40

asm/armv8/mmu.h's value for SECTION_SHIFT and asm/system.h's value for MMU_SECTION_SHIFT.

(Also see the related parts of the description of git commit 376cb1a45315 "ARM: tegra: add custom MMU setup on ARMv8")

Does this patch entirely address those discrepancies?

Alexander Graf

7:37 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On Feb 22, 2016, at 7:18 PM, Stephen Warren swarren@wwwdotorg.org wrote:

...

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c

...
+/* Creates a new full table (512 entries) and sets *pte to refer to it */ +static u64 *create_table(void)

I think that comment is stale (there's no *pte; I assume it should say "returns").

Oops, yes. I split the pte setting into a separate function and forgot to update the comment. Nice catch.

...

...
+/* Returns the estimated required size of all page tables */ +u64 get_page_table_size(void) +{

int i;

u64 one_pt = MAX_PTE_ENTRIES * sizeof(u64);

u64 size = 0;

/* root page table */

size += one_pt;

Isn't the root page table level 0? So, that accounts for the page table that's indexed by VA bits 47:39.

Yes, or - if your va_bits < 39 it actually accounts for level 1 because the page table starts at Lv1.

...

...
for (i = 0; i < ARRAY_SIZE(mem_map); i++) {
struct mm_region *map = &mem_map[i];
/* Account for Lv0 page tables */
size += one_pt * ((map->size >> 39) + 1);
So here, isn't the code accounting for level 1 tables, not level 0 tables? That is, the tables indexed by VA bits 38:30.

(As an aside for myself when I come back and read this later, the shift is 39, since we need to allocate as many tables as there are values for bits 39 and above).

I definitely should use the level2shift helper here, yes.

...

...
/* 1GB aligned pages fit already, so count the others */
if (map->size & 0x3fffffffULL)
	size += one_pt;
if (map->base & 0x3fffffffULL)
	size += one_pt;
Here, I believe we're accounting for any required level 2 tables (which are indexed by VA bits 29:21).

We seem to be missing code inside the loop that accounts for any required level 3 tables (indexed by VA bits 20:12).

The reason I didn't account for level 3 is that *usually* we shouldn't have those around. I guess the size estimation really could use some more thought...

...

...

}

/* Assume we may have to split up to 4 more page tables off */

size += one_pt * 4;

Is this meant to be accounting for level 3 tables? If so, I'm not sure where the value "4" comes from. I would have expected "2" instead (for misaligned start/end) *and* for this calculation to be inside the loop rather than outside. Doesn't putting the calculation outside the loop make assumptions about how many mem_map[] entries are mis-aligned relative to 2MB sections?

The 4 is a random pick to give room for page tables that we need to split to make room for dcache attributes.

...

...
diff --git a/arch/arm/include/asm/armv8/mmu.h b/arch/arm/include/asm/armv8/mmu.h

...
#define VA_BITS (42) /* 42 bits virtual address */ #else #define VA_BITS CONFIG_SYS_VA_BITS -#define PTL2_BITS CONFIG_SYS_PTL2_BITS +#define PTE_BLOCK_BITS CONFIG_SYS_PTL2_BITS #endif

When I "wrote" the Tegra ARMv8 MMU code (i.e. cut/pasted it from other ARMv8 MMU code), I recall finding some inconsistencies between the value of VA_BITS and *SECTION_SHIFT between different header files. I think that was e.g.:

...
arch/arm/include/asm/armv8/mmu.h:26:#define VA_BITS (42) /* 42 bits virtual address */ arch/arm/mach-tegra/arm64-mmu.c:25:#define TEGRA_VA_BITS 40 arch/arm/cpu/armv8/zynqmp/cpu.c:59:#define ZYNQMO_VA_BITS 40

asm/armv8/mmu.h's value for SECTION_SHIFT and asm/system.h's value for MMU_SECTION_SHIFT.

(Also see the related parts of the description of git commit 376cb1a45315 "ARM: tegra: add custom MMU setup on ARMv8")

Does this patch entirely address those discrepancies?

I've seen the comment and I think the problem was that MMU_SECTION_SHIFT indicated 2MB pages while the PTEs were actually on 512MB pages. That should be all solved now FWIW, yes. We now support real 2MB sections.

Alex

Stephen Warren

7:45 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/22/2016 11:37 AM, Alexander Graf wrote:

...

On Feb 22, 2016, at 7:18 PM, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/21/2016 06:57 PM, Alexander Graf wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c

...
+/* Creates a new full table (512 entries) and sets *pte to refer to it */ +static u64 *create_table(void)

I think that comment is stale (there's no *pte; I assume it should say "returns").

Oops, yes. I split the pte setting into a separate function and forgot to update the comment. Nice catch.

...
...
+/* Returns the estimated required size of all page tables */ +u64 get_page_table_size(void) +{

int i;

u64 one_pt = MAX_PTE_ENTRIES * sizeof(u64);

u64 size = 0;

/* root page table */

size += one_pt;

Isn't the root page table level 0? So, that accounts for the page table that's indexed by VA bits 47:39.

Yes, or - if your va_bits < 39 it actually accounts for level 1 because the page table starts at Lv1.

...
...
for (i = 0; i < ARRAY_SIZE(mem_map); i++) {
struct mm_region *map = &mem_map[i];
/* Account for Lv0 page tables */
size += one_pt * ((map->size >> 39) + 1);
So here, isn't the code accounting for level 1 tables, not level 0 tables? That is, the tables indexed by VA bits 38:30.

(As an aside for myself when I come back and read this later, the shift is 39, since we need to allocate as many tables as there are values for bits 39 and above).
I definitely should use the level2shift helper here, yes.

...
...
/* 1GB aligned pages fit already, so count the others */
if (map->size & 0x3fffffffULL)
	size += one_pt;
if (map->base & 0x3fffffffULL)
	size += one_pt;
Here, I believe we're accounting for any required level 2 tables (which are indexed by VA bits 29:21).

We seem to be missing code inside the loop that accounts for any required level 3 tables (indexed by VA bits 20:12).
The reason I didn't account for level 3 is that *usually* we shouldn't have those around. I guess the size estimation really could use some more thought...

As you mention, it's unlikely we'd need level 3 in practice.

However, we should still either account for it now, or explicitly fail if the code for that isn't written yet. I'd like at least an assert()/panic()/... somewhere (here seems best?) if the mem_map[] entries are not 2MB-aligned.

Alexander Graf

24 Feb 24 Feb

11:21 a.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 22.02.16 19:45, Stephen Warren wrote:

...

On 02/22/2016 11:37 AM, Alexander Graf wrote:

...
On Feb 22, 2016, at 7:18 PM, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/21/2016 06:57 PM, Alexander Graf wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c

...
+/* Creates a new full table (512 entries) and sets *pte to refer to it */ +static u64 *create_table(void)

I think that comment is stale (there's no *pte; I assume it should say "returns").

Oops, yes. I split the pte setting into a separate function and forgot to update the comment. Nice catch.

...
...
+/* Returns the estimated required size of all page tables */ +u64 get_page_table_size(void) +{

int i;

u64 one_pt = MAX_PTE_ENTRIES * sizeof(u64);

u64 size = 0;

/* root page table */

size += one_pt;

Isn't the root page table level 0? So, that accounts for the page table that's indexed by VA bits 47:39.

Yes, or - if your va_bits < 39 it actually accounts for level 1 because the page table starts at Lv1.

...
...
for (i = 0; i < ARRAY_SIZE(mem_map); i++) {
   struct mm_region *map = &mem_map[i];
   /* Account for Lv0 page tables */
   size += one_pt * ((map->size >> 39) + 1);
So here, isn't the code accounting for level 1 tables, not level 0 tables? That is, the tables indexed by VA bits 38:30.

(As an aside for myself when I come back and read this later, the shift is 39, since we need to allocate as many tables as there are values for bits 39 and above).
I definitely should use the level2shift helper here, yes.

...
...
   /* 1GB aligned pages fit already, so count the others */
   if (map->size & 0x3fffffffULL)
       size += one_pt;
   if (map->base & 0x3fffffffULL)
       size += one_pt;
Here, I believe we're accounting for any required level 2 tables (which are indexed by VA bits 29:21).

We seem to be missing code inside the loop that accounts for any required level 3 tables (indexed by VA bits 20:12).
The reason I didn't account for level 3 is that *usually* we shouldn't have those around. I guess the size estimation really could use some more thought...
As you mention, it's unlikely we'd need level 3 in practice.

However, we should still either account for it now, or explicitly fail if the code for that isn't written yet. I'd like at least an assert()/panic()/... somewhere (here seems best?) if the mem_map[] entries are not 2MB-aligned.

I've reworked the code to properly account for 4k pages. I guess that also solves it? :)

Alex

Stephen Warren

22 Feb 22 Feb

7:42 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...

The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...

+/* Returns the estimated required size of all page tables */ +u64 get_page_table_size(void) +{
int i;

u64 one_pt = MAX_PTE_ENTRIES * sizeof(u64);

u64 size = 0;

/* root page table */

size += one_pt;

for (i = 0; i < ARRAY_SIZE(mem_map); i++) {
struct mm_region *map = &mem_map[i];
/* Account for Lv0 page tables */
size += one_pt * ((map->size >> 39) + 1);
/* 1GB aligned pages fit already, so count the others */
if (map->size & 0x3fffffffULL)
	size += one_pt;
if (map->base & 0x3fffffffULL)
	size += one_pt;

One more comment here: The two conditions should check start and end, not start and size. The reason is that for a region with aligned size but unaligned start, the end is not aligned and hence needs a next-level page table allocated to correctly represent that. That won't be accounted for in the current code, but would be if end (==start+size) was used instead.

...

}

/* Assume we may have to split up to 4 more page tables off */

size += one_pt * 4;

return size;

+}

Simon Glass

23 Feb 23 Feb

2:17 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...

The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/cache_v8.c | 346 +++++++++++++++++++++++++++++++------ arch/arm/include/asm/armv8/mmu.h | 68 ++++---- arch/arm/include/asm/global_data.h | 4 +- arch/arm/include/asm/system.h | 3 +- include/configs/thunderx_88xx.h | 14 +- 5 files changed, 332 insertions(+), 103 deletions(-)

Should the change to the thunderx file go in a separate patch?

...

diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 9229532..4369a83 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -2,6 +2,9 @@

(C) Copyright 2013

David Feng fenghua@phytium.com.cn

(C) Copyright 2016

Alexander Graf agraf@suse.de

SPDX-License-Identifier: GPL-2.0+

*/

@@ -9,35 +12,40 @@ #include <asm/system.h> #include <asm/armv8/mmu.h>

-DECLARE_GLOBAL_DATA_PTR;

-#ifndef CONFIG_SYS_DCACHE_OFF +/* #define DEBUG_MMU */

-#ifdef CONFIG_SYS_FULL_VA -static void set_ptl1_entry(u64 index, u64 ptl2_entry) -{
  u64 *pgd = (u64 *)gd->arch.tlb_addr;
  u64 value;
+#ifdef DEBUG_MMU +#define DPRINTF(a, ...) printf("%s:%d: " a, __func__, __LINE__, __VA_ARGS__) +#else +#define DPRINTF(a, ...) do { } while(0) +#endif

Can you use the normal DEBUG and debug()?

...

  value = ptl2_entry | PTL1_TYPE_TABLE;
  pgd[index] = value;
-} +DECLARE_GLOBAL_DATA_PTR;

-static void set_ptl2_block(u64 ptl1, u64 bfn, u64 address, u64 memory_attrs) -{
  u64 *pmd = (u64 *)ptl1;
  u64 value;
+#ifndef CONFIG_SYS_DCACHE_OFF
  value = address | PTL2_TYPE_BLOCK | PTL2_BLOCK_AF;
  value |= memory_attrs;
  pmd[bfn] = value;
-} +/*
With 4k page granule, a virtual address is split into 4 lookup parts

spanning 9 bits each:

| | | | | | |

| 0 | Lv0 | Lv1 | Lv2 | Lv3 | off |

|_______|_______|_______|_______|_______|_______|
63-48   47-39   38-30   29-21   20-12   11-00
        mask        page size
Lv0: FF8000000000 --

Lv1: 7FC0000000 1G

Lv2: 3FE00000 2M

Lv3: 1FF000 4K

off: FFF

*/
+#ifdef CONFIG_SYS_FULL_VA static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;

I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.

...

-#define PTL1_ENTRIES CONFIG_SYS_PTL1_ENTRIES -#define PTL2_ENTRIES CONFIG_SYS_PTL2_ENTRIES

static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) { u64 max_addr = 0; @@ -79,8 +87,8 @@ static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) }

    /* PTWs cacheable, inner/outer WBWA and inner shareable */

  tcr |= TCR_TG0_64K | TCR_SHARED_INNER | TCR_ORGN_WBWA | TCR_IRGN_WBWA;

```
  tcr |= TCR_T0SZ(VA_BITS);
```

  tcr |= TCR_TG0_4K | TCR_SHARED_INNER | TCR_ORGN_WBWA | TCR_IRGN_WBWA;

  tcr |= TCR_T0SZ(va_bits);

  if (pips)
          *pips = ips;

@@ -90,39 +98,196 @@ static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) return tcr; }

-static void setup_pgtables(void) +#define MAX_PTE_ENTRIES 512

+static int pte_type(u64 *pte) +{

```
  return *pte & PTE_TYPE_MASK;
```

+/* Returns the LSB number for a PTE on level <level> */ +static int level2shift(int level) {

```
  int l1_e, l2_e;
```
```
  unsigned long pmd = 0;
```
```
  unsigned long address;
```
```
  /* Setup the PMD pointers */
```

  for (l1_e = 0; l1_e < CONFIG_SYS_MEM_MAP_SIZE; l1_e++) {

          gd->arch.pmd_addr[l1_e] = gd->arch.tlb_addr +

                                          PTL1_ENTRIES * sizeof(u64);

          gd->arch.pmd_addr[l1_e] += PTL2_ENTRIES * sizeof(u64) * l1_e;

          gd->arch.pmd_addr[l1_e] = ALIGN(gd->arch.pmd_addr[l1_e],

                                          0x10000UL);

  /* Page is 12 bits wide, every level translates 9 bits */

```
  return (12 + 9 * (3 - level));
```

+static u64 *find_pte(u64 addr, int level) +{

```
  int start_level = 0;
```
```
  u64 *pte;
```
```
  u64 idx;
```
```
  u64 va_bits;
```
```
  int i;
```

  DPRINTF("addr=%llx level=%d\n", addr, level);

```
  get_tcr(0, NULL, &va_bits);
```
```
  if (va_bits < 39)
```
```
          start_level = 1;
```
```
  if (level < start_level)
```
```
          return NULL;
```

  /* Walk through all page table levels to find our PTE */

```
  pte = (u64*)gd->arch.tlb_addr;
```
```
  for (i = start_level; i < 4; i++) {
```

          idx = (addr >> level2shift(i)) & 0x1FF;

```
          pte += idx;
```

          DPRINTF("idx=%llx PTE %p at level %d: %llx\n", idx, pte, i, *pte);

```
          /* Found it */
```
```
          if (i == level)
```
```
                  return pte;
```

          /* PTE is no table (either invalid or block), can't traverse */

          if (pte_type(pte) != PTE_TYPE_TABLE)

```
                  return NULL;
```
```
          /* Off to the next level */
```

          pte = (u64*)(*pte & 0x0000fffffffff000ULL);
  }

```
  /* Setup the page tables */
```

  for (l1_e = 0; l1_e < PTL1_ENTRIES; l1_e++) {

```
          if (mem_map[pmd].base ==
```

                  (uintptr_t)l1_e << PTL2_BITS) {

                  set_ptl1_entry(l1_e, gd->arch.pmd_addr[pmd]);

                  for (l2_e = 0; l2_e < PTL2_ENTRIES; l2_e++) {

                          address = mem_map[pmd].base

                                  + (uintptr_t)l2_e * BLOCK_SIZE;

                          set_ptl2_block(gd->arch.pmd_addr[pmd], l2_e,

                                         address, mem_map[pmd].attrs);

```
                  }
```

```
  /* Should never reach here */
```
```
  return NULL;
```

+/* Creates a new full table (512 entries) and sets *pte to refer to it */ +static u64 *create_table(void) +{

  u64 *new_table = (u64*)gd->arch.tlb_fillptr;

  u64 pt_len = MAX_PTE_ENTRIES * sizeof(u64);

  /* Allocate MAX_PTE_ENTRIES pte entries */

```
  gd->arch.tlb_fillptr += pt_len;
```

  if (gd->arch.tlb_fillptr - gd->arch.tlb_addr > gd->arch.tlb_size)

          panic("Insufficient RAM for page table: 0x%lx > 0x%lx",

                  gd->arch.tlb_fillptr - gd->arch.tlb_addr,

```
                  gd->arch.tlb_size);
```

For each of these panic() calls can you please add a comment as to what the user should do? It needs to be very clear what action should be taken to resolve the problem.

...

```
  /* Mark all entries as invalid */
```
```
  memset(new_table, 0, pt_len);
```

```
                  pmd++;
```
```
          } else {
```

                  set_ptl1_entry(l1_e, 0);

```
  return new_table;
```

+static void set_pte_table(u64 *pte, u64 *table) +{

```
  /* Point *pte to the new table */
```

  DPRINTF("Setting %p to addr=%p\n", pte, table);

  *pte = PTE_TYPE_TABLE | (ulong)table;

+/* Add one mm_region map entry to the page tables */ +static void add_map(struct mm_region *map) +{

```
  u64 *pte;
```
```
  u64 addr = map->base;
```
```
  u64 size = map->size;
```

  u64 attrs = map->attrs | PTE_TYPE_BLOCK | PTE_BLOCK_AF;

```
  u64 blocksize;
```
```
  int level;
```
```
  u64 *new_table;
```
```
  while (size) {
```
```
          pte = find_pte(addr, 0);
```

          if (pte && (pte_type(pte) == PTE_TYPE_FAULT)) {

                  DPRINTF("Creating table for addr 0x%llx\n", addr);

                  new_table = create_table();

                  set_pte_table(pte, new_table);
          }

          for (level = 1; level < 4; level++) {

                  pte = find_pte(addr, level);

                  blocksize = 1ULL << level2shift(level);

                  DPRINTF("Checking if pte fits for addr=%llx size=%llx "

                          "blocksize=%llx\n", addr, size, blocksize);

                  if (size >= blocksize && !(addr & (blocksize - 1))) {

                          /* Page fits, create block PTE */

                          DPRINTF("Setting PTE %p to block addr=%llx\n",

                                  pte, addr);

                          *pte = addr | attrs;

                          addr += blocksize;

                          size -= blocksize;

```
                          break;
```

                  } else if ((pte_type(pte) == PTE_TYPE_FAULT)) {

                          /* Page doesn't fit, create subpages */

                          DPRINTF("Creating subtable for addr 0x%llx "

                                  "blksize=%llx\n", addr, blocksize);

                          new_table = create_table();

                          set_pte_table(pte, new_table);

```
                  }
```
```
          }
```
```
  }
```

+/* Splits a block PTE into table with subpages spanning the old block */ +static void split_block(u64 *pte, int level) +{

```
  u64 old_pte = *pte;
```
```
  u64 *new_table;
```
```
  u64 i = 0;
```

  /* level describes the parent level, we need the child ones */

  int levelshift = level2shift(level + 1);

```
  if (pte_type(pte) != PTE_TYPE_BLOCK)
```

          panic("PTE %p (%llx) is not a block", pte, old_pte);

```
  new_table = create_table();
```

  DPRINTF("Splitting pte %p (%llx) into %p\n", pte, old_pte, new_table);

  for (i = 0; i < MAX_PTE_ENTRIES; i++) {

          new_table[i] = old_pte | (i << levelshift);

          DPRINTF("Setting new_table[%lld] = %llx\n", i, new_table[i]);
  }

```
  /* Set the new table into effect */
```
```
  set_pte_table(pte, new_table);
```

+/* Returns the estimated required size of all page tables */ +u64 get_page_table_size(void) +{

```
  int i;
```

  u64 one_pt = MAX_PTE_ENTRIES * sizeof(u64);

```
  u64 size = 0;
```
```
  /* root page table */
```
```
  size += one_pt;
```

  for (i = 0; i < ARRAY_SIZE(mem_map); i++) {

          struct mm_region *map = &mem_map[i];

          /* Account for Lv0 page tables */

          size += one_pt * ((map->size >> 39) + 1);

          /* 1GB aligned pages fit already, so count the others */

          if (map->size & 0x3fffffffULL)

```
                  size += one_pt;
```

          if (map->base & 0x3fffffffULL)

```
                  size += one_pt;
```
```
  }
```

  /* Assume we may have to split up to 4 more page tables off */

```
  size += one_pt * 4;
```

I suspect this is a better idea than just allocating a fixed size for the whole table (like 1MB). But the error you get when this fails should point to here so people know how to fix it.

...

```
  return size;
```

+static void setup_pgtables(void) +{

```
  int i;
```
```
  /*
```

   * Allocate the first level we're on with invalidate entries.

   * If the starting level is 0 (va_bits >= 39), then this is our

   * Lv0 page table, otherwise it's the entry Lv1 page table.

```
   */
```

  gd->arch.tlb_fillptr = gd->arch.tlb_addr;

```
  create_table();
```

  /* Now add all MMU table entries one after another to the table */

  for (i = 0; i < ARRAY_SIZE(mem_map); i++)

```
          add_map(&mem_map[i]);
```

}

#else @@ -157,10 +322,8 @@ __weak void mmu_setup(void) int el;

#ifdef CONFIG_SYS_FULL_VA

  unsigned long coreid = read_mpidr() & CONFIG_COREID_MASK;

```
  /* Set up page tables only on BSP */
```
```
  if (coreid == BSP_COREID)
```

```
  /* Set up page tables only once */
```

  if (!gd->arch.tlb_fillptr)
          setup_pgtables();

  el = current_el();

@@ -311,6 +474,79 @@ void mmu_set_region_dcache_behaviour(phys_addr_t start, size_t size, flush_dcache_range(start, end); asm volatile("dsb sy"); } +#else +static bool is_aligned(u64 addr, u64 size, u64 align) +{

  return !(addr & (align - 1)) && !(size & (align - 1));

+static u64 set_one_region(u64 start, u64 size, u64 attrs, int level) +{

```
  int levelshift = level2shift(level);
```
```
  u64 levelsize = 1ULL << levelshift;
```
```
  u64 *pte = find_pte(start, level);
```

  /* Can we can just modify the current level block PTE? */

  if (is_aligned(start, size, levelsize)) {

```
          *pte &= ~PMD_ATTRINDX_MASK;
```
```
          *pte |= attrs;
```

          DPRINTF("Set attrs=%llx pte=%p level=%d\n", attrs, pte, level);

```
          return levelsize;
```
```
  }
```

  /* Unaligned or doesn't fit, maybe split block into table */

  DPRINTF("addr=%llx level=%d pte=%p (%llx)\n", start, level, pte, *pte);

  /* Maybe we need to split the block into a table */

```
  if (pte_type(pte) == PTE_TYPE_BLOCK)
```
```
          split_block(pte, level);
```

  /* And then double-check it became a table or already is one */

```
  if (pte_type(pte) != PTE_TYPE_TABLE)
```

          panic("PTE %p (%llx) for addr=%llx should be a table",

```
                pte, *pte, start);
```

  /* Roll on to the next page table level */

```
  return 0;
```

+void mmu_set_region_dcache_behaviour(phys_addr_t start, size_t size,

                               enum dcache_option option)

```
  u64 attrs = PMD_ATTRINDX(option);
```
```
  u64 real_start = start;
```
```
  u64 real_size = size;
```

  DPRINTF("start=%lx size=%lx\n", (ulong)start, (ulong)size);

```
  /*
```

   * Loop through the address range until we find a page granule that fits

   * our alignment constraints, then set it to the new cache attributes

```
   */
```
```
  while (size > 0) {
```
```
          int level;
```
```
          u64 r;
```

          for (level = 1; level < 4; level++) {

                  r = set_one_region(start, size, attrs, level);

```
                  if (r) {
```

                          /* PTE successfully replaced */

```
                          size -= r;
```
```
                          start += r;
```
```
                          break;
```
```
                  }
```
```
          }
```
```
  }
```
```
  asm volatile("dsb sy");
```
```
  __asm_invalidate_tlb_all();
```
```
  asm volatile("dsb sy");
```
```
  asm volatile("isb");
```

  flush_dcache_range(real_start, real_start + real_size);

```
  asm volatile("dsb sy");
```

+} #endif

#else /* CONFIG_SYS_DCACHE_OFF */

[snip]

Regards, Simon

Stephen Warren

6:21 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/23/2016 06:17 AM, Simon Glass wrote:

...

Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...

...
static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;

I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.

I would strongly object to making the MMU setup depend on device tree parsing. This is low-level system code that should be handled purely by simple standalone C code.

Having some CPU-/SoC-/board-specific code define the table, rather than having it be a #define, seems fine though, if the code in the current patch needs to change.

Simon Glass

6:30 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

Hi Stephen,

On 23 February 2016 at 10:21, Stephen Warren swarren@wwwdotorg.org wrote:

...

On 02/23/2016 06:17 AM, Simon Glass wrote:

...
Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...
...
static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;

I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.

I would strongly object to making the MMU setup depend on device tree parsing. This is low-level system code that should be handled purely by simple standalone C code.

Because...?

...

Having some CPU-/SoC-/board-specific code define the table, rather than having it be a #define, seems fine though, if the code in the current patch needs to change.

That seems OK. Perhaps a table in board-specific code and a way to add new entries on top. Bear in mind that this happens before relocation but quite late in the pre-reloc init sequence.

Regards, Simon

Stephen Warren

6:40 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/23/2016 10:30 AM, Simon Glass wrote:

...

Hi Stephen,

On 23 February 2016 at 10:21, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 06:17 AM, Simon Glass wrote:

...
Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...
...
static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;

I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.

I would strongly object to making the MMU setup depend on device tree parsing. This is low-level system code that should be handled purely by simple standalone C code.

Because...?

There is literally zero benefit from putting the exact same content into DT, and hence having to run significantly more code to parse DT and get back exactly the same hard-coded table.

DT is not a goal in-and-of-itself. In some cases there are benefits to placing configuration data outside a binary, and in those cases DT is an acceptable mechanism to do that. However, any benefit from doing so derives from arguments for separating the data out of the code, not because "use DT" is itself a benefit.

Simon Glass

9 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

Hi Stephen,

On 23 February 2016 at 10:40, Stephen Warren swarren@wwwdotorg.org wrote:

...

On 02/23/2016 10:30 AM, Simon Glass wrote:

...
Hi Stephen,

On 23 February 2016 at 10:21, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 06:17 AM, Simon Glass wrote:

...
Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...
...
static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;

I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.

I would strongly object to making the MMU setup depend on device tree parsing. This is low-level system code that should be handled purely by simple standalone C code.

Because...?

There is literally zero benefit from putting the exact same content into DT, and hence having to run significantly more code to parse DT and get back exactly the same hard-coded table.

We do this so that board-specific variations can be described in one place. In the board-specific case, there are benefits.

...

DT is not a goal in-and-of-itself. In some cases there are benefits to placing configuration data outside a binary, and in those cases DT is an acceptable mechanism to do that. However, any benefit from doing so derives from arguments for separating the data out of the code, not because "use DT" is itself a benefit.

That's fine as far as it goes.

The config file is not an acceptable means of providing per-board or per-arch configuration. If it is arch-specific and/or SoC-specific, but NOT board-specific then we can have it in a C table in a source file (not the config header) that is built into the binary. If it is board-specific, it must use the device tree.

What category are we talking about here? Unfortunately it's not entirely clear from the patches and I lack the knowledge/background to figure it out.

Regards, Simon

Stephen Warren

9:33 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/23/2016 01:00 PM, Simon Glass wrote:

...

Hi Stephen,

On 23 February 2016 at 10:40, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 10:30 AM, Simon Glass wrote:

...
Hi Stephen,

On 23 February 2016 at 10:21, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 06:17 AM, Simon Glass wrote:

...
Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

...
...
static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;
I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.
I would strongly object to making the MMU setup depend on device tree parsing. This is low-level system code that should be handled purely by simple standalone C code.
Because...?
There is literally zero benefit from putting the exact same content into DT, and hence having to run significantly more code to parse DT and get back exactly the same hard-coded table.
We do this so that board-specific variations can be described in one place. In the board-specific case, there are benefits.

I'd like to see an explicit enumeration of the benefits; I'm not aware of any (either benefits, or such an enumeration). Board-specific data can just as easily (actually, more easily due to lack of need for parsing code) be stored in C data structures vs. stored in DT.

Or put another way, the simple fact that some data is board-specific does not in-and-of-itself mean there's a benefit to putting it into DT. To move something into DT, we should be able to enumerate some other benefit, such as: - Speeds up boot time. - Allows code to be simpler. - Simplifies editing the data.

(Note that I don't believe any of those example potential benefits are actually true, but in fact are the opposite of the truth).

...

...
DT is not a goal in-and-of-itself. In some cases there are benefits to placing configuration data outside a binary, and in those cases DT is an acceptable mechanism to do that. However, any benefit from doing so derives from arguments for separating the data out of the code, not because "use DT" is itself a benefit.

That's fine as far as it goes.

The config file is not an acceptable means of providing per-board or per-arch configuration. If it is arch-specific and/or SoC-specific, but NOT board-specific then we can have it in a C table in a source file (not the config header) that is built into the binary. If it is board-specific, it must use the device tree.

What category are we talking about here? Unfortunately it's not entirely clear from the patches and I lack the knowledge/background to figure it out.

I expect this data is SoC-specific. At least for Tegra in the codebase, that's certainly true. I believe it's true for other SoCs in the current codebase too. I don't expect this to change going forward, at the very least for Tegra.

Simon Glass

24 Feb 24 Feb

5:42 a.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

Hi Stephen,

On 23 February 2016 at 13:33, Stephen Warren swarren@wwwdotorg.org wrote:

...

On 02/23/2016 01:00 PM, Simon Glass wrote:

...
Hi Stephen,

On 23 February 2016 at 10:40, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 10:30 AM, Simon Glass wrote:

...
Hi Stephen,

On 23 February 2016 at 10:21, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 06:17 AM, Simon Glass wrote:

...
Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote: > > > > The idea to generate our pages tables from an array of memory ranges > is very sound. However, instead of hard coding the code to create up > to 2 levels of 64k granule page tables, we really should just create > normal 4k page tables that allow us to set caching attributes on 2M > or 4k level later on. > > So this patch moves the full_va mapping code to 4k page size and > makes it fully flexible to dynamically create as many levels as > necessary for a map (including dynamic 1G/2M pages). It also adds > support to dynamically split a large map into smaller ones when > some code wants to set dcache attributes. > > With all this in place, there is very little reason to create your > own page tables in board specific files.

...
> static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;

I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.

I would strongly object to making the MMU setup depend on device tree parsing. This is low-level system code that should be handled purely by simple standalone C code.

Because...?

There is literally zero benefit from putting the exact same content into DT, and hence having to run significantly more code to parse DT and get back exactly the same hard-coded table.

We do this so that board-specific variations can be described in one place. In the board-specific case, there are benefits.

I'd like to see an explicit enumeration of the benefits; I'm not aware of any (either benefits, or such an enumeration). Board-specific data can just as easily (actually, more easily due to lack of need for parsing code) be stored in C data structures vs. stored in DT.

Or put another way, the simple fact that some data is board-specific does not in-and-of-itself mean there's a benefit to putting it into DT. To move something into DT, we should be able to enumerate some other benefit, such as:

Speeds up boot time.

Allows code to be simpler.

Simplifies editing the data.

(Note that I don't believe any of those example potential benefits are actually true, but in fact are the opposite of the truth).

Didn't this get discussed to death in the Linux mailing list with the result that platform data was abolished in favour of device tree? From my perspective:

- the relevant configuration is mostly in one place - we can share it with Linux - it is easier to maintain a few text files than dispersed platform data - it permits easy run-time configuration, avoiding the need for multiple builds for trivial differences - it converts to platform data fairly easily at run-time, so most of the code can still deal with that - it is easy to have base SoC data that is expanded/overridden by board data - the configuration can be listed and queried easily, by U-Boot at run-time, or by build systems - device tree is a well-understood format with robust tools

I suspect others have done a much more thoughtful and persuasive analysis.

If you want to pick up on these points I suggest starting a new thread!

...

...
...
DT is not a goal in-and-of-itself. In some cases there are benefits to placing configuration data outside a binary, and in those cases DT is an acceptable mechanism to do that. However, any benefit from doing so derives from arguments for separating the data out of the code, not because "use DT" is itself a benefit.

That's fine as far as it goes.

The config file is not an acceptable means of providing per-board or per-arch configuration. If it is arch-specific and/or SoC-specific, but NOT board-specific then we can have it in a C table in a source file (not the config header) that is built into the binary. If it is board-specific, it must use the device tree.

What category are we talking about here? Unfortunately it's not entirely clear from the patches and I lack the knowledge/background to figure it out.

I expect this data is SoC-specific. At least for Tegra in the codebase, that's certainly true. I believe it's true for other SoCs in the current codebase too. I don't expect this to change going forward, at the very least for Tegra.

In that case a C structure seems fine to me.

Regards, Simon

Stephen Warren

5:56 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/23/2016 09:42 PM, Simon Glass wrote:

...

Hi Stephen,

On 23 February 2016 at 13:33, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 01:00 PM, Simon Glass wrote:

...
Hi Stephen,

On 23 February 2016 at 10:40, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 10:30 AM, Simon Glass wrote:

...
Hi Stephen,

On 23 February 2016 at 10:21, Stephen Warren swarren@wwwdotorg.org wrote:

...
On 02/23/2016 06:17 AM, Simon Glass wrote: > > > > Hi Alex, > > On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote: >> >> >> >> The idea to generate our pages tables from an array of memory ranges >> is very sound. However, instead of hard coding the code to create up >> to 2 levels of 64k granule page tables, we really should just create >> normal 4k page tables that allow us to set caching attributes on 2M >> or 4k level later on. >> >> So this patch moves the full_va mapping code to 4k page size and >> makes it fully flexible to dynamically create as many levels as >> necessary for a map (including dynamic 1G/2M pages). It also adds >> support to dynamically split a large map into smaller ones when >> some code wants to set dcache attributes. >> >> With all this in place, there is very little reason to create your >> own page tables in board specific files.

>> static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP; > > > > > I am not ken on the idea of using a big #define table on these boards. > Is there not a device-tree binding for this that we can use? It is > just a data table, We are moving to Kconfig and eventually want to > drop the config files.

I would strongly object to making the MMU setup depend on device tree parsing. This is low-level system code that should be handled purely by simple standalone C code.

Because...?

There is literally zero benefit from putting the exact same content into DT, and hence having to run significantly more code to parse DT and get back exactly the same hard-coded table.

We do this so that board-specific variations can be described in one place. In the board-specific case, there are benefits.

I'd like to see an explicit enumeration of the benefits; I'm not aware of any (either benefits, or such an enumeration). Board-specific data can just as easily (actually, more easily due to lack of need for parsing code) be stored in C data structures vs. stored in DT.

Or put another way, the simple fact that some data is board-specific does not in-and-of-itself mean there's a benefit to putting it into DT. To move something into DT, we should be able to enumerate some other benefit, such as:

Speeds up boot time.

Allows code to be simpler.

Simplifies editing the data.

(Note that I don't believe any of those example potential benefits are actually true, but in fact are the opposite of the truth).

Didn't this get discussed to death in the Linux mailing list with the result that platform data was abolished in favour of device tree? From my perspective:

I was not aware that decisions within the Linux kernel applied to U-Boot unilaterally. If that is true, there are many other decisions that should be carried over but aren't.

...

the relevant configuration is mostly in one place

we can share it with Linux

In some cases that is true. However, I find it extremely unlikely that the Linux kernel is going to be modified to parse its MMU configuration from DT, especially as Linux doesn't use the same VA layout that U-Boot does currently.

If your argument holds water for this specific case, then the DT binding for MMU configuration needs to be proposed and reviewed on the DT mailing list.

...

it is easier to maintain a few text files than dispersed platform data

it permits easy run-time configuration, avoiding the need for

multiple builds for trivial differences

it converts to platform data fairly easily at run-time, so most of

the code can still deal with that

it is easy to have base SoC data that is expanded/overridden by board data

the configuration can be listed and queried easily, by U-Boot at

run-time, or by build systems

device tree is a well-understood format with robust tools

I suspect others have done a much more thoughtful and persuasive analysis.

If you want to pick up on these points I suggest starting a new thread!

To the other points, I largely disagree with them. All of the points can be easily argued against, but since I've done so repeatedly in detail in the past I won't bother repeating myself yet again, except simply to mention this fact.

Alexander Graf

11:55 a.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 23.02.16 14:17, Simon Glass wrote:

...

Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/cache_v8.c | 346 +++++++++++++++++++++++++++++++------ arch/arm/include/asm/armv8/mmu.h | 68 ++++---- arch/arm/include/asm/global_data.h | 4 +- arch/arm/include/asm/system.h | 3 +- include/configs/thunderx_88xx.h | 14 +- 5 files changed, 332 insertions(+), 103 deletions(-)

Should the change to the thunderx file go in a separate patch?

We're changing semantics for some defines from "this define acts in L1/L2 page table entries" to "this define is for level/block type PTEs". So it's tied to this patch :).

...

...
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 9229532..4369a83 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -2,6 +2,9 @@

(C) Copyright 2013

David Feng fenghua@phytium.com.cn

(C) Copyright 2016

Alexander Graf agraf@suse.de

SPDX-License-Identifier: GPL-2.0+

*/

@@ -9,35 +12,40 @@ #include <asm/system.h> #include <asm/armv8/mmu.h>

-DECLARE_GLOBAL_DATA_PTR;

-#ifndef CONFIG_SYS_DCACHE_OFF +/* #define DEBUG_MMU */

-#ifdef CONFIG_SYS_FULL_VA -static void set_ptl1_entry(u64 index, u64 ptl2_entry) -{
  u64 *pgd = (u64 *)gd->arch.tlb_addr;
  u64 value;
+#ifdef DEBUG_MMU +#define DPRINTF(a, ...) printf("%s:%d: " a, __func__, __LINE__, __VA_ARGS__) +#else +#define DPRINTF(a, ...) do { } while(0) +#endif
Can you use the normal DEBUG and debug()?

Uh, I guess so, yeah.

...

...
  value = ptl2_entry | PTL1_TYPE_TABLE;
  pgd[index] = value;
-} +DECLARE_GLOBAL_DATA_PTR;

-static void set_ptl2_block(u64 ptl1, u64 bfn, u64 address, u64 memory_attrs) -{
  u64 *pmd = (u64 *)ptl1;
  u64 value;
+#ifndef CONFIG_SYS_DCACHE_OFF
  value = address | PTL2_TYPE_BLOCK | PTL2_BLOCK_AF;
  value |= memory_attrs;
  pmd[bfn] = value;
-} +/*
With 4k page granule, a virtual address is split into 4 lookup parts

spanning 9 bits each:

| | | | | | |

| 0 | Lv0 | Lv1 | Lv2 | Lv3 | off |

|_______|_______|_______|_______|_______|_______|
63-48   47-39   38-30   29-21   20-12   11-00
        mask        page size
Lv0: FF8000000000 --

Lv1: 7FC0000000 1G

Lv2: 3FE00000 2M

Lv3: 1FF000 4K

off: FFF

*/
+#ifdef CONFIG_SYS_FULL_VA static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;
I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.

I'll move this into board files which then can do whatever they like with it - take if from a #define in a header, populate it dynamically from device tree, do something halfway in between, whatever fits your needs best :).

Btw, if you want to use dt for it, you don't need to add any new bindings at all. Simply take all of your device regs/ranges and memory ranges and gobble them into the memory map.

...

-#define PTL1_ENTRIES CONFIG_SYS_PTL1_ENTRIES -#define PTL2_ENTRIES CONFIG_SYS_PTL2_ENTRIES

static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) { u64 max_addr = 0; @@ -79,8 +87,8 @@ static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) }

    /* PTWs cacheable, inner/outer WBWA and inner shareable */

  tcr |= TCR_TG0_64K | TCR_SHARED_INNER | TCR_ORGN_WBWA | TCR_IRGN_WBWA;

```
  tcr |= TCR_T0SZ(VA_BITS);
```

  tcr |= TCR_TG0_4K | TCR_SHARED_INNER | TCR_ORGN_WBWA | TCR_IRGN_WBWA;

  tcr |= TCR_T0SZ(va_bits);

  if (pips)
          *pips = ips;

@@ -90,39 +98,196 @@ static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) return tcr; }

-static void setup_pgtables(void) +#define MAX_PTE_ENTRIES 512

+static int pte_type(u64 *pte) +{

```
  return *pte & PTE_TYPE_MASK;
```

+/* Returns the LSB number for a PTE on level <level> */ +static int level2shift(int level) {

```
  int l1_e, l2_e;
```
```
  unsigned long pmd = 0;
```
```
  unsigned long address;
```
```
  /* Setup the PMD pointers */
```

  for (l1_e = 0; l1_e < CONFIG_SYS_MEM_MAP_SIZE; l1_e++) {

          gd->arch.pmd_addr[l1_e] = gd->arch.tlb_addr +

                                          PTL1_ENTRIES * sizeof(u64);

          gd->arch.pmd_addr[l1_e] += PTL2_ENTRIES * sizeof(u64) * l1_e;

          gd->arch.pmd_addr[l1_e] = ALIGN(gd->arch.pmd_addr[l1_e],

                                          0x10000UL);

  /* Page is 12 bits wide, every level translates 9 bits */

```
  return (12 + 9 * (3 - level));
```

+static u64 *find_pte(u64 addr, int level) +{

```
  int start_level = 0;
```
```
  u64 *pte;
```
```
  u64 idx;
```
```
  u64 va_bits;
```
```
  int i;
```

  DPRINTF("addr=%llx level=%d\n", addr, level);

```
  get_tcr(0, NULL, &va_bits);
```
```
  if (va_bits < 39)
```
```
          start_level = 1;
```
```
  if (level < start_level)
```
```
          return NULL;
```

  /* Walk through all page table levels to find our PTE */

```
  pte = (u64*)gd->arch.tlb_addr;
```
```
  for (i = start_level; i < 4; i++) {
```

          idx = (addr >> level2shift(i)) & 0x1FF;

```
          pte += idx;
```

          DPRINTF("idx=%llx PTE %p at level %d: %llx\n", idx, pte, i, *pte);

```
          /* Found it */
```
```
          if (i == level)
```
```
                  return pte;
```

          /* PTE is no table (either invalid or block), can't traverse */

          if (pte_type(pte) != PTE_TYPE_TABLE)

```
                  return NULL;
```
```
          /* Off to the next level */
```

          pte = (u64*)(*pte & 0x0000fffffffff000ULL);
  }

```
  /* Setup the page tables */
```

  for (l1_e = 0; l1_e < PTL1_ENTRIES; l1_e++) {

```
          if (mem_map[pmd].base ==
```

                  (uintptr_t)l1_e << PTL2_BITS) {

                  set_ptl1_entry(l1_e, gd->arch.pmd_addr[pmd]);

                  for (l2_e = 0; l2_e < PTL2_ENTRIES; l2_e++) {

                          address = mem_map[pmd].base

                                  + (uintptr_t)l2_e * BLOCK_SIZE;

                          set_ptl2_block(gd->arch.pmd_addr[pmd], l2_e,

                                         address, mem_map[pmd].attrs);

```
                  }
```

```
  /* Should never reach here */
```
```
  return NULL;
```

+/* Creates a new full table (512 entries) and sets *pte to refer to it */ +static u64 *create_table(void) +{

  u64 *new_table = (u64*)gd->arch.tlb_fillptr;

  u64 pt_len = MAX_PTE_ENTRIES * sizeof(u64);

  /* Allocate MAX_PTE_ENTRIES pte entries */

```
  gd->arch.tlb_fillptr += pt_len;
```

  if (gd->arch.tlb_fillptr - gd->arch.tlb_addr > gd->arch.tlb_size)

          panic("Insufficient RAM for page table: 0x%lx > 0x%lx",

                  gd->arch.tlb_fillptr - gd->arch.tlb_addr,

```
                  gd->arch.tlb_size);
```

For each of these panic() calls can you please add a comment as to what the user should do? It needs to be very clear what action should be taken to resolve the problem.

Good idea. The only case where I can't think of a good text is at

if (pte_type(pte) != PTE_TYPE_TABLE) panic("PTE %p (%llx) for addr=%llx should be a table", pte, *pte, start);

because this really is more of an assert(). We should never reach it.

Thanks a bunch for the reviews,

Alex

Stephen Warren

6:01 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/24/2016 03:55 AM, Alexander Graf wrote:

...

On 23.02.16 14:17, Simon Glass wrote:

...
Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/cache_v8.c | 346 +++++++++++++++++++++++++++++++------ arch/arm/include/asm/armv8/mmu.h | 68 ++++---- arch/arm/include/asm/global_data.h | 4 +- arch/arm/include/asm/system.h | 3 +- include/configs/thunderx_88xx.h | 14 +- 5 files changed, 332 insertions(+), 103 deletions(-)

Should the change to the thunderx file go in a separate patch?

We're changing semantics for some defines from "this define acts in L1/L2 page table entries" to "this define is for level/block type PTEs". So it's tied to this patch :).

...
...
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 9229532..4369a83 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -2,6 +2,9 @@

(C) Copyright 2013

David Feng fenghua@phytium.com.cn

(C) Copyright 2016

Alexander Graf agraf@suse.de

SPDX-License-Identifier: GPL-2.0+

*/

@@ -9,35 +12,40 @@ #include <asm/system.h> #include <asm/armv8/mmu.h>

-DECLARE_GLOBAL_DATA_PTR;

-#ifndef CONFIG_SYS_DCACHE_OFF +/* #define DEBUG_MMU */

-#ifdef CONFIG_SYS_FULL_VA -static void set_ptl1_entry(u64 index, u64 ptl2_entry) -{
  u64 *pgd = (u64 *)gd->arch.tlb_addr;
  u64 value;
+#ifdef DEBUG_MMU +#define DPRINTF(a, ...) printf("%s:%d: " a, __func__, __LINE__, __VA_ARGS__) +#else +#define DPRINTF(a, ...) do { } while(0) +#endif
Can you use the normal DEBUG and debug()?
Uh, I guess so, yeah.

...
...
  value = ptl2_entry | PTL1_TYPE_TABLE;
  pgd[index] = value;
-} +DECLARE_GLOBAL_DATA_PTR;

-static void set_ptl2_block(u64 ptl1, u64 bfn, u64 address, u64 memory_attrs) -{
  u64 *pmd = (u64 *)ptl1;
  u64 value;
+#ifndef CONFIG_SYS_DCACHE_OFF
  value = address | PTL2_TYPE_BLOCK | PTL2_BLOCK_AF;
  value |= memory_attrs;
  pmd[bfn] = value;
-} +/*
With 4k page granule, a virtual address is split into 4 lookup parts

spanning 9 bits each:

| | | | | | |

| 0 | Lv0 | Lv1 | Lv2 | Lv3 | off |

|_______|_______|_______|_______|_______|_______|
63-48   47-39   38-30   29-21   20-12   11-00
        mask        page size
Lv0: FF8000000000 --

Lv1: 7FC0000000 1G

Lv2: 3FE00000 2M

Lv3: 1FF000 4K

off: FFF

*/
+#ifdef CONFIG_SYS_FULL_VA static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;
I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.
I'll move this into board files which then can do whatever they like with it - take if from a #define in a header, populate it dynamically from device tree, do something halfway in between, whatever fits your needs best :).

Btw, if you want to use dt for it, you don't need to add any new bindings at all. Simply take all of your device regs/ranges and memory ranges and gobble them into the memory map.

That's only true if every single MMIO region that the code accesses is already represented in DT already, or any DT conversion patches also add any missing devices. It's quite unlikely that even on heavily DT-dependant platforms that /all/ in-use MMIO regions are already represented in DT. Such a conversion would need quite significant testing.

Alexander Graf

6:04 p.m.

New subject: [U-Boot] [PATCH 2/9] arm64: Make full va map code more dynamic

On 02/24/2016 06:01 PM, Stephen Warren wrote:

...

On 02/24/2016 03:55 AM, Alexander Graf wrote:

...
On 23.02.16 14:17, Simon Glass wrote:

...
Hi Alex,

On 21 February 2016 at 18:57, Alexander Graf agraf@suse.de wrote:

...
The idea to generate our pages tables from an array of memory ranges is very sound. However, instead of hard coding the code to create up to 2 levels of 64k granule page tables, we really should just create normal 4k page tables that allow us to set caching attributes on 2M or 4k level later on.

So this patch moves the full_va mapping code to 4k page size and makes it fully flexible to dynamically create as many levels as necessary for a map (including dynamic 1G/2M pages). It also adds support to dynamically split a large map into smaller ones when some code wants to set dcache attributes.

With all this in place, there is very little reason to create your own page tables in board specific files.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/cache_v8.c | 346 +++++++++++++++++++++++++++++++------ arch/arm/include/asm/armv8/mmu.h | 68 ++++---- arch/arm/include/asm/global_data.h | 4 +- arch/arm/include/asm/system.h | 3 +- include/configs/thunderx_88xx.h | 14 +- 5 files changed, 332 insertions(+), 103 deletions(-)

Should the change to the thunderx file go in a separate patch?

We're changing semantics for some defines from "this define acts in L1/L2 page table entries" to "this define is for level/block type PTEs". So it's tied to this patch :).

...
...
diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 9229532..4369a83 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -2,6 +2,9 @@

(C) Copyright 2013

David Feng fenghua@phytium.com.cn

(C) Copyright 2016

Alexander Graf agraf@suse.de

SPDX-License-Identifier: GPL-2.0+

*/

@@ -9,35 +12,40 @@ #include <asm/system.h> #include <asm/armv8/mmu.h>

-DECLARE_GLOBAL_DATA_PTR;

-#ifndef CONFIG_SYS_DCACHE_OFF +/* #define DEBUG_MMU */

-#ifdef CONFIG_SYS_FULL_VA -static void set_ptl1_entry(u64 index, u64 ptl2_entry) -{
  u64 *pgd = (u64 *)gd->arch.tlb_addr;
  u64 value;
+#ifdef DEBUG_MMU +#define DPRINTF(a, ...) printf("%s:%d: " a, __func__, __LINE__, __VA_ARGS__) +#else +#define DPRINTF(a, ...) do { } while(0) +#endif
Can you use the normal DEBUG and debug()?
Uh, I guess so, yeah.

...
...
  value = ptl2_entry | PTL1_TYPE_TABLE;
  pgd[index] = value;
-} +DECLARE_GLOBAL_DATA_PTR;

-static void set_ptl2_block(u64 ptl1, u64 bfn, u64 address, u64 memory_attrs) -{
  u64 *pmd = (u64 *)ptl1;
  u64 value;
+#ifndef CONFIG_SYS_DCACHE_OFF
  value = address | PTL2_TYPE_BLOCK | PTL2_BLOCK_AF;
  value |= memory_attrs;
  pmd[bfn] = value;
-} +/*

With 4k page granule, a virtual address is split into 4 lookup

parts
spanning 9 bits each:

| | | | | | |

| 0 | Lv0 | Lv1 | Lv2 | Lv3 | off |

|_______|_______|_______|_______|_______|_______|
63-48   47-39   38-30   29-21   20-12   11-00
        mask        page size
Lv0: FF8000000000 --

Lv1: 7FC0000000 1G

Lv2: 3FE00000 2M

Lv3: 1FF000 4K

off: FFF

*/
+#ifdef CONFIG_SYS_FULL_VA static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP;
I am not ken on the idea of using a big #define table on these boards. Is there not a device-tree binding for this that we can use? It is just a data table, We are moving to Kconfig and eventually want to drop the config files.
I'll move this into board files which then can do whatever they like with it - take if from a #define in a header, populate it dynamically from device tree, do something halfway in between, whatever fits your needs best :).

Btw, if you want to use dt for it, you don't need to add any new bindings at all. Simply take all of your device regs/ranges and memory ranges and gobble them into the memory map.
That's only true if every single MMIO region that the code accesses is already represented in DT already, or any DT conversion patches also add any missing devices. It's quite unlikely that even on heavily DT-dependant platforms that /all/ in-use MMIO regions are already represented in DT. Such a conversion would need quite significant testing.

Well, it's something I'd leave to the SoC / board maintainer really. With v2 you can now fiddle with the map as much as you like inside your code. The important bit to me is that we have a standardized path on how we map the mem_map table into the MMU. I don't care how we fetch the table.

Alex

Alexander Graf

22 Feb 22 Feb

2:57 a.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

Now that we have nice table driven page table creating code that gives us everything we need, move to that.

Signed-off-by: Alexander Graf agraf@suse.de --- arch/arm/cpu/armv8/zynqmp/cpu.c | 169 ---------------------------------------- include/configs/xilinx_zynqmp.h | 44 +++++++++++ 2 files changed, 44 insertions(+), 169 deletions(-)

diff --git a/arch/arm/cpu/armv8/zynqmp/cpu.c b/arch/arm/cpu/armv8/zynqmp/cpu.c index c71f291..3f661a9 100644 --- a/arch/arm/cpu/armv8/zynqmp/cpu.c +++ b/arch/arm/cpu/armv8/zynqmp/cpu.c @@ -44,172 +44,3 @@ unsigned int zynqmp_get_silicon_version(void)

return ZYNQMP_CSU_VERSION_SILICON; } - -#ifndef CONFIG_SYS_DCACHE_OFF -#include <asm/armv8/mmu.h> - -#define SECTION_SHIFT_L1 30UL -#define SECTION_SHIFT_L2 21UL -#define BLOCK_SIZE_L0 0x8000000000UL -#define BLOCK_SIZE_L1 (1 << SECTION_SHIFT_L1) -#define BLOCK_SIZE_L2 (1 << SECTION_SHIFT_L2) - -#define TCR_TG1_4K (1 << 31) -#define TCR_EPD1_DISABLE (1 << 23) -#define ZYNQMO_VA_BITS 40 -#define ZYNQMP_TCR TCR_TG1_4K | \ - TCR_EPD1_DISABLE | \ - TCR_SHARED_OUTER | \ - TCR_SHARED_INNER | \ - TCR_IRGN_WBWA | \ - TCR_ORGN_WBWA | \ - TCR_T0SZ(ZYNQMO_VA_BITS) - -#define MEMORY_ATTR PMD_SECT_AF | PMD_SECT_INNER_SHARE | \ - PMD_ATTRINDX(MT_NORMAL) | \ - PMD_TYPE_SECT -#define DEVICE_ATTR PMD_SECT_AF | PMD_SECT_PXN | \ - PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_NGNRNE) | \ - PMD_TYPE_SECT - -/* 4K size is required to place 512 entries in each level */ -#define TLB_TABLE_SIZE 0x1000 - -struct attr_tbl { - u32 num; - u64 attr; -}; - -static struct attr_tbl attr_tbll1t0[4] = { {16, 0x0}, - {8, DEVICE_ATTR}, - {32, MEMORY_ATTR}, - {456, DEVICE_ATTR} - }; -static struct attr_tbl attr_tbll2t3[4] = { {0x180, DEVICE_ATTR}, - {0x40, 0x0}, - {0x3F, DEVICE_ATTR}, - {0x1, MEMORY_ATTR} - }; - -/* - * This mmu table looks as below - * Level 0 table contains two entries to 512GB sizes. One is Level1 Table 0 - * and other Level1 Table1. - * Level1 Table0 contains entries for each 1GB from 0 to 511GB. - * Level1 Table1 contains entries for each 1GB from 512GB to 1TB. - * Level2 Table0, Level2 Table1, Level2 Table2 and Level2 Table3 contains - * entries for each 2MB starting from 0GB, 1GB, 2GB and 3GB respectively. - */ -static void zynqmp_mmu_setup(void) -{ - int el; - u32 index_attr; - u64 i, section_l1t0, section_l1t1; - u64 section_l2t0, section_l2t1, section_l2t2, section_l2t3; - u64 *level0_table = (u64 *)gd->arch.tlb_addr; - u64 *level1_table_0 = (u64 *)(gd->arch.tlb_addr + TLB_TABLE_SIZE); - u64 *level1_table_1 = (u64 *)(gd->arch.tlb_addr + (2 * TLB_TABLE_SIZE)); - u64 *level2_table_0 = (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE)); - u64 *level2_table_1 = (u64 *)(gd->arch.tlb_addr + (4 * TLB_TABLE_SIZE)); - u64 *level2_table_2 = (u64 *)(gd->arch.tlb_addr + (5 * TLB_TABLE_SIZE)); - u64 *level2_table_3 = (u64 *)(gd->arch.tlb_addr + (6 * TLB_TABLE_SIZE)); - - level0_table[0] = - (u64)level1_table_0 | PMD_TYPE_TABLE; - level0_table[1] = - (u64)level1_table_1 | PMD_TYPE_TABLE; - - /* - * set level 1 table 0, covering 0 to 512GB - * set level 1 table 1, covering 512GB to 1TB - */ - section_l1t0 = 0; - section_l1t1 = BLOCK_SIZE_L0; - - index_attr = 0; - for (i = 0; i < 512; i++) { - level1_table_0[i] = section_l1t0; - level1_table_0[i] |= attr_tbll1t0[index_attr].attr; - attr_tbll1t0[index_attr].num--; - if (attr_tbll1t0[index_attr].num == 0) - index_attr++; - level1_table_1[i] = section_l1t1; - level1_table_1[i] |= DEVICE_ATTR; - section_l1t0 += BLOCK_SIZE_L1; - section_l1t1 += BLOCK_SIZE_L1; - } - - level1_table_0[0] = - (u64)level2_table_0 | PMD_TYPE_TABLE; - level1_table_0[1] = - (u64)level2_table_1 | PMD_TYPE_TABLE; - level1_table_0[2] = - (u64)level2_table_2 | PMD_TYPE_TABLE; - level1_table_0[3] = - (u64)level2_table_3 | PMD_TYPE_TABLE; - - section_l2t0 = 0; - section_l2t1 = section_l2t0 + BLOCK_SIZE_L1; /* 1GB */ - section_l2t2 = section_l2t1 + BLOCK_SIZE_L1; /* 2GB */ - section_l2t3 = section_l2t2 + BLOCK_SIZE_L1; /* 3GB */ - - index_attr = 0; - - for (i = 0; i < 512; i++) { - level2_table_0[i] = section_l2t0 | MEMORY_ATTR; - level2_table_1[i] = section_l2t1 | MEMORY_ATTR; - level2_table_2[i] = section_l2t2 | DEVICE_ATTR; - level2_table_3[i] = section_l2t3 | - attr_tbll2t3[index_attr].attr; - attr_tbll2t3[index_attr].num--; - if (attr_tbll2t3[index_attr].num == 0) - index_attr++; - section_l2t0 += BLOCK_SIZE_L2; - section_l2t1 += BLOCK_SIZE_L2; - section_l2t2 += BLOCK_SIZE_L2; - section_l2t3 += BLOCK_SIZE_L2; - } - - /* flush new MMU table */ - flush_dcache_range(gd->arch.tlb_addr, - gd->arch.tlb_addr + gd->arch.tlb_size); - - /* point TTBR to the new table */ - el = current_el(); - set_ttbr_tcr_mair(el, gd->arch.tlb_addr, - ZYNQMP_TCR, MEMORY_ATTRIBUTES); - - set_sctlr(get_sctlr() | CR_M); -} - -int arch_cpu_init(void) -{ - icache_enable(); - __asm_invalidate_dcache_all(); - __asm_invalidate_tlb_all(); - return 0; -} - -/* - * This function is called from lib/board.c. - * It recreates MMU table in main memory. MMU and d-cache are enabled earlier. - * There is no need to disable d-cache for this operation. - */ -void enable_caches(void) -{ - /* The data cache is not active unless the mmu is enabled */ - if (!(get_sctlr() & CR_M)) { - invalidate_dcache_all(); - __asm_invalidate_tlb_all(); - zynqmp_mmu_setup(); - } - puts("Enabling Caches...\n"); - - set_sctlr(get_sctlr() | CR_C); -} - -u64 *arch_get_page_table(void) -{ - return (u64 *)(gd->arch.tlb_addr + 0x3000); -} -#endif diff --git a/include/configs/xilinx_zynqmp.h b/include/configs/xilinx_zynqmp.h index 28622de..439f063 100644 --- a/include/configs/xilinx_zynqmp.h +++ b/include/configs/xilinx_zynqmp.h @@ -29,6 +29,50 @@ #define CONFIG_SYS_MEMTEST_START CONFIG_SYS_SDRAM_BASE #define CONFIG_SYS_MEMTEST_END CONFIG_SYS_SDRAM_SIZE

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \ + { \ + .base = 0x0UL, \ + .size = 0x80000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) | \ + PTE_BLOCK_INNER_SHARE \ + }, { \ + .base = 0x80000000UL, \ + .size = 0x70000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE | \ + PTE_BLOCK_PXN | PTE_BLOCK_UXN \ + }, { \ + .base = 0xf8000000UL, \ + .size = 0x07e00000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE | \ + PTE_BLOCK_PXN | PTE_BLOCK_UXN \ + }, { \ + .base = 0xffe00000UL, \ + .size = 0x00200000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) | \ + PTE_BLOCK_INNER_SHARE \ + }, { \ + .base = 0x400000000UL, \ + .size = 0x200000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE | \ + PTE_BLOCK_PXN | PTE_BLOCK_UXN \ + }, { \ + .base = 0x600000000UL, \ + .size = 0x800000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) | \ + PTE_BLOCK_INNER_SHARE \ + }, { \ + .base = 0xe00000000UL, \ + .size = 0xf200000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE | \ + PTE_BLOCK_PXN | PTE_BLOCK_UXN \ + }, \ + } + /* Have release address at the end of 256MB for now */ #define CPU_RELEASE_ADDR 0xFFFFFF0

-- 1.8.5.6

Michal Simek

23 Feb 23 Feb

12:04 p.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

On 22.2.2016 02:57, Alexander Graf wrote:

...

Now that we have nice table driven page table creating code that gives us everything we need, move to that.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/zynqmp/cpu.c | 169 ---------------------------------------- include/configs/xilinx_zynqmp.h | 44 +++++++++++ 2 files changed, 44 insertions(+), 169 deletions(-)

diff --git a/arch/arm/cpu/armv8/zynqmp/cpu.c b/arch/arm/cpu/armv8/zynqmp/cpu.c index c71f291..3f661a9 100644 --- a/arch/arm/cpu/armv8/zynqmp/cpu.c +++ b/arch/arm/cpu/armv8/zynqmp/cpu.c @@ -44,172 +44,3 @@ unsigned int zynqmp_get_silicon_version(void)

return ZYNQMP_CSU_VERSION_SILICON; }

-#ifndef CONFIG_SYS_DCACHE_OFF -#include <asm/armv8/mmu.h>

-#define SECTION_SHIFT_L1 30UL -#define SECTION_SHIFT_L2 21UL -#define BLOCK_SIZE_L0 0x8000000000UL -#define BLOCK_SIZE_L1 (1 << SECTION_SHIFT_L1) -#define BLOCK_SIZE_L2 (1 << SECTION_SHIFT_L2)

-#define TCR_TG1_4K (1 << 31) -#define TCR_EPD1_DISABLE (1 << 23) -#define ZYNQMO_VA_BITS 40 -#define ZYNQMP_TCR TCR_TG1_4K | \
		TCR_EPD1_DISABLE | \
		TCR_SHARED_OUTER | \
		TCR_SHARED_INNER | \
		TCR_IRGN_WBWA | \
		TCR_ORGN_WBWA | \
		TCR_T0SZ(ZYNQMO_VA_BITS)
-#define MEMORY_ATTR PMD_SECT_AF | PMD_SECT_INNER_SHARE | \
	PMD_ATTRINDX(MT_NORMAL) |	\
	PMD_TYPE_SECT
-#define DEVICE_ATTR PMD_SECT_AF | PMD_SECT_PXN | \
	PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_NGNRNE) |	\
	PMD_TYPE_SECT
-/* 4K size is required to place 512 entries in each level */ -#define TLB_TABLE_SIZE 0x1000

-struct attr_tbl {

u32 num;

u64 attr;

-};

-static struct attr_tbl attr_tbll1t0[4] = { {16, 0x0},
			   {8, DEVICE_ATTR},
			   {32, MEMORY_ATTR},
			   {456, DEVICE_ATTR}
			 };
-static struct attr_tbl attr_tbll2t3[4] = { {0x180, DEVICE_ATTR},
			   {0x40, 0x0},
			   {0x3F, DEVICE_ATTR},
			   {0x1, MEMORY_ATTR}
			 };
-/*

This mmu table looks as below

Level 0 table contains two entries to 512GB sizes. One is Level1 Table 0

and other Level1 Table1.

Level1 Table0 contains entries for each 1GB from 0 to 511GB.

Level1 Table1 contains entries for each 1GB from 512GB to 1TB.

Level2 Table0, Level2 Table1, Level2 Table2 and Level2 Table3 contains

entries for each 2MB starting from 0GB, 1GB, 2GB and 3GB respectively.

*/

-static void zynqmp_mmu_setup(void) -{
int el;

u32 index_attr;

u64 i, section_l1t0, section_l1t1;

u64 section_l2t0, section_l2t1, section_l2t2, section_l2t3;

u64 *level0_table = (u64 *)gd->arch.tlb_addr;

u64 *level1_table_0 = (u64 *)(gd->arch.tlb_addr + TLB_TABLE_SIZE);

u64 *level1_table_1 = (u64 *)(gd->arch.tlb_addr + (2 * TLB_TABLE_SIZE));

u64 *level2_table_0 = (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE));

u64 *level2_table_1 = (u64 *)(gd->arch.tlb_addr + (4 * TLB_TABLE_SIZE));

u64 *level2_table_2 = (u64 *)(gd->arch.tlb_addr + (5 * TLB_TABLE_SIZE));

u64 *level2_table_3 = (u64 *)(gd->arch.tlb_addr + (6 * TLB_TABLE_SIZE));

level0_table[0] =
(u64)level1_table_0 | PMD_TYPE_TABLE;
level0_table[1] =
(u64)level1_table_1 | PMD_TYPE_TABLE;
/*
* set level 1 table 0, covering 0 to 512GB
* set level 1 table 1, covering 512GB to 1TB
*/
section_l1t0 = 0;

section_l1t1 = BLOCK_SIZE_L0;

index_attr = 0;

for (i = 0; i < 512; i++) {
level1_table_0[i] = section_l1t0;
level1_table_0[i] |= attr_tbll1t0[index_attr].attr;
attr_tbll1t0[index_attr].num--;
if (attr_tbll1t0[index_attr].num == 0)
	index_attr++;
level1_table_1[i] = section_l1t1;
level1_table_1[i] |= DEVICE_ATTR;
section_l1t0 += BLOCK_SIZE_L1;
section_l1t1 += BLOCK_SIZE_L1;
}

level1_table_0[0] =
(u64)level2_table_0 | PMD_TYPE_TABLE;
level1_table_0[1] =
(u64)level2_table_1 | PMD_TYPE_TABLE;
level1_table_0[2] =
(u64)level2_table_2 | PMD_TYPE_TABLE;
level1_table_0[3] =
(u64)level2_table_3 | PMD_TYPE_TABLE;
section_l2t0 = 0;

section_l2t1 = section_l2t0 + BLOCK_SIZE_L1; /* 1GB */

section_l2t2 = section_l2t1 + BLOCK_SIZE_L1; /* 2GB */

section_l2t3 = section_l2t2 + BLOCK_SIZE_L1; /* 3GB */

index_attr = 0;

for (i = 0; i < 512; i++) {
level2_table_0[i] = section_l2t0 | MEMORY_ATTR;
level2_table_1[i] = section_l2t1 | MEMORY_ATTR;
level2_table_2[i] = section_l2t2 | DEVICE_ATTR;
level2_table_3[i] = section_l2t3 |
		    attr_tbll2t3[index_attr].attr;
attr_tbll2t3[index_attr].num--;
if (attr_tbll2t3[index_attr].num == 0)
	index_attr++;
section_l2t0 += BLOCK_SIZE_L2;
section_l2t1 += BLOCK_SIZE_L2;
section_l2t2 += BLOCK_SIZE_L2;
section_l2t3 += BLOCK_SIZE_L2;
}

/* flush new MMU table */

flush_dcache_range(gd->arch.tlb_addr,
	   gd->arch.tlb_addr + gd->arch.tlb_size);
/* point TTBR to the new table */

el = current_el();

set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
	  ZYNQMP_TCR, MEMORY_ATTRIBUTES);
set_sctlr(get_sctlr() | CR_M);
-}

-int arch_cpu_init(void) -{

icache_enable();

__asm_invalidate_dcache_all();

__asm_invalidate_tlb_all();

return 0;

-}

-/*

This function is called from lib/board.c.

It recreates MMU table in main memory. MMU and d-cache are enabled earlier.

There is no need to disable d-cache for this operation.

*/

-void enable_caches(void) -{
/* The data cache is not active unless the mmu is enabled */

if (!(get_sctlr() & CR_M)) {
invalidate_dcache_all();
__asm_invalidate_tlb_all();
zynqmp_mmu_setup();
}

puts("Enabling Caches...\n");

set_sctlr(get_sctlr() | CR_C);
-}

-u64 *arch_get_page_table(void) -{

return (u64 *)(gd->arch.tlb_addr + 0x3000);

-} -#endif diff --git a/include/configs/xilinx_zynqmp.h b/include/configs/xilinx_zynqmp.h index 28622de..439f063 100644 --- a/include/configs/xilinx_zynqmp.h +++ b/include/configs/xilinx_zynqmp.h @@ -29,6 +29,50 @@ #define CONFIG_SYS_MEMTEST_START CONFIG_SYS_SDRAM_BASE #define CONFIG_SYS_MEMTEST_END CONFIG_SYS_SDRAM_SIZE

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
.base = 0x0UL,						\
.size = 0x80000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x80000000UL,					\
.size = 0x70000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xf8000000UL,					\
.size = 0x07e00000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xffe00000UL,					\
.size = 0x00200000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x400000000UL,					\
.size = 0x200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0x600000000UL,					\
.size = 0x800000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0xe00000000UL,					\
.size = 0xf200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, \

}
/* Have release address at the end of 256MB for now */ #define CPU_RELEASE_ADDR 0xFFFFFF0

No problem with this default map. I didn't check every entry. Siva will look at it. My problem is that this is static. You should enable option option to generate this table dynamically. The reason is that systems can run just with PL DDR not PS and you can't map memory which is not there which can end up in system lock. We want to change current code to setup MMU table for memories at run time based on DT.

Thanks, Michal

Alexander Graf

12:33 p.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

On 23.02.16 12:04, Michal Simek wrote:

...

On 22.2.2016 02:57, Alexander Graf wrote:

...
Now that we have nice table driven page table creating code that gives us everything we need, move to that.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/zynqmp/cpu.c | 169 ---------------------------------------- include/configs/xilinx_zynqmp.h | 44 +++++++++++ 2 files changed, 44 insertions(+), 169 deletions(-)

diff --git a/arch/arm/cpu/armv8/zynqmp/cpu.c b/arch/arm/cpu/armv8/zynqmp/cpu.c index c71f291..3f661a9 100644 --- a/arch/arm/cpu/armv8/zynqmp/cpu.c +++ b/arch/arm/cpu/armv8/zynqmp/cpu.c @@ -44,172 +44,3 @@ unsigned int zynqmp_get_silicon_version(void)

return ZYNQMP_CSU_VERSION_SILICON; }

-#ifndef CONFIG_SYS_DCACHE_OFF -#include <asm/armv8/mmu.h>

-#define SECTION_SHIFT_L1 30UL -#define SECTION_SHIFT_L2 21UL -#define BLOCK_SIZE_L0 0x8000000000UL -#define BLOCK_SIZE_L1 (1 << SECTION_SHIFT_L1) -#define BLOCK_SIZE_L2 (1 << SECTION_SHIFT_L2)

-#define TCR_TG1_4K (1 << 31) -#define TCR_EPD1_DISABLE (1 << 23) -#define ZYNQMO_VA_BITS 40 -#define ZYNQMP_TCR TCR_TG1_4K | \
		TCR_EPD1_DISABLE | \
		TCR_SHARED_OUTER | \
		TCR_SHARED_INNER | \
		TCR_IRGN_WBWA | \
		TCR_ORGN_WBWA | \
		TCR_T0SZ(ZYNQMO_VA_BITS)
-#define MEMORY_ATTR PMD_SECT_AF | PMD_SECT_INNER_SHARE | \
	PMD_ATTRINDX(MT_NORMAL) |	\
	PMD_TYPE_SECT
-#define DEVICE_ATTR PMD_SECT_AF | PMD_SECT_PXN | \
	PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_NGNRNE) |	\
	PMD_TYPE_SECT
-/* 4K size is required to place 512 entries in each level */ -#define TLB_TABLE_SIZE 0x1000

-struct attr_tbl {

u32 num;

u64 attr;

-};

-static struct attr_tbl attr_tbll1t0[4] = { {16, 0x0},
			   {8, DEVICE_ATTR},
			   {32, MEMORY_ATTR},
			   {456, DEVICE_ATTR}
			 };
-static struct attr_tbl attr_tbll2t3[4] = { {0x180, DEVICE_ATTR},
			   {0x40, 0x0},
			   {0x3F, DEVICE_ATTR},
			   {0x1, MEMORY_ATTR}
			 };
-/*

This mmu table looks as below

Level 0 table contains two entries to 512GB sizes. One is Level1 Table 0

and other Level1 Table1.

Level1 Table0 contains entries for each 1GB from 0 to 511GB.

Level1 Table1 contains entries for each 1GB from 512GB to 1TB.

Level2 Table0, Level2 Table1, Level2 Table2 and Level2 Table3 contains

entries for each 2MB starting from 0GB, 1GB, 2GB and 3GB respectively.

*/

-static void zynqmp_mmu_setup(void) -{
int el;

u32 index_attr;

u64 i, section_l1t0, section_l1t1;

u64 section_l2t0, section_l2t1, section_l2t2, section_l2t3;

u64 *level0_table = (u64 *)gd->arch.tlb_addr;

u64 *level1_table_0 = (u64 *)(gd->arch.tlb_addr + TLB_TABLE_SIZE);

u64 *level1_table_1 = (u64 *)(gd->arch.tlb_addr + (2 * TLB_TABLE_SIZE));

u64 *level2_table_0 = (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE));

u64 *level2_table_1 = (u64 *)(gd->arch.tlb_addr + (4 * TLB_TABLE_SIZE));

u64 *level2_table_2 = (u64 *)(gd->arch.tlb_addr + (5 * TLB_TABLE_SIZE));

u64 *level2_table_3 = (u64 *)(gd->arch.tlb_addr + (6 * TLB_TABLE_SIZE));

level0_table[0] =
(u64)level1_table_0 | PMD_TYPE_TABLE;
level0_table[1] =
(u64)level1_table_1 | PMD_TYPE_TABLE;
/*
* set level 1 table 0, covering 0 to 512GB
* set level 1 table 1, covering 512GB to 1TB
*/
section_l1t0 = 0;

section_l1t1 = BLOCK_SIZE_L0;

index_attr = 0;

for (i = 0; i < 512; i++) {
level1_table_0[i] = section_l1t0;
level1_table_0[i] |= attr_tbll1t0[index_attr].attr;
attr_tbll1t0[index_attr].num--;
if (attr_tbll1t0[index_attr].num == 0)
	index_attr++;
level1_table_1[i] = section_l1t1;
level1_table_1[i] |= DEVICE_ATTR;
section_l1t0 += BLOCK_SIZE_L1;
section_l1t1 += BLOCK_SIZE_L1;
}

level1_table_0[0] =
(u64)level2_table_0 | PMD_TYPE_TABLE;
level1_table_0[1] =
(u64)level2_table_1 | PMD_TYPE_TABLE;
level1_table_0[2] =
(u64)level2_table_2 | PMD_TYPE_TABLE;
level1_table_0[3] =
(u64)level2_table_3 | PMD_TYPE_TABLE;
section_l2t0 = 0;

section_l2t1 = section_l2t0 + BLOCK_SIZE_L1; /* 1GB */

section_l2t2 = section_l2t1 + BLOCK_SIZE_L1; /* 2GB */

section_l2t3 = section_l2t2 + BLOCK_SIZE_L1; /* 3GB */

index_attr = 0;

for (i = 0; i < 512; i++) {
level2_table_0[i] = section_l2t0 | MEMORY_ATTR;
level2_table_1[i] = section_l2t1 | MEMORY_ATTR;
level2_table_2[i] = section_l2t2 | DEVICE_ATTR;
level2_table_3[i] = section_l2t3 |
		    attr_tbll2t3[index_attr].attr;
attr_tbll2t3[index_attr].num--;
if (attr_tbll2t3[index_attr].num == 0)
	index_attr++;
section_l2t0 += BLOCK_SIZE_L2;
section_l2t1 += BLOCK_SIZE_L2;
section_l2t2 += BLOCK_SIZE_L2;
section_l2t3 += BLOCK_SIZE_L2;
}

/* flush new MMU table */

flush_dcache_range(gd->arch.tlb_addr,
	   gd->arch.tlb_addr + gd->arch.tlb_size);
/* point TTBR to the new table */

el = current_el();

set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
	  ZYNQMP_TCR, MEMORY_ATTRIBUTES);
set_sctlr(get_sctlr() | CR_M);
-}

-int arch_cpu_init(void) -{

icache_enable();

__asm_invalidate_dcache_all();

__asm_invalidate_tlb_all();

return 0;

-}

-/*

This function is called from lib/board.c.

It recreates MMU table in main memory. MMU and d-cache are enabled earlier.

There is no need to disable d-cache for this operation.

*/

-void enable_caches(void) -{
/* The data cache is not active unless the mmu is enabled */

if (!(get_sctlr() & CR_M)) {
invalidate_dcache_all();
__asm_invalidate_tlb_all();
zynqmp_mmu_setup();
}

puts("Enabling Caches...\n");

set_sctlr(get_sctlr() | CR_C);
-}

-u64 *arch_get_page_table(void) -{

return (u64 *)(gd->arch.tlb_addr + 0x3000);

-} -#endif diff --git a/include/configs/xilinx_zynqmp.h b/include/configs/xilinx_zynqmp.h index 28622de..439f063 100644 --- a/include/configs/xilinx_zynqmp.h +++ b/include/configs/xilinx_zynqmp.h @@ -29,6 +29,50 @@ #define CONFIG_SYS_MEMTEST_START CONFIG_SYS_SDRAM_BASE #define CONFIG_SYS_MEMTEST_END CONFIG_SYS_SDRAM_SIZE

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
.base = 0x0UL,						\
.size = 0x80000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x80000000UL,					\
.size = 0x70000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xf8000000UL,					\
.size = 0x07e00000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xffe00000UL,					\
.size = 0x00200000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x400000000UL,					\
.size = 0x200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0x600000000UL,					\
.size = 0x800000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0xe00000000UL,					\
.size = 0xf200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, \

}
/* Have release address at the end of 256MB for now */ #define CPU_RELEASE_ADDR 0xFFFFFF0
No problem with this default map. I didn't check every entry. Siva will look at it. My problem is that this is static. You should enable option option to generate this table dynamically. The reason is that systems can run just

The way the code works today, I could easily just use a 0-size element as array terminator and use an external pointer rather than the local memory map array.

However, that contradicts with a runtime memory contrained approach to determine the page table size. The only good path I could come up with here is to generate the page table during build time and then save how big it was. We couldn't do that with runtime changing tables.

...

with PL DDR not PS and you can't map memory which is not there which can end up in system lock. We want to change current code to setup MMU table for memories at run time based on DT.

So in your use case, the regions would still be there, just their size changes. That means we could still have static tables that then get modified by runtime code later on to just span less?

Of course, with a full device tree, we could just generate the mmu tables completely from dt. But then we'd need a good answer to how we determine the page table pool size...

Alex

Michal Simek

2:07 p.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

Hi,

before I comment the rest. You need to also fix gem driver because it is using 1MB mapping.

diff --git a/drivers/net/zynq_gem.c b/drivers/net/zynq_gem.c index b3821c31a91d..cf1376ce1bd7 100644 --- a/drivers/net/zynq_gem.c +++ b/drivers/net/zynq_gem.c @@ -150,10 +150,10 @@ struct emac_bd { };

#define RX_BUF 32 -/* Page table entries are set to 1MB, or multiples of 1MB - * (not < 1MB). driver uses less bd's so use 1MB bdspace. +/* Page table entries are set to 2MB, or multiples of 2MB + * (not < 2MB). driver uses less bd's so use 2MB bdspace. */ -#define BD_SPACE 0x100000 +#define BD_SPACE 0x200000 /* BD separation space */ #define BD_SEPRN_SPACE (RX_BUF * sizeof(struct emac_bd))

On 23.2.2016 12:33, Alexander Graf wrote:

...

On 23.02.16 12:04, Michal Simek wrote:

...
On 22.2.2016 02:57, Alexander Graf wrote:

...
Now that we have nice table driven page table creating code that gives us everything we need, move to that.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/zynqmp/cpu.c | 169 ---------------------------------------- include/configs/xilinx_zynqmp.h | 44 +++++++++++ 2 files changed, 44 insertions(+), 169 deletions(-)

diff --git a/arch/arm/cpu/armv8/zynqmp/cpu.c b/arch/arm/cpu/armv8/zynqmp/cpu.c index c71f291..3f661a9 100644 --- a/arch/arm/cpu/armv8/zynqmp/cpu.c +++ b/arch/arm/cpu/armv8/zynqmp/cpu.c @@ -44,172 +44,3 @@ unsigned int zynqmp_get_silicon_version(void)

return ZYNQMP_CSU_VERSION_SILICON; }

-#ifndef CONFIG_SYS_DCACHE_OFF -#include <asm/armv8/mmu.h>

-#define SECTION_SHIFT_L1 30UL -#define SECTION_SHIFT_L2 21UL -#define BLOCK_SIZE_L0 0x8000000000UL -#define BLOCK_SIZE_L1 (1 << SECTION_SHIFT_L1) -#define BLOCK_SIZE_L2 (1 << SECTION_SHIFT_L2)

-#define TCR_TG1_4K (1 << 31) -#define TCR_EPD1_DISABLE (1 << 23) -#define ZYNQMO_VA_BITS 40 -#define ZYNQMP_TCR TCR_TG1_4K | \
		TCR_EPD1_DISABLE | \
		TCR_SHARED_OUTER | \
		TCR_SHARED_INNER | \
		TCR_IRGN_WBWA | \
		TCR_ORGN_WBWA | \
		TCR_T0SZ(ZYNQMO_VA_BITS)
-#define MEMORY_ATTR PMD_SECT_AF | PMD_SECT_INNER_SHARE | \
	PMD_ATTRINDX(MT_NORMAL) |	\
	PMD_TYPE_SECT
-#define DEVICE_ATTR PMD_SECT_AF | PMD_SECT_PXN | \
	PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_NGNRNE) |	\
	PMD_TYPE_SECT
-/* 4K size is required to place 512 entries in each level */ -#define TLB_TABLE_SIZE 0x1000

-struct attr_tbl {

u32 num;

u64 attr;

-};

-static struct attr_tbl attr_tbll1t0[4] = { {16, 0x0},
			   {8, DEVICE_ATTR},
			   {32, MEMORY_ATTR},
			   {456, DEVICE_ATTR}
			 };
-static struct attr_tbl attr_tbll2t3[4] = { {0x180, DEVICE_ATTR},
			   {0x40, 0x0},
			   {0x3F, DEVICE_ATTR},
			   {0x1, MEMORY_ATTR}
			 };
-/*

This mmu table looks as below

Level 0 table contains two entries to 512GB sizes. One is Level1 Table 0

and other Level1 Table1.

Level1 Table0 contains entries for each 1GB from 0 to 511GB.

Level1 Table1 contains entries for each 1GB from 512GB to 1TB.

Level2 Table0, Level2 Table1, Level2 Table2 and Level2 Table3 contains

entries for each 2MB starting from 0GB, 1GB, 2GB and 3GB respectively.

*/

-static void zynqmp_mmu_setup(void) -{
int el;

u32 index_attr;

u64 i, section_l1t0, section_l1t1;

u64 section_l2t0, section_l2t1, section_l2t2, section_l2t3;

u64 *level0_table = (u64 *)gd->arch.tlb_addr;

u64 *level1_table_0 = (u64 *)(gd->arch.tlb_addr + TLB_TABLE_SIZE);

u64 *level1_table_1 = (u64 *)(gd->arch.tlb_addr + (2 * TLB_TABLE_SIZE));

u64 *level2_table_0 = (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE));

u64 *level2_table_1 = (u64 *)(gd->arch.tlb_addr + (4 * TLB_TABLE_SIZE));

u64 *level2_table_2 = (u64 *)(gd->arch.tlb_addr + (5 * TLB_TABLE_SIZE));

u64 *level2_table_3 = (u64 *)(gd->arch.tlb_addr + (6 * TLB_TABLE_SIZE));

level0_table[0] =
(u64)level1_table_0 | PMD_TYPE_TABLE;
level0_table[1] =
(u64)level1_table_1 | PMD_TYPE_TABLE;
/*
* set level 1 table 0, covering 0 to 512GB
* set level 1 table 1, covering 512GB to 1TB
*/
section_l1t0 = 0;

section_l1t1 = BLOCK_SIZE_L0;

index_attr = 0;

for (i = 0; i < 512; i++) {
level1_table_0[i] = section_l1t0;
level1_table_0[i] |= attr_tbll1t0[index_attr].attr;
attr_tbll1t0[index_attr].num--;
if (attr_tbll1t0[index_attr].num == 0)
	index_attr++;
level1_table_1[i] = section_l1t1;
level1_table_1[i] |= DEVICE_ATTR;
section_l1t0 += BLOCK_SIZE_L1;
section_l1t1 += BLOCK_SIZE_L1;
}

level1_table_0[0] =
(u64)level2_table_0 | PMD_TYPE_TABLE;
level1_table_0[1] =
(u64)level2_table_1 | PMD_TYPE_TABLE;
level1_table_0[2] =
(u64)level2_table_2 | PMD_TYPE_TABLE;
level1_table_0[3] =
(u64)level2_table_3 | PMD_TYPE_TABLE;
section_l2t0 = 0;

section_l2t1 = section_l2t0 + BLOCK_SIZE_L1; /* 1GB */

section_l2t2 = section_l2t1 + BLOCK_SIZE_L1; /* 2GB */

section_l2t3 = section_l2t2 + BLOCK_SIZE_L1; /* 3GB */

index_attr = 0;

for (i = 0; i < 512; i++) {
level2_table_0[i] = section_l2t0 | MEMORY_ATTR;
level2_table_1[i] = section_l2t1 | MEMORY_ATTR;
level2_table_2[i] = section_l2t2 | DEVICE_ATTR;
level2_table_3[i] = section_l2t3 |
		    attr_tbll2t3[index_attr].attr;
attr_tbll2t3[index_attr].num--;
if (attr_tbll2t3[index_attr].num == 0)
	index_attr++;
section_l2t0 += BLOCK_SIZE_L2;
section_l2t1 += BLOCK_SIZE_L2;
section_l2t2 += BLOCK_SIZE_L2;
section_l2t3 += BLOCK_SIZE_L2;
}

/* flush new MMU table */

flush_dcache_range(gd->arch.tlb_addr,
	   gd->arch.tlb_addr + gd->arch.tlb_size);
/* point TTBR to the new table */

el = current_el();

set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
	  ZYNQMP_TCR, MEMORY_ATTRIBUTES);
set_sctlr(get_sctlr() | CR_M);
-}

-int arch_cpu_init(void) -{

icache_enable();

__asm_invalidate_dcache_all();

__asm_invalidate_tlb_all();

return 0;

-}

-/*

This function is called from lib/board.c.

It recreates MMU table in main memory. MMU and d-cache are enabled earlier.

There is no need to disable d-cache for this operation.

*/

-void enable_caches(void) -{
/* The data cache is not active unless the mmu is enabled */

if (!(get_sctlr() & CR_M)) {
invalidate_dcache_all();
__asm_invalidate_tlb_all();
zynqmp_mmu_setup();
}

puts("Enabling Caches...\n");

set_sctlr(get_sctlr() | CR_C);
-}

-u64 *arch_get_page_table(void) -{

return (u64 *)(gd->arch.tlb_addr + 0x3000);

-} -#endif diff --git a/include/configs/xilinx_zynqmp.h b/include/configs/xilinx_zynqmp.h index 28622de..439f063 100644 --- a/include/configs/xilinx_zynqmp.h +++ b/include/configs/xilinx_zynqmp.h @@ -29,6 +29,50 @@ #define CONFIG_SYS_MEMTEST_START CONFIG_SYS_SDRAM_BASE #define CONFIG_SYS_MEMTEST_END CONFIG_SYS_SDRAM_SIZE

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
.base = 0x0UL,						\
.size = 0x80000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x80000000UL,					\
.size = 0x70000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xf8000000UL,					\
.size = 0x07e00000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xffe00000UL,					\
.size = 0x00200000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x400000000UL,					\
.size = 0x200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0x600000000UL,					\
.size = 0x800000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0xe00000000UL,					\
.size = 0xf200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, \

}
/* Have release address at the end of 256MB for now */ #define CPU_RELEASE_ADDR 0xFFFFFF0
No problem with this default map. I didn't check every entry. Siva will look at it. My problem is that this is static. You should enable option option to generate this table dynamically. The reason is that systems can run just
The way the code works today, I could easily just use a 0-size element as array terminator and use an external pointer rather than the local memory map array.

However, that contradicts with a runtime memory contrained approach to determine the page table size. The only good path I could come up with here is to generate the page table during build time and then save how big it was. We couldn't do that with runtime changing tables.

0-size element is not enough. The problem are base addresses too. ZynqMP memory map is just this

1. 0-2GB PS memory up to 2GB 2. 2GB-3GB PL part 3. 3GB-4GB PCIE + device + small memories OCM/TCM 4. 4GB-16GB reserved 5. 16-24GB PL part 6. 24-32GB PCIe 7. 32-64GB DDR 8. 64-512GB PL 9. 512-768 PCIe 10. 768-1TB DDR 11. 1TB-16TB PL

PL regions 2, 5, 11 can also contain memories but you are able to find them via DT for particular HW design. And start addresses can be in this range. In Xilinx tree I have patches for reading memory configuration from DT and saving them to gd->bd->bi_dram based on CONFIG_NR_DRAM_BANKS setting and I think this should be wired. It means you should setup just memory attributes for memories which u-boot is aware of. For the rest of address space it is not that sensitive.

...

...
with PL DDR not PS and you can't map memory which is not there which can end up in system lock. We want to change current code to setup MMU table for memories at run time based on DT.

So in your use case, the regions would still be there, just their size changes. That means we could still have static tables that then get modified by runtime code later on to just span less?

Not entirely. Definitely the part of table can be static but I tend to have especially memory mapping more dynamic.

...

Of course, with a full device tree, we could just generate the mmu tables completely from dt. But then we'd need a good answer to how we determine the page table pool size...

I don't think it is required but at least memory mapping should be more dynamic. As is visible we are moving to u-boot which is fully configured from DT that's why none will want to changes in MMU table that's why this should be more flexible.

Thanks, Michal

Alexander Graf

26 Feb 26 Feb

1:49 a.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

On 23.02.16 14:07, Michal Simek wrote:

...

Hi,

before I comment the rest. You need to also fix gem driver because it is using 1MB mapping.

diff --git a/drivers/net/zynq_gem.c b/drivers/net/zynq_gem.c index b3821c31a91d..cf1376ce1bd7 100644 --- a/drivers/net/zynq_gem.c +++ b/drivers/net/zynq_gem.c @@ -150,10 +150,10 @@ struct emac_bd { };

#define RX_BUF 32 -/* Page table entries are set to 1MB, or multiples of 1MB

(not < 1MB). driver uses less bd's so use 1MB bdspace.

+/* Page table entries are set to 2MB, or multiples of 2MB

(not < 2MB). driver uses less bd's so use 2MB bdspace.

*/

-#define BD_SPACE 0x100000 +#define BD_SPACE 0x200000 /* BD separation space */ #define BD_SEPRN_SPACE (RX_BUF * sizeof(struct emac_bd))

Looks like I didn't reply to this before. The change above makes a lot of space, but is not required for this patch set. We simply break the page table down to 4k maps.

So to keep the bits I touch in this patch set as small as possible, I think we should (if we really want to) move this into a separate, follow-up patch.

...

On 23.2.2016 12:33, Alexander Graf wrote:

...
On 23.02.16 12:04, Michal Simek wrote:

...
On 22.2.2016 02:57, Alexander Graf wrote:

...
Now that we have nice table driven page table creating code that gives us everything we need, move to that.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/zynqmp/cpu.c | 169 ---------------------------------------- include/configs/xilinx_zynqmp.h | 44 +++++++++++ 2 files changed, 44 insertions(+), 169 deletions(-)

diff --git a/arch/arm/cpu/armv8/zynqmp/cpu.c b/arch/arm/cpu/armv8/zynqmp/cpu.c index c71f291..3f661a9 100644 --- a/arch/arm/cpu/armv8/zynqmp/cpu.c +++ b/arch/arm/cpu/armv8/zynqmp/cpu.c @@ -44,172 +44,3 @@ unsigned int zynqmp_get_silicon_version(void)

return ZYNQMP_CSU_VERSION_SILICON; }

-#ifndef CONFIG_SYS_DCACHE_OFF -#include <asm/armv8/mmu.h>

-#define SECTION_SHIFT_L1 30UL -#define SECTION_SHIFT_L2 21UL -#define BLOCK_SIZE_L0 0x8000000000UL -#define BLOCK_SIZE_L1 (1 << SECTION_SHIFT_L1) -#define BLOCK_SIZE_L2 (1 << SECTION_SHIFT_L2)

-#define TCR_TG1_4K (1 << 31) -#define TCR_EPD1_DISABLE (1 << 23) -#define ZYNQMO_VA_BITS 40 -#define ZYNQMP_TCR TCR_TG1_4K | \
		TCR_EPD1_DISABLE | \
		TCR_SHARED_OUTER | \
		TCR_SHARED_INNER | \
		TCR_IRGN_WBWA | \
		TCR_ORGN_WBWA | \
		TCR_T0SZ(ZYNQMO_VA_BITS)
-#define MEMORY_ATTR PMD_SECT_AF | PMD_SECT_INNER_SHARE | \
	PMD_ATTRINDX(MT_NORMAL) |	\
	PMD_TYPE_SECT
-#define DEVICE_ATTR PMD_SECT_AF | PMD_SECT_PXN | \
	PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_NGNRNE) |	\
	PMD_TYPE_SECT
-/* 4K size is required to place 512 entries in each level */ -#define TLB_TABLE_SIZE 0x1000

-struct attr_tbl {

u32 num;

u64 attr;

-};

-static struct attr_tbl attr_tbll1t0[4] = { {16, 0x0},
			   {8, DEVICE_ATTR},
			   {32, MEMORY_ATTR},
			   {456, DEVICE_ATTR}
			 };
-static struct attr_tbl attr_tbll2t3[4] = { {0x180, DEVICE_ATTR},
			   {0x40, 0x0},
			   {0x3F, DEVICE_ATTR},
			   {0x1, MEMORY_ATTR}
			 };
-/*

This mmu table looks as below

Level 0 table contains two entries to 512GB sizes. One is Level1 Table 0

and other Level1 Table1.

Level1 Table0 contains entries for each 1GB from 0 to 511GB.

Level1 Table1 contains entries for each 1GB from 512GB to 1TB.

Level2 Table0, Level2 Table1, Level2 Table2 and Level2 Table3 contains

entries for each 2MB starting from 0GB, 1GB, 2GB and 3GB respectively.

*/

-static void zynqmp_mmu_setup(void) -{
int el;

u32 index_attr;

u64 i, section_l1t0, section_l1t1;

u64 section_l2t0, section_l2t1, section_l2t2, section_l2t3;

u64 *level0_table = (u64 *)gd->arch.tlb_addr;

u64 *level1_table_0 = (u64 *)(gd->arch.tlb_addr + TLB_TABLE_SIZE);

u64 *level1_table_1 = (u64 *)(gd->arch.tlb_addr + (2 * TLB_TABLE_SIZE));

u64 *level2_table_0 = (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE));

u64 *level2_table_1 = (u64 *)(gd->arch.tlb_addr + (4 * TLB_TABLE_SIZE));

u64 *level2_table_2 = (u64 *)(gd->arch.tlb_addr + (5 * TLB_TABLE_SIZE));

u64 *level2_table_3 = (u64 *)(gd->arch.tlb_addr + (6 * TLB_TABLE_SIZE));

level0_table[0] =
(u64)level1_table_0 | PMD_TYPE_TABLE;
level0_table[1] =
(u64)level1_table_1 | PMD_TYPE_TABLE;
/*
* set level 1 table 0, covering 0 to 512GB
* set level 1 table 1, covering 512GB to 1TB
*/
section_l1t0 = 0;

section_l1t1 = BLOCK_SIZE_L0;

index_attr = 0;

for (i = 0; i < 512; i++) {
level1_table_0[i] = section_l1t0;
level1_table_0[i] |= attr_tbll1t0[index_attr].attr;
attr_tbll1t0[index_attr].num--;
if (attr_tbll1t0[index_attr].num == 0)
	index_attr++;
level1_table_1[i] = section_l1t1;
level1_table_1[i] |= DEVICE_ATTR;
section_l1t0 += BLOCK_SIZE_L1;
section_l1t1 += BLOCK_SIZE_L1;
}

level1_table_0[0] =
(u64)level2_table_0 | PMD_TYPE_TABLE;
level1_table_0[1] =
(u64)level2_table_1 | PMD_TYPE_TABLE;
level1_table_0[2] =
(u64)level2_table_2 | PMD_TYPE_TABLE;
level1_table_0[3] =
(u64)level2_table_3 | PMD_TYPE_TABLE;
section_l2t0 = 0;

section_l2t1 = section_l2t0 + BLOCK_SIZE_L1; /* 1GB */

section_l2t2 = section_l2t1 + BLOCK_SIZE_L1; /* 2GB */

section_l2t3 = section_l2t2 + BLOCK_SIZE_L1; /* 3GB */

index_attr = 0;

for (i = 0; i < 512; i++) {
level2_table_0[i] = section_l2t0 | MEMORY_ATTR;
level2_table_1[i] = section_l2t1 | MEMORY_ATTR;
level2_table_2[i] = section_l2t2 | DEVICE_ATTR;
level2_table_3[i] = section_l2t3 |
		    attr_tbll2t3[index_attr].attr;
attr_tbll2t3[index_attr].num--;
if (attr_tbll2t3[index_attr].num == 0)
	index_attr++;
section_l2t0 += BLOCK_SIZE_L2;
section_l2t1 += BLOCK_SIZE_L2;
section_l2t2 += BLOCK_SIZE_L2;
section_l2t3 += BLOCK_SIZE_L2;
}

/* flush new MMU table */

flush_dcache_range(gd->arch.tlb_addr,
	   gd->arch.tlb_addr + gd->arch.tlb_size);
/* point TTBR to the new table */

el = current_el();

set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
	  ZYNQMP_TCR, MEMORY_ATTRIBUTES);
set_sctlr(get_sctlr() | CR_M);
-}

-int arch_cpu_init(void) -{

icache_enable();

__asm_invalidate_dcache_all();

__asm_invalidate_tlb_all();

return 0;

-}

-/*

This function is called from lib/board.c.

It recreates MMU table in main memory. MMU and d-cache are enabled earlier.

There is no need to disable d-cache for this operation.

*/

-void enable_caches(void) -{
/* The data cache is not active unless the mmu is enabled */

if (!(get_sctlr() & CR_M)) {
invalidate_dcache_all();
__asm_invalidate_tlb_all();
zynqmp_mmu_setup();
}

puts("Enabling Caches...\n");

set_sctlr(get_sctlr() | CR_C);
-}

-u64 *arch_get_page_table(void) -{

return (u64 *)(gd->arch.tlb_addr + 0x3000);

-} -#endif diff --git a/include/configs/xilinx_zynqmp.h b/include/configs/xilinx_zynqmp.h index 28622de..439f063 100644 --- a/include/configs/xilinx_zynqmp.h +++ b/include/configs/xilinx_zynqmp.h @@ -29,6 +29,50 @@ #define CONFIG_SYS_MEMTEST_START CONFIG_SYS_SDRAM_BASE #define CONFIG_SYS_MEMTEST_END CONFIG_SYS_SDRAM_SIZE

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
.base = 0x0UL,						\
.size = 0x80000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x80000000UL,					\
.size = 0x70000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xf8000000UL,					\
.size = 0x07e00000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xffe00000UL,					\
.size = 0x00200000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x400000000UL,					\
.size = 0x200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0x600000000UL,					\
.size = 0x800000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0xe00000000UL,					\
.size = 0xf200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, \

}
/* Have release address at the end of 256MB for now */ #define CPU_RELEASE_ADDR 0xFFFFFF0
No problem with this default map. I didn't check every entry. Siva will look at it. My problem is that this is static. You should enable option option to generate this table dynamically. The reason is that systems can run just
The way the code works today, I could easily just use a 0-size element as array terminator and use an external pointer rather than the local memory map array.

However, that contradicts with a runtime memory contrained approach to determine the page table size. The only good path I could come up with here is to generate the page table during build time and then save how big it was. We couldn't do that with runtime changing tables.
0-size element is not enough. The problem are base addresses too. ZynqMP memory map is just this

0-2GB PS memory up to 2GB

2GB-3GB PL part

3GB-4GB PCIE + device + small memories OCM/TCM

4GB-16GB reserved

16-24GB PL part

24-32GB PCIe

32-64GB DDR

64-512GB PL

512-768 PCIe

768-1TB DDR

1TB-16TB PL

PL regions 2, 5, 11 can also contain memories but you are able to find them via DT for particular HW design. And start addresses can be in this range. In Xilinx tree I have patches for reading memory configuration from DT and saving them to gd->bd->bi_dram based on CONFIG_NR_DRAM_BANKS setting and I think this should be wired. It means you should setup just memory attributes for memories which u-boot is aware of. For the rest of address space it is not that sensitive.

I think that makes a lot of sense. The current patch set simply moves the logic as it stands today into tables rather than open coded code. To modify the tables dynamically as next step sounds like a good idea to me.

...

...
...
with PL DDR not PS and you can't map memory which is not there which can end up in system lock. We want to change current code to setup MMU table for memories at run time based on DT.

So in your use case, the regions would still be there, just their size changes. That means we could still have static tables that then get modified by runtime code later on to just span less?

Not entirely. Definitely the part of table can be static but I tend to have especially memory mapping more dynamic.

I guess the way the table is now a struct in a .c file works for you? It means you can modify it to whatever extent you like, maybe even generate it completely dynamically if you wanted to.

Alex

Michal Simek

9:29 a.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

On 26.2.2016 01:49, Alexander Graf wrote:

...

On 23.02.16 14:07, Michal Simek wrote:

...
Hi,

before I comment the rest. You need to also fix gem driver because it is using 1MB mapping.

diff --git a/drivers/net/zynq_gem.c b/drivers/net/zynq_gem.c index b3821c31a91d..cf1376ce1bd7 100644 --- a/drivers/net/zynq_gem.c +++ b/drivers/net/zynq_gem.c @@ -150,10 +150,10 @@ struct emac_bd { };

#define RX_BUF 32 -/* Page table entries are set to 1MB, or multiples of 1MB

(not < 1MB). driver uses less bd's so use 1MB bdspace.

+/* Page table entries are set to 2MB, or multiples of 2MB

(not < 2MB). driver uses less bd's so use 2MB bdspace.

*/

-#define BD_SPACE 0x100000 +#define BD_SPACE 0x200000 /* BD separation space */ #define BD_SEPRN_SPACE (RX_BUF * sizeof(struct emac_bd))

Looks like I didn't reply to this before. The change above makes a lot of space, but is not required for this patch set. We simply break the page table down to 4k maps.

Unfortunately 1MB can be divided by 4k pages but with the v1 series I was testing this driver stopped to work.

...

So to keep the bits I touch in this patch set as small as possible, I think we should (if we really want to) move this into a separate, follow-up patch.

It means this has to be solved before this series is applied because none wants to break any driver.

...

...
On 23.2.2016 12:33, Alexander Graf wrote:

...
On 23.02.16 12:04, Michal Simek wrote:

...
On 22.2.2016 02:57, Alexander Graf wrote:

...
Now that we have nice table driven page table creating code that gives us everything we need, move to that.

Signed-off-by: Alexander Graf agraf@suse.de

arch/arm/cpu/armv8/zynqmp/cpu.c | 169 ---------------------------------------- include/configs/xilinx_zynqmp.h | 44 +++++++++++ 2 files changed, 44 insertions(+), 169 deletions(-)

diff --git a/arch/arm/cpu/armv8/zynqmp/cpu.c b/arch/arm/cpu/armv8/zynqmp/cpu.c index c71f291..3f661a9 100644 --- a/arch/arm/cpu/armv8/zynqmp/cpu.c +++ b/arch/arm/cpu/armv8/zynqmp/cpu.c @@ -44,172 +44,3 @@ unsigned int zynqmp_get_silicon_version(void)

return ZYNQMP_CSU_VERSION_SILICON; }

-#ifndef CONFIG_SYS_DCACHE_OFF -#include <asm/armv8/mmu.h>

-#define SECTION_SHIFT_L1 30UL -#define SECTION_SHIFT_L2 21UL -#define BLOCK_SIZE_L0 0x8000000000UL -#define BLOCK_SIZE_L1 (1 << SECTION_SHIFT_L1) -#define BLOCK_SIZE_L2 (1 << SECTION_SHIFT_L2)

-#define TCR_TG1_4K (1 << 31) -#define TCR_EPD1_DISABLE (1 << 23) -#define ZYNQMO_VA_BITS 40 -#define ZYNQMP_TCR TCR_TG1_4K | \
		TCR_EPD1_DISABLE | \
		TCR_SHARED_OUTER | \
		TCR_SHARED_INNER | \
		TCR_IRGN_WBWA | \
		TCR_ORGN_WBWA | \
		TCR_T0SZ(ZYNQMO_VA_BITS)
-#define MEMORY_ATTR PMD_SECT_AF | PMD_SECT_INNER_SHARE | \
	PMD_ATTRINDX(MT_NORMAL) |	\
	PMD_TYPE_SECT
-#define DEVICE_ATTR PMD_SECT_AF | PMD_SECT_PXN | \
	PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_NGNRNE) |	\
	PMD_TYPE_SECT
-/* 4K size is required to place 512 entries in each level */ -#define TLB_TABLE_SIZE 0x1000

-struct attr_tbl {

u32 num;

u64 attr;

-};

-static struct attr_tbl attr_tbll1t0[4] = { {16, 0x0},
			   {8, DEVICE_ATTR},
			   {32, MEMORY_ATTR},
			   {456, DEVICE_ATTR}
			 };
-static struct attr_tbl attr_tbll2t3[4] = { {0x180, DEVICE_ATTR},
			   {0x40, 0x0},
			   {0x3F, DEVICE_ATTR},
			   {0x1, MEMORY_ATTR}
			 };
-/*

This mmu table looks as below

Level 0 table contains two entries to 512GB sizes. One is Level1 Table 0

and other Level1 Table1.

Level1 Table0 contains entries for each 1GB from 0 to 511GB.

Level1 Table1 contains entries for each 1GB from 512GB to 1TB.

Level2 Table0, Level2 Table1, Level2 Table2 and Level2 Table3 contains

entries for each 2MB starting from 0GB, 1GB, 2GB and 3GB respectively.

*/

-static void zynqmp_mmu_setup(void) -{
int el;

u32 index_attr;

u64 i, section_l1t0, section_l1t1;

u64 section_l2t0, section_l2t1, section_l2t2, section_l2t3;

u64 *level0_table = (u64 *)gd->arch.tlb_addr;

u64 *level1_table_0 = (u64 *)(gd->arch.tlb_addr + TLB_TABLE_SIZE);

u64 *level1_table_1 = (u64 *)(gd->arch.tlb_addr + (2 * TLB_TABLE_SIZE));

u64 *level2_table_0 = (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE));

u64 *level2_table_1 = (u64 *)(gd->arch.tlb_addr + (4 * TLB_TABLE_SIZE));

u64 *level2_table_2 = (u64 *)(gd->arch.tlb_addr + (5 * TLB_TABLE_SIZE));

u64 *level2_table_3 = (u64 *)(gd->arch.tlb_addr + (6 * TLB_TABLE_SIZE));

level0_table[0] =
(u64)level1_table_0 | PMD_TYPE_TABLE;
level0_table[1] =
(u64)level1_table_1 | PMD_TYPE_TABLE;
/*
* set level 1 table 0, covering 0 to 512GB
* set level 1 table 1, covering 512GB to 1TB
*/
section_l1t0 = 0;

section_l1t1 = BLOCK_SIZE_L0;

index_attr = 0;

for (i = 0; i < 512; i++) {
level1_table_0[i] = section_l1t0;
level1_table_0[i] |= attr_tbll1t0[index_attr].attr;
attr_tbll1t0[index_attr].num--;
if (attr_tbll1t0[index_attr].num == 0)
	index_attr++;
level1_table_1[i] = section_l1t1;
level1_table_1[i] |= DEVICE_ATTR;
section_l1t0 += BLOCK_SIZE_L1;
section_l1t1 += BLOCK_SIZE_L1;
}

level1_table_0[0] =
(u64)level2_table_0 | PMD_TYPE_TABLE;
level1_table_0[1] =
(u64)level2_table_1 | PMD_TYPE_TABLE;
level1_table_0[2] =
(u64)level2_table_2 | PMD_TYPE_TABLE;
level1_table_0[3] =
(u64)level2_table_3 | PMD_TYPE_TABLE;
section_l2t0 = 0;

section_l2t1 = section_l2t0 + BLOCK_SIZE_L1; /* 1GB */

section_l2t2 = section_l2t1 + BLOCK_SIZE_L1; /* 2GB */

section_l2t3 = section_l2t2 + BLOCK_SIZE_L1; /* 3GB */

index_attr = 0;

for (i = 0; i < 512; i++) {
level2_table_0[i] = section_l2t0 | MEMORY_ATTR;
level2_table_1[i] = section_l2t1 | MEMORY_ATTR;
level2_table_2[i] = section_l2t2 | DEVICE_ATTR;
level2_table_3[i] = section_l2t3 |
		    attr_tbll2t3[index_attr].attr;
attr_tbll2t3[index_attr].num--;
if (attr_tbll2t3[index_attr].num == 0)
	index_attr++;
section_l2t0 += BLOCK_SIZE_L2;
section_l2t1 += BLOCK_SIZE_L2;
section_l2t2 += BLOCK_SIZE_L2;
section_l2t3 += BLOCK_SIZE_L2;
}

/* flush new MMU table */

flush_dcache_range(gd->arch.tlb_addr,
	   gd->arch.tlb_addr + gd->arch.tlb_size);
/* point TTBR to the new table */

el = current_el();

set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
	  ZYNQMP_TCR, MEMORY_ATTRIBUTES);
set_sctlr(get_sctlr() | CR_M);
-}

-int arch_cpu_init(void) -{

icache_enable();

__asm_invalidate_dcache_all();

__asm_invalidate_tlb_all();

return 0;

-}

-/*

This function is called from lib/board.c.

It recreates MMU table in main memory. MMU and d-cache are enabled earlier.

There is no need to disable d-cache for this operation.

*/

-void enable_caches(void) -{
/* The data cache is not active unless the mmu is enabled */

if (!(get_sctlr() & CR_M)) {
invalidate_dcache_all();
__asm_invalidate_tlb_all();
zynqmp_mmu_setup();
}

puts("Enabling Caches...\n");

set_sctlr(get_sctlr() | CR_C);
-}

-u64 *arch_get_page_table(void) -{

return (u64 *)(gd->arch.tlb_addr + 0x3000);

-} -#endif diff --git a/include/configs/xilinx_zynqmp.h b/include/configs/xilinx_zynqmp.h index 28622de..439f063 100644 --- a/include/configs/xilinx_zynqmp.h +++ b/include/configs/xilinx_zynqmp.h @@ -29,6 +29,50 @@ #define CONFIG_SYS_MEMTEST_START CONFIG_SYS_SDRAM_BASE #define CONFIG_SYS_MEMTEST_END CONFIG_SYS_SDRAM_SIZE

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
.base = 0x0UL,						\
.size = 0x80000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x80000000UL,					\
.size = 0x70000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xf8000000UL,					\
.size = 0x07e00000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0xffe00000UL,					\
.size = 0x00200000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0x400000000UL,					\
.size = 0x200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, { \
.base = 0x600000000UL,					\
.size = 0x800000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\
	 PTE_BLOCK_INNER_SHARE				\
}, { \
.base = 0xe00000000UL,					\
.size = 0xf200000000UL,					\
.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\
	 PTE_BLOCK_NON_SHARE |				\
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
}, \

}
/* Have release address at the end of 256MB for now */ #define CPU_RELEASE_ADDR 0xFFFFFF0
No problem with this default map. I didn't check every entry. Siva will look at it. My problem is that this is static. You should enable option option to generate this table dynamically. The reason is that systems can run just
The way the code works today, I could easily just use a 0-size element as array terminator and use an external pointer rather than the local memory map array.

However, that contradicts with a runtime memory contrained approach to determine the page table size. The only good path I could come up with here is to generate the page table during build time and then save how big it was. We couldn't do that with runtime changing tables.
0-size element is not enough. The problem are base addresses too. ZynqMP memory map is just this

0-2GB PS memory up to 2GB

2GB-3GB PL part

3GB-4GB PCIE + device + small memories OCM/TCM

4GB-16GB reserved

16-24GB PL part

24-32GB PCIe

32-64GB DDR

64-512GB PL

512-768 PCIe

768-1TB DDR

1TB-16TB PL

PL regions 2, 5, 11 can also contain memories but you are able to find them via DT for particular HW design. And start addresses can be in this range. In Xilinx tree I have patches for reading memory configuration from DT and saving them to gd->bd->bi_dram based on CONFIG_NR_DRAM_BANKS setting and I think this should be wired. It means you should setup just memory attributes for memories which u-boot is aware of. For the rest of address space it is not that sensitive.
I think that makes a lot of sense. The current patch set simply moves the logic as it stands today into tables rather than open coded code. To modify the tables dynamically as next step sounds like a good idea to me.

...

...
...
...
with PL DDR not PS and you can't map memory which is not there which can end up in system lock. We want to change current code to setup MMU table for memories at run time based on DT.

So in your use case, the regions would still be there, just their size changes. That means we could still have static tables that then get modified by runtime code later on to just span less?

Not entirely. Definitely the part of table can be static but I tend to have especially memory mapping more dynamic.

I guess the way the table is now a struct in a .c file works for you? It means you can modify it to whatever extent you like, maybe even generate it completely dynamically if you wanted to.

I have seen that and will look at it again when I start to work with HW.

Thanks, Michal

Alexander Graf

9:55 a.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

...

Am 26.02.2016 um 09:29 schrieb Michal Simek michal.simek@xilinx.com:

...
On 26.2.2016 01:49, Alexander Graf wrote:

...
On 23.02.16 14:07, Michal Simek wrote: Hi,

before I comment the rest. You need to also fix gem driver because it is using 1MB mapping.

diff --git a/drivers/net/zynq_gem.c b/drivers/net/zynq_gem.c index b3821c31a91d..cf1376ce1bd7 100644 --- a/drivers/net/zynq_gem.c +++ b/drivers/net/zynq_gem.c @@ -150,10 +150,10 @@ struct emac_bd { };

#define RX_BUF 32 -/* Page table entries are set to 1MB, or multiples of 1MB

(not < 1MB). driver uses less bd's so use 1MB bdspace.

+/* Page table entries are set to 2MB, or multiples of 2MB

(not < 2MB). driver uses less bd's so use 2MB bdspace.

*/ -#define BD_SPACE 0x100000 +#define BD_SPACE 0x200000 /* BD separation space */ #define BD_SEPRN_SPACE (RX_BUF * sizeof(struct emac_bd))

Looks like I didn't reply to this before. The change above makes a lot of space, but is not required for this patch set. We simply break the page table down to 4k maps.

Unfortunately 1MB can be divided by 4k pages but with the v1 series I was testing this driver stopped to work.

Could you please give v4 a try? I fixed a bunch of bugs in the splitting in between.

Alex

brettstahlman

16 Feb 16 Feb

4:26 p.m.

New subject: [U-Boot] [PATCH 3/9] zymqmp: Replace home grown mmu code with generic table approach

I'm trying to boot on a Zynq Ultrascale+ (ARM Cortex-A53 64-bit), and u-boot is panicking in create_table with "Insufficient RAM for page table", along with the recommendation "Please increase the size in get_page_table_size()".

What is the likely root cause, and what's the best way to increase the size?

-- View this message in context: http://u-boot.10912.n7.nabble.com/PATCH-0-9-arm64-Unify-MMU-code-tp246505p28... Sent from the U-Boot mailing list archive at Nabble.com.

Alexander Graf

22 Feb 22 Feb

2:57 a.m.

New subject: [U-Boot] [PATCH 4/9] tegra: Replace home grown mmu code with generic table approach

Now that we have nice table driven page table creating code that gives us everything we need, move to that.

Signed-off-by: Alexander Graf agraf@suse.de --- arch/arm/mach-tegra/Makefile | 1 - arch/arm/mach-tegra/arm64-mmu.c | 131 -------------------------------------- include/configs/tegra210-common.h | 16 +++++ 3 files changed, 16 insertions(+), 132 deletions(-) delete mode 100644 arch/arm/mach-tegra/arm64-mmu.c

diff --git a/arch/arm/mach-tegra/Makefile b/arch/arm/mach-tegra/Makefile index b2dbc69..31dd526 100644 --- a/arch/arm/mach-tegra/Makefile +++ b/arch/arm/mach-tegra/Makefile @@ -14,7 +14,6 @@ else obj-$(CONFIG_CMD_ENTERRCM) += cmd_enterrcm.o endif

-obj-$(CONFIG_ARM64) += arm64-mmu.o obj-y += ap.o obj-y += board.o board2.o obj-y += cache.o diff --git a/arch/arm/mach-tegra/arm64-mmu.c b/arch/arm/mach-tegra/arm64-mmu.c deleted file mode 100644 index c227652..0000000 --- a/arch/arm/mach-tegra/arm64-mmu.c +++ /dev/null @@ -1,131 +0,0 @@ -/* - * (C) Copyright 2014 - 2015 Xilinx, Inc. - * Michal Simek michal.simek@xilinx.com - * (This file derived from arch/arm/cpu/armv8/zynqmp/cpu.c) - * - * Copyright (c) 2015, NVIDIA CORPORATION. All rights reserved. - * - * SPDX-License-Identifier: GPL-2.0+ - */ - -#include <common.h> -#include <asm/system.h> -#include <asm/armv8/mmu.h> - -DECLARE_GLOBAL_DATA_PTR; - -#define SECTION_SHIFT_L1 30UL -#define SECTION_SHIFT_L2 21UL -#define BLOCK_SIZE_L0 0x8000000000UL -#define BLOCK_SIZE_L1 (1 << SECTION_SHIFT_L1) -#define BLOCK_SIZE_L2 (1 << SECTION_SHIFT_L2) - -#define TCR_TG1_4K (1 << 31) -#define TCR_EPD1_DISABLE (1 << 23) -#define TEGRA_VA_BITS 40 -#define TEGRA_TCR TCR_TG1_4K | \ - TCR_EPD1_DISABLE | \ - TCR_SHARED_OUTER | \ - TCR_SHARED_INNER | \ - TCR_IRGN_WBWA | \ - TCR_ORGN_WBWA | \ - TCR_T0SZ(TEGRA_VA_BITS) - -#define MEMORY_ATTR PMD_SECT_AF | PMD_SECT_INNER_SHARE | \ - PMD_ATTRINDX(MT_NORMAL) | \ - PMD_TYPE_SECT -#define DEVICE_ATTR PMD_SECT_AF | PMD_SECT_PXN | \ - PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_NGNRNE) | \ - PMD_TYPE_SECT - -/* 4K size is required to place 512 entries in each level */ -#define TLB_TABLE_SIZE 0x1000 - -/* - * This mmu table looks as below - * Level 0 table contains two entries to 512GB sizes. One is Level1 Table 0 - * and other Level1 Table1. - * Level1 Table0 contains entries for each 1GB from 0 to 511GB. - * Level1 Table1 contains entries for each 1GB from 512GB to 1TB. - * Level2 Table0, Level2 Table1, Level2 Table2 and Level2 Table3 contains - * entries for each 2MB starting from 0GB, 1GB, 2GB and 3GB respectively. - */ -void mmu_setup(void) -{ - int el; - u64 i, section_l1t0, section_l1t1; - u64 section_l2t0, section_l2t1, section_l2t2, section_l2t3; - u64 *level0_table = (u64 *)gd->arch.tlb_addr; - u64 *level1_table_0 = (u64 *)(gd->arch.tlb_addr + TLB_TABLE_SIZE); - u64 *level1_table_1 = (u64 *)(gd->arch.tlb_addr + (2 * TLB_TABLE_SIZE)); - u64 *level2_table_0 = (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE)); - u64 *level2_table_1 = (u64 *)(gd->arch.tlb_addr + (4 * TLB_TABLE_SIZE)); - u64 *level2_table_2 = (u64 *)(gd->arch.tlb_addr + (5 * TLB_TABLE_SIZE)); - u64 *level2_table_3 = (u64 *)(gd->arch.tlb_addr + (6 * TLB_TABLE_SIZE)); - - /* Invalidate all table entries */ - memset(level0_table, 0, PGTABLE_SIZE); - - level0_table[0] = - (u64)level1_table_0 | PMD_TYPE_TABLE; - level0_table[1] = - (u64)level1_table_1 | PMD_TYPE_TABLE; - - /* - * set level 1 table 0, covering 0 to 512GB - * set level 1 table 1, covering 512GB to 1TB - */ - section_l1t0 = 0; - section_l1t1 = BLOCK_SIZE_L0; - - for (i = 0; i < 512; i++) { - level1_table_0[i] = section_l1t0; - if (i >= 4) - level1_table_0[i] |= MEMORY_ATTR; - level1_table_1[i] = section_l1t1; - level1_table_1[i] |= MEMORY_ATTR; - section_l1t0 += BLOCK_SIZE_L1; - section_l1t1 += BLOCK_SIZE_L1; - } - - level1_table_0[0] = - (u64)level2_table_0 | PMD_TYPE_TABLE; - level1_table_0[1] = - (u64)level2_table_1 | PMD_TYPE_TABLE; - level1_table_0[2] = - (u64)level2_table_2 | PMD_TYPE_TABLE; - level1_table_0[3] = - (u64)level2_table_3 | PMD_TYPE_TABLE; - - section_l2t0 = 0; - section_l2t1 = section_l2t0 + BLOCK_SIZE_L1; /* 1GB */ - section_l2t2 = section_l2t1 + BLOCK_SIZE_L1; /* 2GB */ - section_l2t3 = section_l2t2 + BLOCK_SIZE_L1; /* 3GB */ - - for (i = 0; i < 512; i++) { - level2_table_0[i] = section_l2t0 | DEVICE_ATTR; - level2_table_1[i] = section_l2t1 | DEVICE_ATTR; - level2_table_2[i] = section_l2t2 | MEMORY_ATTR; - level2_table_3[i] = section_l2t3 | MEMORY_ATTR; - section_l2t0 += BLOCK_SIZE_L2; - section_l2t1 += BLOCK_SIZE_L2; - section_l2t2 += BLOCK_SIZE_L2; - section_l2t3 += BLOCK_SIZE_L2; - } - - /* flush new MMU table */ - flush_dcache_range(gd->arch.tlb_addr, - gd->arch.tlb_addr + gd->arch.tlb_size); - - /* point TTBR to the new table */ - el = current_el(); - set_ttbr_tcr_mair(el, gd->arch.tlb_addr, - TEGRA_TCR, MEMORY_ATTRIBUTES); - - set_sctlr(get_sctlr() | CR_M); -} - -u64 *arch_get_page_table(void) -{ - return (u64 *)(gd->arch.tlb_addr + (3 * TLB_TABLE_SIZE)); -} diff --git a/include/configs/tegra210-common.h b/include/configs/tegra210-common.h index 8f35a7b..5a664b3 100644 --- a/include/configs/tegra210-common.h +++ b/include/configs/tegra210-common.h @@ -13,6 +13,22 @@ /* Cortex-A57 uses a cache line size of 64 bytes */ #define CONFIG_SYS_CACHELINE_SIZE 64

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \ + { \ + .base = 0x0UL, \ + .size = 0x80000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE | \ + PTE_BLOCK_PXN | PTE_BLOCK_UXN \ + }, { \ + .base = 0x80000000UL, \ + .size = 0xff80000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) | \ + PTE_BLOCK_INNER_SHARE \ + }, \ + } + /* * NS16550 Configuration */

-- 1.8.5.6

Stephen Warren

7:28 p.m.

New subject: [U-Boot] [PATCH 4/9] tegra: Replace home grown mmu code with generic table approach

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...

Now that we have nice table driven page table creating code that gives us everything we need, move to that.

...

diff --git a/include/configs/tegra210-common.h b/include/configs/tegra210-common.h

...

+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \

{ \
```
.base = 0x0UL,						\
```
```
.size = 0x80000000UL,					\
```

.attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |		\

```
	 PTE_BLOCK_NON_SHARE |				\
```
```
	 PTE_BLOCK_PXN | PTE_BLOCK_UXN			\
```
}, { \
```
.base = 0x80000000UL,					\
```
```
.size = 0xff80000000UL,					\
```

.attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |			\

```
	 PTE_BLOCK_INNER_SHARE				\
```
}, \
}

I'd prefer a layout that didn't align the closing } for different nesting levels in the same column. To avoid indenting everything a lot, it seems simplest to pull the final } back into the first column.

I believe the .size field of the second entry in the array only needs to be 0x80000000. Testing with a PCIe Ethernet card on p2371-2180 (the driver for which sets up noncached entries in the page tables, hence should exercise all this code) confirms that.

While recent Tegra systems do support more than 2GB of RAM, U-Boot will itself only use the first 2GB, so that PAs over 4GB are not used. See board_get_usable_ram_top() in arch/arm/mach-tegra/board2.c. That's because some peripherals can only access 32-bit PAs, and the simplest way to accommodate that is to ignore any RAM above the 32-bit limit.

Michal Simek

23 Feb 23 Feb

11:37 a.m.

New subject: [U-Boot] [PATCH 4/9] tegra: Replace home grown mmu code with generic table approach

On 22.2.2016 19:28, Stephen Warren wrote:

...

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...
Now that we have nice table driven page table creating code that gives us everything we need, move to that.

...
diff --git a/include/configs/tegra210-common.h b/include/configs/tegra210-common.h

...
+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
   .base = 0x0UL,                        \
   .size = 0x80000000UL,                    \
   .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |        \
        PTE_BLOCK_NON_SHARE |                \
        PTE_BLOCK_PXN | PTE_BLOCK_UXN            \
}, { \
   .base = 0x80000000UL,                    \
   .size = 0xff80000000UL,                    \
   .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |            \
        PTE_BLOCK_INNER_SHARE                \
}, \

}
I'd prefer a layout that didn't align the closing } for different nesting levels in the same column. To avoid indenting everything a lot, it seems simplest to pull the final } back into the first column.

I believe the .size field of the second entry in the array only needs to be 0x80000000. Testing with a PCIe Ethernet card on p2371-2180 (the driver for which sets up noncached entries in the page tables, hence should exercise all this code) confirms that.

While recent Tegra systems do support more than 2GB of RAM, U-Boot will itself only use the first 2GB, so that PAs over 4GB are not used. See board_get_usable_ram_top() in arch/arm/mach-tegra/board2.c. That's because some peripherals can only access 32-bit PAs, and the simplest way to accommodate that is to ignore any RAM above the 32-bit limit.

Didn't you use mtest to test memory above of 2GB?

Thanks, Michal

Stephen Warren

6:29 p.m.

New subject: [U-Boot] [PATCH 4/9] tegra: Replace home grown mmu code with generic table approach

On 02/23/2016 03:37 AM, Michal Simek wrote:

...

On 22.2.2016 19:28, Stephen Warren wrote:

...
On 02/21/2016 06:57 PM, Alexander Graf wrote:

...
Now that we have nice table driven page table creating code that gives us everything we need, move to that.

...
diff --git a/include/configs/tegra210-common.h b/include/configs/tegra210-common.h

...
+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
   .base = 0x0UL,                        \
   .size = 0x80000000UL,                    \
   .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |        \
        PTE_BLOCK_NON_SHARE |                \
        PTE_BLOCK_PXN | PTE_BLOCK_UXN            \
}, { \
   .base = 0x80000000UL,                    \
   .size = 0xff80000000UL,                    \
   .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |            \
        PTE_BLOCK_INNER_SHARE                \
}, \

}
I'd prefer a layout that didn't align the closing } for different nesting levels in the same column. To avoid indenting everything a lot, it seems simplest to pull the final } back into the first column.

I believe the .size field of the second entry in the array only needs to be 0x80000000. Testing with a PCIe Ethernet card on p2371-2180 (the driver for which sets up noncached entries in the page tables, hence should exercise all this code) confirms that.

While recent Tegra systems do support more than 2GB of RAM, U-Boot will itself only use the first 2GB, so that PAs over 4GB are not used. See board_get_usable_ram_top() in arch/arm/mach-tegra/board2.c. That's because some peripherals can only access 32-bit PAs, and the simplest way to accommodate that is to ignore any RAM above the 32-bit limit.
Didn't you use mtest to test memory above of 2GB?

It looks like we don't have mtest enabled.

However, I was able to use itest to confirm that RAM > 4GB PA does work with this patch. I suppose we may as well leave it enabled then.

Alexander Graf

24 Feb 24 Feb

11:28 a.m.

New subject: [U-Boot] [PATCH 4/9] tegra: Replace home grown mmu code with generic table approach

On 22.02.16 19:28, Stephen Warren wrote:

...

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...
Now that we have nice table driven page table creating code that gives us everything we need, move to that.

...
diff --git a/include/configs/tegra210-common.h b/include/configs/tegra210-common.h

...
+#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \
{ \
   .base = 0x0UL,                        \
   .size = 0x80000000UL,                    \
   .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) |        \
        PTE_BLOCK_NON_SHARE |                \
        PTE_BLOCK_PXN | PTE_BLOCK_UXN            \
}, { \
   .base = 0x80000000UL,                    \
   .size = 0xff80000000UL,                    \
   .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) |            \
        PTE_BLOCK_INNER_SHARE                \
}, \

}
I'd prefer a layout that didn't align the closing } for different nesting levels in the same column. To avoid indenting everything a lot, it seems simplest to pull the final } back into the first column.

Seems like people want this in as structs in board files rather than a #define in a header anyway to enable more flexible code and ensure that you can build the table based on dt. So there we have natural } alignment again :).

...

I believe the .size field of the second entry in the array only needs to be 0x80000000. Testing with a PCIe Ethernet card on p2371-2180 (the driver for which sets up noncached entries in the page tables, hence should exercise all this code) confirms that.

I was surprised to see the full map in your code too, but I wanted to make this patch with as little behavioral change as possible (for bisect reasons). So the page table that gets constructed before and after should be almost identical.

If we want to change behavior later on, I'd much rather like to see that in a follow-up patch independent of this set.

Alex

...

While recent Tegra systems do support more than 2GB of RAM, U-Boot will itself only use the first 2GB, so that PAs over 4GB are not used. See board_get_usable_ram_top() in arch/arm/mach-tegra/board2.c. That's because some peripherals can only access 32-bit PAs, and the simplest way to accommodate that is to ignore any RAM above the 32-bit limit.

Alexander Graf

22 Feb 22 Feb

2:57 a.m.

New subject: [U-Boot] [PATCH 5/9] vexpress64: Add MMU tables

There's no good excuse for running with caches disabled on AArch64, so let's just move the vexpress64 target to enable the MMU and run with caches on.

Signed-off-by: Alexander Graf agraf@suse.de --- include/configs/vexpress_aemv8a.h | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/configs/vexpress_aemv8a.h b/include/configs/vexpress_aemv8a.h index 133041b..9689231 100644 --- a/include/configs/vexpress_aemv8a.h +++ b/include/configs/vexpress_aemv8a.h @@ -19,9 +19,23 @@

#define CONFIG_SUPPORT_RAW_INITRD

-/* Cache Definitions */ -#define CONFIG_SYS_DCACHE_OFF -#define CONFIG_SYS_ICACHE_OFF +/* MMU Definitions */ +#define CONFIG_SYS_CACHELINE_SIZE 64 +#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \ + { \ + .base = 0x0UL, \ + .size = 0x80000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE | \ + PTE_BLOCK_PXN | PTE_BLOCK_UXN \ + }, { \ + .base = 0x80000000UL, \ + .size = 0xff80000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) | \ + PTE_BLOCK_INNER_SHARE \ + }, \ + }

#define CONFIG_IDENT_STRING " vexpress_aemv8a" #define CONFIG_BOOTP_VCI_STRING "U-Boot.armv8.vexpress_aemv8a"

-- 1.8.5.6

Alexander Graf

2:57 a.m.

New subject: [U-Boot] [PATCH 6/9] dwmmc: Increase retry timeout

When enable dcache on HiKey, we're running into MMC command timeouts because our retry loop is now faster than the eMMC (or an external SD card) can answer.

Increase the retry count to the same as the timeout value for status reports.

The real fix is obviously to not base this whole thing on a cycle counter but on real wall time, but that would be slightly more intrusive.

Signed-off-by: Alexander Graf agraf@suse.de --- drivers/mmc/dw_mmc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/mmc/dw_mmc.c b/drivers/mmc/dw_mmc.c index 909e3ca..7329f40 100644 --- a/drivers/mmc/dw_mmc.c +++ b/drivers/mmc/dw_mmc.c @@ -189,7 +189,7 @@ static int dwmci_send_cmd(struct mmc *mmc, struct mmc_cmd *cmd, data ? DIV_ROUND_UP(data->blocks, 8) : 0); int ret = 0, flags = 0, i; unsigned int timeout = 100000; - u32 retry = 10000; + u32 retry = 100000; u32 mask, ctrl; ulong start = get_timer(0); struct bounce_buffer bbstate;

-- 1.8.5.6

Alexander Graf

2:57 a.m.

New subject: [U-Boot] [PATCH 7/9] hikey: Add MMU tables

The hikey runs with dcache disabled today. There really should be no reason not to use caches on AArch64, so let's add MMU definitions and enable the dcache.

Signed-off-by: Alexander Graf agraf@suse.de --- include/configs/hikey.h | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/configs/hikey.h b/include/configs/hikey.h index 796861e..ed0336c 100644 --- a/include/configs/hikey.h +++ b/include/configs/hikey.h @@ -21,8 +21,23 @@

#define CONFIG_SUPPORT_RAW_INITRD

-/* Cache Definitions */ -#define CONFIG_SYS_DCACHE_OFF +/* MMU Definitions */ +#define CONFIG_SYS_CACHELINE_SIZE 64 +#define CONFIG_SYS_FULL_VA +#define CONFIG_SYS_MEM_MAP { \ + { \ + .base = 0x0UL, \ + .size = 0x80000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_NORMAL) | \ + PTE_BLOCK_INNER_SHARE \ + }, { \ + .base = 0x80000000UL, \ + .size = 0x80000000UL, \ + .attrs = PTE_BLOCK_MEMTYPE(MT_DEVICE_NGNRNE) | \ + PTE_BLOCK_NON_SHARE | \ + PTE_BLOCK_PXN | PTE_BLOCK_UXN \ + }, \ + }

#define CONFIG_IDENT_STRING "hikey"

-- 1.8.5.6

Alexander Graf

2:57 a.m.

New subject: [U-Boot] [PATCH 8/9] arm64: Remove non-full-va map code

By now the code to only have a single page table level with 64k page size and 42 bit address space is no longer used by any board in tree, so we can safely remove it.

To clean up code, move the layerscape mmu code to the new defines, removing redundant field definitions.

Signed-off-by: Alexander Graf agraf@suse.de --- arch/arm/cpu/armv8/cache_v8.c | 94 ++------------------------ arch/arm/cpu/armv8/fsl-layerscape/cpu.c | 29 ++++++-- arch/arm/include/asm/arch-fsl-layerscape/cpu.h | 94 +++++++++++++------------- arch/arm/include/asm/armv8/mmu.h | 66 +----------------- arch/arm/include/asm/global_data.h | 2 +- arch/arm/include/asm/system.h | 4 -- doc/README.arm64 | 20 ------ include/configs/hikey.h | 1 - include/configs/tegra210-common.h | 1 - include/configs/thunderx_88xx.h | 14 ---- include/configs/vexpress_aemv8a.h | 1 - include/configs/xilinx_zynqmp.h | 1 - 12 files changed, 77 insertions(+), 250 deletions(-)

diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index 4369a83..c199ad1 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -43,8 +43,11 @@ DECLARE_GLOBAL_DATA_PTR; * off: FFF */

-#ifdef CONFIG_SYS_FULL_VA +#ifdef CONFIG_SYS_MEM_MAP static struct mm_region mem_map[] = CONFIG_SYS_MEM_MAP; +#else +static struct mm_region mem_map[] = { }; +#endif

static u64 get_tcr(int el, u64 *pips, u64 *pva_bits) { @@ -290,38 +293,11 @@ static void setup_pgtables(void) add_map(&mem_map[i]); }

-#else - -inline void set_pgtable_section(u64 *page_table, u64 index, u64 section, - u64 memory_type, u64 attribute) -{ - u64 value; - - value = section | PMD_TYPE_SECT | PMD_SECT_AF; - value |= PMD_ATTRINDX(memory_type); - value |= attribute; - page_table[index] = value; -} - -inline void set_pgtable_table(u64 *page_table, u64 index, u64 *table_addr) -{ - u64 value; - - value = (u64)table_addr | PMD_TYPE_TABLE; - page_table[index] = value; -} -#endif - /* to activate the MMU we need to set up virtual memory */ __weak void mmu_setup(void) { -#ifndef CONFIG_SYS_FULL_VA - bd_t *bd = gd->bd; - u64 *page_table = (u64 *)gd->arch.tlb_addr, i, j; -#endif int el;

-#ifdef CONFIG_SYS_FULL_VA /* Set up page tables only once */ if (!gd->arch.tlb_fillptr) setup_pgtables(); @@ -329,40 +305,6 @@ __weak void mmu_setup(void) el = current_el(); set_ttbr_tcr_mair(el, gd->arch.tlb_addr, get_tcr(el, NULL, NULL), MEMORY_ATTRIBUTES); -#else - /* Setup an identity-mapping for all spaces */ - for (i = 0; i < (PGTABLE_SIZE >> 3); i++) { - set_pgtable_section(page_table, i, i << SECTION_SHIFT, - MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE); - } - - /* Setup an identity-mapping for all RAM space */ - for (i = 0; i < CONFIG_NR_DRAM_BANKS; i++) { - ulong start = bd->bi_dram[i].start; - ulong end = bd->bi_dram[i].start + bd->bi_dram[i].size; - for (j = start >> SECTION_SHIFT; - j < end >> SECTION_SHIFT; j++) { - set_pgtable_section(page_table, j, j << SECTION_SHIFT, - MT_NORMAL, PMD_SECT_NON_SHARE); - } - } - - /* load TTBR0 */ - el = current_el(); - if (el == 1) { - set_ttbr_tcr_mair(el, gd->arch.tlb_addr, - TCR_EL1_RSVD | TCR_FLAGS | TCR_EL1_IPS_BITS, - MEMORY_ATTRIBUTES); - } else if (el == 2) { - set_ttbr_tcr_mair(el, gd->arch.tlb_addr, - TCR_EL2_RSVD | TCR_FLAGS | TCR_EL2_IPS_BITS, - MEMORY_ATTRIBUTES); - } else { - set_ttbr_tcr_mair(el, gd->arch.tlb_addr, - TCR_EL3_RSVD | TCR_FLAGS | TCR_EL3_IPS_BITS, - MEMORY_ATTRIBUTES); - } -#endif

/* enable the mmu */ set_sctlr(get_sctlr() | CR_M); @@ -448,33 +390,6 @@ u64 *__weak arch_get_page_table(void) { return NULL; }

-#ifndef CONFIG_SYS_FULL_VA -void mmu_set_region_dcache_behaviour(phys_addr_t start, size_t size, - enum dcache_option option) -{ - u64 *page_table = arch_get_page_table(); - u64 upto, end; - - if (page_table == NULL) - return; - - end = ALIGN(start + size, (1 << MMU_SECTION_SHIFT)) >> - MMU_SECTION_SHIFT; - start = start >> MMU_SECTION_SHIFT; - for (upto = start; upto < end; upto++) { - page_table[upto] &= ~PMD_ATTRINDX_MASK; - page_table[upto] |= PMD_ATTRINDX(option); - } - asm volatile("dsb sy"); - __asm_invalidate_tlb_all(); - asm volatile("dsb sy"); - asm volatile("isb"); - start = start << MMU_SECTION_SHIFT; - end = end << MMU_SECTION_SHIFT; - flush_dcache_range(start, end); - asm volatile("dsb sy"); -} -#else static bool is_aligned(u64 addr, u64 size, u64 align) { return !(addr & (align - 1)) && !(size & (align - 1)); @@ -547,7 +462,6 @@ void mmu_set_region_dcache_behaviour(phys_addr_t start, size_t size, flush_dcache_range(real_start, real_start + real_size); asm volatile("dsb sy"); } -#endif

#else /* CONFIG_SYS_DCACHE_OFF */

diff --git a/arch/arm/cpu/armv8/fsl-layerscape/cpu.c b/arch/arm/cpu/armv8/fsl-layerscape/cpu.c index 6ea28ed..1dd33e0 100644 --- a/arch/arm/cpu/armv8/fsl-layerscape/cpu.c +++ b/arch/arm/cpu/armv8/fsl-layerscape/cpu.c @@ -48,6 +48,25 @@ void cpu_name(char *name) }

#ifndef CONFIG_SYS_DCACHE_OFF +static void set_pgtable_section(u64 *page_table, u64 index, u64 section, + u64 memory_type, u64 attribute) +{ + u64 value; + + value = section | PTE_TYPE_BLOCK | PTE_BLOCK_AF; + value |= PMD_ATTRINDX(memory_type); + value |= attribute; + page_table[index] = value; +} + +static void set_pgtable_table(u64 *page_table, u64 index, u64 *table_addr) +{ + u64 value; + + value = (u64)table_addr | PTE_TYPE_TABLE; + page_table[index] = value; +} + /* * Set the block entries according to the information of the table. */ @@ -114,10 +133,10 @@ static int find_table(const struct sys_mmu_table *list,

temp_base -= block_size;

- if ((level_table[index - 1] & PMD_TYPE_MASK) == - PMD_TYPE_TABLE) { + if ((level_table[index - 1] & PTE_TYPE_MASK) == + PTE_TYPE_TABLE) { level_table = (u64 *)(level_table[index - 1] & - ~PMD_TYPE_MASK); + ~PTE_TYPE_MASK); level++; continue; } else { @@ -220,7 +239,7 @@ static inline int final_secure_ddr(u64 *level0_table, struct table_info table = {}; struct sys_mmu_table ddr_entry = { 0, 0, BLOCK_SIZE_L1, MT_NORMAL, - PMD_SECT_OUTER_SHARE | PMD_SECT_NS + PTE_BLOCK_OUTER_SHARE | PTE_BLOCK_NS }; u64 index;

@@ -243,7 +262,7 @@ static inline int final_secure_ddr(u64 *level0_table, ddr_entry.virt_addr = phys_addr; ddr_entry.phys_addr = phys_addr; ddr_entry.size = CONFIG_SYS_MEM_RESERVE_SECURE; - ddr_entry.attribute = PMD_SECT_OUTER_SHARE; + ddr_entry.attribute = PTE_BLOCK_OUTER_SHARE; ret = find_table(&ddr_entry, &table, level0_table); if (ret) { printf("MMU error: could not find secure ddr table\n"); diff --git a/arch/arm/include/asm/arch-fsl-layerscape/cpu.h b/arch/arm/include/asm/arch-fsl-layerscape/cpu.h index 15ade84..93bbda3 100644 --- a/arch/arm/include/asm/arch-fsl-layerscape/cpu.h +++ b/arch/arm/include/asm/arch-fsl-layerscape/cpu.h @@ -117,48 +117,48 @@ static const struct sys_mmu_table early_mmu_table[] = { #ifdef CONFIG_FSL_LSCH3 { CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_OCRAM_BASE, CONFIG_SYS_FSL_OCRAM_BASE, - CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PTE_BLOCK_NON_SHARE }, /* For IFC Region #1, only the first 4MB is cache-enabled */ { CONFIG_SYS_FSL_IFC_BASE1, CONFIG_SYS_FSL_IFC_BASE1, - CONFIG_SYS_FSL_IFC_SIZE1_1, MT_NORMAL, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_IFC_SIZE1_1, MT_NORMAL, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_IFC_BASE1 + CONFIG_SYS_FSL_IFC_SIZE1_1, CONFIG_SYS_FSL_IFC_BASE1 + CONFIG_SYS_FSL_IFC_SIZE1_1, CONFIG_SYS_FSL_IFC_SIZE1 - CONFIG_SYS_FSL_IFC_SIZE1_1, - MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE }, + MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FLASH_BASE, CONFIG_SYS_FSL_IFC_BASE1, - CONFIG_SYS_FSL_IFC_SIZE1, MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_IFC_SIZE1, MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DRAM_BASE1, CONFIG_SYS_FSL_DRAM_BASE1, CONFIG_SYS_FSL_DRAM_SIZE1, MT_NORMAL, - PMD_SECT_OUTER_SHARE | PMD_SECT_NS }, + PTE_BLOCK_OUTER_SHARE | PTE_BLOCK_NS }, /* Map IFC region #2 up to CONFIG_SYS_FLASH_BASE for NAND boot */ { CONFIG_SYS_FSL_IFC_BASE2, CONFIG_SYS_FSL_IFC_BASE2, CONFIG_SYS_FLASH_BASE - CONFIG_SYS_FSL_IFC_BASE2, - MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE }, + MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_DRAM_BASE2, CONFIG_SYS_FSL_DRAM_BASE2, CONFIG_SYS_FSL_DRAM_SIZE2, MT_NORMAL, - PMD_SECT_OUTER_SHARE | PMD_SECT_NS }, + PTE_BLOCK_OUTER_SHARE | PTE_BLOCK_NS }, #elif defined(CONFIG_FSL_LSCH2) { CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_OCRAM_BASE, CONFIG_SYS_FSL_OCRAM_BASE, - CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_QSPI_BASE, CONFIG_SYS_FSL_QSPI_BASE, - CONFIG_SYS_FSL_QSPI_SIZE, MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_QSPI_SIZE, MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_IFC_BASE, CONFIG_SYS_FSL_IFC_BASE, - CONFIG_SYS_FSL_IFC_SIZE, MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_IFC_SIZE, MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DRAM_BASE1, CONFIG_SYS_FSL_DRAM_BASE1, - CONFIG_SYS_FSL_DRAM_SIZE1, MT_NORMAL, PMD_SECT_OUTER_SHARE }, + CONFIG_SYS_FSL_DRAM_SIZE1, MT_NORMAL, PTE_BLOCK_OUTER_SHARE }, { CONFIG_SYS_FSL_DRAM_BASE2, CONFIG_SYS_FSL_DRAM_BASE2, - CONFIG_SYS_FSL_DRAM_SIZE2, MT_NORMAL, PMD_SECT_OUTER_SHARE }, + CONFIG_SYS_FSL_DRAM_SIZE2, MT_NORMAL, PTE_BLOCK_OUTER_SHARE }, #endif };

@@ -166,96 +166,96 @@ static const struct sys_mmu_table final_mmu_table[] = { #ifdef CONFIG_FSL_LSCH3 { CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_OCRAM_BASE, CONFIG_SYS_FSL_OCRAM_BASE, - CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DRAM_BASE1, CONFIG_SYS_FSL_DRAM_BASE1, CONFIG_SYS_FSL_DRAM_SIZE1, MT_NORMAL, - PMD_SECT_OUTER_SHARE | PMD_SECT_NS }, + PTE_BLOCK_OUTER_SHARE | PTE_BLOCK_NS }, { CONFIG_SYS_FSL_QSPI_BASE2, CONFIG_SYS_FSL_QSPI_BASE2, CONFIG_SYS_FSL_QSPI_SIZE2, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_IFC_BASE2, CONFIG_SYS_FSL_IFC_BASE2, - CONFIG_SYS_FSL_IFC_SIZE2, MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_IFC_SIZE2, MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_MC_BASE, CONFIG_SYS_FSL_MC_BASE, CONFIG_SYS_FSL_MC_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_NI_BASE, CONFIG_SYS_FSL_NI_BASE, CONFIG_SYS_FSL_NI_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, /* For QBMAN portal, only the first 64MB is cache-enabled */ { CONFIG_SYS_FSL_QBMAN_BASE, CONFIG_SYS_FSL_QBMAN_BASE, CONFIG_SYS_FSL_QBMAN_SIZE_1, MT_NORMAL, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN | PMD_SECT_NS }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN | PTE_BLOCK_NS }, { CONFIG_SYS_FSL_QBMAN_BASE + CONFIG_SYS_FSL_QBMAN_SIZE_1, CONFIG_SYS_FSL_QBMAN_BASE + CONFIG_SYS_FSL_QBMAN_SIZE_1, CONFIG_SYS_FSL_QBMAN_SIZE - CONFIG_SYS_FSL_QBMAN_SIZE_1, - MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_PCIE1_PHYS_ADDR, CONFIG_SYS_PCIE1_PHYS_ADDR, CONFIG_SYS_PCIE1_PHYS_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_PCIE2_PHYS_ADDR, CONFIG_SYS_PCIE2_PHYS_ADDR, CONFIG_SYS_PCIE2_PHYS_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_PCIE3_PHYS_ADDR, CONFIG_SYS_PCIE3_PHYS_ADDR, CONFIG_SYS_PCIE3_PHYS_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, #if defined(CONFIG_LS2080A) || defined(CONFIG_LS2085A) { CONFIG_SYS_PCIE4_PHYS_ADDR, CONFIG_SYS_PCIE4_PHYS_ADDR, CONFIG_SYS_PCIE4_PHYS_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, #endif { CONFIG_SYS_FSL_WRIOP1_BASE, CONFIG_SYS_FSL_WRIOP1_BASE, CONFIG_SYS_FSL_WRIOP1_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_AIOP1_BASE, CONFIG_SYS_FSL_AIOP1_BASE, CONFIG_SYS_FSL_AIOP1_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_PEBUF_BASE, CONFIG_SYS_FSL_PEBUF_BASE, CONFIG_SYS_FSL_PEBUF_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_DRAM_BASE2, CONFIG_SYS_FSL_DRAM_BASE2, CONFIG_SYS_FSL_DRAM_SIZE2, MT_NORMAL, - PMD_SECT_OUTER_SHARE | PMD_SECT_NS }, + PTE_BLOCK_OUTER_SHARE | PTE_BLOCK_NS }, #elif defined(CONFIG_FSL_LSCH2) { CONFIG_SYS_FSL_BOOTROM_BASE, CONFIG_SYS_FSL_BOOTROM_BASE, CONFIG_SYS_FSL_BOOTROM_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_BASE, CONFIG_SYS_FSL_CCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_OCRAM_BASE, CONFIG_SYS_FSL_OCRAM_BASE, - CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_OCRAM_SIZE, MT_NORMAL, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_BASE, CONFIG_SYS_FSL_DCSR_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_QSPI_BASE, CONFIG_SYS_FSL_QSPI_BASE, CONFIG_SYS_FSL_QSPI_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_IFC_BASE, CONFIG_SYS_FSL_IFC_BASE, - CONFIG_SYS_FSL_IFC_SIZE, MT_DEVICE_NGNRNE, PMD_SECT_NON_SHARE }, + CONFIG_SYS_FSL_IFC_SIZE, MT_DEVICE_NGNRNE, PTE_BLOCK_NON_SHARE }, { CONFIG_SYS_FSL_DRAM_BASE1, CONFIG_SYS_FSL_DRAM_BASE1, CONFIG_SYS_FSL_DRAM_SIZE1, MT_NORMAL, - PMD_SECT_OUTER_SHARE | PMD_SECT_NS }, + PTE_BLOCK_OUTER_SHARE | PTE_BLOCK_NS }, { CONFIG_SYS_FSL_QBMAN_BASE, CONFIG_SYS_FSL_QBMAN_BASE, CONFIG_SYS_FSL_QBMAN_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_DRAM_BASE2, CONFIG_SYS_FSL_DRAM_BASE2, - CONFIG_SYS_FSL_DRAM_SIZE2, MT_NORMAL, PMD_SECT_OUTER_SHARE }, + CONFIG_SYS_FSL_DRAM_SIZE2, MT_NORMAL, PTE_BLOCK_OUTER_SHARE }, { CONFIG_SYS_PCIE1_PHYS_ADDR, CONFIG_SYS_PCIE1_PHYS_ADDR, CONFIG_SYS_PCIE1_PHYS_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_PCIE2_PHYS_ADDR, CONFIG_SYS_PCIE2_PHYS_ADDR, CONFIG_SYS_PCIE2_PHYS_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_PCIE3_PHYS_ADDR, CONFIG_SYS_PCIE3_PHYS_ADDR, CONFIG_SYS_PCIE3_PHYS_SIZE, MT_DEVICE_NGNRNE, - PMD_SECT_NON_SHARE | PMD_SECT_PXN | PMD_SECT_UXN }, + PTE_BLOCK_NON_SHARE | PTE_BLOCK_PXN | PTE_BLOCK_UXN }, { CONFIG_SYS_FSL_DRAM_BASE3, CONFIG_SYS_FSL_DRAM_BASE3, - CONFIG_SYS_FSL_DRAM_SIZE3, MT_NORMAL, PMD_SECT_OUTER_SHARE }, + CONFIG_SYS_FSL_DRAM_SIZE3, MT_NORMAL, PTE_BLOCK_OUTER_SHARE }, #endif }; #endif diff --git a/arch/arm/include/asm/armv8/mmu.h b/arch/arm/include/asm/armv8/mmu.h index 1711433..4873e5a 100644 --- a/arch/arm/include/asm/armv8/mmu.h +++ b/arch/arm/include/asm/armv8/mmu.h @@ -22,28 +22,12 @@ * calculated specifically. */

-#ifndef CONFIG_SYS_FULL_VA -#define VA_BITS (42) /* 42 bits virtual address */ -#else #define VA_BITS CONFIG_SYS_VA_BITS #define PTE_BLOCK_BITS CONFIG_SYS_PTL2_BITS -#endif

/* * block/section address mask and size definitions. */ -#ifndef CONFIG_SYS_FULL_VA -#define SECTION_SHIFT 29 -#define SECTION_SIZE (UL(1) << SECTION_SHIFT) -#define SECTION_MASK (~(SECTION_SIZE-1)) - -/* PAGE_SHIFT determines the page size */ -#undef PAGE_SIZE -#define PAGE_SHIFT 16 -#define PAGE_SIZE (1 << PAGE_SHIFT) -#define PAGE_MASK (~(PAGE_SIZE-1)) - -#else

/* PAGE_SHIFT determines the page size */ #undef PAGE_SIZE @@ -51,8 +35,6 @@ #define PAGE_SIZE (1 << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1))

-#endif - /***************************************************************/

/* @@ -75,8 +57,6 @@ * */

-#ifdef CONFIG_SYS_FULL_VA - #define PTE_TYPE_MASK (3 << 0) #define PTE_TYPE_FAULT (0 << 0) #define PTE_TYPE_TABLE (3 << 0) @@ -91,6 +71,7 @@ * Block */ #define PTE_BLOCK_MEMTYPE(x) ((x) << 2) +#define PTE_BLOCK_NS (1 << 5) #define PTE_BLOCK_NON_SHARE (0 << 8) #define PTE_BLOCK_OUTER_SHARE (2 << 8) #define PTE_BLOCK_INNER_SHARE (3 << 8) @@ -99,29 +80,6 @@ #define PTE_BLOCK_PXN (UL(1) << 53) #define PTE_BLOCK_UXN (UL(1) << 54)

-#else -/* - * Level 2 descriptor (PMD). - */ -#define PMD_TYPE_MASK (3 << 0) -#define PMD_TYPE_FAULT (0 << 0) -#define PMD_TYPE_TABLE (3 << 0) -#define PMD_TYPE_SECT (1 << 0) - -/* - * Section - */ -#define PMD_SECT_NS (1 << 5) -#define PMD_SECT_NON_SHARE (0 << 8) -#define PMD_SECT_OUTER_SHARE (2 << 8) -#define PMD_SECT_INNER_SHARE (3 << 8) -#define PMD_SECT_AF (1 << 10) -#define PMD_SECT_NG (1 << 11) -#define PMD_SECT_PXN (UL(1) << 53) -#define PMD_SECT_UXN (UL(1) << 54) - -#endif - /* * AttrIndx[2:0] */ @@ -149,33 +107,11 @@ #define TCR_TG0_64K (1 << 14) #define TCR_TG0_16K (2 << 14)

-#ifndef CONFIG_SYS_FULL_VA -#define TCR_EL1_IPS_BITS (UL(3) << 32) /* 42 bits physical address */ -#define TCR_EL2_IPS_BITS (3 << 16) /* 42 bits physical address */ -#define TCR_EL3_IPS_BITS (3 << 16) /* 42 bits physical address */ - -/* PTWs cacheable, inner/outer WBWA and inner shareable */ -#define TCR_FLAGS (TCR_TG0_64K | \ - TCR_SHARED_INNER | \ - TCR_ORGN_WBWA | \ - TCR_IRGN_WBWA | \ - TCR_T0SZ(VA_BITS)) -#endif - #define TCR_EL1_RSVD (1 << 31) #define TCR_EL2_RSVD (1 << 31 | 1 << 23) #define TCR_EL3_RSVD (1 << 31 | 1 << 23)

#ifndef __ASSEMBLY__ -#ifndef CONFIG_SYS_FULL_VA - -void set_pgtable_section(u64 *page_table, u64 index, - u64 section, u64 memory_type, - u64 attribute); -void set_pgtable_table(u64 *page_table, u64 index, - u64 *table_addr); - -#endif static inline void set_ttbr_tcr_mair(int el, u64 table, u64 tcr, u64 attr) { asm volatile("dsb sy"); diff --git a/arch/arm/include/asm/global_data.h b/arch/arm/include/asm/global_data.h index 3dec1db..e982d9e 100644 --- a/arch/arm/include/asm/global_data.h +++ b/arch/arm/include/asm/global_data.h @@ -39,7 +39,7 @@ struct arch_global_data { #if !(defined(CONFIG_SYS_ICACHE_OFF) && defined(CONFIG_SYS_DCACHE_OFF)) unsigned long tlb_addr; unsigned long tlb_size; -#if defined(CONFIG_SYS_FULL_VA) +#if defined(CONFIG_ARM64) unsigned long tlb_fillptr; #endif #endif diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h index ffd6fe5..5663d22 100644 --- a/arch/arm/include/asm/system.h +++ b/arch/arm/include/asm/system.h @@ -17,12 +17,8 @@ #define CR_WXN (1 << 19) /* Write Permision Imply XN */ #define CR_EE (1 << 25) /* Exception (Big) Endian */

-#ifndef CONFIG_SYS_FULL_VA -#define PGTABLE_SIZE (0x10000) -#else u64 get_page_table_size(void); #define PGTABLE_SIZE get_page_table_size() -#endif

/* 2MB granularity */ #define MMU_SECTION_SHIFT 21 diff --git a/doc/README.arm64 b/doc/README.arm64 index de669cb..f658fa2 100644 --- a/doc/README.arm64 +++ b/doc/README.arm64 @@ -36,26 +36,6 @@ Notes 6. CONFIG_ARM64 instead of CONFIG_ARMV8 is used to distinguish aarch64 and aarch32 specific codes.

-7. CONFIG_SYS_FULL_VA is used to enable 2-level page tables. For cores - supporting 64k pages it allows usage of full 48+ virtual/physical addresses - - Enabling this option requires the following ones to be defined: - - CONFIG_SYS_MEM_MAP - an array of 'struct mm_region' describing the - system memory map (start, length, attributes) - - CONFIG_SYS_MEM_MAP_SIZE - number of entries in CONFIG_SYS_MEM_MAP - - CONFIG_SYS_PTL1_ENTRIES - number of 1st level page table entries - - CONFIG_SYS_PTL2_ENTRIES - number of 1nd level page table entries - for the largest CONFIG_SYS_MEM_MAP entry - - CONFIG_COREID_MASK - the mask value used to get the core from the - MPIDR_EL1 register - - CONFIG_SYS_PTL2_BITS - number of bits addressed by the 2nd level - page tables - - CONFIG_SYS_BLOCK_SHIFT - number of bits addressed by a single block - entry from L2 page tables - - CONFIG_SYS_PGTABLE_SIZE - total size of the page table - - CONFIG_SYS_TCR_EL{1,2,3}_IPS_BITS - the IPS field of the TCR_EL{1,2,3} - -

Contributor diff --git a/include/configs/hikey.h b/include/configs/hikey.h index ed0336c..fc4f368 100644 --- a/include/configs/hikey.h +++ b/include/configs/hikey.h @@ -23,7 +23,6 @@

/* MMU Definitions */ #define CONFIG_SYS_CACHELINE_SIZE 64 -#define CONFIG_SYS_FULL_VA #define CONFIG_SYS_MEM_MAP { \ { \ .base = 0x0UL, \ diff --git a/include/configs/tegra210-common.h b/include/configs/tegra210-common.h index 5a664b3..ff9942c 100644 --- a/include/configs/tegra210-common.h +++ b/include/configs/tegra210-common.h @@ -13,7 +13,6 @@ /* Cortex-A57 uses a cache line size of 64 bytes */ #define CONFIG_SYS_CACHELINE_SIZE 64

-#define CONFIG_SYS_FULL_VA #define CONFIG_SYS_MEM_MAP { \ { \ .base = 0x0UL, \ diff --git a/include/configs/thunderx_88xx.h b/include/configs/thunderx_88xx.h index 20b25f7..7c35f84 100644 --- a/include/configs/thunderx_88xx.h +++ b/include/configs/thunderx_88xx.h @@ -22,8 +22,6 @@

#define MEM_BASE 0x00500000

-#define CONFIG_SYS_FULL_VA - #define CONFIG_SYS_LOWMEM_BASE MEM_BASE

#define CONFIG_SYS_MEM_MAP {{0x000000000000UL, 0x40000000000UL, \ @@ -37,18 +35,6 @@ PTE_BLOCK_NON_SHARE}, \ }

-#define CONFIG_SYS_MEM_MAP_SIZE 3 - -#define CONFIG_SYS_VA_BITS 48 -#define CONFIG_SYS_PTL2_BITS 42 -#define CONFIG_SYS_BLOCK_SHIFT 29 -#define CONFIG_SYS_PTL1_ENTRIES 64 -#define CONFIG_SYS_PTL2_ENTRIES 8192 - -#define CONFIG_SYS_PGTABLE_SIZE \ - ((CONFIG_SYS_PTL1_ENTRIES + \ - CONFIG_SYS_MEM_MAP_SIZE * CONFIG_SYS_PTL2_ENTRIES) * 8) - /* Link Definitions */ #define CONFIG_SYS_TEXT_BASE 0x00500000 #define CONFIG_SYS_INIT_SP_ADDR (CONFIG_SYS_SDRAM_BASE + 0x7fff0) diff --git a/include/configs/vexpress_aemv8a.h b/include/configs/vexpress_aemv8a.h index 9689231..5974633 100644 --- a/include/configs/vexpress_aemv8a.h +++ b/include/configs/vexpress_aemv8a.h @@ -21,7 +21,6 @@

/* MMU Definitions */ #define CONFIG_SYS_CACHELINE_SIZE 64 -#define CONFIG_SYS_FULL_VA #define CONFIG_SYS_MEM_MAP { \ { \ .base = 0x0UL, \ diff --git a/include/configs/xilinx_zynqmp.h b/include/configs/xilinx_zynqmp.h index 439f063..85eabc9 100644 --- a/include/configs/xilinx_zynqmp.h +++ b/include/configs/xilinx_zynqmp.h @@ -29,7 +29,6 @@ #define CONFIG_SYS_MEMTEST_START CONFIG_SYS_SDRAM_BASE #define CONFIG_SYS_MEMTEST_END CONFIG_SYS_SDRAM_SIZE

-#define CONFIG_SYS_FULL_VA #define CONFIG_SYS_MEM_MAP { \ { \ .base = 0x0UL, \

-- 1.8.5.6

Alexander Graf

2:57 a.m.

New subject: [U-Boot] [PATCH 9/9] arm64: Only allow dcache disabled in SPL builds

Now that we have an easy way to describe memory regions and enable the MMU, there really shouldn't be anything holding people back from running with caches enabled on AArch64. To make sure people catch early if they're missing on the caching fun, give them a compile error.

Signed-off-by: Alexander Graf agraf@suse.de --- arch/arm/cpu/armv8/cache_v8.c | 9 +++++++++ 1 file changed, 9 insertions(+)

diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c index c199ad1..9c62cf4 100644 --- a/arch/arm/cpu/armv8/cache_v8.c +++ b/arch/arm/cpu/armv8/cache_v8.c @@ -465,6 +465,15 @@ void mmu_set_region_dcache_behaviour(phys_addr_t start, size_t size,

#else /* CONFIG_SYS_DCACHE_OFF */

+/* + * For SPL builds, we may want to not have dcache enabled. Any real U-Boot + * running however really wants to have dcache and the MMU active. Check that + * everything is sane and give the developer a hint if it isn't. + */ +#ifndef CONFIG_SPL_BUILD +#error Please describe your MMU layout in CONFIG_SYS_MEM_MAP and enable dcache. +#endif + void invalidate_dcache_all(void) { }

-- 1.8.5.6

york sun

6:37 p.m.

On 02/21/2016 05:57 PM, Alexander Graf wrote:

...

Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

Alex,

Unified MMU will be great for all of us. The reason we started with our own MMU table was size and performance. I don't know much about other ARMv8 SoCs. For our use, we enable cache very early to speed up running, especially for pre-silicon development on emulators. We don't have DDR to use for the early stage and we have very limited on-chip SRAM. I believe we can use the unified structure for our 2nd stage MMU when DDR is up.

York

Alexander Graf

7:02 p.m.

...

Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com:

...
On 02/21/2016 05:57 PM, Alexander Graf wrote: Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

Alex,

Unified MMU will be great for all of us. The reason we started with our own MMU table was size and performance. I don't know much about other ARMv8 SoCs. For our use, we enable cache very early to speed up running, especially for pre-silicon development on emulators. We don't have DDR to use for the early stage and we have very limited on-chip SRAM. I believe we can use the unified structure for our 2nd stage MMU when DDR is up.

Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.

Alex

york sun

7:12 p.m.

On 02/22/2016 10:02 AM, Alexander Graf wrote:

...

...
Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com:

...
On 02/21/2016 05:57 PM, Alexander Graf wrote: Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

Alex,

Unified MMU will be great for all of us. The reason we started with our own MMU table was size and performance. I don't know much about other ARMv8 SoCs. For our use, we enable cache very early to speed up running, especially for pre-silicon development on emulators. We don't have DDR to use for the early stage and we have very limited on-chip SRAM. I believe we can use the unified structure for our 2nd stage MMU when DDR is up.

Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

What's the size for the MMU tables? I think it may be simpler to use static tables for our early stage.

...

The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.

True. We have some complication on the address mapping. For compatibility, each device is mapped (partially) under 32-bit space. If the device is too large to fit, the rest is mapped to high regions. I remember one particular case on top of my head. It is the NOR flash we use for environmental variables. U-boot uses that address for saving, but also uses that for loading during booting. For our case, the NOR flash doesn't fit well in the low region, so it is remapped to high region after booting. To make the environmental variables accessible during boot, we mapped the high region phys with different virt, so u-boot doesn't have to know the low region address.

York

Alexander Graf

7:31 p.m.

On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote:

...

On 02/22/2016 10:02 AM, Alexander Graf wrote:

...
...
Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com:

...
On 02/21/2016 05:57 PM, Alexander Graf wrote: Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

Alex,

Unified MMU will be great for all of us. The reason we started with our own MMU table was size and performance. I don't know much about other ARMv8 SoCs. For our use, we enable cache very early to speed up running, especially for pre-silicon development on emulators. We don't have DDR to use for the early stage and we have very limited on-chip SRAM. I believe we can use the unified structure for our 2nd stage MMU when DDR is up.

Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

What's the size for the MMU tables? I think it may be simpler to use static tables for our early stage.

The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

...

...
The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.

True. We have some complication on the address mapping. For compatibility, each device is mapped (partially) under 32-bit space. If the device is too large to

Compatibility with what? Do we really need this in an AArch64 world?

For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.

...

fit, the rest is mapped to high regions. I remember one particular case on top of my head. It is the NOR flash we use for environmental variables. U-boot uses that address for saving, but also uses that for loading during booting. For our case, the NOR flash doesn't fit well in the low region, so it is remapped to high region after booting. To make the environmental variables accessible during boot, we mapped the high region phys with different virt, so u-boot doesn't have to know the low region address.

I might be missing the obvious, but why can't the environmental variables live in high regions?

Alex

york sun

7:39 p.m.

On 02/22/2016 10:31 AM, Alexander Graf wrote:

...

On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote:

...
On 02/22/2016 10:02 AM, Alexander Graf wrote:

...
...
Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com:

...
On 02/21/2016 05:57 PM, Alexander Graf wrote: Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

Alex,

Unified MMU will be great for all of us. The reason we started with our own MMU table was size and performance. I don't know much about other ARMv8 SoCs. For our use, we enable cache very early to speed up running, especially for pre-silicon development on emulators. We don't have DDR to use for the early stage and we have very limited on-chip SRAM. I believe we can use the unified structure for our 2nd stage MMU when DDR is up.

Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

What's the size for the MMU tables? I think it may be simpler to use static tables for our early stage.

The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

That's the part I can't live with. Since we have very limited on-chip RAM, we have to know limit the size. But again, I do see the benefit to use unified structure for the 2nd stage.

...

...
...
The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.

True. We have some complication on the address mapping. For compatibility, each device is mapped (partially) under 32-bit space. If the device is too large to

Compatibility with what? Do we really need this in an AArch64 world?

It's not up to me. The SoC was designed this way. By the way, this SoC can work in AArch32 mode.

...

For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.

...
fit, the rest is mapped to high regions. I remember one particular case on top of my head. It is the NOR flash we use for environmental variables. U-boot uses that address for saving, but also uses that for loading during booting. For our case, the NOR flash doesn't fit well in the low region, so it is remapped to high region after booting. To make the environmental variables accessible during boot, we mapped the high region phys with different virt, so u-boot doesn't have to know the low region address.

I might be missing the obvious, but why can't the environmental variables live in high regions?

It is in high region. But as I tried to explain, the default physical mapping of NOR flash (not MMU) is in low region out of reset.

York

Alexander Graf

8:42 p.m.

On 22.02.16 19:39, york sun wrote:

...

On 02/22/2016 10:31 AM, Alexander Graf wrote:

...
On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote:

...
On 02/22/2016 10:02 AM, Alexander Graf wrote:

...
...
Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com:

...
On 02/21/2016 05:57 PM, Alexander Graf wrote: Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

Alex,

Unified MMU will be great for all of us. The reason we started with our own MMU table was size and performance. I don't know much about other ARMv8 SoCs. For our use, we enable cache very early to speed up running, especially for pre-silicon development on emulators. We don't have DDR to use for the early stage and we have very limited on-chip SRAM. I believe we can use the unified structure for our 2nd stage MMU when DDR is up.

Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

What's the size for the MMU tables? I think it may be simpler to use static tables for our early stage.

The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

That's the part I can't live with. Since we have very limited on-chip RAM, we have to know limit the size. But again, I do see the benefit to use unified structure for the 2nd stage.

I'm not quite sure I see how your current code works any differently. While the code to determine the page table pool size is dynamic, the outcome is static depending on your memory map. So the same memory map always means the same page table pool size.

We could also just hard code the size for the early phase for you I guess.

...

...
...
...
The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.

True. We have some complication on the address mapping. For compatibility, each device is mapped (partially) under 32-bit space. If the device is too large to

Compatibility with what? Do we really need this in an AArch64 world?

It's not up to me. The SoC was designed this way. By the way, this SoC can work in AArch32 mode.

I think I'm slowly grasping what the problem is.

The fact that the SoC can run in AArch32 mode doesn't actually make a difference here though, since we're talking about U-Boot internal memory maps. The only reason to keep things mapped reachable from 32bits is if you want to run 32bit code with the U-Boot maps. I don't think you'd want to do that, no? :)

...

...
For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.

...
fit, the rest is mapped to high regions. I remember one particular case on top of my head. It is the NOR flash we use for environmental variables. U-boot uses that address for saving, but also uses that for loading during booting. For our case, the NOR flash doesn't fit well in the low region, so it is remapped to high region after booting. To make the environmental variables accessible during boot, we mapped the high region phys with different virt, so u-boot doesn't have to know the low region address.

I might be missing the obvious, but why can't the environmental variables live in high regions?

It is in high region. But as I tried to explain, the default physical mapping of NOR flash (not MMU) is in low region out of reset.

I see. So the problem is during the transitioning phase from uncached to MMU enabled, where we'd end up at a different address.

Could we just configure NOR to be in high memory in early asm init code, then always use the high physical NOR address range and jump to it from asm very early on? Then we could ignore the 32bit map and everything could just stay 1:1 mapped.

Alex

york sun

8:52 p.m.

On 02/22/2016 11:42 AM, Alexander Graf wrote:

...

On 22.02.16 19:39, york sun wrote:

...
On 02/22/2016 10:31 AM, Alexander Graf wrote:

...
On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote:

...
On 02/22/2016 10:02 AM, Alexander Graf wrote:

...
...
Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com:

> On 02/21/2016 05:57 PM, Alexander Graf wrote: > Howdy, > > Currently on arm64 there is a big pile of mess when it comes to MMU > support and page tables. Each board does its own little thing and the > generic code is pretty dumb and nobody actually uses it. > > This patch set tries to clean that up. After this series is applied, > all boards except for the FSL Layerscape ones are converted to the > new generic page table logic and have icache+dcache enabled. > > The new code always uses 4k page size. It dynamically allocates 1G or > 2M pages for ranges that fit. When a dcache attribute request comes in > that requires a smaller granularity than our previous allocation could > fulfill, pages get automatically split. > > I have tested and verified the code works on HiKey (bare metal), > vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is > untested, but given the simplicity of the maps I doubt it'll break. > ThunderX in theory should also work, but I haven't tested it. I would > be very happy if people with access to those system could give the patch > set a try. > > With this we're a big step closer to a good base line for EFI payload > support, since we can now just require that all boards always have dcache > enabled. > > I would also be incredibly happy if some Freescale people could look > at their MMU code and try to unify it into the now cleaned up generic > code. I don't think we're far off here.

Alex,

Unified MMU will be great for all of us. The reason we started with our own MMU table was size and performance. I don't know much about other ARMv8 SoCs. For our use, we enable cache very early to speed up running, especially for pre-silicon development on emulators. We don't have DDR to use for the early stage and we have very limited on-chip SRAM. I believe we can use the unified structure for our 2nd stage MMU when DDR is up.

Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

What's the size for the MMU tables? I think it may be simpler to use static tables for our early stage.

The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

That's the part I can't live with. Since we have very limited on-chip RAM, we have to know limit the size. But again, I do see the benefit to use unified structure for the 2nd stage.

I'm not quite sure I see how your current code works any differently. While the code to determine the page table pool size is dynamic, the outcome is static depending on your memory map. So the same memory map always means the same page table pool size.

We could also just hard code the size for the early phase for you I guess.

We can definitely try.

...

...
...
...
...
The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.

True. We have some complication on the address mapping. For compatibility, each device is mapped (partially) under 32-bit space. If the device is too large to

Compatibility with what? Do we really need this in an AArch64 world?

It's not up to me. The SoC was designed this way. By the way, this SoC can work in AArch32 mode.

I think I'm slowly grasping what the problem is.

The fact that the SoC can run in AArch32 mode doesn't actually make a difference here though, since we're talking about U-Boot internal memory maps. The only reason to keep things mapped reachable from 32bits is if you want to run 32bit code with the U-Boot maps. I don't think you'd want to do that, no? :)

I don't really want to run 32-bit code. My point is the SoC was designed that way. We have DDR under 32-bit space, and in high region. We have the same for flash controller where NOR is connected. Explained later below.

...

...
...
For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.

...
fit, the rest is mapped to high regions. I remember one particular case on top of my head. It is the NOR flash we use for environmental variables. U-boot uses that address for saving, but also uses that for loading during booting. For our case, the NOR flash doesn't fit well in the low region, so it is remapped to high region after booting. To make the environmental variables accessible during boot, we mapped the high region phys with different virt, so u-boot doesn't have to know the low region address.

I might be missing the obvious, but why can't the environmental variables live in high regions?

It is in high region. But as I tried to explain, the default physical mapping of NOR flash (not MMU) is in low region out of reset.

I see. So the problem is during the transitioning phase from uncached to MMU enabled, where we'd end up at a different address.

Not exactly. We enable cache very early for performance boost on emulator. It may sound trivial but it makes big difference when debugging software on emulators. Since we still use emulators for new product, I am not ready to drop the early MMU approach.

But you get the idea, the difference is before and after relocation. After u-boot relocates itself into DDR, we remap flash controller physical address to high region.

...

Could we just configure NOR to be in high memory in early asm init code, then always use the high physical NOR address range and jump to it from asm very early on? Then we could ignore the 32bit map and everything could just stay 1:1 mapped.

Out of reset, if booting from NOR flash, the flash controller is pre-configured to use low region address. We can only reprogram the controller when u-boot is not running on it.

I see you are trying to maintain the 1:1 mapping for MMU. Why so? I think the framework should allow different mapping.

York

Alexander Graf

9:09 p.m.

On 22.02.16 20:52, york sun wrote:

...

On 02/22/2016 11:42 AM, Alexander Graf wrote:

...
On 22.02.16 19:39, york sun wrote:

...
On 02/22/2016 10:31 AM, Alexander Graf wrote:

...
On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote:

...
On 02/22/2016 10:02 AM, Alexander Graf wrote:

...
> Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com: > >> On 02/21/2016 05:57 PM, Alexander Graf wrote: >> Howdy, >> >> Currently on arm64 there is a big pile of mess when it comes to MMU >> support and page tables. Each board does its own little thing and the >> generic code is pretty dumb and nobody actually uses it. >> >> This patch set tries to clean that up. After this series is applied, >> all boards except for the FSL Layerscape ones are converted to the >> new generic page table logic and have icache+dcache enabled. >> >> The new code always uses 4k page size. It dynamically allocates 1G or >> 2M pages for ranges that fit. When a dcache attribute request comes in >> that requires a smaller granularity than our previous allocation could >> fulfill, pages get automatically split. >> >> I have tested and verified the code works on HiKey (bare metal), >> vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is >> untested, but given the simplicity of the maps I doubt it'll break. >> ThunderX in theory should also work, but I haven't tested it. I would >> be very happy if people with access to those system could give the patch >> set a try. >> >> With this we're a big step closer to a good base line for EFI payload >> support, since we can now just require that all boards always have dcache >> enabled. >> >> I would also be incredibly happy if some Freescale people could look >> at their MMU code and try to unify it into the now cleaned up generic >> code. I don't think we're far off here. > > Alex, > > Unified MMU will be great for all of us. The reason we started with our own MMU > table was size and performance. I don't know much about other ARMv8 SoCs. For > our use, we enable cache very early to speed up running, especially for > pre-silicon development on emulators. We don't have DDR to use for the early > stage and we have very limited on-chip SRAM. I believe we can use the unified > structure for our 2nd stage MMU when DDR is up.

Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

What's the size for the MMU tables? I think it may be simpler to use static tables for our early stage.

The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

That's the part I can't live with. Since we have very limited on-chip RAM, we have to know limit the size. But again, I do see the benefit to use unified structure for the 2nd stage.

I'm not quite sure I see how your current code works any differently. While the code to determine the page table pool size is dynamic, the outcome is static depending on your memory map. So the same memory map always means the same page table pool size.

We could also just hard code the size for the early phase for you I guess.

We can definitely try.

...
...
...
...
...
The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.

True. We have some complication on the address mapping. For compatibility, each device is mapped (partially) under 32-bit space. If the device is too large to

Compatibility with what? Do we really need this in an AArch64 world?

It's not up to me. The SoC was designed this way. By the way, this SoC can work in AArch32 mode.

I think I'm slowly grasping what the problem is.

The fact that the SoC can run in AArch32 mode doesn't actually make a difference here though, since we're talking about U-Boot internal memory maps. The only reason to keep things mapped reachable from 32bits is if you want to run 32bit code with the U-Boot maps. I don't think you'd want to do that, no? :)

I don't really want to run 32-bit code. My point is the SoC was designed that way. We have DDR under 32-bit space, and in high region. We have the same for flash controller where NOR is connected. Explained later below.

...
...
...
For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.

...
fit, the rest is mapped to high regions. I remember one particular case on top of my head. It is the NOR flash we use for environmental variables. U-boot uses that address for saving, but also uses that for loading during booting. For our case, the NOR flash doesn't fit well in the low region, so it is remapped to high region after booting. To make the environmental variables accessible during boot, we mapped the high region phys with different virt, so u-boot doesn't have to know the low region address.

I might be missing the obvious, but why can't the environmental variables live in high regions?

It is in high region. But as I tried to explain, the default physical mapping of NOR flash (not MMU) is in low region out of reset.

I see. So the problem is during the transitioning phase from uncached to MMU enabled, where we'd end up at a different address.

Not exactly. We enable cache very early for performance boost on emulator. It may sound trivial but it makes big difference when debugging software on emulators. Since we still use emulators for new product, I am not ready to drop the early MMU approach.

I'm surprised it is that slow for you. Running the Foundation model (which doesn't do early mmu FWIW) seemed to be fast enough.

...

But you get the idea, the difference is before and after relocation. After u-boot relocates itself into DDR, we remap flash controller physical address to high region.

...
Could we just configure NOR to be in high memory in early asm init code, then always use the high physical NOR address range and jump to it from asm very early on? Then we could ignore the 32bit map and everything could just stay 1:1 mapped.

Out of reset, if booting from NOR flash, the flash controller is pre-configured to use low region address. We can only reprogram the controller when u-boot is not running on it.

I see, so you keep the low map alive until you make the switch-over to DDR. Makes a lot of sense.

I guess I can give the conversion another stab now whenever I get a free night :). If I understand you correctly we'd only need to do non-1:1 maps for the early code, right?

...

I see you are trying to maintain the 1:1 mapping for MMU. Why so? I think the framework should allow different mapping.

Mostly for the sake of simplicity. It wouldn't be very different to extend the logic to support setting of va != pa, but I find code vastly easier to debug and understand if the address I see is the address I access.

Alex

york sun

9:15 p.m.

On 02/22/2016 12:09 PM, Alexander Graf wrote:

...

On 22.02.16 20:52, york sun wrote:

...
On 02/22/2016 11:42 AM, Alexander Graf wrote:

...
On 22.02.16 19:39, york sun wrote:

...
On 02/22/2016 10:31 AM, Alexander Graf wrote:

...
On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote:

...
On 02/22/2016 10:02 AM, Alexander Graf wrote: > > >> Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com: >> >>> On 02/21/2016 05:57 PM, Alexander Graf wrote: >>> Howdy, >>> >>> Currently on arm64 there is a big pile of mess when it comes to MMU >>> support and page tables. Each board does its own little thing and the >>> generic code is pretty dumb and nobody actually uses it. >>> >>> This patch set tries to clean that up. After this series is applied, >>> all boards except for the FSL Layerscape ones are converted to the >>> new generic page table logic and have icache+dcache enabled. >>> >>> The new code always uses 4k page size. It dynamically allocates 1G or >>> 2M pages for ranges that fit. When a dcache attribute request comes in >>> that requires a smaller granularity than our previous allocation could >>> fulfill, pages get automatically split. >>> >>> I have tested and verified the code works on HiKey (bare metal), >>> vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is >>> untested, but given the simplicity of the maps I doubt it'll break. >>> ThunderX in theory should also work, but I haven't tested it. I would >>> be very happy if people with access to those system could give the patch >>> set a try. >>> >>> With this we're a big step closer to a good base line for EFI payload >>> support, since we can now just require that all boards always have dcache >>> enabled. >>> >>> I would also be incredibly happy if some Freescale people could look >>> at their MMU code and try to unify it into the now cleaned up generic >>> code. I don't think we're far off here. >> >> Alex, >> >> Unified MMU will be great for all of us. The reason we started with our own MMU >> table was size and performance. I don't know much about other ARMv8 SoCs. For >> our use, we enable cache very early to speed up running, especially for >> pre-silicon development on emulators. We don't have DDR to use for the early >> stage and we have very limited on-chip SRAM. I believe we can use the unified >> structure for our 2nd stage MMU when DDR is up. > > Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.

What's the size for the MMU tables? I think it may be simpler to use static tables for our early stage.

The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

That's the part I can't live with. Since we have very limited on-chip RAM, we have to know limit the size. But again, I do see the benefit to use unified structure for the 2nd stage.

I'm not quite sure I see how your current code works any differently. While the code to determine the page table pool size is dynamic, the outcome is static depending on your memory map. So the same memory map always means the same page table pool size.

We could also just hard code the size for the early phase for you I guess.

We can definitely try.

...
...
...
...
> > The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why. > True. We have some complication on the address mapping. For compatibility, each device is mapped (partially) under 32-bit space. If the device is too large to

Compatibility with what? Do we really need this in an AArch64 world?

It's not up to me. The SoC was designed this way. By the way, this SoC can work in AArch32 mode.

I think I'm slowly grasping what the problem is.

The fact that the SoC can run in AArch32 mode doesn't actually make a difference here though, since we're talking about U-Boot internal memory maps. The only reason to keep things mapped reachable from 32bits is if you want to run 32bit code with the U-Boot maps. I don't think you'd want to do that, no? :)

I don't really want to run 32-bit code. My point is the SoC was designed that way. We have DDR under 32-bit space, and in high region. We have the same for flash controller where NOR is connected. Explained later below.

...
...
...
For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.

...
fit, the rest is mapped to high regions. I remember one particular case on top of my head. It is the NOR flash we use for environmental variables. U-boot uses that address for saving, but also uses that for loading during booting. For our case, the NOR flash doesn't fit well in the low region, so it is remapped to high region after booting. To make the environmental variables accessible during boot, we mapped the high region phys with different virt, so u-boot doesn't have to know the low region address.

I might be missing the obvious, but why can't the environmental variables live in high regions?

It is in high region. But as I tried to explain, the default physical mapping of NOR flash (not MMU) is in low region out of reset.

I see. So the problem is during the transitioning phase from uncached to MMU enabled, where we'd end up at a different address.

Not exactly. We enable cache very early for performance boost on emulator. It may sound trivial but it makes big difference when debugging software on emulators. Since we still use emulators for new product, I am not ready to drop the early MMU approach.

I'm surprised it is that slow for you. Running the Foundation model (which doesn't do early mmu FWIW) seemed to be fast enough.

Foundation model is a simulator, not an emulator. Our emulator runs on hardware. It is much much slower than simulator, but more accurate on lower level.

...

...
But you get the idea, the difference is before and after relocation. After u-boot relocates itself into DDR, we remap flash controller physical address to high region.

...
Could we just configure NOR to be in high memory in early asm init code, then always use the high physical NOR address range and jump to it from asm very early on? Then we could ignore the 32bit map and everything could just stay 1:1 mapped.

Out of reset, if booting from NOR flash, the flash controller is pre-configured to use low region address. We can only reprogram the controller when u-boot is not running on it.

I see, so you keep the low map alive until you make the switch-over to DDR. Makes a lot of sense.

I guess I can give the conversion another stab now whenever I get a free night :). If I understand you correctly we'd only need to do non-1:1 maps for the early code, right?

So far, yes. But we don't want to block ourselves from using non-1:1 mapping down the road, do we?

...

...
I see you are trying to maintain the 1:1 mapping for MMU. Why so? I think the framework should allow different mapping.

Mostly for the sake of simplicity. It wouldn't be very different to extend the logic to support setting of va != pa, but I find code vastly easier to debug and understand if the address I see is the address I access.

Agreed.

York

Alexander Graf

24 Feb 24 Feb

11:19 a.m.

On 22.02.16 21:15, york sun wrote:

...

On 02/22/2016 12:09 PM, Alexander Graf wrote:

...
On 22.02.16 20:52, york sun wrote:

...
On 02/22/2016 11:42 AM, Alexander Graf wrote:

...
On 22.02.16 19:39, york sun wrote:

...
On 02/22/2016 10:31 AM, Alexander Graf wrote:

...
On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote:

> On 02/22/2016 10:02 AM, Alexander Graf wrote: >> >> >>> Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com: >>> >>>> On 02/21/2016 05:57 PM, Alexander Graf wrote: >>>> Howdy, >>>> >>>> Currently on arm64 there is a big pile of mess when it comes to MMU >>>> support and page tables. Each board does its own little thing and the >>>> generic code is pretty dumb and nobody actually uses it. >>>> >>>> This patch set tries to clean that up. After this series is applied, >>>> all boards except for the FSL Layerscape ones are converted to the >>>> new generic page table logic and have icache+dcache enabled. >>>> >>>> The new code always uses 4k page size. It dynamically allocates 1G or >>>> 2M pages for ranges that fit. When a dcache attribute request comes in >>>> that requires a smaller granularity than our previous allocation could >>>> fulfill, pages get automatically split. >>>> >>>> I have tested and verified the code works on HiKey (bare metal), >>>> vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is >>>> untested, but given the simplicity of the maps I doubt it'll break. >>>> ThunderX in theory should also work, but I haven't tested it. I would >>>> be very happy if people with access to those system could give the patch >>>> set a try. >>>> >>>> With this we're a big step closer to a good base line for EFI payload >>>> support, since we can now just require that all boards always have dcache >>>> enabled. >>>> >>>> I would also be incredibly happy if some Freescale people could look >>>> at their MMU code and try to unify it into the now cleaned up generic >>>> code. I don't think we're far off here. >>> >>> Alex, >>> >>> Unified MMU will be great for all of us. The reason we started with our own MMU >>> table was size and performance. I don't know much about other ARMv8 SoCs. For >>> our use, we enable cache very early to speed up running, especially for >>> pre-silicon development on emulators. We don't have DDR to use for the early >>> stage and we have very limited on-chip SRAM. I believe we can use the unified >>> structure for our 2nd stage MMU when DDR is up. >> >> Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code. > > What's the size for the MMU tables? I think it may be simpler to use static > tables for our early stage.

The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

That's the part I can't live with. Since we have very limited on-chip RAM, we have to know limit the size. But again, I do see the benefit to use unified structure for the 2nd stage.

I'm not quite sure I see how your current code works any differently. While the code to determine the page table pool size is dynamic, the outcome is static depending on your memory map. So the same memory map always means the same page table pool size.

We could also just hard code the size for the early phase for you I guess.

We can definitely try.

...
...
...
> >> >> The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why. >> > True. We have some complication on the address mapping. For compatibility, each > device is mapped (partially) under 32-bit space. If the device is too large to

Compatibility with what? Do we really need this in an AArch64 world?

It's not up to me. The SoC was designed this way. By the way, this SoC can work in AArch32 mode.

I think I'm slowly grasping what the problem is.

The fact that the SoC can run in AArch32 mode doesn't actually make a difference here though, since we're talking about U-Boot internal memory maps. The only reason to keep things mapped reachable from 32bits is if you want to run 32bit code with the U-Boot maps. I don't think you'd want to do that, no? :)

I don't really want to run 32-bit code. My point is the SoC was designed that way. We have DDR under 32-bit space, and in high region. We have the same for flash controller where NOR is connected. Explained later below.

...
...
...
For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.

> fit, the rest is mapped to high regions. I remember one particular case on top > of my head. It is the NOR flash we use for environmental variables. U-boot uses > that address for saving, but also uses that for loading during booting. For our > case, the NOR flash doesn't fit well in the low region, so it is remapped to > high region after booting. To make the environmental variables accessible during > boot, we mapped the high region phys with different virt, so u-boot doesn't have > to know the low region address.

I might be missing the obvious, but why can't the environmental variables live in high regions?

It is in high region. But as I tried to explain, the default physical mapping of NOR flash (not MMU) is in low region out of reset.

I see. So the problem is during the transitioning phase from uncached to MMU enabled, where we'd end up at a different address.

Not exactly. We enable cache very early for performance boost on emulator. It may sound trivial but it makes big difference when debugging software on emulators. Since we still use emulators for new product, I am not ready to drop the early MMU approach.

I'm surprised it is that slow for you. Running the Foundation model (which doesn't do early mmu FWIW) seemed to be fast enough.

Foundation model is a simulator, not an emulator. Our emulator runs on hardware. It is much much slower than simulator, but more accurate on lower level.

Ah, I remember the confusion in terminology from the PPC times :).

...

...
...
But you get the idea, the difference is before and after relocation. After u-boot relocates itself into DDR, we remap flash controller physical address to high region.

...
Could we just configure NOR to be in high memory in early asm init code, then always use the high physical NOR address range and jump to it from asm very early on? Then we could ignore the 32bit map and everything could just stay 1:1 mapped.

Out of reset, if booting from NOR flash, the flash controller is pre-configured to use low region address. We can only reprogram the controller when u-boot is not running on it.

I see, so you keep the low map alive until you make the switch-over to DDR. Makes a lot of sense.

I guess I can give the conversion another stab now whenever I get a free night :). If I understand you correctly we'd only need to do non-1:1 maps for the early code, right?

So far, yes. But we don't want to block ourselves from using non-1:1 mapping down the road, do we?

We're not blocking us at all if we stick to the verbose struct definition. We can just add a va field later on and default to 1:1 if it's not set.

I've also reworked the page table pool size calculation now, so it can properly determine the required size without much ram overhead at the expense of a few cycles. If it's too slow, you can always override it in your machine file with a constant value.

Alex

Stephen Warren

5:57 p.m.

On 02/24/2016 03:19 AM, Alexander Graf wrote:

...

On 22.02.16 21:15, york sun wrote:

...
On 02/22/2016 12:09 PM, Alexander Graf wrote:

...
On 22.02.16 20:52, york sun wrote:

...
On 02/22/2016 11:42 AM, Alexander Graf wrote:

...
On 22.02.16 19:39, york sun wrote:

...
On 02/22/2016 10:31 AM, Alexander Graf wrote: > > On Feb 22, 2016, at 7:12 PM, york sun york.sun@nxp.com wrote: > >> On 02/22/2016 10:02 AM, Alexander Graf wrote: >>> >>> >>>> Am 22.02.2016 um 18:37 schrieb york sun york.sun@nxp.com: >>>> >>>>> On 02/21/2016 05:57 PM, Alexander Graf wrote: >>>>> Howdy, >>>>> >>>>> Currently on arm64 there is a big pile of mess when it comes to MMU >>>>> support and page tables. Each board does its own little thing and the >>>>> generic code is pretty dumb and nobody actually uses it. >>>>> >>>>> This patch set tries to clean that up. After this series is applied, >>>>> all boards except for the FSL Layerscape ones are converted to the >>>>> new generic page table logic and have icache+dcache enabled. >>>>> >>>>> The new code always uses 4k page size. It dynamically allocates 1G or >>>>> 2M pages for ranges that fit. When a dcache attribute request comes in >>>>> that requires a smaller granularity than our previous allocation could >>>>> fulfill, pages get automatically split. >>>>> >>>>> I have tested and verified the code works on HiKey (bare metal), >>>>> vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is >>>>> untested, but given the simplicity of the maps I doubt it'll break. >>>>> ThunderX in theory should also work, but I haven't tested it. I would >>>>> be very happy if people with access to those system could give the patch >>>>> set a try. >>>>> >>>>> With this we're a big step closer to a good base line for EFI payload >>>>> support, since we can now just require that all boards always have dcache >>>>> enabled. >>>>> >>>>> I would also be incredibly happy if some Freescale people could look >>>>> at their MMU code and try to unify it into the now cleaned up generic >>>>> code. I don't think we're far off here. >>>> >>>> Alex, >>>> >>>> Unified MMU will be great for all of us. The reason we started with our own MMU >>>> table was size and performance. I don't know much about other ARMv8 SoCs. For >>>> our use, we enable cache very early to speed up running, especially for >>>> pre-silicon development on emulators. We don't have DDR to use for the early >>>> stage and we have very limited on-chip SRAM. I believe we can use the unified >>>> structure for our 2nd stage MMU when DDR is up. >>> >>> Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code. >> >> What's the size for the MMU tables? I think it may be simpler to use static >> tables for our early stage. > > The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).

That's the part I can't live with. Since we have very limited on-chip RAM, we have to know limit the size. But again, I do see the benefit to use unified structure for the 2nd stage.

I'm not quite sure I see how your current code works any differently. While the code to determine the page table pool size is dynamic, the outcome is static depending on your memory map. So the same memory map always means the same page table pool size.

We could also just hard code the size for the early phase for you I guess.

We can definitely try.

...
...
> >> >>> >>> The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why. >>> >> True. We have some complication on the address mapping. For compatibility, each >> device is mapped (partially) under 32-bit space. If the device is too large to > > Compatibility with what? Do we really need this in an AArch64 world?

It's not up to me. The SoC was designed this way. By the way, this SoC can work in AArch32 mode.

I think I'm slowly grasping what the problem is.

The fact that the SoC can run in AArch32 mode doesn't actually make a difference here though, since we're talking about U-Boot internal memory maps. The only reason to keep things mapped reachable from 32bits is if you want to run 32bit code with the U-Boot maps. I don't think you'd want to do that, no? :)

I don't really want to run 32-bit code. My point is the SoC was designed that way. We have DDR under 32-bit space, and in high region. We have the same for flash controller where NOR is connected. Explained later below.

...
...
> > For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it. > >> fit, the rest is mapped to high regions. I remember one particular case on top >> of my head. It is the NOR flash we use for environmental variables. U-boot uses >> that address for saving, but also uses that for loading during booting. For our >> case, the NOR flash doesn't fit well in the low region, so it is remapped to >> high region after booting. To make the environmental variables accessible during >> boot, we mapped the high region phys with different virt, so u-boot doesn't have >> to know the low region address. > > I might be missing the obvious, but why can't the environmental variables live in high regions? >

It is in high region. But as I tried to explain, the default physical mapping of NOR flash (not MMU) is in low region out of reset.

I see. So the problem is during the transitioning phase from uncached to MMU enabled, where we'd end up at a different address.

Not exactly. We enable cache very early for performance boost on emulator. It may sound trivial but it makes big difference when debugging software on emulators. Since we still use emulators for new product, I am not ready to drop the early MMU approach.

I'm surprised it is that slow for you. Running the Foundation model (which doesn't do early mmu FWIW) seemed to be fast enough.

Foundation model is a simulator, not an emulator. Our emulator runs on hardware. It is much much slower than simulator, but more accurate on lower level.

Ah, I remember the confusion in terminology from the PPC times :).

...
...
...
But you get the idea, the difference is before and after relocation. After u-boot relocates itself into DDR, we remap flash controller physical address to high region.

...
Could we just configure NOR to be in high memory in early asm init code, then always use the high physical NOR address range and jump to it from asm very early on? Then we could ignore the 32bit map and everything could just stay 1:1 mapped.

Out of reset, if booting from NOR flash, the flash controller is pre-configured to use low region address. We can only reprogram the controller when u-boot is not running on it.

I see, so you keep the low map alive until you make the switch-over to DDR. Makes a lot of sense.

I guess I can give the conversion another stab now whenever I get a free night :). If I understand you correctly we'd only need to do non-1:1 maps for the early code, right?

So far, yes. But we don't want to block ourselves from using non-1:1 mapping down the road, do we?

We're not blocking us at all if we stick to the verbose struct definition. We can just add a va field later on and default to 1:1 if it's not set.

Well, that rather precludes a VA of 0 being valid. Still, we should be able to easily find all instances of the table, and simply edit them to set the VA field to the current PA value, rather than relying on comparing the VA field against 0.

Stephen Warren

22 Feb 22 Feb

7:34 p.m.

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...

Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

The series, Tested-by: Stephen Warren swarren@nvidia.com

I tested p2371-2180 (Jetson TX1) ARMv8 development board, including PCIe-based RTL8169 Ethernet, which makes use of the APIs to allocate/convert memory as/to uncached.

Alexander Graf

24 Feb 24 Feb

11:33 a.m.

On 22.02.16 19:34, Stephen Warren wrote:

...

On 02/21/2016 06:57 PM, Alexander Graf wrote:

...
Howdy,

Currently on arm64 there is a big pile of mess when it comes to MMU support and page tables. Each board does its own little thing and the generic code is pretty dumb and nobody actually uses it.

This patch set tries to clean that up. After this series is applied, all boards except for the FSL Layerscape ones are converted to the new generic page table logic and have icache+dcache enabled.

The new code always uses 4k page size. It dynamically allocates 1G or 2M pages for ranges that fit. When a dcache attribute request comes in that requires a smaller granularity than our previous allocation could fulfill, pages get automatically split.

I have tested and verified the code works on HiKey (bare metal), vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is untested, but given the simplicity of the maps I doubt it'll break. ThunderX in theory should also work, but I haven't tested it. I would be very happy if people with access to those system could give the patch set a try.

With this we're a big step closer to a good base line for EFI payload support, since we can now just require that all boards always have dcache enabled.

I would also be incredibly happy if some Freescale people could look at their MMU code and try to unify it into the now cleaned up generic code. I don't think we're far off here.

The series, Tested-by: Stephen Warren swarren@nvidia.com

I tested p2371-2180 (Jetson TX1) ARMv8 development board, including PCIe-based RTL8169 Ethernet, which makes use of the APIs to allocate/convert memory as/to uncached.

Thanks a bunch for testing! I'll not include the tag in v4, since there are enough changes that could break something. I would greatly appreciate if you could test it again by then :).

Alex

3001

Age (days ago)

3361

Last active (days ago)

List overview

Download

49 comments

6 participants

tags (0)

participants (6)

Alexander Graf
brettstahlman
Michal Simek
Simon Glass
Stephen Warren
york sun