[U-Boot] [PATCH 0/9] net: rtl8169: Fix cache maintenance issues

From: Thierry Reding treding@nvidia.com
This series attempts to fix a long-standing problem in the rtl8169 driver (though the same problem may exist in other drivers as well). Let me first explain what exactly the issue is:
The rtl8169 driver provides a set of RX and TX descriptors for the device to use. Once they're set up, the device is told about their location so that it can fetch the descriptors using DMA. The device will also write packet state back into these descriptors using DMA. For this to work properly, whenever a driver needs to access these descriptors it needs to invalidate the D-cache line(s) associated with them. Similarly when changes to the descriptor have been made by the driver, the cache lines need to be flushed to make sure the changes are visible to the device.
The descriptors are 16 bytes in size. This causes problems when used on CPUs that have a cache-line size that is larger than 16 bytes. One example is the NVIDIA Tegra124 which has 64-byte cache-lines. That means that 4 descriptors fit into a single cache-line. So whenever the driver flushes a cache-line it has the potential to discard changes made to another descriptor by the DMA device. One typical symptom is that large transfers over TFTP will often not complete and hang somewhere midway because a device marked a packet received but the driver flushing the cache and causing the packet to be lost.
Since the descriptors need to be consecutive in memory, I don't see a way to fix this other than to use uncached memory. Therefore the solution proposed in this patch series is to introduce a mechanism in U-Boot to allow a driver to allocate from a pool of uncached memory. Currently an implementation is provided only for ARM v7. The idea is that a region (of user-definable size) immediately below (taking into account architecture-specific alignment restrictions) the malloc() area is mapped uncacheable in the MMU. A driver can use the new noncached_alloc() function to allocate a chunk of memory from this pool dynamically for buffers that it can't or doesn't want to do any explicit cache-maintainance on, yet needs to be shared with DMA devices.
Patches 1-3 are minor preparatory work. Patch 1 cleans up some coding style issues in the ARM v7 cache code and patch 2 uses more future-proof types for the mmu_set_region_dcache_behaviour() function arguments. Patch 3 is purely for debugging purposes. It will print out the region used by malloc() when DEBUG is enabled. This can be useful to see where the malloc() region is in the memory map (compared to the noncached region introduced in a later patch for example).
Patch 4 implements the noncached API for ARM v7. It obtains the start of the malloc() area and places the noncached region immediately below it so that noncached_alloc() can allocate from it. During boot, the noncached area will be set up immediately after malloc().
Patch 5 enables noncached memory for all Tegra boards. It uses a 1 MiB chunk which should be plenty (it's also the minimum on ARM v7 because it matches the MMU section size and therefore the granularity at which U-Boot can set the cacheable attributes).
Patch 6 is not really related but just something I stumbled across when going through the code. According to the top-level README file, network drivers are supposed to respect the CONFIG_SYS_RX_ETH_BUFFER. rtl8169 doesn't currently do that, so this patch fixes it.
Patch 7 is the result of earlier rework that still aimed at solving the problem using explicit cache maintenance. rtl8169 hardware requires buffers to be aligned to 256 byte boundaries. The rtl8169 driver used to employ some trickery to make that work, but nowadays there are macros that can be used to the same effect, so this patch uses them and gets rid of the custom trickery. This patch also prints out a warning if it detects a potential caching issue (i.e. ARCH_DMA_MINALIGN > sizeof(struct RxDesc)).
Patch 8 finally adds optional support for non-cached memory. When available the driver will now use the noncached API to obtain uncached buffers for the RX and TX descriptor rings. At the same time the cache-maintenance functions for the RX and TX descriptors become no-ops so that the code can work with or without the noncached API available.
With all of the above in place, patch 9 adds support for RTL-8168/8111g as found on the NVIDIA Jetson TK1 board (which has a Tegra124 SoC).
Note that this series also fixes the sporadic hangs of large TFTP transfers for earlier SoC generations of Tegra (Tegra20 and Tegra30), though they were less frequent there, probably caused by the cache-lines being 32 bytes rather than 64.
Thierry
Thierry Reding (9): ARM: cache_v7: Various minor cleanups ARM: cache-cp15: Use unsigned long for address and size malloc: Output region when debugging ARM: Implement non-cached memory support ARM: tegra: Enable non-cached memory net: rtl8169: Honor CONFIG_SYS_RX_ETH_BUFFER net: rtl8169: Properly align buffers net: rtl8169: Use non-cached memory if available net: rtl8169: Add support for RTL-8168/8111g
README | 16 ++++++ arch/arm/cpu/armv7/cache_v7.c | 14 +++--- arch/arm/include/asm/system.h | 7 ++- arch/arm/lib/cache-cp15.c | 6 +-- arch/arm/lib/cache.c | 41 +++++++++++++++ common/board_r.c | 11 +++++ common/dlmalloc.c | 3 ++ drivers/net/rtl8169.c | 110 ++++++++++++++++++++++++++++++----------- include/configs/tegra-common.h | 1 + 9 files changed, 168 insertions(+), 41 deletions(-)

From: Thierry Reding treding@nvidia.com
Remove two gratuituous blank lines, uses u32 (instead of int) as the type for values that will be written to a register, moves the beginning of the variable declaration section to a separate line (rather than the one with the opening brace) and keeps the function signature on a single line where possible.
Signed-off-by: Thierry Reding treding@nvidia.com --- arch/arm/cpu/armv7/cache_v7.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/arm/cpu/armv7/cache_v7.c b/arch/arm/cpu/armv7/cache_v7.c index a2c4032fed8c..0f9d8377ed5a 100644 --- a/arch/arm/cpu/armv7/cache_v7.c +++ b/arch/arm/cpu/armv7/cache_v7.c @@ -21,7 +21,8 @@ * to get size details from Current Cache Size ID Register(CCSIDR) */ static void set_csselr(u32 level, u32 type) -{ u32 csselr = level << 1 | type; +{ + u32 csselr = level << 1 | type;
/* Write to Cache Size Selection Register(CSSELR) */ asm volatile ("mcr p15, 2, %0, c0, c0, 0" : : "r" (csselr)); @@ -49,7 +50,8 @@ static void v7_inval_dcache_level_setway(u32 level, u32 num_sets, u32 num_ways, u32 way_shift, u32 log2_line_len) { - int way, set, setway; + int way, set; + u32 setway;
/* * For optimal assembly code: @@ -73,7 +75,8 @@ static void v7_clean_inval_dcache_level_setway(u32 level, u32 num_sets, u32 num_ways, u32 way_shift, u32 log2_line_len) { - int way, set, setway; + int way, set; + u32 setway;
/* * For optimal assembly code: @@ -134,7 +137,6 @@ static void v7_maint_dcache_level_setway(u32 level, u32 operation) static void v7_maint_dcache_all(u32 operation) { u32 level, cache_type, level_start_bit = 0; - u32 clidr = get_clidr();
for (level = 0; level < 7; level++) { @@ -147,8 +149,7 @@ static void v7_maint_dcache_all(u32 operation) } }
-static void v7_dcache_clean_inval_range(u32 start, - u32 stop, u32 line_len) +static void v7_dcache_clean_inval_range(u32 start, u32 stop, u32 line_len) { u32 mva;
@@ -256,7 +257,6 @@ void flush_dcache_all(void) */ void invalidate_dcache_range(unsigned long start, unsigned long stop) { - v7_dcache_maint_range(start, stop, ARMV7_DCACHE_INVAL_RANGE);
v7_outer_cache_inval_range(start, stop);

From: Thierry Reding treding@nvidia.com
size is always non-negative, so it should be unsigned, whereas the address and size can be larger than 32 bit on 64-bit architectures. Change the mmu_set_region_dcache_behaviour() to use these types in anticipation of making the API available on other architectures.
Signed-off-by: Thierry Reding treding@nvidia.com --- arch/arm/include/asm/system.h | 2 +- arch/arm/lib/cache-cp15.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h index d51ba668f323..4b771c261a8b 100644 --- a/arch/arm/include/asm/system.h +++ b/arch/arm/include/asm/system.h @@ -200,7 +200,7 @@ enum { * \param size size of memory region to change * \param option dcache option to select */ -void mmu_set_region_dcache_behaviour(u32 start, int size, +void mmu_set_region_dcache_behaviour(unsigned long start, unsigned long size, enum dcache_option option);
/** diff --git a/arch/arm/lib/cache-cp15.c b/arch/arm/lib/cache-cp15.c index 5fdfdbfca541..b52576dc64bd 100644 --- a/arch/arm/lib/cache-cp15.c +++ b/arch/arm/lib/cache-cp15.c @@ -47,15 +47,15 @@ __weak void mmu_page_table_flush(unsigned long start, unsigned long stop) debug("%s: Warning: not implemented\n", __func__); }
-void mmu_set_region_dcache_behaviour(u32 start, int size, +void mmu_set_region_dcache_behaviour(unsigned long start, unsigned long size, enum dcache_option option) { u32 *page_table = (u32 *)gd->arch.tlb_addr; - u32 upto, end; + unsigned long upto, end;
end = ALIGN(start + size, MMU_SECTION_SIZE) >> MMU_SECTION_SHIFT; start = start >> MMU_SECTION_SHIFT; - debug("%s: start=%x, size=%x, option=%d\n", __func__, start, size, + debug("%s: start=%lx, size=%lx, option=%d\n", __func__, start, size, option); for (upto = start; upto < end; upto++) set_section_dcache(upto, option);

On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
size is always non-negative, so it should be unsigned, whereas the address and size can be larger than 32 bit on 64-bit architectures. Change the mmu_set_region_dcache_behaviour() to use these types in anticipation of making the API available on other architectures.
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
-void mmu_set_region_dcache_behaviour(u32 start, int size, +void mmu_set_region_dcache_behaviour(unsigned long start, unsigned long size, enum dcache_option option);
If we were to use LPAE on a 32-bit system, physical addresses could be more than 32-bit. That would imply we should create a physaddr_t type rather than relying on unsigned long. Still, I suppose since U-Boot just maps RAM (and everything else) 1:1, we'd never use RAM beyond 4GiB, so LPAE actually isn't that interesting...

On Wed, Aug 20, 2014 at 01:15:15PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
size is always non-negative, so it should be unsigned, whereas the address and size can be larger than 32 bit on 64-bit architectures. Change the mmu_set_region_dcache_behaviour() to use these types in anticipation of making the API available on other architectures.
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
-void mmu_set_region_dcache_behaviour(u32 start, int size, +void mmu_set_region_dcache_behaviour(unsigned long start, unsigned long size, enum dcache_option option);
If we were to use LPAE on a 32-bit system, physical addresses could be more than 32-bit. That would imply we should create a physaddr_t type rather than relying on unsigned long. Still, I suppose since U-Boot just maps RAM (and everything else) 1:1, we'd never use RAM beyond 4GiB, so LPAE actually isn't that interesting...
Interestingly there is already a phys_addr_t type defined (as opposed to Linux it's defined in arch/*/include/asm/types.h). It doesn't currently take into account things like LPAE, but then again there's no standard configuration option to select that (Linux has CONFIG_PHYS_ADDR_T_64BIT for this). CONFIG_PHYS_64BIT seems to come close, but it's currently only defined by true 64-bit architectures.
I took a stab at unifying the various definitions in linux/types.h, but that resulted in all kinds of weird build errors all over the place, so at some point I gave up and kept the definitions as they are. Still, for this series I've tried to use the existing phys_addr_t where appropriate so it should just be a matter of properly redefining in for LPAE if support for it ever gets added.
I've refrained from using phys_size_t since I don't see why it would be useful and instead used size_t consistently where a size is involved. If anyone feels strongly about using phys_size_t I can use it instead, though.
Thierry

From: Thierry Reding treding@nvidia.com
When DEBUG is set, output memory region used for malloc().
Signed-off-by: Thierry Reding treding@nvidia.com --- common/dlmalloc.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/common/dlmalloc.c b/common/dlmalloc.c index f9873393c183..3d6391e60acf 100644 --- a/common/dlmalloc.c +++ b/common/dlmalloc.c @@ -1533,6 +1533,9 @@ void mem_malloc_init(ulong start, ulong size) mem_malloc_end = start + size; mem_malloc_brk = start;
+ debug("using memory %#lx-%#lx for malloc()\n", mem_malloc_start, + mem_malloc_end); + memset((void *)mem_malloc_start, 0, size);
malloc_bin_reloc();

From: Thierry Reding treding@nvidia.com
Implement an API that can be used by drivers to allocate memory from a poll that is mapped uncached. This is useful if drivers would otherwise need to do extensive cache maintenance (or explicitly maintaining the cache isn't safe).
The API is protected using the new CONFIG_SYS_NONCACHED_MEMORY setting. Boards can set this to the size to be used for the non-cached area. The area will typically be right below the malloc() area, but architectures should take care of aligning the beginning and end of the area to honor any mapping restrictions. Architectures must also ensure that mappings established for this area do not overlap with the malloc() area (which should remain cached for improved performance).
While the API is currently only implemented for ARM v7, it should be generic enough to allow other architectures to implement it as well.
Signed-off-by: Thierry Reding treding@nvidia.com --- README | 16 ++++++++++++++++ arch/arm/include/asm/system.h | 5 +++++ arch/arm/lib/cache.c | 41 +++++++++++++++++++++++++++++++++++++++++ common/board_r.c | 11 +++++++++++ 4 files changed, 73 insertions(+)
diff --git a/README b/README index 1d713596ec6f..807b2d0ef0f9 100644 --- a/README +++ b/README @@ -3759,6 +3759,22 @@ Configuration Settings: Pre-relocation malloc() is only supported on ARM at present but is fairly easy to enable for other archs.
+- CONFIG_SYS_NONCACHED_MEMORY: + Size of non-cached memory area. This area of memory will be + typically located right below the malloc() area and mapped + uncached in the MMU. This is useful for drivers that would + otherwise require a lot of explicit cache maintenance. For + some drivers it's also impossible to properly maintain the + cache. For example if the regions that need to be flushed + are not a multiple of the cache-line size, then a flush of + one region may result in overwriting data that hardware has + written to another region in the same cache-line. This can + happen for example in network drivers where descriptors for + buffers are typically smaller than the CPU cache-line (e.g. + 16 bytes vs. 32 or 64 bytes). + + Non-cached memory is only supported on 32-bit ARM at present. + - CONFIG_SYS_BOOTM_LEN: Normally compressed uImages are limited to an uncompressed size of 8 MBytes. If this is not enough, diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h index 4b771c261a8b..74ba6912bb81 100644 --- a/arch/arm/include/asm/system.h +++ b/arch/arm/include/asm/system.h @@ -211,6 +211,11 @@ void mmu_set_region_dcache_behaviour(unsigned long start, unsigned long size, */ void mmu_page_table_flush(unsigned long start, unsigned long stop);
+#ifdef CONFIG_SYS_NONCACHED_MEMORY +void noncached_init(void); +unsigned long noncached_alloc(size_t size, size_t align); +#endif /* CONFIG_SYS_NONCACHED_MEMORY */ + #endif /* __ASSEMBLY__ */
#define arch_align_stack(x) (x) diff --git a/arch/arm/lib/cache.c b/arch/arm/lib/cache.c index 4e597a4c1d16..cd0b5ef1c3f4 100644 --- a/arch/arm/lib/cache.c +++ b/arch/arm/lib/cache.c @@ -8,6 +8,7 @@ /* for now: just dummy functions to satisfy the linker */
#include <common.h> +#include <malloc.h>
__weak void flush_cache(unsigned long start, unsigned long size) { @@ -49,3 +50,43 @@ __weak void enable_caches(void) { puts("WARNING: Caches not enabled\n"); } + +#ifdef CONFIG_SYS_NONCACHED_MEMORY +/* + * Reserve one MMU section worth of address space below the malloc() area that + * will be mapped uncached. + */ +static unsigned long noncached_start; +static unsigned long noncached_end; +static unsigned long noncached_next; + +void noncached_init(void) +{ + unsigned long start, end, size; + + end = ALIGN(mem_malloc_start, MMU_SECTION_SIZE) - MMU_SECTION_SIZE; + size = ALIGN(CONFIG_SYS_NONCACHED_MEMORY, MMU_SECTION_SIZE); + start = end - size; + + debug("mapping memory %#lx-%#lx non-cached\n", start, end); + + noncached_start = start; + noncached_end = end; + noncached_next = start; + + mmu_set_region_dcache_behaviour(noncached_start, size, DCACHE_OFF); +} + +unsigned long noncached_alloc(size_t size, size_t align) +{ + unsigned long next = ALIGN(noncached_next, align); + + if (next >= noncached_end || (next + size) >= noncached_end) + return 0; + + debug("allocated %zu bytes of uncached memory @%#lx\n", size, next); + noncached_next = next + size; + + return next; +} +#endif /* CONFIG_SYS_NONCACHED_MEMORY */ diff --git a/common/board_r.c b/common/board_r.c index ba9a68dc6691..6b8e62bf032b 100644 --- a/common/board_r.c +++ b/common/board_r.c @@ -270,6 +270,14 @@ static int initr_malloc(void) return 0; }
+#ifdef CONFIG_SYS_NONCACHED_MEMORY +static int initr_noncached(void) +{ + noncached_init(); + return 0; +} +#endif + #ifdef CONFIG_DM static int initr_dm(void) { @@ -765,6 +773,9 @@ init_fnc_t init_sequence_r[] = { #endif initr_barrier, initr_malloc, +#ifdef CONFIG_SYS_NONCACHED_MEMORY + initr_noncached, +#endif bootstage_relocate, #ifdef CONFIG_DM initr_dm,

On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
Implement an API that can be used by drivers to allocate memory from a poll that is mapped uncached. This is useful if drivers would otherwise
s/poll/pool/
need to do extensive cache maintenance (or explicitly maintaining the cache isn't safe).
The API is protected using the new CONFIG_SYS_NONCACHED_MEMORY setting. Boards can set this to the size to be used for the non-cached area. The area will typically be right below the malloc() area, but architectures should take care of aligning the beginning and end of the area to honor any mapping restrictions. Architectures must also ensure that mappings established for this area do not overlap with the malloc() area (which should remain cached for improved performance).
While the API is currently only implemented for ARM v7, it should be generic enough to allow other architectures to implement it as well.
diff --git a/README b/README
+- CONFIG_SYS_NONCACHED_MEMORY:
Size of non-cached memory area. This area of memory will be
typically located right below the malloc() area and mapped
uncached in the MMU. This is useful for drivers that would
otherwise require a lot of explicit cache maintenance. For
some drivers it's also impossible to properly maintain the
cache. For example if the regions that need to be flushed
are not a multiple of the cache-line size,
I would add:
*and* padding cannot be allocated between the regions to align them (i.e. if the HW requires a contiguous array of regions, and the size of each region is not cache-aligned),
then a flush of
one region may result in overwriting data that hardware has
written to another region in the same cache-line. This can
happen for example in network drivers where descriptors for
buffers are typically smaller than the CPU cache-line (e.g.
16 bytes vs. 32 or 64 bytes).
Non-cached memory is only supported on 32-bit ARM at present.
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
+#ifdef CONFIG_SYS_NONCACHED_MEMORY +void noncached_init(void); +unsigned long noncached_alloc(size_t size, size_t align);
s/size_t/unsigned long/ would match the arguments in the earlier patch to mmu_set_region_dcache_behaviour()'s prototype.
diff --git a/arch/arm/lib/cache.c b/arch/arm/lib/cache.c
+void noncached_init(void) +{
- unsigned long start, end, size;
- end = ALIGN(mem_malloc_start, MMU_SECTION_SIZE) - MMU_SECTION_SIZE;
Not really "end" (which implies it's inside the region) but "first address outside the region". That said, I'm not sure how to express that succinctly, so perhaps "end" is fine!
+unsigned long noncached_alloc(size_t size, size_t align) +{
- unsigned long next = ALIGN(noncached_next, align);
- if (next >= noncached_end || (next + size) >= noncached_end)
return 0;
If size is quite large, and next is near to U32_MAX, there's a chance of wrap-around there. Perhaps calculate the size left in the region (noncached_end - next), and validate that's >= size.

On Wed, Aug 20, 2014 at 01:23:13PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
Implement an API that can be used by drivers to allocate memory from a poll that is mapped uncached. This is useful if drivers would otherwise
s/poll/pool/
Done.
need to do extensive cache maintenance (or explicitly maintaining the cache isn't safe).
The API is protected using the new CONFIG_SYS_NONCACHED_MEMORY setting. Boards can set this to the size to be used for the non-cached area. The area will typically be right below the malloc() area, but architectures should take care of aligning the beginning and end of the area to honor any mapping restrictions. Architectures must also ensure that mappings established for this area do not overlap with the malloc() area (which should remain cached for improved performance).
While the API is currently only implemented for ARM v7, it should be generic enough to allow other architectures to implement it as well.
diff --git a/README b/README
+- CONFIG_SYS_NONCACHED_MEMORY:
Size of non-cached memory area. This area of memory will be
typically located right below the malloc() area and mapped
uncached in the MMU. This is useful for drivers that would
otherwise require a lot of explicit cache maintenance. For
some drivers it's also impossible to properly maintain the
cache. For example if the regions that need to be flushed
are not a multiple of the cache-line size,
I would add:
*and* padding cannot be allocated between the regions to align them (i.e. if the HW requires a contiguous array of regions, and the size of each region is not cache-aligned),
Sounds good.
one region may result in overwriting data that hardware has
written to another region in the same cache-line. This can
happen for example in network drivers where descriptors for
buffers are typically smaller than the CPU cache-line (e.g.
16 bytes vs. 32 or 64 bytes).
Non-cached memory is only supported on 32-bit ARM at present.
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
+#ifdef CONFIG_SYS_NONCACHED_MEMORY +void noncached_init(void); +unsigned long noncached_alloc(size_t size, size_t align);
s/size_t/unsigned long/ would match the arguments in the earlier patch to mmu_set_region_dcache_behaviour()'s prototype.
diff --git a/arch/arm/lib/cache.c b/arch/arm/lib/cache.c
+void noncached_init(void) +{
- unsigned long start, end, size;
- end = ALIGN(mem_malloc_start, MMU_SECTION_SIZE) - MMU_SECTION_SIZE;
Not really "end" (which implies it's inside the region) but "first address outside the region". That said, I'm not sure how to express that succinctly, so perhaps "end" is fine!
I think I've seen "limit" used rather commonly in that case.
+unsigned long noncached_alloc(size_t size, size_t align) +{
- unsigned long next = ALIGN(noncached_next, align);
- if (next >= noncached_end || (next + size) >= noncached_end)
return 0;
If size is quite large, and next is near to U32_MAX, there's a chance of wrap-around there. Perhaps calculate the size left in the region (noncached_end - next), and validate that's >= size.
Yes, that sounds a lot safer.
Thierry

On Wed, Aug 20, 2014 at 01:23:13PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
[...]
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
+#ifdef CONFIG_SYS_NONCACHED_MEMORY +void noncached_init(void); +unsigned long noncached_alloc(size_t size, size_t align);
s/size_t/unsigned long/ would match the arguments in the earlier patch to mmu_set_region_dcache_behaviour()'s prototype.
I've replaced the unsigned long in the earlier patch with size_t for consistency. Let me know if you have any objections to that.
Thierry

From: Thierry Reding treding@nvidia.com
Some boards, most notably those with a PCIe ethernet NIC, require this to avoid cache coherency problems. Since the option adds very little code and overhead enable it across all Tegra generations. Other drivers may also start supporting this functionality at some point, so enabling it now will automatically reap the benefits later on.
Signed-off-by: Thierry Reding treding@nvidia.com --- include/configs/tegra-common.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/include/configs/tegra-common.h b/include/configs/tegra-common.h index 717cd61bd69a..16f45f5def9b 100644 --- a/include/configs/tegra-common.h +++ b/include/configs/tegra-common.h @@ -41,6 +41,7 @@ * Size of malloc() pool */ #define CONFIG_SYS_MALLOC_LEN (4 << 20) /* 4MB */ +#define CONFIG_SYS_NONCACHED_MEMORY (1 << 20) /* 1 MiB */
/* * NS16550 Configuration

On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
Some boards, most notably those with a PCIe ethernet NIC, require this to avoid cache coherency problems. Since the option adds very little code and overhead enable it across all Tegra generations. Other drivers may also start supporting this functionality at some point, so enabling it now will automatically reap the benefits later on.
Acked-by: Stephen Warren swarren@nvidia.com

From: Thierry Reding treding@nvidia.com
According to the top-level README file, this configuration setting can be used to override the number of receive buffers that an ethernet NIC uses.
Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/net/rtl8169.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/net/rtl8169.c b/drivers/net/rtl8169.c index d040ab171bf5..c3eb474f0fba 100644 --- a/drivers/net/rtl8169.c +++ b/drivers/net/rtl8169.c @@ -79,7 +79,11 @@ static int media[MAX_UNITS] = { -1, -1, -1, -1, -1, -1, -1, -1 }; #define InterFrameGap 0x03 /* 3 means InterFrameGap = the shortest one */
#define NUM_TX_DESC 1 /* Number of Tx descriptor registers */ -#define NUM_RX_DESC 4 /* Number of Rx descriptor registers */ +#ifdef CONFIG_SYS_RX_ETH_BUFFER + #define NUM_RX_DESC CONFIG_SYS_RX_ETH_BUFFER +#else + #define NUM_RX_DESC 4 /* Number of Rx descriptor registers */ +#endif #define RX_BUF_SIZE 1536 /* Rx Buffer size */ #define RX_BUF_LEN 8192

From: Thierry Reding treding@nvidia.com
RX and TX descriptor rings should be aligned to 256 byte boundaries. Use the DEFINE_ALIGN_BUFFER() macro to define the buffers so that they don't have to be manually aligned later on. Also make sure that the buffers do align to cache-line boundaries in case the cache-line is higher than the 256 byte alignment requirements of the NIC.
Also add a warning if the cache-line size is larger than the descriptor size, because the driver may discard changes to descriptors made by the hardware when requeuing RX buffers.
Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/net/rtl8169.c | 60 ++++++++++++++++++++++++++------------------------- 1 file changed, 31 insertions(+), 29 deletions(-)
diff --git a/drivers/net/rtl8169.c b/drivers/net/rtl8169.c index c3eb474f0fba..c53666134e06 100644 --- a/drivers/net/rtl8169.c +++ b/drivers/net/rtl8169.c @@ -277,23 +277,29 @@ struct RxDesc { u32 buf_Haddr; };
-/* Define the TX Descriptor */ -static u8 tx_ring[NUM_TX_DESC * sizeof(struct TxDesc) + 256]; -/* __attribute__ ((aligned(256))); */ +#if ARCH_DMA_MINALIGN > 256 +# define RTL8169_ALIGN ARCH_DMA_MINALIGN +#else +# define RTL8169_ALIGN 256 +#endif
-/* Create a static buffer of size RX_BUF_SZ for each -TX Descriptor. All descriptors point to a -part of this buffer */ -static unsigned char txb[NUM_TX_DESC * RX_BUF_SIZE]; +/* Define the TX Descriptor */ +DEFINE_ALIGN_BUFFER(struct TxDesc, tx_ring, NUM_TX_DESC, RTL8169_ALIGN);
/* Define the RX Descriptor */ -static u8 rx_ring[NUM_RX_DESC * sizeof(struct TxDesc) + 256]; - /* __attribute__ ((aligned(256))); */ +DEFINE_ALIGN_BUFFER(struct RxDesc, rx_ring, NUM_RX_DESC, RTL8169_ALIGN); + +/* + * Create a static buffer of size RX_BUF_SZ for each TX Descriptor. All + * descriptors point to a part of this buffer. + */ +DEFINE_ALIGN_BUFFER(u8, txb, NUM_TX_DESC * RX_BUF_SIZE, RTL8169_ALIGN);
-/* Create a static buffer of size RX_BUF_SZ for each -RX Descriptor All descriptors point to a -part of this buffer */ -static unsigned char rxb[NUM_RX_DESC * RX_BUF_SIZE]; +/* + * Create a static buffer of size RX_BUF_SZ for each RX Descriptor. All + * descriptors point to a part of this buffer. + */ +DEFINE_ALIGN_BUFFER(u8, rxb, NUM_RX_DESC * RX_BUF_SIZE, RTL8169_ALIGN);
struct rtl8169_private { void *mmio_addr; /* memory map physical address */ @@ -301,8 +307,6 @@ struct rtl8169_private { unsigned long cur_rx; /* Index into the Rx descriptor buffer of next Rx pkt. */ unsigned long cur_tx; /* Index into the Tx descriptor buffer of next Rx pkt. */ unsigned long dirty_tx; - unsigned char *TxDescArrays; /* Index of Tx Descriptor buffer */ - unsigned char *RxDescArrays; /* Index of Rx Descriptor buffer */ struct TxDesc *TxDescArray; /* Index of 256-alignment Tx Descriptor buffer */ struct RxDesc *RxDescArray; /* Index of 256-alignment Rx Descriptor buffer */ unsigned char *RxBufferRings; /* Index of Rx Buffer */ @@ -710,16 +714,6 @@ static int rtl_reset(struct eth_device *dev, bd_t *bis) printf ("%s\n", __FUNCTION__); #endif
- tpc->TxDescArrays = tx_ring; - /* Tx Desscriptor needs 256 bytes alignment; */ - tpc->TxDescArray = (struct TxDesc *) ((unsigned long)(tpc->TxDescArrays + - 255) & ~255); - - tpc->RxDescArrays = rx_ring; - /* Rx Desscriptor needs 256 bytes alignment; */ - tpc->RxDescArray = (struct RxDesc *) ((unsigned long)(tpc->RxDescArrays + - 255) & ~255); - rtl8169_init_ring(dev); rtl8169_hw_start(dev); /* Construct a perfect filter frame with the mac address as first match @@ -761,10 +755,6 @@ static void rtl_halt(struct eth_device *dev)
RTL_W32(RxMissed, 0);
- tpc->TxDescArrays = NULL; - tpc->RxDescArrays = NULL; - tpc->TxDescArray = NULL; - tpc->RxDescArray = NULL; for (i = 0; i < NUM_RX_DESC; i++) { tpc->RxBufferRing[i] = NULL; } @@ -909,6 +899,18 @@ static int rtl_init(struct eth_device *dev, bd_t *bis) #endif }
+ /* + * Warn if the cache-line size is larger than the descriptor size. In + * such cases the driver will likely fail because the CPU needs to + * flush the cache when requeuing RX buffers, therefore descriptors + * written by the hardware may be discarded. + */ + if (ARCH_DMA_MINALIGN > sizeof(struct RxDesc)) + printf("WARNING: cache-line size is larger than descriptor size\n"); + + tpc->TxDescArray = tx_ring; + tpc->RxDescArray = rx_ring; + return 1; }

On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
RX and TX descriptor rings should be aligned to 256 byte boundaries. Use the DEFINE_ALIGN_BUFFER() macro to define the buffers so that they don't have to be manually aligned later on. Also make sure that the buffers do align to cache-line boundaries in case the cache-line is higher than the 256 byte alignment requirements of the NIC.
Also add a warning if the cache-line size is larger than the descriptor size, because the driver may discard changes to descriptors made by the hardware when requeuing RX buffers.
@@ -909,6 +899,18 @@ static int rtl_init(struct eth_device *dev, bd_t *bis)
- /*
* Warn if the cache-line size is larger than the descriptor size. In
* such cases the driver will likely fail because the CPU needs to
* flush the cache when requeuing RX buffers, therefore descriptors
* written by the hardware may be discarded.
*/
- if (ARCH_DMA_MINALIGN > sizeof(struct RxDesc))
printf("WARNING: cache-line size is larger than descriptor size\n");
I'd be tempted to make that a compile-time #error (or perhaps just a #warning) Perhaps #error would break compilation of existing boards though?

On Wed, Aug 20, 2014 at 01:29:57PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
RX and TX descriptor rings should be aligned to 256 byte boundaries. Use the DEFINE_ALIGN_BUFFER() macro to define the buffers so that they don't have to be manually aligned later on. Also make sure that the buffers do align to cache-line boundaries in case the cache-line is higher than the 256 byte alignment requirements of the NIC.
Also add a warning if the cache-line size is larger than the descriptor size, because the driver may discard changes to descriptors made by the hardware when requeuing RX buffers.
@@ -909,6 +899,18 @@ static int rtl_init(struct eth_device *dev, bd_t *bis)
- /*
* Warn if the cache-line size is larger than the descriptor size. In
* such cases the driver will likely fail because the CPU needs to
* flush the cache when requeuing RX buffers, therefore descriptors
* written by the hardware may be discarded.
*/
- if (ARCH_DMA_MINALIGN > sizeof(struct RxDesc))
printf("WARNING: cache-line size is larger than descriptor size\n");
I'd be tempted to make that a compile-time #error (or perhaps just a #warning) Perhaps #error would break compilation of existing boards though?
There are two SH4 boards that use the rtl8169 (r7780mp and sh7785lcr) for which this condition is true, so #error would break them (well, technically not r7780mp since it comments out CONFIG_RTL8169 in the configuration). I'll make it a #warning instead.
Thierry

On 22 August 2014 03:15, Thierry Reding thierry.reding@gmail.com wrote:
On Wed, Aug 20, 2014 at 01:29:57PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
RX and TX descriptor rings should be aligned to 256 byte boundaries. Use the DEFINE_ALIGN_BUFFER() macro to define the buffers so that they don't have to be manually aligned later on. Also make sure that the buffers do align to cache-line boundaries in case the cache-line is higher than the 256 byte alignment requirements of the NIC.
Also add a warning if the cache-line size is larger than the descriptor size, because the driver may discard changes to descriptors made by the hardware when requeuing RX buffers.
@@ -909,6 +899,18 @@ static int rtl_init(struct eth_device *dev, bd_t *bis)
- /*
* Warn if the cache-line size is larger than the descriptor size. In
* such cases the driver will likely fail because the CPU needs to
* flush the cache when requeuing RX buffers, therefore descriptors
* written by the hardware may be discarded.
*/
- if (ARCH_DMA_MINALIGN > sizeof(struct RxDesc))
printf("WARNING: cache-line size is larger than descriptor size\n");
I'd be tempted to make that a compile-time #error (or perhaps just a #warning) Perhaps #error would break compilation of existing boards though?
There are two SH4 boards that use the rtl8169 (r7780mp and sh7785lcr) for which this condition is true, so #error would break them (well, technically not r7780mp since it comments out CONFIG_RTL8169 in the configuration). I'll make it a #warning instead.
Adding the maintainer for comment.
Regards, Simon

Hi!
I comment to latest patch.
Best regards, Nobuhiro
2014-11-13 1:23 GMT+09:00 Simon Glass sjg@chromium.org:
On 22 August 2014 03:15, Thierry Reding thierry.reding@gmail.com wrote:
On Wed, Aug 20, 2014 at 01:29:57PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
RX and TX descriptor rings should be aligned to 256 byte boundaries. Use the DEFINE_ALIGN_BUFFER() macro to define the buffers so that they don't have to be manually aligned later on. Also make sure that the buffers do align to cache-line boundaries in case the cache-line is higher than the 256 byte alignment requirements of the NIC.
Also add a warning if the cache-line size is larger than the descriptor size, because the driver may discard changes to descriptors made by the hardware when requeuing RX buffers.
@@ -909,6 +899,18 @@ static int rtl_init(struct eth_device *dev, bd_t *bis)
- /*
* Warn if the cache-line size is larger than the descriptor size. In
* such cases the driver will likely fail because the CPU needs to
* flush the cache when requeuing RX buffers, therefore descriptors
* written by the hardware may be discarded.
*/
- if (ARCH_DMA_MINALIGN > sizeof(struct RxDesc))
printf("WARNING: cache-line size is larger than descriptor size\n");
I'd be tempted to make that a compile-time #error (or perhaps just a #warning) Perhaps #error would break compilation of existing boards though?
There are two SH4 boards that use the rtl8169 (r7780mp and sh7785lcr) for which this condition is true, so #error would break them (well, technically not r7780mp since it comments out CONFIG_RTL8169 in the configuration). I'll make it a #warning instead.
Adding the maintainer for comment.
Regards, Simon

Hi Nobuhiro,
On 12 November 2014 16:38, Nobuhiro Iwamatsu iwamatsu@nigauri.org wrote:
Hi!
I comment to latest patch.
Best regards, Nobuhiro
For me this one gives a warning, but I suppose the warning is correct.
2014-11-13 1:23 GMT+09:00 Simon Glass sjg@chromium.org:
On 22 August 2014 03:15, Thierry Reding thierry.reding@gmail.com wrote:
On Wed, Aug 20, 2014 at 01:29:57PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
RX and TX descriptor rings should be aligned to 256 byte boundaries. Use the DEFINE_ALIGN_BUFFER() macro to define the buffers so that they don't have to be manually aligned later on. Also make sure that the buffers do align to cache-line boundaries in case the cache-line is higher than the 256 byte alignment requirements of the NIC.
Also add a warning if the cache-line size is larger than the descriptor size, because the driver may discard changes to descriptors made by the hardware when requeuing RX buffers.
@@ -909,6 +899,18 @@ static int rtl_init(struct eth_device *dev, bd_t *bis)
- /*
* Warn if the cache-line size is larger than the descriptor size. In
* such cases the driver will likely fail because the CPU needs to
* flush the cache when requeuing RX buffers, therefore descriptors
* written by the hardware may be discarded.
*/
- if (ARCH_DMA_MINALIGN > sizeof(struct RxDesc))
printf("WARNING: cache-line size is larger than descriptor size\n");
I'd be tempted to make that a compile-time #error (or perhaps just a #warning) Perhaps #error would break compilation of existing boards though?
There are two SH4 boards that use the rtl8169 (r7780mp and sh7785lcr) for which this condition is true, so #error would break them (well, technically not r7780mp since it comments out CONFIG_RTL8169 in the configuration). I'll make it a #warning instead.
Adding the maintainer for comment.
Regards, Simon
-- Nobuhiro Iwamatsu iwamatsu at {nigauri.org / debian.org} GPG ID: 40AD1FA6
Regards, Simon

From: Thierry Reding treding@nvidia.com
To work around potential issues with explicit cache maintenance of the RX and TX descriptor rings, allocate them from a pool of uncached memory if the architecture supports it.
Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/net/rtl8169.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+)
diff --git a/drivers/net/rtl8169.c b/drivers/net/rtl8169.c index c53666134e06..1c04946d7691 100644 --- a/drivers/net/rtl8169.c +++ b/drivers/net/rtl8169.c @@ -283,11 +283,31 @@ struct RxDesc { # define RTL8169_ALIGN 256 #endif
+/* + * TX and RX descriptors are 16 bytes. This causes problems with the cache + * maintenance on CPUs where the cache-line size exceeds the size of these + * descriptors. What will happen is that when the driver receives a packet + * it will be immediately requeued for the hardware to reuse. The CPU will + * therefore need to flush the cache-line containing the descriptor, which + * will cause all other descriptors in the same cache-line to be flushed + * along with it. If one of those descriptors had been written to by the + * device those changes (and the associated packet) will be lost. + * + * To work around this, we make use of non-cached memory if available. If + * descriptors are mapped uncached there's no need to manually flush them + * or invalidate them. + * + * Note that this only applies to descriptors. The packet data buffers do + * not have the same constraints since they are 1536 bytes large, so they + * are unlikely to share cache-lines. + */ +#ifndef CONFIG_SYS_NONCACHED_MEMORY /* Define the TX Descriptor */ DEFINE_ALIGN_BUFFER(struct TxDesc, tx_ring, NUM_TX_DESC, RTL8169_ALIGN);
/* Define the RX Descriptor */ DEFINE_ALIGN_BUFFER(struct RxDesc, rx_ring, NUM_RX_DESC, RTL8169_ALIGN); +#endif
/* * Create a static buffer of size RX_BUF_SZ for each TX Descriptor. All @@ -412,28 +432,36 @@ match:
static void rtl_inval_rx_desc(struct RxDesc *desc) { +#ifndef CONFIG_SYS_NONCACHED_MEMORY unsigned long start = (unsigned long)desc & ~(ARCH_DMA_MINALIGN - 1); unsigned long end = ALIGN(start + sizeof(*desc), ARCH_DMA_MINALIGN);
invalidate_dcache_range(start, end); +#endif }
static void rtl_flush_rx_desc(struct RxDesc *desc) { +#ifndef CONFIG_SYS_NONCACHED_MEMORY flush_cache((unsigned long)desc, sizeof(*desc)); +#endif }
static void rtl_inval_tx_desc(struct TxDesc *desc) { +#ifndef CONFIG_SYS_NONCACHED_MEMORY unsigned long start = (unsigned long)desc & ~(ARCH_DMA_MINALIGN - 1); unsigned long end = ALIGN(start + sizeof(*desc), ARCH_DMA_MINALIGN);
invalidate_dcache_range(start, end); +#endif }
static void rtl_flush_tx_desc(struct TxDesc *desc) { +#ifndef CONFIG_SYS_NONCACHED_MEMORY flush_cache((unsigned long)desc, sizeof(*desc)); +#endif }
static void rtl_inval_buffer(void *buf, size_t size) @@ -769,6 +797,9 @@ INIT - Look for an adapter, this routine's visible to the outside static int rtl_init(struct eth_device *dev, bd_t *bis) { static int board_idx = -1; +#ifdef CONFIG_SYS_NONCACHED_MEMORY + size_t size; +#endif int i, rc; int option = -1, Cap10_100 = 0, Cap1000 = 0;
@@ -899,6 +930,7 @@ static int rtl_init(struct eth_device *dev, bd_t *bis) #endif }
+#ifndef CONFIG_SYS_NONCACHED_MEMORY /* * Warn if the cache-line size is larger than the descriptor size. In * such cases the driver will likely fail because the CPU needs to @@ -910,6 +942,17 @@ static int rtl_init(struct eth_device *dev, bd_t *bis)
tpc->TxDescArray = tx_ring; tpc->RxDescArray = rx_ring; +#else + /* + * When non-cached memory is available, allocate the descriptors from + * an uncached memory region to avoid any problems caused by caching. + */ + size = NUM_TX_DESC * sizeof(struct TxDesc); + tpc->TxDescArray = (struct TxDesc *)noncached_alloc(size, 256); + + size = NUM_RX_DESC * sizeof(struct RxDesc); + tpc->RxDescArray = (struct RxDesc *)noncached_alloc(size, 256); +#endif
return 1; }

On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
To work around potential issues with explicit cache maintenance of the RX and TX descriptor rings, allocate them from a pool of uncached memory if the architecture supports it.
diff --git a/drivers/net/rtl8169.c b/drivers/net/rtl8169.c
+#ifndef CONFIG_SYS_NONCACHED_MEMORY /* Define the TX Descriptor */ DEFINE_ALIGN_BUFFER(struct TxDesc, tx_ring, NUM_TX_DESC, RTL8169_ALIGN);
/* Define the RX Descriptor */ DEFINE_ALIGN_BUFFER(struct RxDesc, rx_ring, NUM_RX_DESC, RTL8169_ALIGN); +#endif
It feels slightly inconsistent to have one case allocate this memory in BSS at compile-time, and in the other to allocate it dynamically.
Would it be better to remove this global, and simply call different APIs (regular aligned malloc, v.s. non-cached allocate) during rtl_init()?

On Wed, Aug 20, 2014 at 01:33:06PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
To work around potential issues with explicit cache maintenance of the RX and TX descriptor rings, allocate them from a pool of uncached memory if the architecture supports it.
diff --git a/drivers/net/rtl8169.c b/drivers/net/rtl8169.c
+#ifndef CONFIG_SYS_NONCACHED_MEMORY /* Define the TX Descriptor */ DEFINE_ALIGN_BUFFER(struct TxDesc, tx_ring, NUM_TX_DESC, RTL8169_ALIGN);
/* Define the RX Descriptor */ DEFINE_ALIGN_BUFFER(struct RxDesc, rx_ring, NUM_RX_DESC, RTL8169_ALIGN); +#endif
It feels slightly inconsistent to have one case allocate this memory in BSS at compile-time, and in the other to allocate it dynamically.
Would it be better to remove this global, and simply call different APIs (regular aligned malloc, v.s. non-cached allocate) during rtl_init()?
I've reworked this a bit to use memalign() when non-cached memory isn't available and factored out the descriptor allocation. I liked the code slightly better before, but I guess consistently using dynamic memory allocations is worth it. One potential downside is that memalign() allocates from the malloc() pool, therefore devices using this driver will now use more memory. I suppose they could even run out of memory depending on how large the number of receive buffers becomes. Although it's only the descriptors that are being dynamically allocated and they are 16 bytes each. And it's not a problem that couldn't easily be solved by increasing the malloc() size.
Thierry

From: Thierry Reding treding@nvidia.com
This network interface card in found on the NVIDIA Jetson TK1.
Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/net/rtl8169.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/net/rtl8169.c b/drivers/net/rtl8169.c index 1c04946d7691..a02968fa4a12 100644 --- a/drivers/net/rtl8169.c +++ b/drivers/net/rtl8169.c @@ -252,6 +252,7 @@ static struct { {"RTL-8168b/8111sb", 0x38, 0xff7e1880,}, {"RTL-8168d/8111d", 0x28, 0xff7e1880,}, {"RTL-8168evl/8111evl", 0x2e, 0xff7e1880,}, + {"RTL-8168/8111g", 0x4c, 0xff7e1880,}, {"RTL-8101e", 0x34, 0xff7e1880,}, {"RTL-8100e", 0x32, 0xff7e1880,}, };

On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
This series attempts to fix a long-standing problem in the rtl8169 driver (though the same problem may exist in other drivers as well). Let me first explain what exactly the issue is:
The rtl8169 driver provides a set of RX and TX descriptors for the device to use. Once they're set up, the device is told about their location so that it can fetch the descriptors using DMA. The device will also write packet state back into these descriptors using DMA. For this to work properly, whenever a driver needs to access these descriptors it needs to invalidate the D-cache line(s) associated with them. Similarly when changes to the descriptor have been made by the driver, the cache lines need to be flushed to make sure the changes are visible to the device.
The descriptors are 16 bytes in size. This causes problems when used on CPUs that have a cache-line size that is larger than 16 bytes. One example is the NVIDIA Tegra124 which has 64-byte cache-lines. That means that 4 descriptors fit into a single cache-line. So whenever the driver flushes a cache-line it has the potential to discard changes made to another descriptor by the DMA device. One typical symptom is that large transfers over TFTP will often not complete and hang somewhere midway because a device marked a packet received but the driver flushing the cache and causing the packet to be lost.
Since the descriptors need to be consecutive in memory, I don't see a way to fix this other than to use uncached memory. Therefore the solution proposed in this patch series is to introduce a mechanism in U-Boot to allow a driver to allocate from a pool of uncached memory. Currently an implementation is provided only for ARM v7. The idea is that a region (of user-definable size) immediately below (taking into account architecture-specific alignment restrictions) the malloc() area is mapped uncacheable in the MMU. A driver can use the new noncached_alloc() function to allocate a chunk of memory from this pool dynamically for buffers that it can't or doesn't want to do any explicit cache-maintainance on, yet needs to be shared with DMA devices.
Patches 1-3 are minor preparatory work. Patch 1 cleans up some coding style issues in the ARM v7 cache code and patch 2 uses more future-proof types for the mmu_set_region_dcache_behaviour() function arguments. Patch 3 is purely for debugging purposes. It will print out the region used by malloc() when DEBUG is enabled. This can be useful to see where the malloc() region is in the memory map (compared to the noncached region introduced in a later patch for example).
Patch 4 implements the noncached API for ARM v7. It obtains the start of the malloc() area and places the noncached region immediately below it so that noncached_alloc() can allocate from it. During boot, the noncached area will be set up immediately after malloc().
Patch 5 enables noncached memory for all Tegra boards. It uses a 1 MiB chunk which should be plenty (it's also the minimum on ARM v7 because it matches the MMU section size and therefore the granularity at which U-Boot can set the cacheable attributes).
If LPAE were to be enabled, the minimum would be 2MiB, but I suppose we can deal with that if/when the time comes.

On Wed, Aug 20, 2014 at 01:12:20PM -0600, Stephen Warren wrote:
On 08/18/2014 02:00 AM, Thierry Reding wrote:
From: Thierry Reding treding@nvidia.com
This series attempts to fix a long-standing problem in the rtl8169 driver (though the same problem may exist in other drivers as well). Let me first explain what exactly the issue is:
The rtl8169 driver provides a set of RX and TX descriptors for the device to use. Once they're set up, the device is told about their location so that it can fetch the descriptors using DMA. The device will also write packet state back into these descriptors using DMA. For this to work properly, whenever a driver needs to access these descriptors it needs to invalidate the D-cache line(s) associated with them. Similarly when changes to the descriptor have been made by the driver, the cache lines need to be flushed to make sure the changes are visible to the device.
The descriptors are 16 bytes in size. This causes problems when used on CPUs that have a cache-line size that is larger than 16 bytes. One example is the NVIDIA Tegra124 which has 64-byte cache-lines. That means that 4 descriptors fit into a single cache-line. So whenever the driver flushes a cache-line it has the potential to discard changes made to another descriptor by the DMA device. One typical symptom is that large transfers over TFTP will often not complete and hang somewhere midway because a device marked a packet received but the driver flushing the cache and causing the packet to be lost.
Since the descriptors need to be consecutive in memory, I don't see a way to fix this other than to use uncached memory. Therefore the solution proposed in this patch series is to introduce a mechanism in U-Boot to allow a driver to allocate from a pool of uncached memory. Currently an implementation is provided only for ARM v7. The idea is that a region (of user-definable size) immediately below (taking into account architecture-specific alignment restrictions) the malloc() area is mapped uncacheable in the MMU. A driver can use the new noncached_alloc() function to allocate a chunk of memory from this pool dynamically for buffers that it can't or doesn't want to do any explicit cache-maintainance on, yet needs to be shared with DMA devices.
Patches 1-3 are minor preparatory work. Patch 1 cleans up some coding style issues in the ARM v7 cache code and patch 2 uses more future-proof types for the mmu_set_region_dcache_behaviour() function arguments. Patch 3 is purely for debugging purposes. It will print out the region used by malloc() when DEBUG is enabled. This can be useful to see where the malloc() region is in the memory map (compared to the noncached region introduced in a later patch for example).
Patch 4 implements the noncached API for ARM v7. It obtains the start of the malloc() area and places the noncached region immediately below it so that noncached_alloc() can allocate from it. During boot, the noncached area will be set up immediately after malloc().
Patch 5 enables noncached memory for all Tegra boards. It uses a 1 MiB chunk which should be plenty (it's also the minimum on ARM v7 because it matches the MMU section size and therefore the granularity at which U-Boot can set the cacheable attributes).
If LPAE were to be enabled, the minimum would be 2MiB, but I suppose we can deal with that if/when the time comes.
The code that sets up the noncached memory region will pad the size to a multiple of the section size, and if LPAE were enabled I'd expect the section size to be defined appropriately, too. It would still mean that Tegra configured it to be 1 MiB but the code actually reserving 2 MiB of addresses, but that's explicitly allowed in the documentation.
Thierry
participants (4)
-
Nobuhiro Iwamatsu
-
Simon Glass
-
Stephen Warren
-
Thierry Reding