Re: [U-Boot] i.MX51: FEC: Cache coherency problem?

On 7/20/2011 7:35 AM, Albert ARIBAUD wrote:
Le 20/07/2011 16:01, J. William Campbell a écrit :
On 7/20/2011 6:02 AM, Albert ARIBAUD wrote:
Le 19/07/2011 22:11, J. William Campbell a écrit :
If this is true, then it means that the cache is of type write-back (as opposed to write-thru). From a (very brief) look at the arm7 manuals, it appears that both types of cache may be present in the cpu. Do you know how this operates?
Usually, copyback (rather than writeback) and writethough are modes of operation, not cache types.
Hi Albert, One some CPUs both cache modes are available. On many other CPUs (I would guess most), you have one fixed mode available, but not both. I have always seen the two modes described as write-back and write-through, but I am sure we are talking about the same things.
We are. Copy-back is another name for write-back, not used by ARM but by some others.
The examples that have both modes that I am familiar with have the mode as a "global" setting. It is not controlled by bits in the TLB or anything like that. How does it work on ARM? Is it fixed, globally, globally controlled, or controlled by memory management?
Well, it's a bit complicated, because it depends on the architecture version *and* implementation -- ARM themselves do not mandate things, and it is up to the SoC designer to specify what cache they want and what mode it supports, both at L1 and L2, in their specific instance of ARM cores. And yes, you can have memory areas that are write-back and others that are write-through in the same system.
If it is controlled by memory management, it looks to me like lots of problems could be avoided by operating with input type buffers set as write-through. One probably isn't going to be writing to input buffers much under program control anyway, so the performance loss should be minimal. This gets rid of the alignment restrictions on these buffers but not the invalidate/flush requirements.
There's not much you can do about alignment issues except align to cache line boundaries.
However, if memory management is required to set the cache mode, it might be best to operate with the buffers and descriptors un-cached. That gets rid of the flush/invalidate requirement at the expense of slowing down copying from read buffers.
That makes 'best' a subjective choice, doesn't it? :)
Hi All, Yes,it probably depends on the usage.
Probably a reasonable price to pay for the associated simplicity.
Others would say that spending some time setting up alignments and flushes and invalidates is a reasonable price to pay for increased performance... That's an open debate where no solution is The Right One(tm).
For instance, consider the TFTP image reading. People would like the image to end up in cached memory because we'll do some checksumming on it before we give it control, and having it cached makes this step quite faster; but we'll lose that if we put it in non-cached memory because it comes through the Ethernet controller's DMA; and it would be worse to receive packets in non-cached memory only to move their contents into cached memory later on.
I think properly aligning descriptors and buffers is enough to avoid the mixed flush/invalidate line issue, and wisely putting instruction barriers should be enough to get the added performance of cache without too much of the hassle of memory management.
I am pretty sure that all the drivers read the input data into intermediate buffers in all cases. There is no practical way to be sure the next packet received is the "right one" for the tftp. Plus there are headers involved, and furthermore there is no way to ensure that a tftp destination is located on a sector boundary. In short, you are going to copy from an input buffer to a destination. However, it is still correct that copying from an non-cached area is slower than from cached areas, because of burst reads vs. individual reads. However, I doubt that the u-boot user can tell the difference, as the network latency will far exceed the difference in copy time. The question is, which is easier to do, and that is probably a matter of opinion. However, it is safe to say that so far a cached solution has eluded us. That may be changing, but it would still be nice to know how to allocate a section of un-cached RAM in the ARM processor, in so far as the question has a single answer! That would allow easy portability of drivers that do not know about caches, of which there seems to be many.
Best Regards, Bill Campbell
Best Regards, Bill Campbell
Amicalement,

On Wed, 20 Jul 2011 08:36:12 -0700 "J. William Campbell" jwilliamcampbell@comcast.net wrote:
On 7/20/2011 7:35 AM, Albert ARIBAUD wrote:
Le 20/07/2011 16:01, J. William Campbell a écrit :
On 7/20/2011 6:02 AM, Albert ARIBAUD wrote:
Le 19/07/2011 22:11, J. William Campbell a écrit :
If this is true, then it means that the cache is of type write-back (as opposed to write-thru). From a (very brief) look at the arm7 manuals, it appears that both types of cache may be present in the cpu. Do you know how this operates?
Usually, copyback (rather than writeback) and writethough are modes of operation, not cache types.
Hi Albert, One some CPUs both cache modes are available. On many other CPUs (I would guess most), you have one fixed mode available, but not both. I have always seen the two modes described as write-back and write-through, but I am sure we are talking about the same things.
We are. Copy-back is another name for write-back, not used by ARM but by some others.
The examples that have both modes that I am familiar with have the mode as a "global" setting. It is not controlled by bits in the TLB or anything like that. How does it work on ARM? Is it fixed, globally, globally controlled, or controlled by memory management?
Well, it's a bit complicated, because it depends on the architecture version *and* implementation -- ARM themselves do not mandate things, and it is up to the SoC designer to specify what cache they want and what mode it supports, both at L1 and L2, in their specific instance of ARM cores. And yes, you can have memory areas that are write-back and others that are write-through in the same system.
If it is controlled by memory management, it looks to me like lots of problems could be avoided by operating with input type buffers set as write-through. One probably isn't going to be writing to input buffers much under program control anyway, so the performance loss should be minimal. This gets rid of the alignment restrictions on these buffers but not the invalidate/flush requirements.
There's not much you can do about alignment issues except align to cache line boundaries.
However, if memory management is required to set the cache mode, it might be best to operate with the buffers and descriptors un-cached. That gets rid of the flush/invalidate requirement at the expense of slowing down copying from read buffers.
That makes 'best' a subjective choice, doesn't it? :)
Hi All, Yes,it probably depends on the usage.
Probably a reasonable price to pay for the associated simplicity.
Others would say that spending some time setting up alignments and flushes and invalidates is a reasonable price to pay for increased performance... That's an open debate where no solution is The Right One(tm).
For instance, consider the TFTP image reading. People would like the image to end up in cached memory because we'll do some checksumming on it before we give it control, and having it cached makes this step quite faster; but we'll lose that if we put it in non-cached memory because it comes through the Ethernet controller's DMA; and it would be worse to receive packets in non-cached memory only to move their contents into cached memory later on.
I think properly aligning descriptors and buffers is enough to avoid the mixed flush/invalidate line issue, and wisely putting instruction barriers should be enough to get the added performance of cache without too much of the hassle of memory management.
I am pretty sure that all the drivers read the input data into intermediate buffers in all cases. There is no practical way to be sure the next packet received is the "right one" for the tftp. Plus there are headers involved, and furthermore there is no way to ensure that a tftp destination is located on a sector boundary. In short, you are going to copy from an input buffer to a destination. However, it is still correct that copying from an non-cached area is slower than from cached areas, because of burst reads vs. individual reads. However, I doubt that the u-boot user can tell the difference, as the network latency will far exceed the difference in copy time. The question is, which is easier to do, and that is probably a matter of opinion. However, it is safe to say that so far a cached solution has eluded us. That may be changing, but it would still be nice to know how to allocate a section of un-cached RAM in the ARM processor, in so far as the question has a single answer! That would allow easy portability of drivers that do not know about caches, of which there seems to be many.
I agree. Unfortunately, my time is up for now, and I can't go on with trying to fix this driver. Maybe I'll pick up after my vacation. As for now I settled for the ugly solution of keeping dcache disabled while ethernet is being used :-( IMHO, doing cache maintenance all over the driver is not an easy or nice solution. Implementing a non-cached memory pool in the MMU and a corresponding dma_malloc() sounds like much more universally applicable to any driver.
Best regards,

Le 21/07/2011 08:48, David Jander a écrit :
However, it is still correct that copying from an non-cached area is slower than from cached areas, because of burst reads vs. individual reads. However, I doubt that the u-boot user can tell the difference, as the network latency will far exceed the difference in copy time.
That's assuming cache is only for networking. There can be DMA engines in a lot of other peripherals which do not have the same latency as network (and then, even for networking, TFTP can be done from a very nearby server, possibly even on the same Ethernet segment).
The question is, which is easier to do, and that is probably a matter of opinion. However, it is safe to say that so far a cached solution has eluded us. That may be changing, but it would still be nice to know how to allocate a section of un-cached RAM in the ARM processor, in so far as the question has a single answer! That would allow easy portability of drivers that do not know about caches, of which there seems to be many.
That is one approach, which I think prevents cache from being used beyond caching pure CPU-used DRAM.
I agree. Unfortunately, my time is up for now, and I can't go on with trying to fix this driver. Maybe I'll pick up after my vacation. As for now I settled for the ugly solution of keeping dcache disabled while ethernet is being used :-(
Make sure you flush before disabling. :)
IMHO, doing cache maintenance all over the driver is not an easy or nice solution. Implementing a non-cached memory pool in the MMU and a corresponding dma_malloc() sounds like much more universally applicable to any driver.
I think cache maintenance is feasible if one makes sure the cached areas used by the driver are properly aligned, which simplifies things a lot: you don't have to care for flush-invalidate or just-in-time invalidate, you just have to flush before sending and invalidate before reading.
Best regards,
Amicalement,

On 7/23/2011 6:04 AM, Albert ARIBAUD wrote:
Le 21/07/2011 08:48, David Jander a écrit :
However, it is still correct that copying from an non-cached area is slower than from cached areas, because of burst reads vs. individual reads. However, I doubt that the u-boot user can tell the difference, as the network latency will far exceed the difference in copy time.
That's assuming cache is only for networking. There can be DMA engines in a lot of other peripherals which do not have the same latency as network (and then, even for networking, TFTP can be done from a very nearby server, possibly even on the same Ethernet segment).
Hi All, Yes, there are other uses of DMA. On a network, unless you have a Gigabit network, your memory access speed is at least an order of magnitude faster than the network, probably more. Plus, there is a latency due to sending the ack and request for the next record that undoubtedly swamps out any reduction in memory speed due to the single copy that takes place. In the case of other devices, like for example disks, the percentage effect is probably greater, but since these devices are so fast anyway that the human-perceived speed reduction is essentially nil. If we were talking about a CPU running Linux and doing all kinds of I/O all day long, the reduction in throughput performance might be 10% and that might matter. In a boot loader that does I/O mostly to read in a program to replace itself, I would argue that nobody will notice the difference between cached and un-cached buffers. Counter-examples welcome however.
The question is, which is easier to do, and that is probably a matter of opinion. However, it is safe to say that so far a cached solution has eluded us. That may be changing, but it would still be nice to know how to allocate a section of un-cached RAM in the ARM processor, in so far as the question has a single answer! That would allow easy portability of drivers that do not know about caches, of which there seems to be many.
That is one approach, which I think prevents cache from being used beyond caching pure CPU-used DRAM.
You are certainly correct there. However, I think the pure CPU-used ram case is the one that matters most. Uncompressing and checksumming of input data are typical u-boot functions that take significant time. The performance increase due to cache hits in these cases is huge, and easily perceptible by the user.
I agree. Unfortunately, my time is up for now, and I can't go on with trying to fix this driver. Maybe I'll pick up after my vacation. As for now I settled for the ugly solution of keeping dcache disabled while ethernet is being used :-(
Make sure you flush before disabling. :)
IMHO, doing cache maintenance all over the driver is not an easy or nice solution. Implementing a non-cached memory pool in the MMU and a corresponding dma_malloc() sounds like much more universally applicable to any driver.
I think cache maintenance is feasible if one makes sure the cached areas used by the driver are properly aligned, which simplifies things a lot: you don't have to care for flush-invalidate or just-in-time invalidate, you just have to flush before sending and invalidate before reading.
I do agree it can be done. However, most (I think?) of the CPUs to which u-boot have been ported have cache-coherent DMA. As a result, cache issues for these CPUs are not addressed in the driver at all. Often this means that cache support is done after the fact by somebody other than the original author who may not totally understand the original driver. If DMA buffers were always allocated from cache-coherent memory, either because the memory is un-cached or because the CPU is DMA cache coherent, no changes would be necessary to get the driver working correctly. If performance ever became an issue in the un-cached case, then more work would be required, but in most cases, I expect nobody will notice.
Best Regards, Bill Campbell
Best regards,
Amicalement,
participants (3)
-
Albert ARIBAUD
-
David Jander
-
J. William Campbell