Re: [U-Boot] [PATCH V2] net: fec_mxc: allow use with cache enabled

6 Mar 2012

Dear Eric Nelson,
...
On 03/05/2012 01:06 PM, Marek Vasut wrote:
...
Dear Eric Nelson,
...
ensure that transmit and receive buffers are cache-line aligned
     invalidate cache after each packet received
     flush cache before transmitting


Original patch by Marek:
   	http://lists.denx.de/pipermail/u-boot/2012-February/117695.html
Would be cool to Cc me :-p
Sorry about that.
It's ok, don't worry about it ;-)
[...]
...
...
I think this "writel()" call is bogus and should be removed in subsequent
patch and replaced with simple assignment. It was here probably due to
cache issues on PPC?
The RBD has me puzzled. Do we treat them like registers and use
readx/writex or like in-RAM data structures...
I'd go for the in-RAM data structures, but such conversion should happen in a 
separate patch only AFTER the cache support is in.
[...]
...
...
...

if (!fec->rbd_base) {


	ret = -ENOMEM;


	goto err2;


}


memset(fec->rbd_base, 0, size);


}

We might want to flush the descriptors to memory after they have been
inited?
Again, good catch.
On this topic (initialization of RBD), I had a bit of a concern
regarding the initialization of things.
In fec_open, the receive buffer descriptors are initialized and the
last one set is to 'wrap'. If this code were to execute when the
controller is live, bad things would surely happen.
I traced through all of the paths I can see, and I believe that
we're safe. It appears that fec_halt() will be called prior to
any call to fec_init() and fec_open().
Yes, this will only happen if something went wrong.
...
In fec_open() a number of calls to fec_rbd_clean() are made and
a single flush_dcache() is made afterwards.
While this works and causes less thrashing than per-RBD flushes,
I think the code would be simpler if fec_rbd_clean just did the
flush itself. This would also remove the need for a separate
flush in fec_recv.
Not really, rbd might be (and likely is) smaller than cache line, therefore you 
won't be able to flush single rbd, unless it's cacheline aligned, which wastes 
space.
[...]
...
...
...

invalidate_dcache_range(addr, addr + size);


The idea here is the following (demo uses 32byte long cachelines):
[DESC1][DESC2][DESC3][DESC4][DESC5][DESC6]
`------- cacheline --------'
We want to start retrieving contents of DESC3, therefore "addr" points to
DESC1, "size" is size of cacheline (I hope there's no hardware with 8
byte cachelines, but this should be ok here).
NOTE[5]: Here we can invalidate the whole cacheline, because the
descriptors so far were modified only be hardware, not by us. We are not
writing anything there so we won't loose any information.
NOTE[6]: This invalidation ensures that we always have a fresh copy of
the cacheline containing all the descriptors, therefore we always have a
fresh status of the descriptors we are about to pick. Since this is a
sequential execution, the cache eviction should not kick in here (or
might it?).
Another way to look at this is this:
After fec_open(), the hardware owns the rbd, and all we should do
   is read it. In order to make sure we don't have a stale copy, we
   need to call invalidate() before looking at the values.
Tracing the code to find out whether this is true, the only write I see
is within fec_recv() when the last descriptor is full, when the driver
takes ownership of **all** of the descriptors, calling fec_rbd_clean()
on each.
The only thing that looks funky is this:
size = (CONFIG_FEC_ALIGN / sizeof(struct fec_bd)) - 1;
if ((fec->rbd_index & size) == size) {


Wouldn't a test of rbd_index against FEC_RBD_NUM be more appropriate?
i.e.
   	if (fec->rbd_index == FEC_RBD_NUM-1) {
I believe the FEC doesn't always start from rbd_index == 0, and if you were to 
receive more than 64 rbds between open() and close(), this implementation works, 
your would fail.
...
...
...
bd_status = readw(&rbd->status);
   debug("fec_recv: status 0x%x\n", bd_status);
@@ -751,6 +851,13 @@ static int fec_recv(struct eth_device *dev)
	frame = (struct nbuf *)readl(&rbd->data_pointer);
	frame_length = readw(&rbd->data_length) - 4;

NOTE[7]: We received a frame, we know how long it is. The frame is loaded
into the rbd->data_pointer, which IS CACHE ALIGNED.
I detect a problem here that frame_length might overflow, but that's not
related to caches and might be just false accusation.
...
	/*



	 * Invalidate data cache over the buffer


	 */


	addr = (uint32_t)frame;


	size = roundup(frame_length, CONFIG_FEC_DATA_ALIGNMENT);


	invalidate_dcache_range(addr, addr + size);



NOTE[8]: The roundup here is a valid operation, we can flush up to the
size of rbd->data_pointer, which is cache aligned and considering
frame_length is less or equal to the memory allocated for
rbd->data_pointer, the invalidation of cache here IS SAFE.
...


	/*

 *  Fill the buffer and pass it to upper layers
 */


#ifdef	CONFIG_FEC_MXC_SWAP_PACKET


@@ -765,11 +872,27 @@ static int fec_recv(struct eth_device *dev)
				(ulong)rbd->data_pointer,
				bd_status);

}

/*



 * free the current buffer, restart the engine


 * and move forward to the next buffer




 * Free the current buffer, restart the engine and move forward


 * to the next buffer. Here we check if the whole cacheline of


 * descriptors was already processed and if so, we mark it free


 * as whole.


*/


NOTE[9]: This is the most problematic part, handling the marking of RX
descriptors with a flag that they are ready again. Obviously, directly
writing to the desciptor won't work, because:

There are multiple descriptors on a cacheline, therefore by writing

one, we'd need to flush the whole cacheline back into DRAM immediatelly
so the hardware knows about it. But that'd mean we can loose some
information from the hardware, refer to demo before NOTE[5]:
We modify DESC2 and hardware is concurently changing DESC3. We flush
DESC2 and the whole cacheline, we loose part of DESC3.

Cache eviction algorithm might flush data for us, therefore causing

loss of information as well.
The solution is the following:

Compute how many descriptors are per-cache line
Make sure FEC_RBD_NUM * sizeof(struct fec_bd) is at least 2 *

CONFIG_FEC_DATA_ALIGNMENT in size, see NOTE[11].
3) Once the last RX buffer in the cacheline is processed, mark them all
clean and flush them all, see NOTE[10].
NOTE[10]: This is legal, because the hardware won't use RX descriptors
that it marked dirty (which means not picked up by software yet). We
clean the desciptors in an order the hardware would pick them up again
so there's no problem with race condition either. The only possible
issue here is if there was hardware with cacheline size smaller than
descriptor size (we should add a check for this at the begining of the
file).
NOTE[11]: This is because we want the FEC to overwrite descriptors below
the other cacheline while we're marking the one containing retrieved
descriptors clean.
Ahah! Now I see what the size calculation is doing.
A well-named constant, maybe "RXDESC_PER_CACHELINE" would be useful here.
Yes
...
#define RXDESC_PER_CACHELINE	(CONFIG_FEC_ALIGN/sizeof(struct fec_bd))
...
...

fec_rbd_clean(fec->rbd_index == (FEC_RBD_NUM - 1) ? 1 : 0, rbd);




size = (CONFIG_FEC_DATA_ALIGNMENT / sizeof(struct fec_bd)) - 1;



size = RXDESC_PER_CACHELINE-1;


...
...

if ((fec->rbd_index&  size) == size) {



The line above only works if RXDESC_PER_CACHELINE is a multiple of 2, which
is likely to work because sizeof(struct fec_bd) == 8.
Adding such a comment (and maybe CPP check) won't hurt.
...
...
...

	i = fec->rbd_index - size;


	addr = (uint32_t)&fec->rbd_base[i];


	for (; i<= fec->rbd_index + size; i++) {



This flushes too many descriptors! This should be:
   		for (; i<= fec->rbd_index; i++) {
Agreed
...
...
...

		if (i == (FEC_RBD_NUM - 1))


			fec_rbd_clean(1,&fec->rbd_base[i]);


		else


			fec_rbd_clean(0,&fec->rbd_base[i]);


	}


	flush_dcache_range(addr,


		addr + CONFIG_FEC_DATA_ALIGNMENT);


}


fec_rx_task_enable(fec);
  fec->rbd_index = (fec->rbd_index + 1) % FEC_RBD_NUM;
}

diff --git a/drivers/net/fec_mxc.h b/drivers/net/fec_mxc.h
index 2eb7803..852b2e0 100644
--- a/drivers/net/fec_mxc.h
+++ b/drivers/net/fec_mxc.h
@@ -234,22 +234,6 @@ struct ethernet_regs {
#endif
/**


@brief Descriptor buffer alignment







i.MX27 requires a 16 byte alignment (but for the first element only)


*/

-#define DB_ALIGNMENT		16



-/**


@brief Data buffer alignment







i.MX27 requires a four byte alignment for transmit and 16 bits



alignment for receive so take 16



Note: Valid for member data_pointer in struct buffer_descriptor


*/

-#define DB_DATA_ALIGNMENT	16



-/**

@brief Receive&  Transmit Buffer Descriptor definitions

Note: The first BD must be aligned (see DB_ALIGNMENT)

@@ -282,8 +266,7 @@ struct fec_priv {
struct fec_bd *tbd_base;	/* TBD ring */
   int tbd_index;			/* next transmit BD to write */
   bd_t *bd;

void *rdb_ptr;
void *base_ptr;


uint8_t *tdb_ptr;
int dev_id;
int phy_id;
struct mii_dev *bus;


Uh, this was tough.
How bad do we want cache?
We're almost there, why do you ask? :-)