[U-Boot] [PATCH/RFC] net: defragment IP packets

This patch add a quick and dirty defrag step in IP reception. This allows to increase the TFTP block size and get more performance in slow links (but at that point it should be made configurable).
The overhead is negligible, verified with an ARM9 CPU and 10MB data file, changing the server MTU from 1500 to 800 and then 550. However, on a LAN connection, I didn't see advantes with using a 4k block size with default MTU.
Signed-off-by: Alessandro Rubini rubini@gnudd.com ---
This patch is over mainline, without the (much appreciated) cleanup patch that reached the list these days.
net/net.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- 1 files changed, 42 insertions(+), 4 deletions(-)
diff --git a/net/net.c b/net/net.c index 641c37c..5034a2e 100644 --- a/net/net.c +++ b/net/net.c @@ -1117,6 +1117,46 @@ static void CDPStart(void) } #endif
+/* This only reassembles fragments that come in proper order */ +static inline IP_t *NetDefragment(IP_t *ip, int *lenp) +{ + static uchar pkt_buff[16384]; /*temporary arbitrary limit */ + static int next_fragment; + static ushort pkt_id; + + #define IPSZ 20 + uchar *pkt = (uchar *)ip; + ushort ip_off; + int offset, len = *lenp -2; + + ip_off = ntohs(ip->ip_off); + if (!(ip_off & (IP_OFFS | IP_FLAGS_MFRAG))) + return ip; + + offset = (ip_off & IP_OFFS) * 8; + if (!offset) { /* new packet begins, discard any we might have */ + pkt_id = ip->ip_id; + memcpy(pkt_buff, ip, len); + next_fragment = len; + return NULL; + } + + /* further fragment: discard IP header */ + offset += IPSZ; len -= IPSZ; pkt += IPSZ; + + if (ip->ip_id != pkt_id || offset != next_fragment) + return NULL; /* out of order */ + + /* further fragment: skip ip header (we miss offset_of...) */ + memcpy(pkt_buff + next_fragment, pkt, len); + next_fragment += len; + + if (ip_off & IP_FLAGS_MFRAG) + return NULL; /* more expected */ + + *lenp = next_fragment; + return (IP_t *)pkt_buff; +}
void NetReceive(volatile uchar * inpkt, int len) @@ -1360,6 +1400,8 @@ NetReceive(volatile uchar * inpkt, int len) break;
case PROT_IP: + if (!(ip = NetDefragment(ip, &len))) + return; #ifdef ET_DEBUG puts ("Got IP\n"); #endif @@ -1378,10 +1420,6 @@ NetReceive(volatile uchar * inpkt, int len) if ((ip->ip_hl_v & 0xf0) != 0x40) { return; } - /* Can't deal with fragments */ - if (ip->ip_off & htons(IP_OFFS | IP_FLAGS_MFRAG)) { - return; - } /* can't deal with headers > 20 bytes */ if ((ip->ip_hl_v & 0x0f) > 0x05) { return;

On Fri 24 Jul 2009 04:04, Alessandro Rubini pondered:
This patch add a quick and dirty defrag step in IP reception. This allows to increase the TFTP block size and get more performance in slow links (but at that point it should be made configurable).
The overhead is negligible, verified with an ARM9 CPU and 10MB data file, changing the server MTU from 1500 to 800 and then 550. However, on a LAN connection, I didn't see advantes with using a 4k block size with default MTU.
I do...
Filesize : 4613551 bytes (4.3Mbytes)
tftp modified, so it doesn't print out hashes (have have things slow down for the UART printing) - to make sure we are getting a true network measurement...
+#ifndef CONFIG_TFTP_QUIET if (((TftpBlock - 1) % 10) == 0) { putc ('#'); } else if ((TftpBlock % (10 * HASHES_PER_LINE)) == 0) puts ("\n\t "); +#endif
block size seconds 512 9.43 144.78% (+2.92 seconds) 1468 6.52 100.00% (default) 2048 6.27 96.20% (-0.25 seconds) 3072 5.69 87.40% (-0.82 seconds) 4096 5.46 83.81% (-1.05 seconds) 5120 5.26 80.76% (-1.25 seconds) 6144 5.13 78.79% (-1.38 seconds) 7168 5.09 78.13% (-1.42 seconds) 8192 5.01 76.92% (-1.50 seconds) 10240 4.94 75.83% (-1.58 seconds) 12288 4.87 74.74% (-1.65 seconds) 14336 4.85 74.49% (-1.66 seconds) 16384 4.82 73.95% (-1.70 seconds)
Saving that 1.7 seconds is worth it to me...
On a slower connection - I would guess that the difference is going to be more dramatic...
I needed to modify your patch a little bit to get it working on my platform.
If Ben/Wolfgang are interested in taking this - I'll fix up my mods, and send it back.
-Robin

Robin Getz wrote:
On Fri 24 Jul 2009 04:04, Alessandro Rubini pondered:
This patch add a quick and dirty defrag step in IP reception. This allows to increase the TFTP block size and get more performance in slow links (but at that point it should be made configurable).
The overhead is negligible, verified with an ARM9 CPU and 10MB data file, changing the server MTU from 1500 to 800 and then 550. However, on a LAN connection, I didn't see advantes with using a 4k block size with default MTU.
[snip]
I needed to modify your patch a little bit to get it working on my platform.
If Ben/Wolfgang are interested in taking this - I'll fix up my mods, and send it back.
-Robin
FWIIW, RFC815 describes a reassembly algorithm that handles out-of-order reassembly directly in the receive buffer by keeping the "holes" bookkeeping data in the holes themselves. http://www.faqs.org/rfcs/rfc815.html
Best regards, gvb

On Sat 25 Jul 2009 22:02, Jerry Van Baren pondered:
Robin Getz wrote:
On Fri 24 Jul 2009 04:04, Alessandro Rubini pondered:
This patch add a quick and dirty defrag step in IP reception.
I needed to modify your patch a little bit to get it working on my platform.
If Ben/Wolfgang are interested in taking this - I'll fix up my mods, and send it back.
FWIIW, RFC815 describes a reassembly algorithm that handles out-of-order
reassembly directly in the receive buffer by keeping the "holes" bookkeeping data in the holes themselves. http://www.faqs.org/rfcs/rfc815.html
Yeah, I had seen this - but didn't want to duplicate something that Alessandro might already working on...
Alessandro - were you going to add out of order packets? Since we are only really interested in TFTP (which still requires an ACK for a complete block), we should not need to handle more than one packet ID, so it should be mostly trivial...
To make your host send out of order/delayed packets, which should be more "real world/long haul" try something like: # modprobe sch_netem (if it's not compiled into your kernel) # tc qdisc change dev eth0 root netem reorder 50% delay 10ms
-Robin

Yeah, I had seen this - but didn't want to duplicate something that Alessandro might already working on...
Alessandro - were you going to add out of order packets?
If the code has chances to go mainline, I'll be happy to complete this task. So unless I get a nak earlier, I'm going to find a time slot in the next few days (with your fixes, I suppose, or should they remain separate patches?)
To make your host send out of order/delayed packets, which should be more "real world/long haul" try something like: # modprobe sch_netem (if it's not compiled into your kernel) # tc qdisc change dev eth0 root netem reorder 50% delay 10ms
Thanks a lot, I was missing that. /alessandro

On Sun 26 Jul 2009 16:23, Alessandro Rubini pondered:
Yeah, I had seen this - but didn't want to duplicate something that Alessandro might already working on...
Alessandro - were you going to add out of order packets?
If the code has chances to go mainline, I'll be happy to complete this task.
In the past - Wolfgang has normally said that as long as it doesn't negatively effect his platforms (i.e. is a compile option that doesn't effect the size of the normal build) he is mostly OK with anything (within reason).
So unless I get a nak earlier, I'm going to find a time slot in the next few days (with your fixes, I suppose, or should they remain separate patches?)
Nah - roll them all together...
I'll send some comments to your earlier patch.
To make your host send out of order/delayed packets, which should be more "real world/long haul" try something like: # modprobe sch_netem (if it's not compiled into your kernel) # tc qdisc change dev eth0 root netem reorder 50% delay 10ms
Thanks a lot, I was missing that.
http://www.linuxfoundation.org/en/Net:Netem#Packet_re-ordering
Some of the examples do not work, and the tc errors are pretty much meaningless - the man page is pretty thin, but the command line
"tc qdisc change dev eth0 root netem help"
might get you what you need to test things.
-Robin

Hi Guys,
Alessandro Rubini wrote:
Yeah, I had seen this - but didn't want to duplicate something that Alessandro might already working on...
Alessandro - were you going to add out of order packets?
If the code has chances to go mainline, I'll be happy to complete this task. So unless I get a nak earlier, I'm going to find a time slot in the next few days (with your fixes, I suppose, or should they remain separate patches?)
To make your host send out of order/delayed packets, which should be more "real world/long haul" try something like: # modprobe sch_netem (if it's not compiled into your kernel) # tc qdisc change dev eth0 root netem reorder 50% delay 10ms
Thanks a lot, I was missing that. /alessandro
This is great work. Thanks! If you follow these guidelines, I'll pull it into the net repo:
1. Configurable block size (via a well-named CONFIG). Choose a good default value. 2. Handle out-of-order fragments, and some test results showing that it works. 3. Make the feature configurable 4. Test with a TFTP server that doesn't have blksize feature enabled
I'm not sure about how to handle the configurability of this feature. I can see this being the default configuration in the future, but for now it should probably be opt-in. Let's see how it goes.
regards, Ben

On Mon 27 Jul 2009 01:08, Ben Warren pondered:
Hi Guys,
This is great work. Thanks! If you follow these guidelines, I'll pull it into the net repo:
- Configurable block size (via a well-named CONFIG). Choose a good default value.
- Handle out-of-order fragments, and some test results showing that it works.
Are you OK with something that only has been tested with TFTP?
Looks like nfs.h also has:
/* Block size used for NFS read accesses. A RPC reply packet (including all * headers) must fit within a single Ethernet frame to avoid fragmentation. * Chosen to be a power of two, as most NFS servers are optimized for this. */ #define NFS_READ_SIZE 1024
- Make the feature configurable
- Test with a TFTP server that doesn't have blksize feature enabled
It looks like the logic to do this is already there today - U-Boot uses a non-standard Block size (1468), and then falls back to 512 if it fails, or if the server doesn't ACK the option...
I'm not sure about how to handle the configurability of this feature.
I think there are two issues: - core networks stuff (CONFIG_NET_FRAG_SIZE defines the size of a reassembled packet, if it is not set, no reassembly code..) - default block size (if CONFIG_NET_FRAG_SIZE is defined, just read it from an env var?) or do you want a separate fixed CONFIG (and checks to make sure it is going to fit?)
I can see this being the default configuration in the future, but for now it should probably be opt-in. Let's see how it goes.

On Fri 24 Jul 2009 04:04, Alessandro Rubini pondered:
This patch add a quick and dirty defrag step in IP reception. This allows to increase the TFTP block size and get more performance in slow links (but at that point it should be made configurable).
The overhead is negligible, verified with an ARM9 CPU and 10MB data file, changing the server MTU from 1500 to 800 and then 550. However, on a LAN connection, I didn't see advantes with using a 4k block size with default MTU.
Signed-off-by: Alessandro Rubini rubini@gnudd.com
This patch is over mainline, without the (much appreciated) cleanup patch that reached the list these days.
net/net.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- 1 files changed, 42 insertions(+), 4 deletions(-)
diff --git a/net/net.c b/net/net.c index 641c37c..5034a2e 100644 --- a/net/net.c +++ b/net/net.c @@ -1117,6 +1117,46 @@ static void CDPStart(void) } #endif
Should have a CONFIG_ something - to make this conditional.
+/* This only reassembles fragments that come in proper order */ +static inline IP_t *NetDefragment(IP_t *ip, int *lenp) +{
- static uchar pkt_buff[16384]; /*temporary arbitrary limit */
- static int next_fragment;
- static ushort pkt_id;
pkt_buff needs to be aligned - in case you want to start poking in it (which you are going to want to do...) just add a:
__attribute__((aligned(PKTALIGN)));
- #define IPSZ 20
I think what you want is "IP_HDR_SIZE_NO_UDP" - from include/net.h
- uchar *pkt = (uchar *)ip;
- ushort ip_off;
- int offset, len = *lenp -2;
Some ethernet drivers are messed up, and provide bogus lengths (add a few bytes that shouldn't be there).
It's better to get the length from the IP header. Adjusted to dump the header:
len = ntohs(ip->ip_len) - IP_HDR_SIZE_NO_UDP;
- ip_off = ntohs(ip->ip_off);
- if (!(ip_off & (IP_OFFS | IP_FLAGS_MFRAG)))
return ip;
Why not do this in NetReceive()?
JMP/RTS are much more expensive on most archs than a if()
- offset = (ip_off & IP_OFFS) * 8;
- if (!offset) { /* new packet begins, discard any we might have
*/
pkt_id = ip->ip_id;
memcpy(pkt_buff, ip, len);
next_fragment = len;
return NULL;
- }
- /* further fragment: discard IP header */
- offset += IPSZ; len -= IPSZ; pkt += IPSZ;
- if (ip->ip_id != pkt_id || offset != next_fragment)
return NULL; /* out of order */
We should check more than packet id - in case it is coming from a different host...
if (ip->ip_id != (IP_t *)pkt_buff->ip_id || /* check the packet ID */ ip->ip_p != (IP_t *)pkt_buff->ip_p || /* check the protocol */ ip->ip_src != (IP_t *)pkt_buff->ip_src || /* check the source */ ip->ip_dst != (IP_t *)pkt_buff->ip_dst ) /* check the dest */
- /* further fragment: skip ip header (we miss offset_of...) */
- memcpy(pkt_buff + next_fragment, pkt, len);
- next_fragment += len;
- if (ip_off & IP_FLAGS_MFRAG)
return NULL; /* more expected */
- *lenp = next_fragment;
- return (IP_t *)pkt_buff;
+}
void NetReceive(volatile uchar * inpkt, int len) @@ -1360,6 +1400,8 @@ NetReceive(volatile uchar * inpkt, int len) break;
case PROT_IP:
if (!(ip = NetDefragment(ip, &len)))
return;
#ifdef ET_DEBUG puts ("Got IP\n"); #endif
I guess this is includes some generic cleanup - but at least the reassembly should happen _after_ the sanity checks are done. Something like:
case PROT_IP: #ifdef ET_DEBUG puts ("Got IP\n"); #endif /* Before we start poking the header, make sure it is there */ if (len < IP_HDR_SIZE) { debug ("len bad %d < %lu\n", len, (ulong)IP_HDR_SIZE); return; } /* can't deal with anything except IPv4 */ if ((ip->ip_hl_v & 0xf0) != 0x40) { debug("I only understand IPv4\n"); return; } /* can't deal with IP options (headers != 20 bytes) */ if ((ip->ip_hl_v & 0x0f) * 4 != IP_HDR_SIZE_NO_UDP) { debug("Can't deal with IP options\n"); return; } /* Check the Checksum of the header */ if (!NetCksumOk((uchar *)ip, IP_HDR_SIZE_NO_UDP / 2)) { puts ("IP header checksum bad\n"); return; } /* Check the packet length */ if (len < ntohs(ip->ip_len)) { printf("len bad %d < %d\n", len, ntohs(ip->ip_len)); return; } /* If it is not for us, ignore it */ tmp = NetReadIP(&ip->ip_dst); if (NetOurIP && tmp != NetOurIP && tmp != 0xFFFFFFFF) { #ifdef CONFIG_MCAST_TFTP if (Mcast_addr != tmp) #endif return; } #ifdef CONFIG_NET_FRAGMENT /* If it is a IP fragment, try to combine it with it */ if (ntohs(ip->ip_off) & (IP_OFFS | IP_FLAGS_MFRAG)) { debug("received fragmented packet\n"); ip = NetDefragment(ip); if (!ip) return; } #endif len = ntohs(ip->ip_len); #ifdef ET_DEBUG printf("len=%d, v=%02x\n", len, ip->ip_hl_v & 0xff); #endif
@@ -1378,10 +1420,6 @@ NetReceive(volatile uchar * inpkt, int len) if ((ip->ip_hl_v & 0xf0) != 0x40) { return; }
/* Can't deal with fragments */
if (ip->ip_off & htons(IP_OFFS | IP_FLAGS_MFRAG)) {
return;
/* can't deal with headers > 20 bytes */ if ((ip->ip_hl_v & 0x0f) > 0x05) { return;}
I also had (for easier testing)
@@ -478,8 +497,15 @@ TftpStart (void) #ifdef CONFIG_TFTP_PORT char *ep; /* Environment pointer */ #endif
TftpServerIP = NetServerIP; +#ifdef CONFIG_NET_FRAGMENT + { + char *tmp; + tmp = getenv("tftpblocksize") ; + if (tmp) + TftpBlkSizeOption = simple_strtol(tmp, NULL, 10); + else + TftpBlkSizeOption = TFTP_MTU_BLOCKSIZE; + debug("using TftpBlkSizeOption = %d\n", TftpBlkSizeOption); + } +#endif if (BootFile[0] == '\0') { sprintf(default_filename, "%02lX%02lX%02lX%02lX.img", NetOurIP & 0xFF,
and I was doing md5 or sha1 on things to make sure that things came over properly...
-Robin

Thanks for your comments.
Should have a CONFIG_ something - to make this conditional.
This has been asked by Ben too. Will do, although I'm not very happy about all those conditionals for every few lines of code.
Some of your remarks are just symptoms of this being a quick hack, like the memcpy not checked, the missed getenv and so on.
and I was doing md5 or sha1 on things to make sure that things came over properly...
Yes, me too. Besides booting the stuff I got.
/alessandro

On Mon 27 Jul 2009 08:13, Alessandro Rubini pondered:
Thanks for your comments.
Should have a CONFIG_ something - to make this conditional.
This has been asked by Ben too. Will do, although I'm not very happy about all those conditionals for every few lines of code.
There are 22,250 #ifdef in the code base already - a few more isn't going to make thing any uglier than it already is. :)
Some of your remarks are just symptoms of this being a quick hack, like the memcpy not checked, the missed getenv and so on.
Yeah, comments were for completeness, not a comment on the (nice & quick) function you put together...
and I was doing md5 or sha1 on things to make sure that things came over properly...
Yes, me too. Besides booting the stuff I got.
/alessandro

Dear Robin Getz,
In message 200907262059.34188.rgetz@blackfin.uclinux.org you wrote:
...
and I was doing md5 or sha1 on things to make sure that things came over properly...
Are you using FIT images, or reinventing the wheel?
Best regards,
Wolfgang Denk

On Mon 27 Jul 2009 08:41, Wolfgang Denk pondered:
Dear Robin Getz,
In message 200907262059.34188.rgetz@blackfin.uclinux.org you wrote:
...
and I was doing md5 or sha1 on things to make sure that things came over properly...
Are you using FIT images, or reinventing the wheel?
Neither - Blackfin does not support FIT yet - just doing a command line "sha1sum $(loadaddr) $(filesize)" (via the patch I sent yesterday).

On Fri 24 Jul 2009 04:04, Alessandro Rubini pondered: [snip]
+/* This only reassembles fragments that come in proper order */ +static inline IP_t *NetDefragment(IP_t *ip, int *lenp) +{
- static uchar pkt_buff[16384]; /*temporary arbitrary limit */
- static int next_fragment;
- static ushort pkt_id;
[snip]
- /* further fragment: skip ip header (we miss offset_of...) */
- memcpy(pkt_buff + next_fragment, pkt, len);
- next_fragment += len;
Also (forgot last night) - we need to make sure the length of the packet fits in the reassembly buffer before you do the memcpy(). Setting the tftp block size to 16384 is bad if the buffer is also set to 16384.. (since it has the IP header on it)...
-Robin
participants (5)
-
Alessandro Rubini
-
Ben Warren
-
Jerry Van Baren
-
Robin Getz
-
Wolfgang Denk