Re: [U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed

31 Jul 2012

      Am 31.07.2012 03:06, schrieb Benoît Thébaudeau:
...
Dear Marek Vasut,
On Tue, Jul 31, 2012 at 12:38:54 AM, Marek Vasut wrote:
...
[...]
...
...
...
Can you explain where this gain would come from? In both cases,
the
data in USB
transfers would be organized in the same way, and it would be
accessed in memory
also in the same way (regarding bursts). The only difference
would
be the fetch
time of a little bit more qTDs, which is extremely fast and
insignificant
compared to the transfer time of the payload, which remains
unchanged.
You are right, the speed different will be minimal, only the
memory
usage will be lower.
If your point is only the memory gain, I agree. With your
suggestion, there
are roughly 25% less qTDs used in the "(max wMaxPacketSize)-aligned
but
not page-aligned" case since the number of qTDs is about (total
transfer
size) / 5 instead of (total transfer size) / 4. But this is still
small
compared to usual heap sizes (at least on the kind of hardware I
use).
Ok, I see the point. I understand it's not really a bug, just an
improvement.
Exactly.
...
Maybe we can do a subsequent patch on top of these from Benoit and
see how it
fares?
If you wish. I'll do that.
...
...
...
...
Moreover, in your use case, if you are e.g. using FAT, on the
one
hand, the
buffers in fat.c are never aligned to more than the DMA min
alignment, and on
the other hand, if you can align your user buffers to 512
bytes,
you can also
align them directly to 4 kB.
The user buffer is aligned to 4kB, but this doesn't matter as a
file
load from a storage device (ex. fatload) can be segmented in
partial
USB
transfers. This can lead to any block aligned buffer for a
partial
transfer.
What do you mean by "partial USB transfers"? As seen from EHCI
users like
the MSC driver (usb_storage.c), USB transfers either succeed or
fail, but
they cannot be "segmented".
Segmented -- like multiple transfers being issues with small payload?
Right.
...
...
You can
not put these together at the USB-level, since it's the issuing code
that has to
be fixed.
If the segmentation comes from the file system handling we can not avoid 
this.
...
...
...
On its side, the MSC driver will only segment the FAT layer
requests if
they are larger than 65535 blocks, so still not what you describe.
As to the FAT stack, it will only read whole clusters while
accessing file
payload, and the most usual cluster sizes are by default a multiple
of 4
kiB (see http://support.microsoft.com/kb/140365).
512b is minimum and it's quite often used.
OK.
In my example I use a FAT partition with 128 MB and 1 KB clusters. The 
file is read in two segments in which the first transfer starts 4 kB 
aligned but stops 1 kB aligned but not 4 kB aligned and leads to 
unaligned second transfer.
...
...
...
So I don't see "segmentation" anywhere, and for usual cluster
sizes, the
EHCI buffer alignment is fully determined by the applicative buffer
alignment and the file position corresponding to the beginning of
the
applicative buffer. But there are indeed some unusual use cases
(e.g.
smaller clusters) for which only a block-aligned buffer will reach
EHCI
despite a page-aligned applicative buffer.
I don't quite get this one.
I meant that 512 bytes (most usual storage block size) is what we should aim at
to optimize the number of qTDs.
Right.
...
...
...
...
...
...
My suggestion would be to truncate the xfr_bytes with the max
wMaxPacketSize (1024) and for the qtd_count use:
if ((uint32_t)buffer & 1023)    /* wMaxPacketSize unaligned */
   qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +

           length, (QT_BUFFER_CNT - 1) * 4096);

else                /* wMaxPacketSize aligned */
   qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +

           length, QT_BUFFER_CNT * 4096);

This allows 50% of unaligned block data (512) to be
transferred
with
min
qTDs.
That would also require a realignment-to-page stage. This is
specific code for
specific buffer alignment from the upper layers. We could also
skip
the
realignment to page and always keep the same qTD transfer size
except for the
last one, by adding as many packets as possible for the buffer
alignment.
What you mean by realignment-to-page stage?
I mean that the alignment of the transfer to 1024 instead of 4096
can make
the first qTD transfer larger than the following ones, which
guarantees
that the following qTD transfers are page-aligned, even if the
first one
was only aligned to 1024. For the 1024-aligned case, this results
in the
change that you suggest, but it also changes things for the
unaligned
case, which makes this part of the code inaccurate. See below.
You are right. It maximise the first transfer. All other transfers are 5 
* 4 KB (aligned) or 4 * 4 KB (unaligned) long.
...
...
...
...
...
But I still don't see a significant reason to complicate code
to do
that.
I don't understand where you expect to complicate the code.
You limit the size of one transfer (xfr_bytes) to (QT_BUFFER_CNT

4kB for unaligned buffers. But you only need to limit it to a
multiple
of the maximal possible wMaxPacketSize (1kB) to make sure that
you
always send full packages.
I only suggest to replace the causeless 4kB alignment with a
reasonable
1kB alignment and adapte the qtd_count caluclation.
                      int xfr_bytes = min(left_length,

                                          (QT_BUFFER_CNT *
                                          4096 -

                                           ((uint32_t)buf_ptr
                                           &
                                           4095)) &

                                      ~4095);

                                      ~1023);

I agree for this part of the code. But for the allocation part,
your
suggestion is already a little bit more complicated than my
original code,
while still incomplete. Besides that, the "((uint32_t)buffer &
4095) +"
for the page-unaligned case in my code was actually useless, which
emphasizes the difference, even if it's a light complication.
For the allocation part, the appropriate code for your suggestion
would be:
if ((uint32_t)buffer & 1023) /* max-wMaxPacketSize-unaligned */
	qtd_count +=

DIV_ROUND_UP(
     max(
         length > 0,
         length - (4096 - ((uint32_t)buf_ptr & 4095) & ~1023)
     ),
     (QT_BUFFER_CNT - 1) * 4096
);
Ok, I now think I understand what's going on here. I still have to
wonder how
much would the compiler optimize of this if you "decompressed" your
code -- to
make it more easy to understand.
I wouldn't go for this complicated version since it's not really useful compared
to the simpler yet less accurate solution I gave below.
You are right, but your complicate code only saves one qTD.
...
...
...
else			     /* max-wMaxPacketSize-aligned */
	qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
			length, QT_BUFFER_CNT * 4096);

This code allocates exactly the required number of qTDs, no less,
no more.
It's clearly more complicated than the 4096 version.
A test should also be added to make sure that qtd_count is never 0.
Otherwise, ZLPs are broken (this applies to my original code too).
If we want to compromise accuracy for simplicity, we can change
that to:
qtd_count += 2 + length /
	((QT_BUFFER_CNT - !!((uint32_t)buffer & 1023)) * 4096);

It's this solution I'd like to use to optimize the number of qTDs (with 1023 or
something else).
This sounds reasonable.
...
...
...
This code allocates enough qTDs for all cases, with at worst 2
extra qTDs
(i.e. a total of 128 bytes) that will be left unused. It also
handles
intrinsically ZLPs.
Now, let's consider the possible values of wMaxPacketSize:

control endpoints:
LS: 8,
FS: 8, 16, 32 or 64,
HS: 64,

isochronous endpoints: not supported by ehci-hcd.c,
interrupt endpoints:
LS: <= 8,
FS: <= 64,
HS: <= 1024 (1x to 3x for high-bandwidth),

bulk endpoints:
LS: N/A,
FS: 8, 16, 32 or 64,
HS: 512.

My code assumes that wMaxPacketSize is a power of 2. This is not
always
true for interrupt endpoints. Let's talk about these. Their
handling is
currently broken in U-Boot since their transfers are made
asynchronous
instead of periodic. Devices shouldn't care too much about that, as
long
as transfers do not exceed wMaxPacketSize, in which case my code
still
works because wMaxPacketSize always fits in a single qTD. Interrupt
transfers larger than wMaxPacketSize do not seem to be used by
U-Boot. If
they were used, the current code in U-Boot would have a timing
issue
because the asynchronous scheme would break the interval requested
by
devices, which could at worst make them fail in some way. So the
only
solution would be that such transfers be split by the caller of
submit_int_msg, in which case my code still works. What would you
think
about failing with an error message in submit_int_msg if length is
larger
than wMaxPacketSize? Marek, what do you think?
Let's do that ... I think the interrupt endpoint is only used for
keyboard and
if someone needs it for something else, the code will be there, just
needing
improvement. Comment and error message are OK.
OK. I have thought of another solution for this. You'll tell me which one you
prefer.
The ehci_submit_async code currently in U-Boot checks through ehci_td_buffer
that length fits in the single qTD reserved for data payload only after work has
begun, possibly after a SETUP transfer. With my series, this is checked at the
very beginning, before the allocation. We could detect that wMaxPacketSize is
not a power of 2 (e.g. with __builtin_popcount), in which case the allocation
for the data payload would be restricted to 1 qTD like now, and there would be
a check at the very beginning to test if length fits in this qTD. In that way,
there could be several packets per interrupt transfer as long as it fits in a
single qTD, just like now, contrary to the limitation imposed by the error in
submit_int_msg. But I'm not sure it's a good idea to allow this behavior.
I think this is not needed, as there is only one user (keyboard) with 
max size of 8 byte.
...
...
...
For all other cases, wMaxPacketSize is a power of 2, so everything
is fine,
except that in those cases wMaxPacketSize is at most 512, which
means that
with the suggested limitation applied to submit_int_msg, your
suggested
1024 could be replaced with 512, which is good news since this is
also the
most common storage sector size.
We could even use usb_maxpacket(dev, pipe) instead of 512, with
some
restrictions. If we don't want to alter the misalignment check in
ehci_td_buffer, max(usb_maxpacket(dev, pipe), ARCH_DMA_MINALIGN)
would
actually have to be used. This misalignment check could be limited
to the
first qTD transfer of a USB transfer, but that would complicate
things,
all the more the corresponding call to flush_dcache_range would
have to be
modified to fix alignments.
So we have to make two choices:

between 4096, 1024, 512 and max(usb_maxpacket(dev, pipe),

ARCH_DMA_MINALIGN), - between the accurate and simple allocations.
That makes a total of 8 working possibilities. What do you guys
think we
should choose? On my side, I like max(usb_maxpacket(dev, pipe),
ARCH_DMA_MINALIGN)
Won't maxpacket fall below 512 on occasions,
Sure.
I would go with 512 as it also the most common storage sector size.
...
...
which might cause
trouble?
Why?
...
...
with the simple allocation. It's efficient as to code
speed, size and readability, as well as to RAM usage.
For now, I'd go for the safest, easiest and dumbest solution and see
how it
fares. Subsequent patch can be submitted to improve that and
measurements made.
"We should forget about small efficiencies, say about 97% of the
time; premature
optimization is the root of all evil"
             -- Donald E. Knuth, Structured Programming with go to
             Statements
[...]
OK, so I'll stick to my original series, rework it lightly as we said, add Jim's
patch, and add a further patch for these optimizations.
Okay.
...
...
...
So we could perhaps issue a #error in ehci-hcd or in usb_storage if
CONFIG_SYS_MALLOC_LEN is not large enough, but I don't think it's a
good
idea because:

the threshold value would have to depend on runtime block sizes

or
something, which could lead to a big worst worst case that would
almost
never happen in real life, so giving such an unrealistic heap size
constraint would be cumbersome,
#warning then?
With which limit if so?
I would expect more than 128kB as this is a common worst case (512 B 
block size).
...
...
...

reaching the top sizes would mean reading a huge file or

something to a
large buffer (much larger than the qTDs this transfer requires),
which
would very likely be heap-allocated (except for commands like
fatload), so
CONFIG_SYS_MALLOC_LEN would already have to be large for the
application,

for command line operations like fatload, transfers of

uncontrolled
lengths could simply crash U-Boot if they go too far in memory
Why, because they overwrite it?
Yes. U-Boot expands down its allocation during startup, so it's often located at
the end of the embedded RAM, which means that fatload will very likely use the
beginning of the RAM.
...
...
, which
means that users of such commands need to know what they are doing
anyway,
so they have to control transfer sizes,

there is already a runtime error displayed in case of allocation

failure.
Ok
So #warning or not besides this?
...
...
Marek, what do you think?
Had a good evening with the EHCI r10 spec, hope I answered most of
your
questions.
Yes, thanks.
Regards,
     Stefan

Re: [U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed

Stefan Herbrechtsmeier