[U-Boot] [RFC V2 PATCH 0/3] Add cache for block devices

Here's a more full-featured implementation of a cache for block devices that uses a small linked list of cache blocks.
Experimentation loading a 4.5 MiB kernel from the root directory of a FAT filesystem shows that a single cache entry of a single block is the only
Loading the same from the /boot directory of an ext4 filesystem shows a benefit with 4 cache entries, though the single biggest benefit is also with the first cache entry:
=> for n in 0 1 2 4 8 ; do
blkc 1 $n ; blkc c ; blkc i ; load mmc 0:2 10008000 /boot/zImage ; done
changed to max of 0 entries of 1 blocks each 4955304 bytes read in 503 ms (9.4 MiB/s) changed to max of 1 entries of 1 blocks each 4955304 bytes read in 284 ms (16.6 MiB/s) changed to max of 2 entries of 1 blocks each 4955304 bytes read in 284 ms (16.6 MiB/s) changed to max of 4 entries of 1 blocks each 4955304 bytes read in 255 ms (18.5 MiB/s) changed to max of 8 entries of 1 blocks each 4955304 bytes read in 255 ms (18.5 MiB/s)
As mentioned earlier in this thread, the modification to the mmc layer should probably be simpler and easier to apply to other block subsystems.
Eric Nelson (3): drivers: block: add block device cache block: add Kconfig options for BLOCK_CACHE, CMD_BLOCK_CACHE mmc: add support for block device cache
drivers/block/Kconfig | 19 ++++ drivers/block/Makefile | 1 + drivers/block/cache_block.c | 240 ++++++++++++++++++++++++++++++++++++++++++++ drivers/mmc/mmc.c | 10 +- drivers/mmc/mmc_write.c | 7 ++ include/part.h | 69 +++++++++++++ 6 files changed, 345 insertions(+), 1 deletion(-) create mode 100644 drivers/block/cache_block.c

Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 4 and the maximum number of blocks per cache entry has a default of 1, which matches the current implementation of at least FAT and ext4 that only read a single block at a time.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Signed-off-by: Eric Nelson eric@nelint.com --- drivers/block/Makefile | 1 + drivers/block/cache_block.c | 240 ++++++++++++++++++++++++++++++++++++++++++++ include/part.h | 69 +++++++++++++ 3 files changed, 310 insertions(+) create mode 100644 drivers/block/cache_block.c
diff --git a/drivers/block/Makefile b/drivers/block/Makefile index b5c7ae1..056a48b 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -24,3 +24,4 @@ obj-$(CONFIG_IDE_SIL680) += sil680.o obj-$(CONFIG_SANDBOX) += sandbox.o obj-$(CONFIG_SCSI_SYM53C8XX) += sym53c8xx.o obj-$(CONFIG_SYSTEMACE) += systemace.o +obj-$(CONFIG_BLOCK_CACHE) += cache_block.o diff --git a/drivers/block/cache_block.c b/drivers/block/cache_block.c new file mode 100644 index 0000000..e4ebeda --- /dev/null +++ b/drivers/block/cache_block.c @@ -0,0 +1,240 @@ +/* + * Copyright (C) Nelson Integration, LLC 2016 + * Author: Eric Nelsoneric@nelint.com + * + * SPDX-License-Identifier: GPL-2.0+ + * + */ +#include <config.h> +#include <common.h> +#include <malloc.h> +#include <part.h> +#include <linux/ctype.h> +#include <linux/list.h> + +struct block_cache_node { + struct list_head lh; + int iftype; + int devnum; + lbaint_t start; + lbaint_t blkcnt; + unsigned long blksz; + char *cache; +}; + +static LIST_HEAD(block_cache); +static unsigned cache_count; + +static unsigned long max_blocks_per_entry = 1; +static unsigned long max_cache_entries = 4; + +static unsigned block_cache_misses; +static unsigned block_cache_hits; + +static int trace; + +/* + * search for a set of blocks in the cache + * + * remove and return the node if found so it can be + * added back at the start of the list and maintain MRU order) + */ +static struct block_cache_node *cache_find + (int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz) +{ + struct block_cache_node *node; + + list_for_each_entry(node, &block_cache, lh) + if ((node->iftype == iftype) && + (node->devnum == devnum) && + (node->blksz == blksz) && + (node->start <= start) && + (node->start + node->blkcnt >= start + blkcnt)) { + list_del(&node->lh); + return node; + } + return 0; +} + +int cache_block_read(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + struct block_cache_node *node = cache_find(iftype, devnum, start, + blkcnt, blksz); + if (node) { + const char *src = node->cache + (start - node->start) * blksz; + list_add(&node->lh, &block_cache); + memcpy(buffer, src, blksz*blkcnt); + if (trace) + printf("hit: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++block_cache_hits; + return 1; + } + + if (trace) + printf("miss: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++block_cache_misses; + return 0; +} + +void cache_block_fill(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) +{ + lbaint_t bytes; + struct block_cache_node *node; + + /* don't cache big stuff */ + if (blkcnt > max_blocks_per_entry) + return; + + if (max_cache_entries == 0) + return; + + bytes = blksz * blkcnt; + if (max_cache_entries <= cache_count) { + /* pop LRU */ + node = (struct block_cache_node *)block_cache.prev; + list_del(&node->lh); + cache_count--; + if (node->blkcnt * node->blksz < bytes) { + free(node->cache); + node->cache = 0; + } + } else { + node = malloc(sizeof(*node)); + if (!node) + return; + node->cache = 0; + } + + if (node->cache == 0) { + node->cache = malloc(bytes); + if (!node->cache) { + free(node); + return; + } + } + + if (trace) + printf("fill: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + + node->iftype = iftype; + node->devnum = devnum; + node->start = start; + node->blkcnt = blkcnt; + node->blksz = blksz; + memcpy(node->cache, buffer, bytes); + list_add(&node->lh, &block_cache); + cache_count++; +} + +void cache_block_invalidate(int iftype, int devnum) +{ + struct list_head *entry, *n; + struct block_cache_node *node; + + list_for_each_safe(entry, n, &block_cache) { + node = (struct block_cache_node *)entry; + if ((node->iftype == iftype) && + (node->devnum == devnum)) { + list_del(entry); + free(node->cache); + free(node); + --cache_count; + } + } +} + +void cache_block_invalidate_all(void) +{ + struct list_head *entry, *n; + struct block_cache_node *node; + + list_for_each_safe(entry, n, &block_cache) { + node = (struct block_cache_node *)entry; + list_del(entry); + free(node->cache); + free(node); + } + + cache_count = 0; +} + +#ifdef CONFIG_CMD_BLOCK_CACHE + +static void dump_block_cache(void) +{ + struct list_head *entry, *n; + struct block_cache_node *node; + int i = 0; + + list_for_each_safe(entry, n, &block_cache) { + node = (struct block_cache_node *)entry; + printf("----- cache entry[%d]\n", i++); + printf("iftype %d\n", node->iftype); + printf("devnum %d\n", node->devnum); + printf("blksize " LBAFU "\n", node->blksz); + printf("start " LBAF "\n", node->start); + printf("count " LBAFU "\n", node->blkcnt); + } +} + +static int do_blkcache(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + if ((argc == 1) || !strcmp("show", argv[1])) { + printf("block cache:\n" + "%u\thits\n" + "%u\tmisses\n" + "%u\tentries in cache\n" + "trace %s\n" + "max blocks/entry %lu\n" + "max entries %lu\n", + block_cache_hits, block_cache_misses, cache_count, + trace ? "on" : "off", + max_blocks_per_entry, max_cache_entries); + } else if ((argc == 2) && ('c' == argv[1][0])) { + block_cache_hits = 0; + block_cache_misses = 0; + } else if ((argc == 2) && ('d' == argv[1][0])) { + dump_block_cache(); + } else if ((argc == 2) && ('i' == argv[1][0])) { + cache_block_invalidate_all(); + } else if ((argc >= 2) && ('t' == argv[1][0])) { + if ((argc == 3) && !strcmp("off", argv[2])) + trace = 0; + else + trace = 1; + } else if ((argc == 3) && isdigit(argv[1][0]) && isdigit(argv[2][0])) { + max_blocks_per_entry = simple_strtoul(argv[1], 0, 0); + max_cache_entries = simple_strtoul(argv[2], 0, 0); + cache_block_invalidate_all(); + printf("changed to max of %lu entries of %lu blocks each\n", + max_cache_entries, max_blocks_per_entry); + } else { + return CMD_RET_USAGE; + } + + return 0; +} + +U_BOOT_CMD( + blkcache, 3, 0, do_blkcache, + "block cache control/statistics", + "[show] - show statistics\n" + "blkcache c[lear] - clear statistics\n" + "blkcache d[ump] - dump cache entries\n" + "blkcache i[nvalidate] - invalidate cache\n" + "blkcache #maxblocks #maxentries\n" + "\tset maximum blocks per cache entry, maximum entries\n" + "blkcache t[race] [off] - enable (disable) tracing" +); + +#endif diff --git a/include/part.h b/include/part.h index dc8e72e..1ac73dcc 100644 --- a/include/part.h +++ b/include/part.h @@ -376,4 +376,73 @@ int gpt_verify_partitions(struct blk_desc *dev_desc, gpt_header *gpt_head, gpt_entry **gpt_pte); #endif
+#ifdef CONFIG_BLOCK_CACHE +/** + * cache_block_read() - attempt to read a set of blocks from cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks to read + * @param blksz - size in bytes of each block + * @param buf - buffer to contain cached data + * + * @return - '1' if block returned from cache, '0' otherwise. + */ +int cache_block_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer); + +/** + * cache_block_fill() - make data read from a block device available + * to the block cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks available + * @param blksz - size in bytes of each block + * @param buf - buffer containing data to cache + * + */ +void cache_block_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer); + +/** + * cache_block_invalidate() - discard the cache for a set of blocks + * because of a write or device (re)initialization. + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + */ +void cache_block_invalidate + (int iftype, int dev); + +void cache_block_invalidate_all(void); + +#else + +static inline int cache_block_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + return 0; +} + +static inline void cache_block_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) {} + +static inline void cache_block_invalidate + (int iftype, int dev) {} + +static inline void cache_block_invalidate_all(void){} + +#endif + #endif /* _PART_H */

On 03/20/2016 06:45 PM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
...
Signed-off-by: Eric Nelson eric@nelint.com
drivers/block/Makefile | 1 + drivers/block/cache_block.c | 240 ++++++++++++++++++++++++++++++++++++++++++++ include/part.h | 69 +++++++++++++ 3 files changed, 310 insertions(+) create mode 100644 drivers/block/cache_block.c
...
diff --git a/include/part.h b/include/part.h index dc8e72e..1ac73dcc 100644 --- a/include/part.h +++ b/include/part.h @@ -376,4 +376,73 @@ int gpt_verify_partitions(struct blk_desc *dev_desc, gpt_header *gpt_head, gpt_entry **gpt_pte); #endif
I think this stuff now belongs in blk.h instead of part.h:
+#ifdef CONFIG_BLOCK_CACHE +/**
- cache_block_read() - attempt to read a set of blocks from cache
- @param iftype - IF_TYPE_x for type of device

On 03/20/2016 07:45 PM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 4 and the maximum number of blocks per cache entry has a default of 1, which matches the current implementation of at least FAT and ext4 that only read a single block at a time.
If the maximum size of the cache is fixed and small (which judging by the description it is), why not use an array rather than a linked list. That would be far simpler to manage.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Even with this dynamic adjustment, using an array still feels simpler, although I have't looked at the code yet.
diff --git a/drivers/block/cache_block.c b/drivers/block/cache_block.c
+/*
- search for a set of blocks in the cache
- remove and return the node if found so it can be
- added back at the start of the list and maintain MRU order)
- */
Splitting the remove/add feels a bit dangerous, since forgetting to re-do the add (or e.g. some error causing that to be skipped) could cause cache_count to become inconsistent.
The function name "find" is a bit misleading. cache_find_and_remove() would make its semantics more obvious. Or, simply put the list_del() somewhere else; it doesn't feel like part of a "find" operation to me. Or, put the entire move-to-front operation inside this function so it isn't split up - if so, cache_find_and_move_to_head().
+static struct block_cache_node *cache_find
- (int iftype, int devnum,
Nit: the ( and first n arguments shouldn't be wrapped.
+int cache_block_read(int iftype, int devnum,
lbaint_t start, lbaint_t blkcnt,
unsigned long blksz, void *buffer)
memcpy(buffer, src, blksz*blkcnt);
Nit: Space around operators. checkpatch should catch this.
if (trace)
printf("hit: start " LBAF ", count " LBAFU "\n",
start, blkcnt);
I guess I'll find that trace can be adjusted at run-time somewhere later in this patch, but it's more typical to use debug() without the if statement for this. It would be awesome if debug() could be adjusted at run-time...
+void cache_block_fill(int iftype, int devnum,
...
- if (node->cache == 0) {
node->cache = malloc(bytes);
if (!node->cache) {
Nit: may as well be consistent with those NULL checks.
+void cache_block_invalidate(int iftype, int devnum)
...
+void cache_block_invalidate_all(void)
There's no invalidate_blocks(); I imagine that writing-to/erasing (some blocks of) a device must call cache_block_invalidate() which will wipe out even data that wasn't written.
+static void dump_block_cache(void)
...
- list_for_each_safe(entry, n, &block_cache) {
Nit: This doesn't need to be _safe since the list is not being modified.
+static int do_blkcache(cmd_tbl_t *cmdtp, int flag,
int argc, char * const argv[])
+{
- if ((argc == 1) || !strcmp("show", argv[1])) {
printf("block cache:\n"
"%u\thits\n"
"%u\tmisses\n"
"%u\tentries in cache\n"
"trace %s\n"
"max blocks/entry %lu\n"
"max entries %lu\n",
block_cache_hits, block_cache_misses, cache_count,
trace ? "on" : "off",
max_blocks_per_entry, max_cache_entries);
Nit: Let's print the value and "label" in a consistent order. I would suggest everything being formatted as:
" description: %u"
so that it's indented, has a delimiter between description and value, and so that irrespective of the length of the converted value, the indentation of the other parts of the string don't change (\t doesn't guarantee this after a certain number length).
I would rather have expected "show" to dump the entries too, but I suppose it's fine that they're separate.
- } else if ((argc == 2) && ('c' == argv[1][0])) {
Nit: the value of 'c' is always 'c', so it's pointless to test whether it's equal to something. The value being compared is arv[1][0], so the parameters to == should be swapped. I'm aware that == is mathematically commutative, but typical English reading of the operator is "is equal to" which has non-commutative implications re: what is being tested. I'm also aware that this construct is used to trigger compiler warnings if someone types = instead of ==. However, (a) you didn't and (b) compilers warn about this these days, so there's no need to write unusual code to get that feature anymore.
I worry that being imprecise in command-line parsing (i.e. treating both "crud" and "clear" as the same thing) will lead to problems expanding the command-line format in the future; we won't be able to ever add another option starting with "c" and maintain the same abbreviation semantics. I'd suggest requiring the full option name always.
- } else if ((argc == 3) && isdigit(argv[1][0]) && isdigit(argv[2][0])) {
Let's make this "blkcache set" or "blkcache configure" so that other sub-commands can take numeric parameters in the future, and have consistent command-line syntax.
+U_BOOT_CMD(
- blkcache, 3, 0, do_blkcache,
- "block cache control/statistics",
- "[show] - show statistics\n"
- "blkcache c[lear] - clear statistics\n"
- "blkcache d[ump] - dump cache entries\n"
- "blkcache i[nvalidate] - invalidate cache\n"
- "blkcache #maxblocks #maxentries\n"
- "\tset maximum blocks per cache entry, maximum entries\n"
- "blkcache t[race] [off] - enable (disable) tracing"
+);
BTW, isn't there some support in U-Boot for sub-commands already, so you can write a separate function per sub-command and skip writing all the strcmp logic to select between them? I'm pretty sure I saw that somewhere.
diff --git a/include/part.h b/include/part.h
+static inline void cache_block_invalidate_all(void){}
Is that useful outside of the debug commands?

Hi Stephen,
Thanks again for the detailed review.
On 03/23/2016 10:22 AM, Stephen Warren wrote:
On 03/20/2016 07:45 PM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 4 and the maximum number of blocks per cache entry has a default of 1, which matches the current implementation of at least FAT and ext4 that only read a single block at a time.
If the maximum size of the cache is fixed and small (which judging by the description it is), why not use an array rather than a linked list. That would be far simpler to manage.
You seem to agree with Marek, and I must be dain-bramaged because I just don't see it.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Even with this dynamic adjustment, using an array still feels simpler, although I have't looked at the code yet.
diff --git a/drivers/block/cache_block.c b/drivers/block/cache_block.c
+/*
- search for a set of blocks in the cache
- remove and return the node if found so it can be
- added back at the start of the list and maintain MRU order)
- */
Splitting the remove/add feels a bit dangerous, since forgetting to re-do the add (or e.g. some error causing that to be skipped) could cause cache_count to become inconsistent.
The function name "find" is a bit misleading. cache_find_and_remove() would make its semantics more obvious. Or, simply put the list_del() somewhere else; it doesn't feel like part of a "find" operation to me. Or, put the entire move-to-front operation inside this function so it isn't split up - if so, cache_find_and_move_to_head().
You're right. I'll look at that for a V3 (hopefully non-RFC) patch set.
+static struct block_cache_node *cache_find
- (int iftype, int devnum,
Nit: the ( and first n arguments shouldn't be wrapped.
+int cache_block_read(int iftype, int devnum,
lbaint_t start, lbaint_t blkcnt,
unsigned long blksz, void *buffer)
memcpy(buffer, src, blksz*blkcnt);
Nit: Space around operators. checkpatch should catch this.
Thanks.
if (trace)
printf("hit: start " LBAF ", count " LBAFU "\n",
start, blkcnt);
I guess I'll find that trace can be adjusted at run-time somewhere later in this patch, but it's more typical to use debug() without the if statement for this. It would be awesome if debug() could be adjusted at run-time...
I wanted to keep this as a run-time thing because it's useful in tuning things and I'm not yet certain if/when users might need to update the sizes.
I could wrap these with some #ifdef stuff but I'm not certain which is the right thing to do.
+void cache_block_fill(int iftype, int devnum,
...
- if (node->cache == 0) {
node->cache = malloc(bytes);
if (!node->cache) {
Nit: may as well be consistent with those NULL checks.
Yep.
+void cache_block_invalidate(int iftype, int devnum)
...
+void cache_block_invalidate_all(void)
There's no invalidate_blocks(); I imagine that writing-to/erasing (some blocks of) a device must call cache_block_invalidate() which will wipe out even data that wasn't written.
Right. I figured that this wasn't worth extra code.
The 99.99% use case for U-Boot is read-only, so optimizing this just seems like bloat.
+static void dump_block_cache(void)
...
- list_for_each_safe(entry, n, &block_cache) {
Nit: This doesn't need to be _safe since the list is not being modified.
Good catch.
+static int do_blkcache(cmd_tbl_t *cmdtp, int flag,
int argc, char * const argv[])
+{
- if ((argc == 1) || !strcmp("show", argv[1])) {
printf("block cache:\n"
"%u\thits\n"
"%u\tmisses\n"
"%u\tentries in cache\n"
"trace %s\n"
"max blocks/entry %lu\n"
"max entries %lu\n",
block_cache_hits, block_cache_misses, cache_count,
trace ? "on" : "off",
max_blocks_per_entry, max_cache_entries);
Nit: Let's print the value and "label" in a consistent order. I would suggest everything being formatted as:
" description: %u"
so that it's indented, has a delimiter between description and value, and so that irrespective of the length of the converted value, the indentation of the other parts of the string don't change (\t doesn't guarantee this after a certain number length).
Thanks.
I would rather have expected "show" to dump the entries too, but I suppose it's fine that they're separate.
While testing, I found myself mostly using 'show' to see results.
I added dump as a debug tool and almost removed it, but figured this was still an RFC, so I left it in.
- } else if ((argc == 2) && ('c' == argv[1][0])) {
Nit: the value of 'c' is always 'c', so it's pointless to test whether it's equal to something. The value being compared is arv[1][0], so the parameters to == should be swapped. I'm aware that == is mathematically commutative, but typical English reading of the operator is "is equal to" which has non-commutative implications re: what is being tested. I'm also aware that this construct is used to trigger compiler warnings if someone types = instead of ==. However, (a) you didn't and (b) compilers warn about this these days, so there's no need to write unusual code to get that feature anymore.
:) muscle-memory again.
It's usually Marek that catches my yoda-expressions, which are usually tests against zero.
I worry that being imprecise in command-line parsing (i.e. treating both "crud" and "clear" as the same thing) will lead to problems expanding the command-line format in the future; we won't be able to ever add another option starting with "c" and maintain the same abbreviation semantics. I'd suggest requiring the full option name always.
- } else if ((argc == 3) && isdigit(argv[1][0]) &&
isdigit(argv[2][0])) {
Let's make this "blkcache set" or "blkcache configure" so that other sub-commands can take numeric parameters in the future, and have consistent command-line syntax.
Sounds good to me.
+U_BOOT_CMD(
- blkcache, 3, 0, do_blkcache,
- "block cache control/statistics",
- "[show] - show statistics\n"
- "blkcache c[lear] - clear statistics\n"
- "blkcache d[ump] - dump cache entries\n"
- "blkcache i[nvalidate] - invalidate cache\n"
- "blkcache #maxblocks #maxentries\n"
- "\tset maximum blocks per cache entry, maximum entries\n"
- "blkcache t[race] [off] - enable (disable) tracing"
+);
BTW, isn't there some support in U-Boot for sub-commands already, so you can write a separate function per sub-command and skip writing all the strcmp logic to select between them? I'm pretty sure I saw that somewhere.
There is, but I haven't (yet) used it. I'll take a look at this in V3.
diff --git a/include/part.h b/include/part.h
+static inline void cache_block_invalidate_all(void){}
Is that useful outside of the debug commands?
Not at the moment, and it probably won't be needed when I figure out how this glues to the block layer and/or drivers in a cleaner way.
Regards,
Eric

Allow the selection of CONFIG_BLOCK_CACHE, CONFIG_CMD_BLOCK_CACHE using menuconfig.
Signed-off-by: Eric Nelson eric@nelint.com --- drivers/block/Kconfig | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index f35c4d4..6529efb 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -18,3 +18,22 @@ config DISK types can use this, such as AHCI/SATA. It does not provide any standard operations at present. The block device interface has not been converted to driver model. + +config BLOCK_CACHE + bool "Use block device cache" + default n + help + This option enables a disk-block cache for all block devices. + This is most useful when accessing filesystems under U-Boot since + it will prevent repeated reads from directory structures. + +config CMD_BLOCK_CACHE + bool "Include block device cache control command (blkcache)" + depends on BLOCK_CACHE + default y if BLOCK_CACHE + help + Enable the blkcache command, which can be used to control the + operation of the cache functions. + This is most useful when fine-tuning the operation of the cache + during development, but also allows the cache to be disabled when + it might hurt performance (e.g. when using the ums command).

On 03/23/2016 10:24 AM, Stephen Warren wrote:
On 03/20/2016 07:45 PM, Eric Nelson wrote:
Allow the selection of CONFIG_BLOCK_CACHE, CONFIG_CMD_BLOCK_CACHE using menuconfig.
I think this should be part of patch 1.
Works for me, especially as it adds some documentation in the form of help text.

Signed-off-by: Eric Nelson eric@nelint.com --- drivers/mmc/mmc.c | 10 +++++++++- drivers/mmc/mmc_write.c | 7 +++++++ 2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/mmc/mmc.c b/drivers/mmc/mmc.c index 8b2e606..956f4e1 100644 --- a/drivers/mmc/mmc.c +++ b/drivers/mmc/mmc.c @@ -6,7 +6,6 @@ * * SPDX-License-Identifier: GPL-2.0+ */ - #include <config.h> #include <common.h> #include <command.h> @@ -240,6 +239,8 @@ static ulong mmc_bread(struct blk_desc *block_dev, lbaint_t start, int dev_num = block_dev->devnum; int err; lbaint_t cur, blocks_todo = blkcnt; + void *outbuf = dst; + lbaint_t outblk = start;
if (blkcnt == 0) return 0; @@ -260,6 +261,10 @@ static ulong mmc_bread(struct blk_desc *block_dev, lbaint_t start, return 0; }
+ if (cache_block_read(IF_TYPE_MMC, dev_num, start, blkcnt, + mmc->read_bl_len, dst)) + return blkcnt; + if (mmc_set_blocklen(mmc, mmc->read_bl_len)) { debug("%s: Failed to set blocklen\n", __func__); return 0; @@ -277,6 +282,9 @@ static ulong mmc_bread(struct blk_desc *block_dev, lbaint_t start, dst += cur * mmc->read_bl_len; } while (blocks_todo > 0);
+ cache_block_fill(IF_TYPE_MMC, dev_num, outblk, blkcnt, + mmc->read_bl_len, outbuf); + return blkcnt; }
diff --git a/drivers/mmc/mmc_write.c b/drivers/mmc/mmc_write.c index 7b186f8..1c96d29 100644 --- a/drivers/mmc/mmc_write.c +++ b/drivers/mmc/mmc_write.c @@ -12,6 +12,7 @@ #include <part.h> #include <div64.h> #include <linux/math64.h> +#include <part.h> #include "mmc_private.h"
static ulong mmc_erase_t(struct mmc *mmc, ulong start, lbaint_t blkcnt) @@ -20,6 +21,8 @@ static ulong mmc_erase_t(struct mmc *mmc, ulong start, lbaint_t blkcnt) ulong end; int err, start_cmd, end_cmd;
+ cache_block_invalidate(IF_TYPE_MMC, mmc->block_dev.devnum); + if (mmc->high_capacity) { end = start + blkcnt - 1; } else { @@ -82,6 +85,8 @@ unsigned long mmc_berase(struct blk_desc *block_dev, lbaint_t start, if (err < 0) return -1;
+ cache_block_invalidate(IF_TYPE_MMC, dev_num); + /* * We want to see if the requested start or total block count are * unaligned. We discard the whole numbers and only care about the @@ -186,6 +191,8 @@ ulong mmc_bwrite(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt, if (err < 0) return 0;
+ cache_block_invalidate(IF_TYPE_MMC, dev_num); + if (mmc_set_blocklen(mmc, mmc->write_bl_len)) return 0;

On 03/20/2016 07:45 PM, Eric Nelson wrote:
diff --git a/drivers/mmc/mmc.c b/drivers/mmc/mmc.c
@@ -6,7 +6,6 @@
- SPDX-License-Identifier: GPL-2.0+
*/
- #include <config.h>
Unrelated change?
I don't see any cache invalidation call when the SD device is re-initialized.
I think I mentioned these two points last time.

Hi Stephen,
On 03/23/2016 10:27 AM, Stephen Warren wrote:
On 03/20/2016 07:45 PM, Eric Nelson wrote:
diff --git a/drivers/mmc/mmc.c b/drivers/mmc/mmc.c
@@ -6,7 +6,6 @@
- SPDX-License-Identifier: GPL-2.0+
*/
- #include <config.h>
Unrelated change?
Yes. Sorry about that.
I don't see any cache invalidation call when the SD device is re-initialized.
I think I mentioned these two points last time.
You did and I missed it.
Regards,
Eric

On 03/21/2016 02:45 AM, Eric Nelson wrote:
Here's a more full-featured implementation of a cache for block devices that uses a small linked list of cache blocks.
Why do you use linked list ? You have four entries, you can as well use fixed array. Maybe you should implement an adaptive cache would would use the unpopulated malloc area and hash the sector number(s) into that area ?
Experimentation loading a 4.5 MiB kernel from the root directory of a FAT filesystem shows that a single cache entry of a single block is the only
only ... what ? This is where things started to be interesting, but you leave us hanging :)
Loading the same from the /boot directory of an ext4 filesystem shows a benefit with 4 cache entries, though the single biggest benefit is also with the first cache entry:
=> for n in 0 1 2 4 8 ; do
blkc 1 $n ; blkc c ; blkc i ; load mmc 0:2 10008000 /boot/zImage ; done
changed to max of 0 entries of 1 blocks each 4955304 bytes read in 503 ms (9.4 MiB/s) changed to max of 1 entries of 1 blocks each 4955304 bytes read in 284 ms (16.6 MiB/s) changed to max of 2 entries of 1 blocks each 4955304 bytes read in 284 ms (16.6 MiB/s) changed to max of 4 entries of 1 blocks each 4955304 bytes read in 255 ms (18.5 MiB/s) changed to max of 8 entries of 1 blocks each 4955304 bytes read in 255 ms (18.5 MiB/s)
As mentioned earlier in this thread, the modification to the mmc layer should probably be simpler and easier to apply to other block subsystems.
Eric Nelson (3): drivers: block: add block device cache block: add Kconfig options for BLOCK_CACHE, CMD_BLOCK_CACHE mmc: add support for block device cache
drivers/block/Kconfig | 19 ++++ drivers/block/Makefile | 1 + drivers/block/cache_block.c | 240 ++++++++++++++++++++++++++++++++++++++++++++ drivers/mmc/mmc.c | 10 +- drivers/mmc/mmc_write.c | 7 ++ include/part.h | 69 +++++++++++++ 6 files changed, 345 insertions(+), 1 deletion(-) create mode 100644 drivers/block/cache_block.c

Hi Marek,
On 03/20/2016 06:59 PM, Marek Vasut wrote:
On 03/21/2016 02:45 AM, Eric Nelson wrote:
Here's a more full-featured implementation of a cache for block devices that uses a small linked list of cache blocks.
Why do you use linked list ? You have four entries, you can as well use fixed array. Maybe you should implement an adaptive cache would would use the unpopulated malloc area and hash the sector number(s) into that area ?
I was looking for a simple implementation that would allow tweaking of the max entries/size per entry.
We could get higher performance through hashing, but with such a small cache, it's probably not worth extra code.
Using an array and re-allocating on changes to the max entries variable is feasible, but I think it would be slightly more code.
Experimentation loading a 4.5 MiB kernel from the root directory of a FAT filesystem shows that a single cache entry of a single block is the only
only ... what ? This is where things started to be interesting, but you leave us hanging :)
Oops.
... I was planning on re-wording that.
My testing showed no gain in performance (additional cache hits) past a single entry of a single block. This was done on a small (32MiB) partition with a small number of files (~10) and only a single read is skipped.
=> blkc c ; blkc i ; blkc 0 0 ; changed to max of 0 entries of 0 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 247 ms (19.1 MiB/s) => blkc block cache: 0 hits 7 misses 0 entries in cache trace off max blocks/entry 0 max entries 0 => blkc c ; blkc i ; blkc 1 1 ; changed to max of 1 entries of 1 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 243 ms (19.4 MiB/s) => blkc block cache: 1 hits 6 misses 1 entries in cache trace off max blocks/entry 1 max entries 1
I don't believe that enabling the cache is worth the extra code for this use case.
By comparison, a load of 150 MiB compressed disk image from ext4 showed a 30x speedup with the V1 patch (single block, single entry) from ~150s to 5s.
Without some form of cache, the 150s was long enough to make a user (me) think something is broken.
Regards,
Eric

On 03/21/2016 02:48 PM, Eric Nelson wrote:
Hi Marek,
On 03/20/2016 06:59 PM, Marek Vasut wrote:
On 03/21/2016 02:45 AM, Eric Nelson wrote:
Here's a more full-featured implementation of a cache for block devices that uses a small linked list of cache blocks.
Why do you use linked list ? You have four entries, you can as well use fixed array. Maybe you should implement an adaptive cache would would use the unpopulated malloc area and hash the sector number(s) into that area ?
I was looking for a simple implementation that would allow tweaking of the max entries/size per entry.
We could get higher performance through hashing, but with such a small cache, it's probably not worth extra code.
The hashing function can be a simple modulo on sector number ;-) That'd be less code than linked lists.
Using an array and re-allocating on changes to the max entries variable is feasible, but I think it would be slightly more code.
That would indeed be more code.
Experimentation loading a 4.5 MiB kernel from the root directory of a FAT filesystem shows that a single cache entry of a single block is the only
only ... what ? This is where things started to be interesting, but you leave us hanging :)
Oops.
... I was planning on re-wording that.
My testing showed no gain in performance (additional cache hits) past a single entry of a single block. This was done on a small (32MiB) partition with a small number of files (~10) and only a single read is skipped.
I'd kinda expect that indeed.
=> blkc c ; blkc i ; blkc 0 0 ; changed to max of 0 entries of 0 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 247 ms (19.1 MiB/s) => blkc block cache: 0 hits 7 misses 0 entries in cache trace off max blocks/entry 0 max entries 0 => blkc c ; blkc i ; blkc 1 1 ; changed to max of 1 entries of 1 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 243 ms (19.4 MiB/s) => blkc block cache: 1 hits 6 misses 1 entries in cache trace off max blocks/entry 1 max entries 1
I don't believe that enabling the cache is worth the extra code for this use case.
By comparison, a load of 150 MiB compressed disk image from ext4 showed a 30x speedup with the V1 patch (single block, single entry) from ~150s to 5s.
Without some form of cache, the 150s was long enough to make a user (me) think something is broken.
I'm obviously loving this improvement.
Best regards, Marek Vasut

Hi Marek,
On 03/21/2016 09:49 AM, Marek Vasut wrote:
On 03/21/2016 02:48 PM, Eric Nelson wrote:
On 03/20/2016 06:59 PM, Marek Vasut wrote:
On 03/21/2016 02:45 AM, Eric Nelson wrote:
Here's a more full-featured implementation of a cache for block devices that uses a small linked list of cache blocks.
Why do you use linked list ? You have four entries, you can as well use fixed array. Maybe you should implement an adaptive cache would would use the unpopulated malloc area and hash the sector number(s) into that area ?
I was looking for a simple implementation that would allow tweaking of the max entries/size per entry.
We could get higher performance through hashing, but with such a small cache, it's probably not worth extra code.
The hashing function can be a simple modulo on sector number ;-) That'd be less code than linked lists.
I'm not seeing how.
I'm going look first at a better way to integrate than the approach taken in patch 3.
Using an array and re-allocating on changes to the max entries variable is feasible, but I think it would be slightly more code.
That would indeed be more code.
Experimentation loading a 4.5 MiB kernel from the root directory of a FAT filesystem shows that a single cache entry of a single block is the only
only ... what ? This is where things started to be interesting, but you leave us hanging :)
Oops.
... I was planning on re-wording that.
My testing showed no gain in performance (additional cache hits) past a single entry of a single block. This was done on a small (32MiB) partition with a small number of files (~10) and only a single read is skipped.
I'd kinda expect that indeed.
Yeah, and the single-digit-ms improvement isn't worth much.
=> blkc c ; blkc i ; blkc 0 0 ; changed to max of 0 entries of 0 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 247 ms (19.1 MiB/s) => blkc block cache: 0 hits 7 misses 0 entries in cache trace off max blocks/entry 0 max entries 0 => blkc c ; blkc i ; blkc 1 1 ; changed to max of 1 entries of 1 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 243 ms (19.4 MiB/s) => blkc block cache: 1 hits 6 misses 1 entries in cache trace off max blocks/entry 1 max entries 1
I don't believe that enabling the cache is worth the extra code for this use case.
By comparison, a load of 150 MiB compressed disk image from ext4 showed a 30x speedup with the V1 patch (single block, single entry) from ~150s to 5s.
Without some form of cache, the 150s was long enough to make a user (me) think something is broken.
I'm obviously loving this improvement.
Glad to hear it.

On 03/21/2016 06:56 PM, Eric Nelson wrote:
Hi Marek,
Hi!
On 03/21/2016 09:49 AM, Marek Vasut wrote:
On 03/21/2016 02:48 PM, Eric Nelson wrote:
On 03/20/2016 06:59 PM, Marek Vasut wrote:
On 03/21/2016 02:45 AM, Eric Nelson wrote:
Here's a more full-featured implementation of a cache for block devices that uses a small linked list of cache blocks.
Why do you use linked list ? You have four entries, you can as well use fixed array. Maybe you should implement an adaptive cache would would use the unpopulated malloc area and hash the sector number(s) into that area ?
I was looking for a simple implementation that would allow tweaking of the max entries/size per entry.
We could get higher performance through hashing, but with such a small cache, it's probably not worth extra code.
The hashing function can be a simple modulo on sector number ;-) That'd be less code than linked lists.
I'm not seeing how.
I'm going look first at a better way to integrate than the approach taken in patch 3.
I will dive in the code itself and comment if applicable.
Using an array and re-allocating on changes to the max entries variable is feasible, but I think it would be slightly more code.
That would indeed be more code.
Experimentation loading a 4.5 MiB kernel from the root directory of a FAT filesystem shows that a single cache entry of a single block is the only
only ... what ? This is where things started to be interesting, but you leave us hanging :)
Oops.
... I was planning on re-wording that.
My testing showed no gain in performance (additional cache hits) past a single entry of a single block. This was done on a small (32MiB) partition with a small number of files (~10) and only a single read is skipped.
I'd kinda expect that indeed.
Yeah, and the single-digit-ms improvement isn't worth much.
=> blkc c ; blkc i ; blkc 0 0 ; changed to max of 0 entries of 0 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 247 ms (19.1 MiB/s) => blkc block cache: 0 hits 7 misses 0 entries in cache trace off max blocks/entry 0 max entries 0 => blkc c ; blkc i ; blkc 1 1 ; changed to max of 1 entries of 1 blocks each => load mmc 0 10008000 /zImage reading /zImage 4955304 bytes read in 243 ms (19.4 MiB/s) => blkc block cache: 1 hits 6 misses 1 entries in cache trace off max blocks/entry 1 max entries 1
I don't believe that enabling the cache is worth the extra code for this use case.
By comparison, a load of 150 MiB compressed disk image from ext4 showed a 30x speedup with the V1 patch (single block, single entry) from ~150s to 5s.
Without some form of cache, the 150s was long enough to make a user (me) think something is broken.
I'm obviously loving this improvement.
Glad to hear it.
:)

Patch 1 is the block cache implementation.
Patches 2 and 3 update the mmc and sata commands to use the new blk_dread/dwrite/derase routines and can be applied separately.
Notable changes since the V2 RFC patch set include: - renaming of files and routines to use "blkcache" rather than "cache_block" to make things consistent with the command name.
- implemented through the blk_dread/dwrite/derase routines in blk.h instead of drivers/mmc.
- added detection of device (re)initialization in disk/part.c
- changed the default max blocks/entry and entries to 2 and 32 respectively based on further testing.
- changed to sub-command implementation as suggested by Stephen Warren.
- changed interface to cache_find per Stephen's suggestion
I tested with and without DM and compile-tested with CONFIG_BLK enabled, but only on MMC and USB devices.
As a teaser, here are some performance numbers loading a 5MiB file from SD card.
With cache: => blkc max 2 32 changed to max of 32 entries of 2 blocks each => load mmc 0:2 10008000 /usr/lib/libGAL.so 5042083 bytes read in 251 ms (19.2 MiB/s)
Without cache: => blkc max 0 0 changed to max of 0 entries of 0 blocks each => load mmc 0:2 10008000 /usr/lib/libGAL.so 5042083 bytes read in 4638 ms (1 MiB/s)
Eric Nelson (3): drivers: block: add block device cache mmc: force mmc command to go through block layer sata: force sata reads and writes to go through block layer
cmd/mmc.c | 7 +- cmd/sata.c | 6 +- disk/part.c | 2 + drivers/block/Kconfig | 20 ++++ drivers/block/Makefile | 1 + drivers/block/blk-uclass.c | 13 +- drivers/block/blkcache.c | 293 +++++++++++++++++++++++++++++++++++++++++++++ include/blk.h | 79 +++++++++++- 8 files changed, 414 insertions(+), 7 deletions(-) create mode 100644 drivers/block/blkcache.c

Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Signed-off-by: Eric Nelson eric@nelint.com --- disk/part.c | 2 + drivers/block/Kconfig | 20 ++++ drivers/block/Makefile | 1 + drivers/block/blk-uclass.c | 13 +- drivers/block/blkcache.c | 293 +++++++++++++++++++++++++++++++++++++++++++++ include/blk.h | 79 +++++++++++- 6 files changed, 406 insertions(+), 2 deletions(-) create mode 100644 drivers/block/blkcache.c
diff --git a/disk/part.c b/disk/part.c index 67d98fe..0aff954 100644 --- a/disk/part.c +++ b/disk/part.c @@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
+ blkcache_invalidate(dev_desc->if_type, dev_desc->devnum); + dev_desc->part_type = PART_TYPE_UNKNOWN; for (entry = drv; entry != drv + n_ents; entry++) { int ret; diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index f35c4d4..0209b95 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -18,3 +18,23 @@ config DISK types can use this, such as AHCI/SATA. It does not provide any standard operations at present. The block device interface has not been converted to driver model. + +config BLOCK_CACHE + bool "Use block device cache" + default n + help + This option enables a disk-block cache for all block devices. + This is most useful when accessing filesystems under U-Boot since + it will prevent repeated reads from directory structures and other + filesystem data structures. + +config CMD_BLOCK_CACHE + bool "Include block device cache control command (blkcache)" + depends on BLOCK_CACHE + default y if BLOCK_CACHE + help + Enable the blkcache command, which can be used to control the + operation of the cache functions. + This is most useful when fine-tuning the operation of the cache + during development, but also allows the cache to be disabled when + it might hurt performance (e.g. when using the ums command). diff --git a/drivers/block/Makefile b/drivers/block/Makefile index b5c7ae1..b4cbb09 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -24,3 +24,4 @@ obj-$(CONFIG_IDE_SIL680) += sil680.o obj-$(CONFIG_SANDBOX) += sandbox.o obj-$(CONFIG_SCSI_SYM53C8XX) += sym53c8xx.o obj-$(CONFIG_SYSTEMACE) += systemace.o +obj-$(CONFIG_BLOCK_CACHE) += blkcache.o diff --git a/drivers/block/blk-uclass.c b/drivers/block/blk-uclass.c index 49df2a6..617db22 100644 --- a/drivers/block/blk-uclass.c +++ b/drivers/block/blk-uclass.c @@ -80,11 +80,20 @@ unsigned long blk_dread(struct blk_desc *block_dev, lbaint_t start, { struct udevice *dev = block_dev->bdev; const struct blk_ops *ops = blk_get_ops(dev); + ulong blks_read;
if (!ops->read) return -ENOSYS;
- return ops->read(dev, start, blkcnt, buffer); + if (blkcache_read(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer)) + return blkcnt; + blks_read = ops->read(dev, start, blkcnt, buffer); + if (blks_read == blkcnt) + blkcache_fill(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer); + + return blks_read; }
unsigned long blk_dwrite(struct blk_desc *block_dev, lbaint_t start, @@ -96,6 +105,7 @@ unsigned long blk_dwrite(struct blk_desc *block_dev, lbaint_t start, if (!ops->write) return -ENOSYS;
+ blkcache_invalidate(block_dev->if_type, block_dev->devnum); return ops->write(dev, start, blkcnt, buffer); }
@@ -108,6 +118,7 @@ unsigned long blk_derase(struct blk_desc *block_dev, lbaint_t start, if (!ops->erase) return -ENOSYS;
+ blkcache_invalidate(block_dev->if_type, block_dev->devnum); return ops->erase(dev, start, blkcnt); }
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c new file mode 100644 index 0000000..125e1e3 --- /dev/null +++ b/drivers/block/blkcache.c @@ -0,0 +1,293 @@ +/* + * Copyright (C) Nelson Integration, LLC 2016 + * Author: Eric Nelsoneric@nelint.com + * + * SPDX-License-Identifier: GPL-2.0+ + * + */ +#include <config.h> +#include <common.h> +#include <malloc.h> +#include <part.h> +#include <linux/ctype.h> +#include <linux/list.h> + +struct block_cache_node { + struct list_head lh; + int iftype; + int devnum; + lbaint_t start; + lbaint_t blkcnt; + unsigned long blksz; + char *cache; +}; + +static LIST_HEAD(block_cache); +static unsigned cache_count; + +static unsigned long max_blocks_per_entry = 2; +static unsigned long max_cache_entries = 32; + +static unsigned block_cache_misses; +static unsigned block_cache_hits; + +static int trace; + +static struct block_cache_node *cache_find(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz) +{ + struct block_cache_node *node; + + list_for_each_entry(node, &block_cache, lh) + if ((node->iftype == iftype) && + (node->devnum == devnum) && + (node->blksz == blksz) && + (node->start <= start) && + (node->start + node->blkcnt >= start + blkcnt)) { + if (block_cache.next != &node->lh) { + /* maintain MRU ordering */ + list_del(&node->lh); + list_add(&node->lh, &block_cache); + } + return node; + } + return 0; +} + +int blkcache_read(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + struct block_cache_node *node = cache_find(iftype, devnum, start, + blkcnt, blksz); + if (node) { + const char *src = node->cache + (start - node->start) * blksz; + memcpy(buffer, src, blksz * blkcnt); + if (trace) + printf("hit: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++block_cache_hits; + return 1; + } + + if (trace) + printf("miss: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++block_cache_misses; + return 0; +} + +void blkcache_fill(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) +{ + lbaint_t bytes; + struct block_cache_node *node; + + /* don't cache big stuff */ + if (blkcnt > max_blocks_per_entry) + return; + + if (max_cache_entries == 0) + return; + + bytes = blksz * blkcnt; + if (max_cache_entries <= cache_count) { + /* pop LRU */ + node = (struct block_cache_node *)block_cache.prev; + list_del(&node->lh); + cache_count--; + if (trace) + printf("drop: start " LBAF ", count " LBAFU "\n", + node->start, node->blkcnt); + if (node->blkcnt * node->blksz < bytes) { + free(node->cache); + node->cache = 0; + } + } else { + node = malloc(sizeof(*node)); + if (!node) + return; + node->cache = 0; + } + + if (!node->cache) { + node->cache = malloc(bytes); + if (!node->cache) { + free(node); + return; + } + } + + if (trace) + printf("fill: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + + node->iftype = iftype; + node->devnum = devnum; + node->start = start; + node->blkcnt = blkcnt; + node->blksz = blksz; + memcpy(node->cache, buffer, bytes); + list_add(&node->lh, &block_cache); + cache_count++; +} + +void blkcache_invalidate(int iftype, int devnum) +{ + struct list_head *entry, *n; + struct block_cache_node *node; + + list_for_each_safe(entry, n, &block_cache) { + node = (struct block_cache_node *)entry; + if ((node->iftype == iftype) && + (node->devnum == devnum)) { + list_del(entry); + free(node->cache); + free(node); + --cache_count; + } + } +} + +#ifdef CONFIG_CMD_BLOCK_CACHE + +static int blkc_show(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + printf(" hits: %u\n" + " misses: %u\n" + " entries: %u\n" + " trace: %s\n" + " max blocks/entry: %lu\n" + " max cache entries: %lu\n", + block_cache_hits, block_cache_misses, cache_count, + trace ? "on" : "off", + max_blocks_per_entry, max_cache_entries); + return 0; +} + +static int blkc_clear(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + block_cache_hits = 0; + block_cache_misses = 0; + return 0; +} + +static int blkc_dump(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + struct block_cache_node *node; + int i = 0; + + list_for_each_entry(node, &block_cache, lh) { + printf("----- cache entry[%d]\n", i++); + printf("iftype: %d\n", node->iftype); + printf("devnum: %d\n", node->devnum); + printf("blksize: " LBAFU "\n", node->blksz); + printf("start: " LBAF "\n", node->start); + printf("count: " LBAFU "\n", node->blkcnt); + } + return 0; +} + +static int blkc_invalidate(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + struct list_head *entry, *n; + struct block_cache_node *node; + + list_for_each_safe(entry, n, &block_cache) { + node = (struct block_cache_node *)entry; + list_del(entry); + free(node->cache); + free(node); + } + + cache_count = 0; + + return 0; +} + +static int blkc_max(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + if (argc != 3) + return CMD_RET_USAGE; + + max_blocks_per_entry = simple_strtoul(argv[1], 0, 0); + max_cache_entries = simple_strtoul(argv[2], 0, 0); + blkc_invalidate(cmdtp, flag, argc, argv); + printf("changed to max of %lu entries of %lu blocks each\n", + max_cache_entries, max_blocks_per_entry); + return 0; +} + +static int blkc_trace(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + if ((argc == 2) && !strcmp("off", argv[1])) + trace = 0; + else + trace = 1; + return 0; +} + +static cmd_tbl_t cmd_blkc_sub[] = { + U_BOOT_CMD_MKENT(show, 0, 0, blkc_show, "", ""), + U_BOOT_CMD_MKENT(clear, 0, 0, blkc_clear, "", ""), + U_BOOT_CMD_MKENT(dump, 0, 0, blkc_dump, "", ""), + U_BOOT_CMD_MKENT(invalidate, 0, 0, blkc_invalidate, "", ""), + U_BOOT_CMD_MKENT(max, 3, 0, blkc_max, "", ""), + U_BOOT_CMD_MKENT(trace, 2, 0, blkc_trace, "", ""), +}; + +static __maybe_unused void blkc_reloc(void) +{ + static int relocated; + + if (!relocated) { + fixup_cmdtable(cmd_blkc_sub, ARRAY_SIZE(cmd_blkc_sub)); + relocated = 1; + }; +} + +static int do_blkcache(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + cmd_tbl_t *c; + +#ifdef CONFIG_NEEDS_MANUAL_RELOC + blkc_reloc(); +#endif + if (argc < 2) + return CMD_RET_USAGE; + + /* Strip off leading 'i2c' command argument */ + argc--; + argv++; + + c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub)); + + if (c) + return c->cmd(cmdtp, flag, argc, argv); + else + return CMD_RET_USAGE; + + return 0; +} + +U_BOOT_CMD( + blkcache, 4, 0, do_blkcache, + "block cache diagnostics and control", + "show - show statistics\n" + "blkcache clear - clear statistics\n" + "blkcache invalidate - invalidate cache\n" + "blkcache max blocks entries - set maximums\n" + "blkcache dump - dump cache entries\n" + "blkcache trace [off] - enable (disable) tracing" +); + +#endif diff --git a/include/blk.h b/include/blk.h index e83c144..aa70f72 100644 --- a/include/blk.h +++ b/include/blk.h @@ -83,6 +83,71 @@ struct blk_desc { #define PAD_TO_BLOCKSIZE(size, blk_desc) \ (PAD_SIZE(size, blk_desc->blksz))
+#ifdef CONFIG_BLOCK_CACHE +/** + * blkcache_read() - attempt to read a set of blocks from cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks to read + * @param blksz - size in bytes of each block + * @param buf - buffer to contain cached data + * + * @return - '1' if block returned from cache, '0' otherwise. + */ +int blkcache_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer); + +/** + * blkcache_fill() - make data read from a block device available + * to the block cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks available + * @param blksz - size in bytes of each block + * @param buf - buffer containing data to cache + * + */ +void blkcache_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer); + +/** + * blkcache_invalidate() - discard the cache for a set of blocks + * because of a write or device (re)initialization. + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + */ +void blkcache_invalidate + (int iftype, int dev); + +#else + +static inline int blkcache_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + return 0; +} + +static inline void blkcache_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) {} + +static inline void blkcache_invalidate + (int iftype, int dev) {} + +#endif + #ifdef CONFIG_BLK struct udevice;
@@ -224,23 +289,35 @@ int blk_unbind_all(int if_type); static inline ulong blk_dread(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt, void *buffer) { + ulong blks_read; + if (blkcache_read(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer)) + return blkcnt; + /* * We could check if block_read is NULL and return -ENOSYS. But this * bloats the code slightly (cause some board to fail to build), and * it would be an error to try an operation that does not exist. */ - return block_dev->block_read(block_dev, start, blkcnt, buffer); + blks_read = block_dev->block_read(block_dev, start, blkcnt, buffer); + if (blks_read == blkcnt) + blkcache_fill(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer); + + return blks_read; }
static inline ulong blk_dwrite(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt, const void *buffer) { + blkcache_invalidate(block_dev->if_type, block_dev->devnum); return block_dev->block_write(block_dev, start, blkcnt, buffer); }
static inline ulong blk_derase(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt) { + blkcache_invalidate(block_dev->if_type, block_dev->devnum); return block_dev->block_erase(block_dev, start, blkcnt); } #endif /* !CONFIG_BLK */

On Sun, Mar 27, 2016 at 12:00:13PM -0700, Eric Nelson wrote:
+++ b/drivers/block/blkcache.c
[snip]
+static int trace;
I see where you can toggle this at run-time. But do we really think that this will be useful outside of debug? My first reaction is that we should move the trace stuff into debug() statements instead.
[snip]
+#ifdef CONFIG_CMD_BLOCK_CACHE
Please split the command code into cmd/blkcache.c. And yes, this might require thinking harder about what to expose or making an API for some of it. I'm also not sure that's a bad thing as tuning the cache seems useful long term but dumping the stats seems more like debug work.
[snip]
- /* Strip off leading 'i2c' command argument */
- argc--;
- argv++;
I see you borrowed from i2c here :)

Thanks Tom,
On 03/28/2016 07:16 AM, Tom Rini wrote:
On Sun, Mar 27, 2016 at 12:00:13PM -0700, Eric Nelson wrote:
+++ b/drivers/block/blkcache.c
[snip]
+static int trace;
I see where you can toggle this at run-time. But do we really think that this will be useful outside of debug? My first reaction is that we should move the trace stuff into debug() statements instead.
Will do.
Stephen had the same comment.
[snip]
+#ifdef CONFIG_CMD_BLOCK_CACHE
Please split the command code into cmd/blkcache.c. And yes, this might require thinking harder about what to expose or making an API for some of it. I'm also not sure that's a bad thing as tuning the cache seems useful long term but dumping the stats seems more like debug work.
Okay. I started to do that but stopped when I looked at the number of implementation details needed by the commands themselves.
[snip]
- /* Strip off leading 'i2c' command argument */
- argc--;
- argv++;
I see you borrowed from i2c here :)
Yep. Oops.
Regards,
Eric

Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Signed-off-by: Eric Nelson eric@nelint.com --- Changes in V2: - moved blkcache command into cmd/blkcache (Misc commands) - removed invalidate sub-command - done in max command - removed clear sub-command - done in show command - removed dump command (was only for debug) - removed run-time trace control in favor of debug() messages - use struct block_cache_stats instead of separate static vars to streamline block_cache_ - rename max sub-command to configure to match API - change show sub-command to auto-clear statistics
cmd/Kconfig | 11 +++ cmd/Makefile | 1 + cmd/blkcache.c | 90 +++++++++++++++++++++++ disk/part.c | 2 + drivers/block/Kconfig | 9 +++ drivers/block/Makefile | 1 + drivers/block/blk-uclass.c | 13 +++- drivers/block/blkcache.c | 173 +++++++++++++++++++++++++++++++++++++++++++++ include/blk.h | 105 ++++++++++++++++++++++++++- 9 files changed, 403 insertions(+), 2 deletions(-) create mode 100644 cmd/blkcache.c create mode 100644 drivers/block/blkcache.c
diff --git a/cmd/Kconfig b/cmd/Kconfig index 7cdff04..11106af 100644 --- a/cmd/Kconfig +++ b/cmd/Kconfig @@ -485,6 +485,17 @@ config SYS_AMBAPP_PRINT_ON_STARTUP help Show AMBA Plug-n-Play information on startup.
+config CMD_BLOCK_CACHE + bool "blkcache - control and stats for block cache" + depends on BLOCK_CACHE + default y if BLOCK_CACHE + help + Enable the blkcache command, which can be used to control the + operation of the cache functions. + This is most useful when fine-tuning the operation of the cache + during development, but also allows the cache to be disabled when + it might hurt performance (e.g. when using the ums command). + config CMD_TIME bool "time" help diff --git a/cmd/Makefile b/cmd/Makefile index 7604621..ba04197 100644 --- a/cmd/Makefile +++ b/cmd/Makefile @@ -20,6 +20,7 @@ obj-$(CONFIG_SOURCE) += source.o obj-$(CONFIG_CMD_SOURCE) += source.o obj-$(CONFIG_CMD_BDI) += bdinfo.o obj-$(CONFIG_CMD_BEDBUG) += bedbug.o +obj-$(CONFIG_CMD_BLOCK_CACHE) += blkcache.o obj-$(CONFIG_CMD_BMP) += bmp.o obj-$(CONFIG_CMD_BOOTEFI) += bootefi.o obj-$(CONFIG_CMD_BOOTMENU) += bootmenu.o diff --git a/cmd/blkcache.c b/cmd/blkcache.c new file mode 100644 index 0000000..5c2a859 --- /dev/null +++ b/cmd/blkcache.c @@ -0,0 +1,90 @@ +/* + * Copyright (C) Nelson Integration, LLC 2016 + * Author: Eric Nelsoneric@nelint.com + * + * SPDX-License-Identifier: GPL-2.0+ + * + */ +#include <config.h> +#include <common.h> +#include <malloc.h> +#include <part.h> + +static int blkc_show(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + struct block_cache_stats stats; + blkcache_stats(&stats); + + printf(" hits: %u\n" + " misses: %u\n" + " entries: %u\n" + " max blocks/entry: %u\n" + " max cache entries: %u\n", + stats.hits, stats.misses, stats.entries, + stats.max_blocks_per_entry, stats.max_entries); + return 0; +} + +static int blkc_configure(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + unsigned blocks_per_entry, max_entries; + if (argc != 3) + return CMD_RET_USAGE; + + blocks_per_entry = simple_strtoul(argv[1], 0, 0); + max_entries = simple_strtoul(argv[2], 0, 0); + blkcache_configure(blocks_per_entry, max_entries); + printf("changed to max of %u entries of %u blocks each\n", + max_entries, blocks_per_entry); + return 0; +} + +static cmd_tbl_t cmd_blkc_sub[] = { + U_BOOT_CMD_MKENT(show, 0, 0, blkc_show, "", ""), + U_BOOT_CMD_MKENT(configure, 3, 0, blkc_configure, "", ""), +}; + +static __maybe_unused void blkc_reloc(void) +{ + static int relocated; + + if (!relocated) { + fixup_cmdtable(cmd_blkc_sub, ARRAY_SIZE(cmd_blkc_sub)); + relocated = 1; + }; +} + +static int do_blkcache(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + cmd_tbl_t *c; + +#ifdef CONFIG_NEEDS_MANUAL_RELOC + blkc_reloc(); +#endif + if (argc < 2) + return CMD_RET_USAGE; + + /* Strip off leading argument */ + argc--; + argv++; + + c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub)); + + if (c) + return c->cmd(cmdtp, flag, argc, argv); + else + return CMD_RET_USAGE; + + return 0; +} + +U_BOOT_CMD( + blkcache, 4, 0, do_blkcache, + "block cache diagnostics and control", + "show - show and reset statistics\n" + "blkcache max blocks entries - set maximums\n" +); + diff --git a/disk/part.c b/disk/part.c index 67d98fe..0aff954 100644 --- a/disk/part.c +++ b/disk/part.c @@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
+ blkcache_invalidate(dev_desc->if_type, dev_desc->devnum); + dev_desc->part_type = PART_TYPE_UNKNOWN; for (entry = drv; entry != drv + n_ents; entry++) { int ret; diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index f35c4d4..fcc9ccd 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -18,3 +18,12 @@ config DISK types can use this, such as AHCI/SATA. It does not provide any standard operations at present. The block device interface has not been converted to driver model. + +config BLOCK_CACHE + bool "Use block device cache" + default n + help + This option enables a disk-block cache for all block devices. + This is most useful when accessing filesystems under U-Boot since + it will prevent repeated reads from directory structures and other + filesystem data structures. diff --git a/drivers/block/Makefile b/drivers/block/Makefile index b5c7ae1..b4cbb09 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -24,3 +24,4 @@ obj-$(CONFIG_IDE_SIL680) += sil680.o obj-$(CONFIG_SANDBOX) += sandbox.o obj-$(CONFIG_SCSI_SYM53C8XX) += sym53c8xx.o obj-$(CONFIG_SYSTEMACE) += systemace.o +obj-$(CONFIG_BLOCK_CACHE) += blkcache.o diff --git a/drivers/block/blk-uclass.c b/drivers/block/blk-uclass.c index 49df2a6..617db22 100644 --- a/drivers/block/blk-uclass.c +++ b/drivers/block/blk-uclass.c @@ -80,11 +80,20 @@ unsigned long blk_dread(struct blk_desc *block_dev, lbaint_t start, { struct udevice *dev = block_dev->bdev; const struct blk_ops *ops = blk_get_ops(dev); + ulong blks_read;
if (!ops->read) return -ENOSYS;
- return ops->read(dev, start, blkcnt, buffer); + if (blkcache_read(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer)) + return blkcnt; + blks_read = ops->read(dev, start, blkcnt, buffer); + if (blks_read == blkcnt) + blkcache_fill(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer); + + return blks_read; }
unsigned long blk_dwrite(struct blk_desc *block_dev, lbaint_t start, @@ -96,6 +105,7 @@ unsigned long blk_dwrite(struct blk_desc *block_dev, lbaint_t start, if (!ops->write) return -ENOSYS;
+ blkcache_invalidate(block_dev->if_type, block_dev->devnum); return ops->write(dev, start, blkcnt, buffer); }
@@ -108,6 +118,7 @@ unsigned long blk_derase(struct blk_desc *block_dev, lbaint_t start, if (!ops->erase) return -ENOSYS;
+ blkcache_invalidate(block_dev->if_type, block_dev->devnum); return ops->erase(dev, start, blkcnt); }
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c new file mode 100644 index 0000000..46a6059 --- /dev/null +++ b/drivers/block/blkcache.c @@ -0,0 +1,173 @@ +/* + * Copyright (C) Nelson Integration, LLC 2016 + * Author: Eric Nelsoneric@nelint.com + * + * SPDX-License-Identifier: GPL-2.0+ + * + */ +#include <config.h> +#include <common.h> +#include <malloc.h> +#include <part.h> +#include <linux/ctype.h> +#include <linux/list.h> + +struct block_cache_node { + struct list_head lh; + int iftype; + int devnum; + lbaint_t start; + lbaint_t blkcnt; + unsigned long blksz; + char *cache; +}; + +static LIST_HEAD(block_cache); + +static struct block_cache_stats _stats = { + .max_blocks_per_entry = 2, + .max_entries = 32 +}; + +static struct block_cache_node *cache_find(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz) +{ + struct block_cache_node *node; + + list_for_each_entry(node, &block_cache, lh) + if ((node->iftype == iftype) && + (node->devnum == devnum) && + (node->blksz == blksz) && + (node->start <= start) && + (node->start + node->blkcnt >= start + blkcnt)) { + if (block_cache.next != &node->lh) { + /* maintain MRU ordering */ + list_del(&node->lh); + list_add(&node->lh, &block_cache); + } + return node; + } + return 0; +} + +int blkcache_read(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + struct block_cache_node *node = cache_find(iftype, devnum, start, + blkcnt, blksz); + if (node) { + const char *src = node->cache + (start - node->start) * blksz; + memcpy(buffer, src, blksz * blkcnt); + debug("hit: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++_stats.hits; + return 1; + } + + debug("miss: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++_stats.misses; + return 0; +} + +void blkcache_fill(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) +{ + lbaint_t bytes; + struct block_cache_node *node; + + /* don't cache big stuff */ + if (blkcnt > _stats.max_blocks_per_entry) + return; + + if (_stats.max_entries == 0) + return; + + bytes = blksz * blkcnt; + if (_stats.max_entries <= _stats.entries) { + /* pop LRU */ + node = (struct block_cache_node *)block_cache.prev; + list_del(&node->lh); + _stats.entries--; + debug("drop: start " LBAF ", count " LBAFU "\n", + node->start, node->blkcnt); + if (node->blkcnt * node->blksz < bytes) { + free(node->cache); + node->cache = 0; + } + } else { + node = malloc(sizeof(*node)); + if (!node) + return; + node->cache = 0; + } + + if (!node->cache) { + node->cache = malloc(bytes); + if (!node->cache) { + free(node); + return; + } + } + + debug("fill: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + + node->iftype = iftype; + node->devnum = devnum; + node->start = start; + node->blkcnt = blkcnt; + node->blksz = blksz; + memcpy(node->cache, buffer, bytes); + list_add(&node->lh, &block_cache); + _stats.entries++; +} + +void blkcache_invalidate(int iftype, int devnum) +{ + struct list_head *entry, *n; + struct block_cache_node *node; + + list_for_each_safe(entry, n, &block_cache) { + node = (struct block_cache_node *)entry; + if ((node->iftype == iftype) && + (node->devnum == devnum)) { + list_del(entry); + free(node->cache); + free(node); + --_stats.entries; + } + } +} + +void blkcache_configure(unsigned blocks, unsigned entries) +{ + struct block_cache_node *node; + if ((blocks != _stats.max_blocks_per_entry) || + (entries != _stats.max_entries)) { + /* invalidate cache */ + while (!list_empty(&block_cache)) { + node = (struct block_cache_node *)block_cache.next; + list_del(&node->lh); + free(node->cache); + free(node); + } + _stats.entries = 0; + } + + _stats.max_blocks_per_entry = blocks; + _stats.max_entries = entries; + + _stats.hits = 0; + _stats.misses = 0; +} + +void blkcache_stats(struct block_cache_stats *stats) +{ + memcpy(stats, &_stats, sizeof(*stats)); + _stats.hits = 0; + _stats.misses = 0; +} diff --git a/include/blk.h b/include/blk.h index e83c144..263a791 100644 --- a/include/blk.h +++ b/include/blk.h @@ -83,6 +83,97 @@ struct blk_desc { #define PAD_TO_BLOCKSIZE(size, blk_desc) \ (PAD_SIZE(size, blk_desc->blksz))
+#ifdef CONFIG_BLOCK_CACHE +/** + * blkcache_read() - attempt to read a set of blocks from cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks to read + * @param blksz - size in bytes of each block + * @param buf - buffer to contain cached data + * + * @return - '1' if block returned from cache, '0' otherwise. + */ +int blkcache_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer); + +/** + * blkcache_fill() - make data read from a block device available + * to the block cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks available + * @param blksz - size in bytes of each block + * @param buf - buffer containing data to cache + * + */ +void blkcache_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer); + +/** + * blkcache_invalidate() - discard the cache for a set of blocks + * because of a write or device (re)initialization. + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + */ +void blkcache_invalidate + (int iftype, int dev); + +/** + * blkcache_configure() - configure block cache + * + * @param blocks - maximum blocks per entry + * @param entries - maximum entries in cache + */ +void blkcache_configure(unsigned blocks, unsigned entries); + +/* + * statistics of the block cache + */ +struct block_cache_stats { + unsigned hits; + unsigned misses; + unsigned entries; /* current entry count */ + unsigned max_blocks_per_entry; + unsigned max_entries; +}; + +/** + * get_blkcache_stats() - return statistics and reset + * + * @param stats - statistics are copied here + */ +void blkcache_stats(struct block_cache_stats *stats); + +#else + +static inline int blkcache_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + return 0; +} + +static inline void blkcache_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) {} + +static inline void blkcache_invalidate + (int iftype, int dev) {} + +#endif + #ifdef CONFIG_BLK struct udevice;
@@ -224,23 +315,35 @@ int blk_unbind_all(int if_type); static inline ulong blk_dread(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt, void *buffer) { + ulong blks_read; + if (blkcache_read(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer)) + return blkcnt; + /* * We could check if block_read is NULL and return -ENOSYS. But this * bloats the code slightly (cause some board to fail to build), and * it would be an error to try an operation that does not exist. */ - return block_dev->block_read(block_dev, start, blkcnt, buffer); + blks_read = block_dev->block_read(block_dev, start, blkcnt, buffer); + if (blks_read == blkcnt) + blkcache_fill(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer); + + return blks_read; }
static inline ulong blk_dwrite(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt, const void *buffer) { + blkcache_invalidate(block_dev->if_type, block_dev->devnum); return block_dev->block_write(block_dev, start, blkcnt, buffer); }
static inline ulong blk_derase(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt) { + blkcache_invalidate(block_dev->if_type, block_dev->devnum); return block_dev->block_erase(block_dev, start, blkcnt); } #endif /* !CONFIG_BLK */

Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Signed-off-by: Eric Nelson eric@nelint.com --- Changes in V3: - replace remnant of "blkcache max" from sub-command name change
Changes in V2: - moved blkcache command into cmd/blkcache (Misc commands) - removed invalidate sub-command - done in max command - removed clear sub-command - done in show command - removed dump command (was only for debug) - removed run-time trace control in favor of debug() messages - use struct block_cache_stats instead of separate static vars to streamline block_cache_ - rename max sub-command to configure to match API - change show sub-command to auto-clear statistics
cmd/Kconfig | 11 +++ cmd/Makefile | 1 + cmd/blkcache.c | 90 +++++++++++++++++++++++ disk/part.c | 2 + drivers/block/Kconfig | 9 +++ drivers/block/Makefile | 1 + drivers/block/blk-uclass.c | 13 +++- drivers/block/blkcache.c | 173 +++++++++++++++++++++++++++++++++++++++++++++ include/blk.h | 105 ++++++++++++++++++++++++++- 9 files changed, 403 insertions(+), 2 deletions(-) create mode 100644 cmd/blkcache.c create mode 100644 drivers/block/blkcache.c
diff --git a/cmd/Kconfig b/cmd/Kconfig index 7cdff04..11106af 100644 --- a/cmd/Kconfig +++ b/cmd/Kconfig @@ -485,6 +485,17 @@ config SYS_AMBAPP_PRINT_ON_STARTUP help Show AMBA Plug-n-Play information on startup.
+config CMD_BLOCK_CACHE + bool "blkcache - control and stats for block cache" + depends on BLOCK_CACHE + default y if BLOCK_CACHE + help + Enable the blkcache command, which can be used to control the + operation of the cache functions. + This is most useful when fine-tuning the operation of the cache + during development, but also allows the cache to be disabled when + it might hurt performance (e.g. when using the ums command). + config CMD_TIME bool "time" help diff --git a/cmd/Makefile b/cmd/Makefile index 7604621..ba04197 100644 --- a/cmd/Makefile +++ b/cmd/Makefile @@ -20,6 +20,7 @@ obj-$(CONFIG_SOURCE) += source.o obj-$(CONFIG_CMD_SOURCE) += source.o obj-$(CONFIG_CMD_BDI) += bdinfo.o obj-$(CONFIG_CMD_BEDBUG) += bedbug.o +obj-$(CONFIG_CMD_BLOCK_CACHE) += blkcache.o obj-$(CONFIG_CMD_BMP) += bmp.o obj-$(CONFIG_CMD_BOOTEFI) += bootefi.o obj-$(CONFIG_CMD_BOOTMENU) += bootmenu.o diff --git a/cmd/blkcache.c b/cmd/blkcache.c new file mode 100644 index 0000000..725163a --- /dev/null +++ b/cmd/blkcache.c @@ -0,0 +1,90 @@ +/* + * Copyright (C) Nelson Integration, LLC 2016 + * Author: Eric Nelsoneric@nelint.com + * + * SPDX-License-Identifier: GPL-2.0+ + * + */ +#include <config.h> +#include <common.h> +#include <malloc.h> +#include <part.h> + +static int blkc_show(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + struct block_cache_stats stats; + blkcache_stats(&stats); + + printf(" hits: %u\n" + " misses: %u\n" + " entries: %u\n" + " max blocks/entry: %u\n" + " max cache entries: %u\n", + stats.hits, stats.misses, stats.entries, + stats.max_blocks_per_entry, stats.max_entries); + return 0; +} + +static int blkc_configure(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + unsigned blocks_per_entry, max_entries; + if (argc != 3) + return CMD_RET_USAGE; + + blocks_per_entry = simple_strtoul(argv[1], 0, 0); + max_entries = simple_strtoul(argv[2], 0, 0); + blkcache_configure(blocks_per_entry, max_entries); + printf("changed to max of %u entries of %u blocks each\n", + max_entries, blocks_per_entry); + return 0; +} + +static cmd_tbl_t cmd_blkc_sub[] = { + U_BOOT_CMD_MKENT(show, 0, 0, blkc_show, "", ""), + U_BOOT_CMD_MKENT(configure, 3, 0, blkc_configure, "", ""), +}; + +static __maybe_unused void blkc_reloc(void) +{ + static int relocated; + + if (!relocated) { + fixup_cmdtable(cmd_blkc_sub, ARRAY_SIZE(cmd_blkc_sub)); + relocated = 1; + }; +} + +static int do_blkcache(cmd_tbl_t *cmdtp, int flag, + int argc, char * const argv[]) +{ + cmd_tbl_t *c; + +#ifdef CONFIG_NEEDS_MANUAL_RELOC + blkc_reloc(); +#endif + if (argc < 2) + return CMD_RET_USAGE; + + /* Strip off leading argument */ + argc--; + argv++; + + c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub)); + + if (c) + return c->cmd(cmdtp, flag, argc, argv); + else + return CMD_RET_USAGE; + + return 0; +} + +U_BOOT_CMD( + blkcache, 4, 0, do_blkcache, + "block cache diagnostics and control", + "show - show and reset statistics\n" + "blkcache configure blocks entries\n" +); + diff --git a/disk/part.c b/disk/part.c index 67d98fe..0aff954 100644 --- a/disk/part.c +++ b/disk/part.c @@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
+ blkcache_invalidate(dev_desc->if_type, dev_desc->devnum); + dev_desc->part_type = PART_TYPE_UNKNOWN; for (entry = drv; entry != drv + n_ents; entry++) { int ret; diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index f35c4d4..fcc9ccd 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -18,3 +18,12 @@ config DISK types can use this, such as AHCI/SATA. It does not provide any standard operations at present. The block device interface has not been converted to driver model. + +config BLOCK_CACHE + bool "Use block device cache" + default n + help + This option enables a disk-block cache for all block devices. + This is most useful when accessing filesystems under U-Boot since + it will prevent repeated reads from directory structures and other + filesystem data structures. diff --git a/drivers/block/Makefile b/drivers/block/Makefile index b5c7ae1..b4cbb09 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -24,3 +24,4 @@ obj-$(CONFIG_IDE_SIL680) += sil680.o obj-$(CONFIG_SANDBOX) += sandbox.o obj-$(CONFIG_SCSI_SYM53C8XX) += sym53c8xx.o obj-$(CONFIG_SYSTEMACE) += systemace.o +obj-$(CONFIG_BLOCK_CACHE) += blkcache.o diff --git a/drivers/block/blk-uclass.c b/drivers/block/blk-uclass.c index 49df2a6..617db22 100644 --- a/drivers/block/blk-uclass.c +++ b/drivers/block/blk-uclass.c @@ -80,11 +80,20 @@ unsigned long blk_dread(struct blk_desc *block_dev, lbaint_t start, { struct udevice *dev = block_dev->bdev; const struct blk_ops *ops = blk_get_ops(dev); + ulong blks_read;
if (!ops->read) return -ENOSYS;
- return ops->read(dev, start, blkcnt, buffer); + if (blkcache_read(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer)) + return blkcnt; + blks_read = ops->read(dev, start, blkcnt, buffer); + if (blks_read == blkcnt) + blkcache_fill(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer); + + return blks_read; }
unsigned long blk_dwrite(struct blk_desc *block_dev, lbaint_t start, @@ -96,6 +105,7 @@ unsigned long blk_dwrite(struct blk_desc *block_dev, lbaint_t start, if (!ops->write) return -ENOSYS;
+ blkcache_invalidate(block_dev->if_type, block_dev->devnum); return ops->write(dev, start, blkcnt, buffer); }
@@ -108,6 +118,7 @@ unsigned long blk_derase(struct blk_desc *block_dev, lbaint_t start, if (!ops->erase) return -ENOSYS;
+ blkcache_invalidate(block_dev->if_type, block_dev->devnum); return ops->erase(dev, start, blkcnt); }
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c new file mode 100644 index 0000000..46a6059 --- /dev/null +++ b/drivers/block/blkcache.c @@ -0,0 +1,173 @@ +/* + * Copyright (C) Nelson Integration, LLC 2016 + * Author: Eric Nelsoneric@nelint.com + * + * SPDX-License-Identifier: GPL-2.0+ + * + */ +#include <config.h> +#include <common.h> +#include <malloc.h> +#include <part.h> +#include <linux/ctype.h> +#include <linux/list.h> + +struct block_cache_node { + struct list_head lh; + int iftype; + int devnum; + lbaint_t start; + lbaint_t blkcnt; + unsigned long blksz; + char *cache; +}; + +static LIST_HEAD(block_cache); + +static struct block_cache_stats _stats = { + .max_blocks_per_entry = 2, + .max_entries = 32 +}; + +static struct block_cache_node *cache_find(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz) +{ + struct block_cache_node *node; + + list_for_each_entry(node, &block_cache, lh) + if ((node->iftype == iftype) && + (node->devnum == devnum) && + (node->blksz == blksz) && + (node->start <= start) && + (node->start + node->blkcnt >= start + blkcnt)) { + if (block_cache.next != &node->lh) { + /* maintain MRU ordering */ + list_del(&node->lh); + list_add(&node->lh, &block_cache); + } + return node; + } + return 0; +} + +int blkcache_read(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + struct block_cache_node *node = cache_find(iftype, devnum, start, + blkcnt, blksz); + if (node) { + const char *src = node->cache + (start - node->start) * blksz; + memcpy(buffer, src, blksz * blkcnt); + debug("hit: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++_stats.hits; + return 1; + } + + debug("miss: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + ++_stats.misses; + return 0; +} + +void blkcache_fill(int iftype, int devnum, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) +{ + lbaint_t bytes; + struct block_cache_node *node; + + /* don't cache big stuff */ + if (blkcnt > _stats.max_blocks_per_entry) + return; + + if (_stats.max_entries == 0) + return; + + bytes = blksz * blkcnt; + if (_stats.max_entries <= _stats.entries) { + /* pop LRU */ + node = (struct block_cache_node *)block_cache.prev; + list_del(&node->lh); + _stats.entries--; + debug("drop: start " LBAF ", count " LBAFU "\n", + node->start, node->blkcnt); + if (node->blkcnt * node->blksz < bytes) { + free(node->cache); + node->cache = 0; + } + } else { + node = malloc(sizeof(*node)); + if (!node) + return; + node->cache = 0; + } + + if (!node->cache) { + node->cache = malloc(bytes); + if (!node->cache) { + free(node); + return; + } + } + + debug("fill: start " LBAF ", count " LBAFU "\n", + start, blkcnt); + + node->iftype = iftype; + node->devnum = devnum; + node->start = start; + node->blkcnt = blkcnt; + node->blksz = blksz; + memcpy(node->cache, buffer, bytes); + list_add(&node->lh, &block_cache); + _stats.entries++; +} + +void blkcache_invalidate(int iftype, int devnum) +{ + struct list_head *entry, *n; + struct block_cache_node *node; + + list_for_each_safe(entry, n, &block_cache) { + node = (struct block_cache_node *)entry; + if ((node->iftype == iftype) && + (node->devnum == devnum)) { + list_del(entry); + free(node->cache); + free(node); + --_stats.entries; + } + } +} + +void blkcache_configure(unsigned blocks, unsigned entries) +{ + struct block_cache_node *node; + if ((blocks != _stats.max_blocks_per_entry) || + (entries != _stats.max_entries)) { + /* invalidate cache */ + while (!list_empty(&block_cache)) { + node = (struct block_cache_node *)block_cache.next; + list_del(&node->lh); + free(node->cache); + free(node); + } + _stats.entries = 0; + } + + _stats.max_blocks_per_entry = blocks; + _stats.max_entries = entries; + + _stats.hits = 0; + _stats.misses = 0; +} + +void blkcache_stats(struct block_cache_stats *stats) +{ + memcpy(stats, &_stats, sizeof(*stats)); + _stats.hits = 0; + _stats.misses = 0; +} diff --git a/include/blk.h b/include/blk.h index e83c144..263a791 100644 --- a/include/blk.h +++ b/include/blk.h @@ -83,6 +83,97 @@ struct blk_desc { #define PAD_TO_BLOCKSIZE(size, blk_desc) \ (PAD_SIZE(size, blk_desc->blksz))
+#ifdef CONFIG_BLOCK_CACHE +/** + * blkcache_read() - attempt to read a set of blocks from cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks to read + * @param blksz - size in bytes of each block + * @param buf - buffer to contain cached data + * + * @return - '1' if block returned from cache, '0' otherwise. + */ +int blkcache_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer); + +/** + * blkcache_fill() - make data read from a block device available + * to the block cache + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + * @param start - starting block number + * @param blkcnt - number of blocks available + * @param blksz - size in bytes of each block + * @param buf - buffer containing data to cache + * + */ +void blkcache_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer); + +/** + * blkcache_invalidate() - discard the cache for a set of blocks + * because of a write or device (re)initialization. + * + * @param iftype - IF_TYPE_x for type of device + * @param dev - device index of particular type + */ +void blkcache_invalidate + (int iftype, int dev); + +/** + * blkcache_configure() - configure block cache + * + * @param blocks - maximum blocks per entry + * @param entries - maximum entries in cache + */ +void blkcache_configure(unsigned blocks, unsigned entries); + +/* + * statistics of the block cache + */ +struct block_cache_stats { + unsigned hits; + unsigned misses; + unsigned entries; /* current entry count */ + unsigned max_blocks_per_entry; + unsigned max_entries; +}; + +/** + * get_blkcache_stats() - return statistics and reset + * + * @param stats - statistics are copied here + */ +void blkcache_stats(struct block_cache_stats *stats); + +#else + +static inline int blkcache_read + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) +{ + return 0; +} + +static inline void blkcache_fill + (int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) {} + +static inline void blkcache_invalidate + (int iftype, int dev) {} + +#endif + #ifdef CONFIG_BLK struct udevice;
@@ -224,23 +315,35 @@ int blk_unbind_all(int if_type); static inline ulong blk_dread(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt, void *buffer) { + ulong blks_read; + if (blkcache_read(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer)) + return blkcnt; + /* * We could check if block_read is NULL and return -ENOSYS. But this * bloats the code slightly (cause some board to fail to build), and * it would be an error to try an operation that does not exist. */ - return block_dev->block_read(block_dev, start, blkcnt, buffer); + blks_read = block_dev->block_read(block_dev, start, blkcnt, buffer); + if (blks_read == blkcnt) + blkcache_fill(block_dev->if_type, block_dev->devnum, + start, blkcnt, block_dev->blksz, buffer); + + return blks_read; }
static inline ulong blk_dwrite(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt, const void *buffer) { + blkcache_invalidate(block_dev->if_type, block_dev->devnum); return block_dev->block_write(block_dev, start, blkcnt, buffer); }
static inline ulong blk_derase(struct blk_desc *block_dev, lbaint_t start, lbaint_t blkcnt) { + blkcache_invalidate(block_dev->if_type, block_dev->devnum); return block_dev->block_erase(block_dev, start, blkcnt); } #endif /* !CONFIG_BLK */

On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
diff --git a/cmd/blkcache.c b/cmd/blkcache.c
+static int blkc_show(cmd_tbl_t *cmdtp, int flag,
int argc, char * const argv[])
+{
- struct block_cache_stats stats;
- blkcache_stats(&stats);
- printf(" hits: %u\n"
" misses: %u\n"
" entries: %u\n"
" max blocks/entry: %u\n"
" max cache entries: %u\n",
Nit: I'm not sure why all that command output is indented. Perhaps it made sense before if some kind of title was printed before that text, but I don't think it is any more.
+static int do_blkcache(cmd_tbl_t *cmdtp, int flag,
- c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub));
- if (c)
return c->cmd(cmdtp, flag, argc, argv);
- else
return CMD_RET_USAGE;
- return 0;
Nit: I'd prefer to see the error handling immediately follow the function call whose result is being tested, and the non-failure-case code be not indented, i.e.:
c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub)); if (!c) return CMD_RET_USAGE;
return c->cmd(cmdtp, flag, argc, argv);
diff --git a/disk/part.c b/disk/part.c
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
diff --git a/include/blk.h b/include/blk.h
+int blkcache_read
- (int iftype, int dev,
lbaint_t start, lbaint_t blkcnt,
unsigned long blksz, void *buffer);
Nit: The opening ( and first n arguments should be wrapped onto the same line as the function name.

On Wed, Mar 30, 2016 at 08:36:21AM -0600, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
[snip]
diff --git a/disk/part.c b/disk/part.c
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
My initial reaction here is that we should stay conservative and invalidate the cache more often rather than too infrequent. I mean, what's the worst case here, an extra read? A few extra reads? We want to make sure we keep the complexity to functionality ratio right here, if we make the recovery/flashing/factory cases a whole lot better but are leaving 1 second of wall clock time on the table when we've just gained a minute, we're OK.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
That is a good point. But how would you hit this? The problem in ums/dfu was that it was several megabytes, yes? My quick read over the code right now has me thinking this is something measured in kilobytes.

On 03/30/2016 09:19 AM, Tom Rini wrote:
On Wed, Mar 30, 2016 at 08:36:21AM -0600, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
[snip]
diff --git a/disk/part.c b/disk/part.c
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
My initial reaction here is that we should stay conservative and invalidate the cache more often rather than too infrequent. I mean, what's the worst case here, an extra read? A few extra reads? We want to make sure we keep the complexity to functionality ratio right here, if we make the recovery/flashing/factory cases a whole lot better but are leaving 1 second of wall clock time on the table when we've just gained a minute, we're OK.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
That is a good point. But how would you hit this? The problem in ums/dfu was that it was several megabytes, yes? My quick read over the code right now has me thinking this is something measured in kilobytes.
The allocation that failed was a large multi-megabyte buffer. However, the allocations that caused the fragmentation/failure were small (probably tens/hundreds of bytes, but I didn't check).

Hi Tom,
On 03/30/2016 08:19 AM, Tom Rini wrote:
On Wed, Mar 30, 2016 at 08:36:21AM -0600, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
[snip]
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
My initial reaction here is that we should stay conservative and invalidate the cache more often rather than too infrequent. I mean, what's the worst case here, an extra read? A few extra reads? We want to make sure we keep the complexity to functionality ratio right here, if we make the recovery/flashing/factory cases a whole lot better but are leaving 1 second of wall clock time on the table when we've just gained a minute, we're OK.
I don't think this is called during every command, at least not with mmc.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
That is a good point. But how would you hit this? The problem in ums/dfu was that it was several megabytes, yes? My quick read over the code right now has me thinking this is something measured in kilobytes.
36 bytes for the node, plus 512 bytes or 1k for the block data unless tuned through the blkcache command.
And the block data is going to be allocated at the same time.
Regards,
Eric

Thanks again for the detailed review, Stephen.
On 03/30/2016 07:36 AM, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
<snip>
- printf(" hits: %u\n"
" misses: %u\n"
" entries: %u\n"
" max blocks/entry: %u\n"
" max cache entries: %u\n",
Nit: I'm not sure why all that command output is indented. Perhaps it made sense before if some kind of title was printed before that text, but I don't think it is any more.
Okay.
- if (c)
return c->cmd(cmdtp, flag, argc, argv);
- else
return CMD_RET_USAGE;
- return 0;
Nit: I'd prefer to see the error handling immediately follow the function call whose result is being tested, and the non-failure-case code be not indented, i.e.:
c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub)); if (!c) return CMD_RET_USAGE;
return c->cmd(cmdtp, flag, argc, argv);
Okay. As Tom pointed out, I copied this verbatim from i2c.c.
diff --git a/disk/part.c b/disk/part.c
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
I'm not sure it does. I traced through the mmc initialization and it's only called when the card itself is initialized.
IOW, at the point where we'd recognize a card swap.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
We're going to allocate a block or set of blocks every time we allocate a new node for the list, so having the list in an array doesn't fix the problem.
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Doing this in an array would require keeping the array sorted (a bad idea) or keeping an access sequence in each node of the array and searching for oldest when invalidation is needed (when the max entries limit is hit).
... unless I'm still missing something and you or Marek have something else in mind.
I could also allocate an array of "free" nodes once the first time around (as a buffer pool), but I don't think this is warranted because of the block allocations I mentioned above.
diff --git a/include/blk.h b/include/blk.h
+int blkcache_read
- (int iftype, int dev,
lbaint_t start, lbaint_t blkcnt,
unsigned long blksz, void *buffer);
Nit: The opening ( and first n arguments should be wrapped onto the same line as the function name.
Aargh. You've now pointed this out several times and I keep missing it.
Will fix in V4.
Regards,
Eric

On 03/30/2016 11:34 AM, Eric Nelson wrote:
Thanks again for the detailed review, Stephen.
On 03/30/2016 07:36 AM, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
diff --git a/disk/part.c b/disk/part.c
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
I'm not sure it does. I traced through the mmc initialization and it's only called when the card itself is initialized.
I don't believe U-Boot caches the partition structure across user commands. Doesn't each user command (e.g. part list, ls, load, save) first look up the block device, then scan the partition table, then "mount" the filesystem, then perform its action, then throw all that state away? Conversely, "mmc rescan" only happens under explicit user control. Still as I said, the current implementation is probably fine to start with, and at least is safe.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
We're going to allocate a block or set of blocks every time we allocate a new node for the list, so having the list in an array doesn't fix the problem.
We could allocate the data storage for the block cache at the top of RAM before relocation, like many other things are allocated, and hence not use malloc() for that.
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Yes, the MRU logic would make it more complex. Is that particularly useful, i.e. is it an intrinsic part of the speedup?

Hi Stephen,
On 03/30/2016 02:57 PM, Stephen Warren wrote:
On 03/30/2016 11:34 AM, Eric Nelson wrote:
Thanks again for the detailed review, Stephen.
On 03/30/2016 07:36 AM, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems. diff --git a/disk/part.c b/disk/part.c
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
I'm not sure it does. I traced through the mmc initialization and it's only called when the card itself is initialized.
I don't believe U-Boot caches the partition structure across user commands. Doesn't each user command (e.g. part list, ls, load, save) first look up the block device, then scan the partition table, then "mount" the filesystem, then perform its action, then throw all that state away? Conversely, "mmc rescan" only happens under explicit user control. Still as I said, the current implementation is probably fine to start with, and at least is safe.
At least for MMC, this isn't the case. Various filesystem commands operate without calling part_init.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
We're going to allocate a block or set of blocks every time we allocate a new node for the list, so having the list in an array doesn't fix the problem.
We could allocate the data storage for the block cache at the top of RAM before relocation, like many other things are allocated, and hence not use malloc() for that.
Hmmm. We seem to have gone from a discussion about data structures to type of allocation.
I'm interested in seeing how that works. Can you provide hints about what's doing this now?
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Yes, the MRU logic would make it more complex. Is that particularly useful, i.e. is it an intrinsic part of the speedup?
It's not a question of speed with small numbers of entries. The code to handle eviction would just be more complex.
Given that the command "blkcache configure 0 0" will discard all cache and since both dfu and ums should properly have the cache disabled, I'd like to proceed as-is with the list and heap approach.
A follow-up change to use another form of allocation is unlikely to change the primary interfaces, though I can't be sure until I understand how these allocation(s) would occur.
I have a V3 prepped that addresses your other comments.
To reiterate the impact of this code, I have use cases where file loading takes minutes when it should take seconds and suspect that others have been seeing the same for quite some time.
Let me know your thoughts.
Regards,
Eric

On 03/31/2016 02:24 PM, Eric Nelson wrote:
Hi Stephen,
On 03/30/2016 02:57 PM, Stephen Warren wrote:
On 03/30/2016 11:34 AM, Eric Nelson wrote:
Thanks again for the detailed review, Stephen.
On 03/30/2016 07:36 AM, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems. diff --git a/disk/part.c b/disk/part.c
@@ -268,6 +268,8 @@ void part_init(struct blk_desc *dev_desc) const int n_ents = ll_entry_count(struct part_driver, part_driver); struct part_driver *entry;
- blkcache_invalidate(dev_desc->if_type, dev_desc->devnum);
Doesn't this invalidate the cache far too often? I expect that function is called for command the user executes from the command-line, whereas it'd be nice if the cache persisted across commands. I suppose this is a reasonable (and very safe) first implementation though, and saves having to go through each storage provider type and find out the right place to detect media changes.
I'm not sure it does. I traced through the mmc initialization and it's only called when the card itself is initialized.
I don't believe U-Boot caches the partition structure across user commands. Doesn't each user command (e.g. part list, ls, load, save) first look up the block device, then scan the partition table, then "mount" the filesystem, then perform its action, then throw all that state away? Conversely, "mmc rescan" only happens under explicit user control. Still as I said, the current implementation is probably fine to start with, and at least is safe.
At least for MMC, this isn't the case. Various filesystem commands operate without calling part_init.
Interesting; that step is indeed only performed when the device is first probed for MMC and USB.
diff --git a/drivers/block/blkcache.c b/drivers/block/blkcache.c
+struct block_cache_node {
- struct list_head lh;
- int iftype;
- int devnum;
- lbaint_t start;
- lbaint_t blkcnt;
- unsigned long blksz;
- char *cache;
+};
+static LIST_HEAD(block_cache);
+static struct block_cache_stats _stats = {
- .max_blocks_per_entry = 2,
- .max_entries = 32
+};
Now is a good time to mention another reason why I don't like using a dynamically allocated linked list for this: Memory fragmentation. By dynamically allocating the cache, we could easily run into a situation where the user runs a command that allocates memory and also adds to the block cache, then most of that memory gets freed when U-Boot returns to the command prompt, then the user runs the command again but it fails since it can't allocate the memory due to fragmentation of the heap. This is a real problem I've seen e.g. with the "ums" and "dfu" commands, since they might initialize the USB controller the first time they're run, which allocates some new memory. Statically allocation would avoid this.
We're going to allocate a block or set of blocks every time we allocate a new node for the list, so having the list in an array doesn't fix the problem.
We could allocate the data storage for the block cache at the top of RAM before relocation, like many other things are allocated, and hence not use malloc() for that.
Hmmm. We seem to have gone from a discussion about data structures to type of allocation.
I'm interested in seeing how that works. Can you provide hints about what's doing this now?
Something like common/board_f.c:reserve_mmu() and many other functions there. relocaddr starts at approximately the top of RAM, continually gets adjusted down as many static allocations are reserved, and eventually becomes the address that U-Boot is relocated to. Simply adding another entry into init_sequence_f[] for the disk cache might work.
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Yes, the MRU logic would make it more complex. Is that particularly useful, i.e. is it an intrinsic part of the speedup?
It's not a question of speed with small numbers of entries. The code to handle eviction would just be more complex.
My thought was that if the eviction algorithm wasn't important (i.e. most of the speedup comes from have some (any) kind of cache, but the eviction algorithm makes little difference to the gain from having the cache), we could just drop MRU completely. If that's not possible, then indeed a list would make implementing MRU easier.
You could still do a list with a statically allocated set of list nodes, especially since the length of the list is bounded.
Given that the command "blkcache configure 0 0" will discard all cache and since both dfu and ums should properly have the cache disabled, I'd like to proceed as-is with the list and heap approach.
I don't understand "since both dfu and ums should properly have the cache disabled"; I didn't see anything that did that. Perhaps you're referring to the fact that writes invalidate the cache?
Eventually it seems better to keep the cache enabled for at least DFU to a filesystem (rather than raw block device) since presumably parsing the directory structure to write to a file for DFU would benefit from the cache just like anything else.

Hi Stephen,
On 04/01/2016 03:57 PM, Stephen Warren wrote:
On 03/31/2016 02:24 PM, Eric Nelson wrote:
On 03/30/2016 02:57 PM, Stephen Warren wrote:
On 03/30/2016 11:34 AM, Eric Nelson wrote:
On 03/30/2016 07:36 AM, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
<snip>
We could allocate the data storage for the block cache at the top of RAM before relocation, like many other things are allocated, and hence not use malloc() for that.
Hmmm. We seem to have gone from a discussion about data structures to type of allocation.
I'm interested in seeing how that works. Can you provide hints about what's doing this now?
Something like common/board_f.c:reserve_mmu() and many other functions there. relocaddr starts at approximately the top of RAM, continually gets adjusted down as many static allocations are reserved, and eventually becomes the address that U-Boot is relocated to. Simply adding another entry into init_sequence_f[] for the disk cache might work.
Thanks for the pointer. I'll review that when time permits.
This would remove the opportunity to re-configure the cache though, right?
I'm not sure whether how important this feature is, and I think only time and use will tell.
I'd prefer to keep that ability at least for a cycle or two so that I and others can test.
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Yes, the MRU logic would make it more complex. Is that particularly useful, i.e. is it an intrinsic part of the speedup?
It's not a question of speed with small numbers of entries. The code to handle eviction would just be more complex.
My thought was that if the eviction algorithm wasn't important (i.e. most of the speedup comes from have some (any) kind of cache, but the eviction algorithm makes little difference to the gain from having the cache), we could just drop MRU completely. If that's not possible, then indeed a list would make implementing MRU easier.
How would we decide which block to discard? I haven't traced enough to know what algorithm(s) might be best, but I can say that there's a preponderance of repeated accesses to the last-accessed block, especially in ext4.
You could still do a list with a statically allocated set of list nodes, especially since the length of the list is bounded.
Sure. A pooled allocator (pool of free nodes) works well with array-based allocation.
Having a fixed upper limit on the number of blocks would require additional checking unless we just sized it for (max entries * max blocks/entry).
Given that the command "blkcache configure 0 0" will discard all cache and since both dfu and ums should properly have the cache disabled, I'd like to proceed as-is with the list and heap approach.
I don't understand "since both dfu and ums should properly have the cache disabled"; I didn't see anything that did that. Perhaps you're referring to the fact that writes invalidate the cache?
Yes, but also that the host will cache blocks in the ums case, so having the cache enabled will only slow things down slightly by lots of memcpy's to cached blocks that won't be helpful.
I think I was a bit flippant by including dfu in this statement, since I haven't used it to access anything except SPI-NOR.
Eventually it seems better to keep the cache enabled for at least DFU to a filesystem (rather than raw block device) since presumably parsing the directory structure to write to a file for DFU would benefit from the cache just like anything else.
I'm not in a position to comment about dfu.
Regards,
Eric

On Fri, Apr 01, 2016 at 04:16:42PM -0700, Eric Nelson wrote:
Hi Stephen,
On 04/01/2016 03:57 PM, Stephen Warren wrote:
On 03/31/2016 02:24 PM, Eric Nelson wrote:
On 03/30/2016 02:57 PM, Stephen Warren wrote:
On 03/30/2016 11:34 AM, Eric Nelson wrote:
On 03/30/2016 07:36 AM, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
<snip>
We could allocate the data storage for the block cache at the top of RAM before relocation, like many other things are allocated, and hence not use malloc() for that.
Hmmm. We seem to have gone from a discussion about data structures to type of allocation.
I'm interested in seeing how that works. Can you provide hints about what's doing this now?
Something like common/board_f.c:reserve_mmu() and many other functions there. relocaddr starts at approximately the top of RAM, continually gets adjusted down as many static allocations are reserved, and eventually becomes the address that U-Boot is relocated to. Simply adding another entry into init_sequence_f[] for the disk cache might work.
Thanks for the pointer. I'll review that when time permits.
This would remove the opportunity to re-configure the cache though, right?
I'm not sure whether how important this feature is, and I think only time and use will tell.
I'd prefer to keep that ability at least for a cycle or two so that I and others can test.
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Yes, the MRU logic would make it more complex. Is that particularly useful, i.e. is it an intrinsic part of the speedup?
It's not a question of speed with small numbers of entries. The code to handle eviction would just be more complex.
My thought was that if the eviction algorithm wasn't important (i.e. most of the speedup comes from have some (any) kind of cache, but the eviction algorithm makes little difference to the gain from having the cache), we could just drop MRU completely. If that's not possible, then indeed a list would make implementing MRU easier.
How would we decide which block to discard? I haven't traced enough to know what algorithm(s) might be best, but I can say that there's a preponderance of repeated accesses to the last-accessed block, especially in ext4.
You could still do a list with a statically allocated set of list nodes, especially since the length of the list is bounded.
Sure. A pooled allocator (pool of free nodes) works well with array-based allocation.
Having a fixed upper limit on the number of blocks would require additional checking unless we just sized it for (max entries * max blocks/entry).
Given that the command "blkcache configure 0 0" will discard all cache and since both dfu and ums should properly have the cache disabled, I'd like to proceed as-is with the list and heap approach.
I don't understand "since both dfu and ums should properly have the cache disabled"; I didn't see anything that did that. Perhaps you're referring to the fact that writes invalidate the cache?
Yes, but also that the host will cache blocks in the ums case, so having the cache enabled will only slow things down slightly by lots of memcpy's to cached blocks that won't be helpful.
I think I was a bit flippant by including dfu in this statement, since I haven't used it to access anything except SPI-NOR.
Eventually it seems better to keep the cache enabled for at least DFU to a filesystem (rather than raw block device) since presumably parsing the directory structure to write to a file for DFU would benefit from the cache just like anything else.
I'm not in a position to comment about dfu.
For the record, I think this discussion is good and important, but not a blocker for getting the initial functionality in.

On 04/01/2016 04:41 PM, Tom Rini wrote:
On Fri, Apr 01, 2016 at 04:16:42PM -0700, Eric Nelson wrote:
Hi Stephen,
On 04/01/2016 03:57 PM, Stephen Warren wrote:
On 03/31/2016 02:24 PM, Eric Nelson wrote:
On 03/30/2016 02:57 PM, Stephen Warren wrote:
On 03/30/2016 11:34 AM, Eric Nelson wrote:
On 03/30/2016 07:36 AM, Stephen Warren wrote: > On 03/28/2016 11:05 AM, Eric Nelson wrote:
<snip>
Yes, but also that the host will cache blocks in the ums case, so having the cache enabled will only slow things down slightly by lots of memcpy's to cached blocks that won't be helpful.
I think I was a bit flippant by including dfu in this statement, since I haven't used it to access anything except SPI-NOR.
Eventually it seems better to keep the cache enabled for at least DFU to a filesystem (rather than raw block device) since presumably parsing the directory structure to write to a file for DFU would benefit from the cache just like anything else.
I'm not in a position to comment about dfu.
For the record, I think this discussion is good and important, but not a blocker for getting the initial functionality in.
I wholeheartedly agree.
I'm learning a lot and the code's better because of Stephen's feedback.
Regards,
Eric

On 04/01/2016 05:16 PM, Eric Nelson wrote:
Hi Stephen,
On 04/01/2016 03:57 PM, Stephen Warren wrote:
On 03/31/2016 02:24 PM, Eric Nelson wrote:
On 03/30/2016 02:57 PM, Stephen Warren wrote:
On 03/30/2016 11:34 AM, Eric Nelson wrote:
On 03/30/2016 07:36 AM, Stephen Warren wrote:
On 03/28/2016 11:05 AM, Eric Nelson wrote:
<snip>
We could allocate the data storage for the block cache at the top of RAM before relocation, like many other things are allocated, and hence not use malloc() for that.
Hmmm. We seem to have gone from a discussion about data structures to type of allocation.
I'm interested in seeing how that works. Can you provide hints about what's doing this now?
Something like common/board_f.c:reserve_mmu() and many other functions there. relocaddr starts at approximately the top of RAM, continually gets adjusted down as many static allocations are reserved, and eventually becomes the address that U-Boot is relocated to. Simply adding another entry into init_sequence_f[] for the disk cache might work.
Thanks for the pointer. I'll review that when time permits.
This would remove the opportunity to re-configure the cache though, right?
Well, it would make it impossible to use less RAM. One could use more by having a mix of the initial static allocation plus some additional dynamic allocation, but that might get a bit painful to manage.
It might be interesting to use the MMU more and allow de-fragmentation of VA space. That is, assuming there's much more VA space than RAM, such as is true on current 64-bit architectures. Then I wouldn't dislike dynamic allocation so much:-)
I'm not sure whether how important this feature is, and I think only time and use will tell.
I'd prefer to keep that ability at least for a cycle or two so that I and others can test.
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Yes, the MRU logic would make it more complex. Is that particularly useful, i.e. is it an intrinsic part of the speedup?
It's not a question of speed with small numbers of entries. The code to handle eviction would just be more complex.
My thought was that if the eviction algorithm wasn't important (i.e. most of the speedup comes from have some (any) kind of cache, but the eviction algorithm makes little difference to the gain from having the cache), we could just drop MRU completely. If that's not possible, then indeed a list would make implementing MRU easier.
How would we decide which block to discard? I haven't traced enough to know what algorithm(s) might be best, but I can say that there's a preponderance of repeated accesses to the last-accessed block, especially in ext4.
Perhaps just keep an index into the array, use that index any time something is written to the cache, and then increment it each time. Probably not anywhere near as optimal as MRU/LRU though.

Hi Stephen,
On 04/01/2016 07:07 PM, Stephen Warren wrote:
On 04/01/2016 05:16 PM, Eric Nelson wrote:
On 04/01/2016 03:57 PM, Stephen Warren wrote:
On 03/31/2016 02:24 PM, Eric Nelson wrote:
On 03/30/2016 02:57 PM, Stephen Warren wrote:
On 03/30/2016 11:34 AM, Eric Nelson wrote:
On 03/30/2016 07:36 AM, Stephen Warren wrote: > On 03/28/2016 11:05 AM, Eric Nelson wrote:
<snip>
We could allocate the data storage for the block cache at the top of RAM before relocation, like many other things are allocated, and hence not use malloc() for that.
Hmmm. We seem to have gone from a discussion about data structures to type of allocation.
I'm interested in seeing how that works. Can you provide hints about what's doing this now?
Something like common/board_f.c:reserve_mmu() and many other functions there. relocaddr starts at approximately the top of RAM, continually gets adjusted down as many static allocations are reserved, and eventually becomes the address that U-Boot is relocated to. Simply adding another entry into init_sequence_f[] for the disk cache might work.
Thanks for the pointer. I'll review that when time permits.
This would remove the opportunity to re-configure the cache though, right?
Well, it would make it impossible to use less RAM. One could use more by having a mix of the initial static allocation plus some additional dynamic allocation, but that might get a bit painful to manage.
This might not be too bad though. Even if we allocated 4x the current defaults, we're only at ~64k.
It might be interesting to use the MMU more and allow de-fragmentation of VA space. That is, assuming there's much more VA space than RAM, such as is true on current 64-bit architectures. Then I wouldn't dislike dynamic allocation so much:-)
That's interesting, but probably more invasive than this patch set.
I'm not sure whether how important this feature is, and I think only time and use will tell.
I'd prefer to keep that ability at least for a cycle or two so that I and others can test.
While re-working the code, I also thought more about using an array and still don't see how the implementation doesn't get more complex.
The key bit is that the list is implemented in MRU order so invalidating the oldest is trivial.
Yes, the MRU logic would make it more complex. Is that particularly useful, i.e. is it an intrinsic part of the speedup?
It's not a question of speed with small numbers of entries. The code to handle eviction would just be more complex.
My thought was that if the eviction algorithm wasn't important (i.e. most of the speedup comes from have some (any) kind of cache, but the eviction algorithm makes little difference to the gain from having the cache), we could just drop MRU completely. If that's not possible, then indeed a list would make implementing MRU easier.
How would we decide which block to discard? I haven't traced enough to know what algorithm(s) might be best, but I can say that there's a preponderance of repeated accesses to the last-accessed block, especially in ext4.
Perhaps just keep an index into the array, use that index any time something is written to the cache, and then increment it each time. Probably not anywhere near as optimal as MRU/LRU though.
I see that Tom just applied V3, so I'd be interested in seeing patches on top of that.
Regards,
Eric

On Mon, Mar 28, 2016 at 10:05:44AM -0700, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Signed-off-by: Eric Nelson eric@nelint.com
Applied to u-boot/master, thanks!

Hi Tom,
On 04/01/2016 06:59 PM, Tom Rini wrote:
On Mon, Mar 28, 2016 at 10:05:44AM -0700, Eric Nelson wrote:
Add a block device cache to speed up repeated reads of block devices by various filesystems.
This small amount of cache can dramatically speed up filesystem operations by skipping repeated reads of common areas of a block device (typically directory structures).
This has shown to have some benefit on FAT filesystem operations of loading a kernel and RAM disk, but more dramatic benefits on ext4 filesystems when the kernel and/or RAM disk are spread across multiple extent header structures as described in commit fc0fc50.
The cache is implemented through a minimal list (block_cache) maintained in most-recently-used order and count of the current number of entries (cache_count). It uses a maximum block count setting to prevent copies of large block reads and an upper bound on the number of cached areas.
The maximum number of entries in the cache defaults to 32 and the maximum number of blocks per cache entry has a default of 2, which has shown to produce the best results on testing of ext4 and FAT filesystems.
The 'blkcache' command (enabled through CONFIG_CMD_BLOCK_CACHE) allows changing these values and can be used to tune for a particular filesystem layout.
Signed-off-by: Eric Nelson eric@nelint.com
Applied to u-boot/master, thanks!
Whoops. I have a couple of minor updates from Stephen's last review that I'll forward shortly.

This set of updates addresses a number of small coding style issues caught by Stephen Warren's review of the block cache patch set.
Eric Nelson (3): cmd: blkcache: remove indentation from output of 'show' cmd: blkcache: simplify sub-command handling drivers: block: formatting: fix placement of function parameters
cmd/blkcache.c | 16 +++++++--------- include/blk.h | 34 ++++++++++++++-------------------- 2 files changed, 21 insertions(+), 29 deletions(-)

Signed-off-by: Eric Nelson eric@nelint.com --- cmd/blkcache.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/cmd/blkcache.c b/cmd/blkcache.c index 725163a..d97bed5 100644 --- a/cmd/blkcache.c +++ b/cmd/blkcache.c @@ -16,11 +16,11 @@ static int blkc_show(cmd_tbl_t *cmdtp, int flag, struct block_cache_stats stats; blkcache_stats(&stats);
- printf(" hits: %u\n" - " misses: %u\n" - " entries: %u\n" - " max blocks/entry: %u\n" - " max cache entries: %u\n", + printf("hits: %u\n" + "misses: %u\n" + "entries: %u\n" + "max blocks/entry: %u\n" + "max cache entries: %u\n", stats.hits, stats.misses, stats.entries, stats.max_blocks_per_entry, stats.max_entries); return 0;

On Sat, Apr 02, 2016 at 07:37:12AM -0700, Eric Nelson wrote:
Signed-off-by: Eric Nelson eric@nelint.com
Applied to u-boot/master, thanks!

Signed-off-by: Eric Nelson eric@nelint.com --- cmd/blkcache.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/cmd/blkcache.c b/cmd/blkcache.c index d97bed5..1338dbd 100644 --- a/cmd/blkcache.c +++ b/cmd/blkcache.c @@ -73,12 +73,10 @@ static int do_blkcache(cmd_tbl_t *cmdtp, int flag,
c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub));
- if (c) - return c->cmd(cmdtp, flag, argc, argv); - else + if (!c) return CMD_RET_USAGE;
- return 0; + return c->cmd(cmdtp, flag, argc, argv); }
U_BOOT_CMD(

On 04/02/2016 08:37 AM, Eric Nelson wrote:
The series, Acked-by: Stephen Warren swarren@nvidia.com
One nit below:
diff --git a/cmd/blkcache.c b/cmd/blkcache.c
@@ -73,12 +73,10 @@ static int do_blkcache(cmd_tbl_t *cmdtp, int flag,
c = find_cmd_tbl(argv[0], &cmd_blkc_sub[0], ARRAY_SIZE(cmd_blkc_sub));
- if (c)
I'd suggest removing that blank line too.

On Sat, Apr 02, 2016 at 07:37:13AM -0700, Eric Nelson wrote:
Signed-off-by: Eric Nelson eric@nelint.com Acked-by: Stephen Warren swarren@nvidia.com
Applied to u-boot/master, thanks!

Signed-off-by: Eric Nelson eric@nelint.com --- include/blk.h | 34 ++++++++++++++-------------------- 1 file changed, 14 insertions(+), 20 deletions(-)
diff --git a/include/blk.h b/include/blk.h index 263a791..f624671 100644 --- a/include/blk.h +++ b/include/blk.h @@ -96,10 +96,9 @@ struct blk_desc { * * @return - '1' if block returned from cache, '0' otherwise. */ -int blkcache_read - (int iftype, int dev, - lbaint_t start, lbaint_t blkcnt, - unsigned long blksz, void *buffer); +int blkcache_read(int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer);
/** * blkcache_fill() - make data read from a block device available @@ -113,10 +112,9 @@ int blkcache_read * @param buf - buffer containing data to cache * */ -void blkcache_fill - (int iftype, int dev, - lbaint_t start, lbaint_t blkcnt, - unsigned long blksz, void const *buffer); +void blkcache_fill(int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer);
/** * blkcache_invalidate() - discard the cache for a set of blocks @@ -125,8 +123,7 @@ void blkcache_fill * @param iftype - IF_TYPE_x for type of device * @param dev - device index of particular type */ -void blkcache_invalidate - (int iftype, int dev); +void blkcache_invalidate(int iftype, int dev);
/** * blkcache_configure() - configure block cache @@ -156,21 +153,18 @@ void blkcache_stats(struct block_cache_stats *stats);
#else
-static inline int blkcache_read - (int iftype, int dev, - lbaint_t start, lbaint_t blkcnt, - unsigned long blksz, void *buffer) +static inline int blkcache_read(int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void *buffer) { return 0; }
-static inline void blkcache_fill - (int iftype, int dev, - lbaint_t start, lbaint_t blkcnt, - unsigned long blksz, void const *buffer) {} +static inline void blkcache_fill(int iftype, int dev, + lbaint_t start, lbaint_t blkcnt, + unsigned long blksz, void const *buffer) {}
-static inline void blkcache_invalidate - (int iftype, int dev) {} +static inline void blkcache_invalidate(int iftype, int dev) {}
#endif

On Sat, Apr 02, 2016 at 07:37:14AM -0700, Eric Nelson wrote:
Signed-off-by: Eric Nelson eric@nelint.com
Applied to u-boot/master, thanks!

Call blk_dread, blk_dwrite, blk_derase to ensure that the block cache is used if enabled and to remove build breakage when CONFIG_BLK is enabled.
Signed-off-by: Eric Nelson eric@nelint.com --- cmd/mmc.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/cmd/mmc.c b/cmd/mmc.c index fb4382e..39ef072 100644 --- a/cmd/mmc.c +++ b/cmd/mmc.c @@ -150,6 +150,7 @@ static struct mmc *init_mmc_device(int dev, bool force_init) printf("no mmc device at slot %x\n", dev); return NULL; } + if (force_init) mmc->has_init = 0; if (mmc_init(mmc)) @@ -345,7 +346,7 @@ static int do_mmc_read(cmd_tbl_t *cmdtp, int flag, printf("\nMMC read: dev # %d, block # %d, count %d ... ", curr_device, blk, cnt);
- n = mmc->block_dev.block_read(&mmc->block_dev, blk, cnt, addr); + n = blk_dread(&mmc->block_dev, blk, cnt, addr); /* flush cache after read */ flush_cache((ulong)addr, cnt * 512); /* FIXME */ printf("%d blocks read: %s\n", n, (n == cnt) ? "OK" : "ERROR"); @@ -377,7 +378,7 @@ static int do_mmc_write(cmd_tbl_t *cmdtp, int flag, printf("Error: card is write protected!\n"); return CMD_RET_FAILURE; } - n = mmc->block_dev.block_write(&mmc->block_dev, blk, cnt, addr); + n = blk_dwrite(&mmc->block_dev, blk, cnt, addr); printf("%d blocks written: %s\n", n, (n == cnt) ? "OK" : "ERROR");
return (n == cnt) ? CMD_RET_SUCCESS : CMD_RET_FAILURE; @@ -405,7 +406,7 @@ static int do_mmc_erase(cmd_tbl_t *cmdtp, int flag, printf("Error: card is write protected!\n"); return CMD_RET_FAILURE; } - n = mmc->block_dev.block_erase(&mmc->block_dev, blk, cnt); + n = blk_derase(&mmc->block_dev, blk, cnt); printf("%d blocks erased: %s\n", n, (n == cnt) ? "OK" : "ERROR");
return (n == cnt) ? CMD_RET_SUCCESS : CMD_RET_FAILURE;

On Sun, Mar 27, 2016 at 12:00:14PM -0700, Eric Nelson wrote:
Call blk_dread, blk_dwrite, blk_derase to ensure that the block cache is used if enabled and to remove build breakage when CONFIG_BLK is enabled.
Signed-off-by: Eric Nelson eric@nelint.com
Reviewed-by: Tom Rini trini@konsulko.com

On Sun, Mar 27, 2016 at 12:00:14PM -0700, Eric Nelson wrote:
Call blk_dread, blk_dwrite, blk_derase to ensure that the block cache is used if enabled and to remove build breakage when CONFIG_BLK is enabled.
Signed-off-by: Eric Nelson eric@nelint.com Reviewed-by: Tom Rini trini@konsulko.com
Applied to u-boot/master, thanks!

Call blk_dread, blk_dwrite, blk_derase to ensure that the block cache is used if enabled and to remove build breakage when CONFIG_BLK is enabled.
Signed-off-by: Eric Nelson eric@nelint.com --- cmd/sata.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/cmd/sata.c b/cmd/sata.c index c8de9a3..8748cce 100644 --- a/cmd/sata.c +++ b/cmd/sata.c @@ -183,7 +183,8 @@ static int do_sata(cmd_tbl_t *cmdtp, int flag, int argc, char * const argv[]) printf("\nSATA read: device %d block # %ld, count %ld ... ", sata_curr_device, blk, cnt);
- n = sata_read(sata_curr_device, blk, cnt, (u32 *)addr); + n = blk_dread(&sata_dev_desc[sata_curr_device], + blk, cnt, (u32 *)addr);
/* flush cache after read */ flush_cache(addr, cnt * sata_dev_desc[sata_curr_device].blksz); @@ -201,7 +202,8 @@ static int do_sata(cmd_tbl_t *cmdtp, int flag, int argc, char * const argv[]) printf("\nSATA write: device %d block # %ld, count %ld ... ", sata_curr_device, blk, cnt);
- n = sata_write(sata_curr_device, blk, cnt, (u32 *)addr); + n = blk_dwrite(&sata_dev_desc[sata_curr_device], + blk, cnt, (u32 *)addr);
printf("%ld blocks written: %s\n", n, (n == cnt) ? "OK" : "ERROR");

On Sun, Mar 27, 2016 at 12:00:15PM -0700, Eric Nelson wrote:
Call blk_dread, blk_dwrite, blk_derase to ensure that the block cache is used if enabled and to remove build breakage when CONFIG_BLK is enabled.
Signed-off-by: Eric Nelson eric@nelint.com
Reviewed-by: Tom Rini trini@konsulko.com

On Sun, Mar 27, 2016 at 12:00:15PM -0700, Eric Nelson wrote:
Call blk_dread, blk_dwrite, blk_derase to ensure that the block cache is used if enabled and to remove build breakage when CONFIG_BLK is enabled.
Signed-off-by: Eric Nelson eric@nelint.com Reviewed-by: Tom Rini trini@konsulko.com
Applied to u-boot/master, thanks!
participants (5)
-
Eric Nelson
-
Eric Nelson
-
Marek Vasut
-
Stephen Warren
-
Tom Rini