[PATCH U-BOOT 0/3] btrfs: fix and improve read code

I'm posting this as a series but this is really three independant patches. The first is a real bug that has made our board unable to boot as it was corrupting the kernel and image CRC validation failed (yay for checksums -- any plan of enabling validation of data checksums too? I see the code is there for metadata already, so it's probably not missing that much, but unfortunately as far as I understand it would only have seen that the extent data was correct and still corrupted the final buffer as it's just an offset error...)
The second patch is an opportunistic optimization as I noticed the code behaved differently than before, and there doesn't seem to be a reason the reads couldn't be coalesced.
The last patch is just something that looked odd and looks "obviously better" but I'll leave that up to you.
Thanks!
Link: https://lore.kernel.org/linux-btrfs/20200525063257.46757-1-wqu@suse.com/ [1] Signed-off-by: Dominique Martinet dominique.martinet@atmark-techno.com --- Dominique Martinet (3): btrfs: fix offset within btrfs_read_extent_reg() btrfs: btrfs_file_read: allow opportunistic read until the end btrfs: btfs_file_read: zero trailing data if no extent was found
fs/btrfs/inode.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) --- base-commit: 5db4972a5bbdbf9e3af48ffc9bc4fec73b7b6a79 change-id: 20230418-btrfs-extent-reads-e2df6e329ad4
Best regards,

From: Dominique Martinet dominique.martinet@atmark-techno.com
btrfs_read_extent_reg correctly computed the extent offset in the BTRFS_COMPRESS_NONE case, but did not account for the 'offset - key.offset' part correctly in the compressed case, making the function read incorrect data.
In the case I examined, the last 4k of a file was corrupted and contained data from a few blocks prior: btrfs_file_read() -> btrfs_read_extent_reg (aligned part loop from 0x40480000 to 0x40ba0000, 128KB at a time) last read had 0x4000 bytes in extent, but aligned_end - cur limited the read to 0x3000 (offset 0x720000) -> read_and_truncate_page -> btrfs_read_extent_reg reading the last 0x1000 bytes from offset 0x73000 (end of the previous 0x4000) into 0x40ba3000 here key.offset = 0x70000 so we need to use it to recover the 0x3000 offset.
Confirmed by checking data, before patch: u-boot=> mmc load 1:1 $loadaddr boot/uImage u-boot=> md 0x40ba0000 40ba0000: c0232ff8 c0c722cb 030e94be bf10000c ./#..".......... u-boot=> md 0x40ba3000 40ba3000: c0232ff8 c0c722cb 030e94be bf10000c ./#.."..........
After patch (also confirmed with data read from linux): u-boot=> md 0x40ba3000 40ba3000: 64cc9f03 81142256 6910000c 0c03483c ...dV".....i<H..
Note that the code previously (before commit e3427184f38a ("fs: btrfs: Implement btrfs_file_read()")) did not split that read in two, so this is a regression.
Fixes: a26a6bedafcf ("fs: btrfs: Introduce btrfs_read_extent_inline() and btrfs_read_extent_reg()") Signed-off-by: Dominique Martinet dominique.martinet@atmark-techno.com --- fs/btrfs/inode.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 40025662f250..3d6e39e6544d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -443,6 +443,8 @@ int btrfs_read_extent_reg(struct btrfs_path *path, IS_ALIGNED(len, fs_info->sectorsize)); ASSERT(offset >= key.offset && offset + len <= key.offset + extent_num_bytes); + /* offset is now offset within extent */ + offset = btrfs_file_extent_offset(leaf, fi) + offset - key.offset;
/* Preallocated or hole , fill @dest with zero */ if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_PREALLOC || @@ -454,9 +456,7 @@ int btrfs_read_extent_reg(struct btrfs_path *path, if (btrfs_file_extent_compression(leaf, fi) == BTRFS_COMPRESS_NONE) { u64 logical;
- logical = btrfs_file_extent_disk_bytenr(leaf, fi) + - btrfs_file_extent_offset(leaf, fi) + - offset - key.offset; + logical = btrfs_file_extent_disk_bytenr(leaf, fi) + offset; read = len;
num_copies = btrfs_num_copies(fs_info, logical, len); @@ -511,7 +511,7 @@ int btrfs_read_extent_reg(struct btrfs_path *path, if (ret < dsize) memset(dbuf + ret, 0, dsize - ret); /* Then copy the needed part */ - memcpy(dest, dbuf + btrfs_file_extent_offset(leaf, fi), len); + memcpy(dest, dbuf + offset, len); ret = len; out: free(cbuf);

On 2023/4/18 09:17, Dominique Martinet wrote:
From: Dominique Martinet dominique.martinet@atmark-techno.com
The subject can be changed to "btrfs: fix offset when reading compressed extents". The original one is a little too generic.
btrfs_read_extent_reg correctly computed the extent offset in the BTRFS_COMPRESS_NONE case, but did not account for the 'offset - key.offset' part correctly in the compressed case, making the function read incorrect data.
In the case I examined, the last 4k of a file was corrupted and contained data from a few blocks prior: btrfs_file_read() -> btrfs_read_extent_reg (aligned part loop from 0x40480000 to 0x40ba0000, 128KB at a time) last read had 0x4000 bytes in extent, but aligned_end - cur limited the read to 0x3000 (offset 0x720000) -> read_and_truncate_page -> btrfs_read_extent_reg reading the last 0x1000 bytes from offset 0x73000 (end of the previous 0x4000) into 0x40ba3000 here key.offset = 0x70000 so we need to use it to recover the 0x3000 offset.
You can use "btrfs ins dump-tree" to provide a much easier to read on-disk data layout.
And you can also add a smaller reproducer.
Confirmed by checking data, before patch: u-boot=> mmc load 1:1 $loadaddr boot/uImage u-boot=> md 0x40ba0000 40ba0000: c0232ff8 c0c722cb 030e94be bf10000c ./#..".......... u-boot=> md 0x40ba3000 40ba3000: c0232ff8 c0c722cb 030e94be bf10000c ./#.."..........
After patch (also confirmed with data read from linux): u-boot=> md 0x40ba3000 40ba3000: 64cc9f03 81142256 6910000c 0c03483c ...dV".....i<H..
Note that the code previously (before commit e3427184f38a ("fs: btrfs: Implement btrfs_file_read()")) did not split that read in two, so this is a regression.
Fixes: a26a6bedafcf ("fs: btrfs: Introduce btrfs_read_extent_inline() and btrfs_read_extent_reg()") Signed-off-by: Dominique Martinet dominique.martinet@atmark-techno.com
fs/btrfs/inode.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 40025662f250..3d6e39e6544d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -443,6 +443,8 @@ int btrfs_read_extent_reg(struct btrfs_path *path, IS_ALIGNED(len, fs_info->sectorsize)); ASSERT(offset >= key.offset && offset + len <= key.offset + extent_num_bytes);
- /* offset is now offset within extent */
- offset = btrfs_file_extent_offset(leaf, fi) + offset - key.offset;
I prefer not to use the @offset, explained later.
/* Preallocated or hole , fill @dest with zero */ if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_PREALLOC || @@ -454,9 +456,7 @@ int btrfs_read_extent_reg(struct btrfs_path *path, if (btrfs_file_extent_compression(leaf, fi) == BTRFS_COMPRESS_NONE) { u64 logical;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) +
btrfs_file_extent_offset(leaf, fi) +
offset - key.offset;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) + offset;
This is touching non-compressed path, which is very weird as your commit message said this part is correct.
I prefer the original one to show everything we need to take into consideration.
Otherwise looks good to me.
Thanks, Qu
read = len; num_copies = btrfs_num_copies(fs_info, logical, len);
@@ -511,7 +511,7 @@ int btrfs_read_extent_reg(struct btrfs_path *path, if (ret < dsize) memset(dbuf + ret, 0, dsize - ret); /* Then copy the needed part */
- memcpy(dest, dbuf + btrfs_file_extent_offset(leaf, fi), len);
- memcpy(dest, dbuf + offset, len); ret = len; out: free(cbuf);

Qu Wenruo wrote on Tue, Apr 18, 2023 at 09:58:37AM +0800:
The subject can be changed to "btrfs: fix offset when reading compressed extents". The original one is a little too generic.
Ok.
btrfs_file_read() -> btrfs_read_extent_reg (aligned part loop from 0x40480000 to 0x40ba0000, 128KB at a time) last read had 0x4000 bytes in extent, but aligned_end - cur limited the read to 0x3000 (offset 0x720000) -> read_and_truncate_page -> btrfs_read_extent_reg reading the last 0x1000 bytes from offset 0x73000 (end of the previous 0x4000) into 0x40ba3000 here key.offset = 0x70000 so we need to use it to recover the 0x3000 offset.
You can use "btrfs ins dump-tree" to provide a much easier to read on-disk data layout.
key.offset is the offset within the read call, not the offset on disk. The file on disk looks perfectly normal, the condition to trigger the bug is to have a file which size is not sector-aligned and where the last extent is bigger than a sector.
I had a look at dump-tree early on and couldn't actually find my file in there, now the problem is understood it should be easy to make a reproducer so I'll add this info and commands needed to reproduce (+ a link to a fs image just in case) in v2
/* Preallocated or hole , fill @dest with zero */ if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_PREALLOC || @@ -454,9 +456,7 @@ int btrfs_read_extent_reg(struct btrfs_path *path, if (btrfs_file_extent_compression(leaf, fi) == BTRFS_COMPRESS_NONE) { u64 logical;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) +
btrfs_file_extent_offset(leaf, fi) +
offset - key.offset;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) + offset;
This is touching non-compressed path, which is very weird as your commit message said this part is correct.
my rationale is that since this was considered once but forgotten the other time it'll be easy to add yet another code path that forgets it later, but I guess it won't change much anyway -- I'll fix the patch making it explicit again.
Thanks,

On 2023/4/18 10:05, Dominique Martinet wrote:
Qu Wenruo wrote on Tue, Apr 18, 2023 at 09:58:37AM +0800:
The subject can be changed to "btrfs: fix offset when reading compressed extents". The original one is a little too generic.
Ok.
btrfs_file_read() -> btrfs_read_extent_reg (aligned part loop from 0x40480000 to 0x40ba0000, 128KB at a time) last read had 0x4000 bytes in extent, but aligned_end - cur limited the read to 0x3000 (offset 0x720000) -> read_and_truncate_page -> btrfs_read_extent_reg reading the last 0x1000 bytes from offset 0x73000 (end of the previous 0x4000) into 0x40ba3000 here key.offset = 0x70000 so we need to use it to recover the 0x3000 offset.
You can use "btrfs ins dump-tree" to provide a much easier to read on-disk data layout.
key.offset is the offset within the read call, not the offset on disk. The file on disk looks perfectly normal, the condition to trigger the bug is to have a file which size is not sector-aligned and where the last extent is bigger than a sector.
Yes, I understand the runtime offset is involved.
But you're still trying to explain the situation with involved on-disk file extent, right?
In that case it's way easier to go something like this:
We can have a compressed file extent like this:
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 7 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 32768 ram 32768 extent compression 1 (zlib)
Then if we try to read file range [4K, 8K) of inode 257 in Uboot, then we got corrupted data.
At least that's the preferred way to explain in btrfs community (with on-disk thus static part, then runtime part).
I had a look at dump-tree early on and couldn't actually find my file in there, now the problem is understood it should be easy to make a reproducer so I'll add this info and commands needed to reproduce (+ a link to a fs image just in case) in v2
The reproducer is super easy.
# mkfs.btrfs -f $dev # mount $dev -o compress $mnt # xfs_io -f -c "pwrite -i /dev/urandom 0 16K" $mnt/file # umount $mnt
Then in uboot # load host 0 $addr file 0x1000 0x1000 # md $addr
And compare to regular xxd from kernel.
/* Preallocated or hole , fill @dest with zero */ if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_PREALLOC ||
@@ -454,9 +456,7 @@ int btrfs_read_extent_reg(struct btrfs_path *path, if (btrfs_file_extent_compression(leaf, fi) == BTRFS_COMPRESS_NONE) { u64 logical;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) +
btrfs_file_extent_offset(leaf, fi) +
offset - key.offset;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) + offset;
This is touching non-compressed path, which is very weird as your commit message said this part is correct.
my rationale is that since this was considered once but forgotten the other time it'll be easy to add yet another code path that forgets it later, but I guess it won't change much anyway -- I'll fix the patch making it explicit again.
OK, that makes sense now.
Thanks, Qu
Thanks,

On 2023/4/18 10:05, Dominique Martinet wrote:
Qu Wenruo wrote on Tue, Apr 18, 2023 at 09:58:37AM +0800:
The subject can be changed to "btrfs: fix offset when reading compressed extents". The original one is a little too generic.
Ok.
btrfs_file_read() -> btrfs_read_extent_reg (aligned part loop from 0x40480000 to 0x40ba0000, 128KB at a time) last read had 0x4000 bytes in extent, but aligned_end - cur limited the read to 0x3000 (offset 0x720000) -> read_and_truncate_page -> btrfs_read_extent_reg reading the last 0x1000 bytes from offset 0x73000 (end of the previous 0x4000) into 0x40ba3000 here key.offset = 0x70000 so we need to use it to recover the 0x3000 offset.
You can use "btrfs ins dump-tree" to provide a much easier to read on-disk data layout.
key.offset is the offset within the read call, not the offset on disk. The file on disk looks perfectly normal, the condition to trigger the bug is to have a file which size is not sector-aligned and where the last extent is bigger than a sector.
Forgot to mention, this bug does not only affect the mentioned case, it affects all partial read of an extent.
Even if it's sector aligned.
I had a look at dump-tree early on and couldn't actually find my file in there, now the problem is understood it should be easy to make a reproducer so I'll add this info and commands needed to reproduce (+ a link to a fs image just in case) in v2
/* Preallocated or hole , fill @dest with zero */ if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_PREALLOC ||
@@ -454,9 +456,7 @@ int btrfs_read_extent_reg(struct btrfs_path *path, if (btrfs_file_extent_compression(leaf, fi) == BTRFS_COMPRESS_NONE) { u64 logical;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) +
btrfs_file_extent_offset(leaf, fi) +
offset - key.offset;
logical = btrfs_file_extent_disk_bytenr(leaf, fi) + offset;
This is touching non-compressed path, which is very weird as your commit message said this part is correct.
my rationale is that since this was considered once but forgotten the other time it'll be easy to add yet another code path that forgets it later, but I guess it won't change much anyway -- I'll fix the patch making it explicit again.
Thanks,

From: Dominique Martinet dominique.martinet@atmark-techno.com
btrfs_file_read main 'aligned read' loop would limit the last read to the aligned end even if the data is present in the extent: there is no reason not to read the data that we know will be presented correctly.
If that somehow fails cur only advances up to the aligned_end anyway and the file tail is read through the old code unchanged; this could be required if e.g. the end data is inlined.
Signed-off-by: Dominique Martinet dominique.martinet@atmark-techno.com --- fs/btrfs/inode.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3d6e39e6544d..efffec0f2e68 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -663,7 +663,8 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, struct btrfs_path path; struct btrfs_key key; u64 aligned_start = round_down(file_offset, fs_info->sectorsize); - u64 aligned_end = round_down(file_offset + len, fs_info->sectorsize); + u64 end = file_offset + len; + u64 aligned_end = round_down(end, fs_info->sectorsize); u64 next_offset; u64 cur = aligned_start; int ret = 0; @@ -743,26 +744,26 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, extent_num_bytes = btrfs_file_extent_num_bytes(path.nodes[0], fi); ret = btrfs_read_extent_reg(&path, fi, cur, - min(extent_num_bytes, aligned_end - cur), + min(extent_num_bytes, end - cur), dest + cur - file_offset); if (ret < 0) goto out; - cur += min(extent_num_bytes, aligned_end - cur); + cur += min(extent_num_bytes, end - cur); }
/* Read the tailing unaligned part*/ - if (file_offset + len != aligned_end) { + if (file_offset + len != cur) { btrfs_release_path(&path); - ret = lookup_data_extent(root, &path, ino, aligned_end, + ret = lookup_data_extent(root, &path, ino, cur, &next_offset); /* <0 is error, >0 means no extent */ if (ret) goto out; fi = btrfs_item_ptr(path.nodes[0], path.slots[0], struct btrfs_file_extent_item); - ret = read_and_truncate_page(&path, fi, aligned_end, - file_offset + len - aligned_end, - dest + aligned_end - file_offset); + ret = read_and_truncate_page(&path, fi, cur, + file_offset + len - cur, + dest + cur - file_offset); } out: btrfs_release_path(&path);

On 2023/4/18 09:17, Dominique Martinet wrote:
From: Dominique Martinet dominique.martinet@atmark-techno.com
btrfs_file_read main 'aligned read' loop would limit the last read to the aligned end even if the data is present in the extent: there is no reason not to read the data that we know will be presented correctly.
If that somehow fails cur only advances up to the aligned_end anyway and the file tail is read through the old code unchanged; this could be required if e.g. the end data is inlined.
Signed-off-by: Dominique Martinet dominique.martinet@atmark-techno.com
fs/btrfs/inode.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3d6e39e6544d..efffec0f2e68 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -663,7 +663,8 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, struct btrfs_path path; struct btrfs_key key; u64 aligned_start = round_down(file_offset, fs_info->sectorsize);
- u64 aligned_end = round_down(file_offset + len, fs_info->sectorsize);
- u64 end = file_offset + len;
- u64 aligned_end = round_down(end, fs_info->sectorsize); u64 next_offset; u64 cur = aligned_start; int ret = 0;
@@ -743,26 +744,26 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, extent_num_bytes = btrfs_file_extent_num_bytes(path.nodes[0], fi); ret = btrfs_read_extent_reg(&path, fi, cur,
min(extent_num_bytes, aligned_end - cur),
if (ret < 0) goto out;min(extent_num_bytes, end - cur), dest + cur - file_offset);
cur += min(extent_num_bytes, aligned_end - cur);
cur += min(extent_num_bytes, end - cur);
}
/* Read the tailing unaligned part*/
Can we remove this part completely?
IIRC if we read until the target end, the unaligned end part can be completely removed then.
Thanks, Qu
- if (file_offset + len != aligned_end) {
- if (file_offset + len != cur) { btrfs_release_path(&path);
ret = lookup_data_extent(root, &path, ino, aligned_end,
/* <0 is error, >0 means no extent */ if (ret) goto out; fi = btrfs_item_ptr(path.nodes[0], path.slots[0], struct btrfs_file_extent_item);ret = lookup_data_extent(root, &path, ino, cur, &next_offset);
ret = read_and_truncate_page(&path, fi, aligned_end,
file_offset + len - aligned_end,
dest + aligned_end - file_offset);
ret = read_and_truncate_page(&path, fi, cur,
file_offset + len - cur,
} out: btrfs_release_path(&path);dest + cur - file_offset);

Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:02:00AM +0800:
/* Read the tailing unaligned part*/
Can we remove this part completely?
IIRC if we read until the target end, the unaligned end part can be completely removed then.
The "Read the aligned part" loop stops at aligned_end:
while (cur < aligned_end)
So it should be possible that the last aligned extent we consider does not contain data until the end, e.g. an offset that ends with the aligned end:
0 4096 4123 [extent1-----|extent2]
In this case the main loop will only read extent1 and we need the "trailing unaligned part" if for extent2.
I have a feeling the loop could just be updated to go to the end `while (cur < end)` as it doesn't seem to care about the end alignment... Should I update v2 to do that instead?
This made me look at the "Read out the leading unaligned part" initial part, and its check only looks at the sector size but it probably has the same problem -- we want to make sure we read any leftover from a previous extent e.g. this file: 0 4096 8192 [extent1------------------|extent2] and reading from 4k with a sectorsize of 4k is probably bad will enter the aligned main loop right away... And I think that'll fail?... Actually not quite sure what's expecting what to be aligned in there, but I just tried some partial reads from non-zero offsets and my board resets almost all the time so I guess I've found something else to dig into. This isn't a priority for me right now but I'll look a bit more when I have more time if you haven't first.

On 2023/4/18 10:41, Dominique Martinet wrote:
Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:02:00AM +0800:
/* Read the tailing unaligned part*/
Can we remove this part completely?
IIRC if we read until the target end, the unaligned end part can be completely removed then.
The "Read the aligned part" loop stops at aligned_end:
while (cur < aligned_end)
So it should be possible that the last aligned extent we consider does not contain data until the end, e.g. an offset that ends with the aligned end:
0 4096 4123 [extent1-----|extent2]
In this case the main loop will only read extent1 and we need the "trailing unaligned part" if for extent2.
I have a feeling the loop could just be updated to go to the end `while (cur < end)` as it doesn't seem to care about the end alignment... Should I update v2 to do that instead?
Yeah, it would be very awesome if you can remove the tailing part completely.
This made me look at the "Read out the leading unaligned part" initial part, and its check only looks at the sector size but it probably has the same problem -- we want to make sure we read any leftover from a previous extent e.g. this file:
If you're talking about the same problem mentioned in patch 1, then yes, any read in the middle of an extent would cause problems.
No matter if it's aligned or not, as btrfs would convert any unaligned part into aligned read.
But if we fix the bug you mentioned, I see nothing can going wrong.
Unaligned read is converted into aligned one (then copy the target range into the dest buf), and aligned part (including the tailing unaligned part) would be properly handled then.
0 4096 8192 [extent1------------------|extent2] and reading from 4k with a sectorsize of 4k is probably bad will enter the aligned main loop right away... And I think that'll fail?... Actually not quite sure what's expecting what to be aligned in there, but I just tried some partial reads from non-zero offsets and my board resets almost all the time so I guess I've found something else to dig into. This isn't a priority for me right now but I'll look a bit more when I have more time if you haven't first.
My initial plan is to merge the tailing part into the aligned loop, but since you're already working in this part, feel free to do it.
And I'm very happy to provide any help on this.
Thanks, Qu

Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:53:41AM +0800:
I have a feeling the loop could just be updated to go to the end `while (cur < end)` as it doesn't seem to care about the end alignment... Should I update v2 to do that instead?
Yeah, it would be very awesome if you can remove the tailing part completely.
Ok, will give it a try. I'll want to test this a bit so might take a day or two as I have other work to finish first.
This made me look at the "Read out the leading unaligned part" initial part, and its check only looks at the sector size but it probably has the same problem -- we want to make sure we read any leftover from a previous extent e.g. this file:
If you're talking about the same problem mentioned in patch 1, then yes, any read in the middle of an extent would cause problems.
No, was just thinking the leading part being a separate loop doesn't seem to make sense either as the code shouldn't care about sector size alignemnt but about full extents. If the main loop handles everything correctly then the leading if can also be removed along with "read_and_truncate_page" that would no longer be used.
I'll give this a try as well and report back.
No matter if it's aligned or not, as btrfs would convert any unaligned part into aligned read.
Yes I don't see where it would fail (except that my board does crash), I guess that at this point I should spend some time building a qemu u-boot and hooking up gdb will be faster..
My initial plan is to merge the tailing part into the aligned loop, but since you're already working in this part, feel free to do it.
Yes, sure -- removing the if is easy, I'd just rather not make it fail for someone else :)

On 2023/4/18 11:07, Dominique Martinet wrote:
Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:53:41AM +0800:
I have a feeling the loop could just be updated to go to the end `while (cur < end)` as it doesn't seem to care about the end alignment... Should I update v2 to do that instead?
Yeah, it would be very awesome if you can remove the tailing part completely.
Ok, will give it a try. I'll want to test this a bit so might take a day or two as I have other work to finish first.
This made me look at the "Read out the leading unaligned part" initial part, and its check only looks at the sector size but it probably has the same problem -- we want to make sure we read any leftover from a previous extent e.g. this file:
If you're talking about the same problem mentioned in patch 1, then yes, any read in the middle of an extent would cause problems.
No, was just thinking the leading part being a separate loop doesn't seem to make sense either as the code shouldn't care about sector size alignemnt but about full extents.
The main concern related to the leading unaligned part is, we need to skip something unaligned from the beginning , while all other situations never need to skip such case (they at most skip the tailing part).
Maybe we can do some extra calculation to handle it, but I'm not in a hurry to address the leading part.
Another thing is, for all other fs-ish interface (kernel/fuse etc), they all read/write in certain block size, then handle unaligned part from a higher layer. Thus I prefer to have most things handled in an aligned fashion, and only handle the unaligned part specially.
But if you can find an elegant way to handle all cases, that would be really awesome!
Thanks, Qu
If the main loop handles everything correctly then the leading if can also be removed along with "read_and_truncate_page" that would no longer be used.
I'll give this a try as well and report back.
No matter if it's aligned or not, as btrfs would convert any unaligned part into aligned read.
Yes I don't see where it would fail (except that my board does crash), I guess that at this point I should spend some time building a qemu u-boot and hooking up gdb will be faster..
My initial plan is to merge the tailing part into the aligned loop, but since you're already working in this part, feel free to do it.
Yes, sure -- removing the if is easy, I'd just rather not make it fail for someone else :)

Qu Wenruo wrote on Tue, Apr 18, 2023 at 11:21:00AM +0800:
No, was just thinking the leading part being a separate loop doesn't seem to make sense either as the code shouldn't care about sector size alignemnt but about full extents.
The main concern related to the leading unaligned part is, we need to skip something unaligned from the beginning , while all other situations never need to skip such case (they at most skip the tailing part).
Ok, there is one exception for inline extents apparently.. But I'm not still not convinced the `aligned_start != file_offset` check is enough for that either; I'd say it's unlikely but the inline part can be compressed, so we could have a file which has > 4k (sectorsize) of expanded data, so a read from the 4k offset would skip the special handling and fail (reading the whole extent in dest)
Even if that's not possible, reading just the first 10 bytes of an inline extent will be aligned and go through the main loop which just reads the whole extent, so it'll need the same handling as the regular btrfs_read_extent_reg handling at which point it might just as well handle start offset too.
That aside taking the loop in order: - lookup_data_extent doesn't care (used in heading/tail) - skipping holes don't care as they explicitely put cursor at start of next extent (or bail out if nothing next) - inline needs fixing anyway as said above - PREALLOC or nothing on disk also goes straight to next and is ok - ah, I see what you meant now, we need to substract the current position within the extent to extent_num_bytes... That's also already a problem, though; to take the same example: 0 8k 16k [extent1 | extent2 ... ] reading from 4k onwards will try to read min(extent_num_bytes, end-cur) = min(8k, 12k) = 8k from the 4k offset which goes over the end of the extent.
That could actually be my resets from the previous mail.
So either the first check should just lookup the extent and check that extent start matches current offset instead of checking for sectorsize alignment, or we can just fix the loop and remove the first if.

Dominique Martinet wrote on Tue, Apr 18, 2023 at 12:53:35PM +0900:
Ok, there is one exception for inline extents apparently.. But I'm not still not convinced the `aligned_start != file_offset` check is enough for that either; I'd say it's unlikely but the inline part can be compressed, so we could have a file which has > 4k (sectorsize) of expanded data, so a read from the 4k offset would skip the special handling and fail (reading the whole extent in dest)
(Wasn't able to create such a file, I assume that means the uncompressed data must fits in a page -- if we deal with arm machines with 16k or 64k page size that'll probably change things again but I'll just pretend sector size will magically match in this case..)
Even if that's not possible, reading just the first 10 bytes of an inline extent will be aligned and go through the main loop which just reads the whole extent, so it'll need the same handling as the regular btrfs_read_extent_reg handling at which point it might just as well handle start offset too.
Question regarding inline extent: the main loop goes straight out after reading an inline extent, assuming it is always alone (or last) -- is that correct? By playing with `filefrag -v` I could sometimes see files that temporarily have inline + a delalloc extent, but wasn't able to make a file that kept the inline extent after appending more data, so I guess it is also sane enough... Just making it continue and letting the loop end seems just as simple now there is no trailing if, but if inline extents cannot be mixed I'll be happy to keep it going out.
back to btrfs_read_extent_reg: merging the tail back into the main loop breaks the first assert (IS_ALIGNED(len, fs_info->sectorsize) in particular). The old loop invocation made sure it was aligned and read_and_truncate_page used to take care of calling it with a bigger buffer when it was not.
I was only looking at the compressed path and that does not care about 'len' alignment because it makes an intermediate copy for decompression, but the BTRFS_COMPRESS_NONE's read_extent_data might care? I didn't see anything that actually requires alignment there (len in particular should be ok, but even offset->logical seems to properly be used as "being part of a range" in lookups so alignment doesn't actually matter), but if this isn't tested I can understand wanting to be more careful there.
Ok so this is rightfully less obvious than I had first assumed -- sorry for rushing in. Let's do baby steps: - I'll resend just the first patch shortly, it's a real fix and I'll be using it right away. - I'll double check 'len' doesn't need to be aligned in btrfs_read_extent_reg and just rework the main loop to allow removing the tail exception, that'll avoid a double read in many cases and I think it's worth doing. Might take a bit more time as I want to finish some other work, let's say a few days. - That leaves the two issues I brought up with the main loop: * inline case ignoring end; this is minor enough but easy to fix * btrfs_read_extent_reg() assuming start of extent in its length agument; I believe it'll be easier to stop trying to set a upper limit in the main loop and just have btrfs_read_extent_reg() do it itself, we can use its return value to know how much was actually read. (Probably just substract offset - key.offset to extent_num_bytes to have a new cap, but I still don't understand btrfs_file_extent_offset() in all of this as it's always 0 when I looked) Will try to do that over the next few weeks, but if you want to look at it feel free to do this before me. - At this point it might be worth considering removing the initial check as that also makes an extra small read in the unaligned case, but it's not a bug and can wait.
What do you think?

On 2023/4/18 11:53, Dominique Martinet wrote:
Qu Wenruo wrote on Tue, Apr 18, 2023 at 11:21:00AM +0800:
No, was just thinking the leading part being a separate loop doesn't seem to make sense either as the code shouldn't care about sector size alignemnt but about full extents.
The main concern related to the leading unaligned part is, we need to skip something unaligned from the beginning , while all other situations never need to skip such case (they at most skip the tailing part).
Ok, there is one exception for inline extents apparently.. But I'm not still not convinced the `aligned_start != file_offset` check is enough for that either; I'd say it's unlikely but the inline part can be compressed, so we could have a file which has > 4k (sectorsize) of expanded data, so a read from the 4k offset would skip the special handling and fail (reading the whole extent in dest)
Btrfs inline has a limit to sectorsize.
That means, inlined compressed extent can at most be 4K sized (if 4K is our sector size).
So that won't be a problem.
Even if that's not possible, reading just the first 10 bytes of an inline extent will be aligned and go through the main loop which just reads the whole extent, so it'll need the same handling as the regular btrfs_read_extent_reg handling at which point it might just as well handle start offset too.
If we just read 10 bytes, the aligned part should handle it well.
My real concern is what if we read 10 bytes at offset 10 bytes.
If this can be handled in the same way of aligned read (and still be reasonable readable), then it would be awesome.
That aside taking the loop in order:
- lookup_data_extent doesn't care (used in heading/tail)
- skipping holes don't care as they explicitely put cursor at start of
next extent (or bail out if nothing next)
- inline needs fixing anyway as said above
- PREALLOC or nothing on disk also goes straight to next and is ok
- ah, I see what you meant now, we need to substract the current
position within the extent to extent_num_bytes... That's also already a problem, though; to take the same example: 0 8k 16k [extent1 | extent2 ... ] reading from 4k onwards will try to read min(extent_num_bytes, end-cur) = min(8k, 12k) = 8k from the 4k offset which goes over the end of the extent.
That's indeed a problem.
As most of the Uboot fs drivers only care read the whole file, never really utilize the ability to read part of the file, that path is not properly tested. (Some driver, IIRC ubifs?, doesn't even allow read with non-zero offset)
Thanks, Qu
That could actually be my resets from the previous mail.
So either the first check should just lookup the extent and check that extent start matches current offset instead of checking for sectorsize alignment, or we can just fix the loop and remove the first if.

From: Dominique Martinet dominique.martinet@atmark-techno.com
btfs_file_read's truncate path has a comment noting '>0 means no extent' and bailing out immediately, but the buffer has not been written so probably needs zeroing out.
This is a theorical fix only and hasn't been tested on a file that actually runs this code path.
Signed-off-by: Dominique Martinet dominique.martinet@atmark-techno.com --- fs/btrfs/inode.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index efffec0f2e68..23c006c98c3b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -756,9 +756,12 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, btrfs_release_path(&path); ret = lookup_data_extent(root, &path, ino, cur, &next_offset); - /* <0 is error, >0 means no extent */ - if (ret) + /* <0 is error, >0 means no extent: zero end of buffer */ + if (ret) { + if (ret > 0) + memset(dest + cur, 0, end - cur); goto out; + } fi = btrfs_item_ptr(path.nodes[0], path.slots[0], struct btrfs_file_extent_item); ret = read_and_truncate_page(&path, fi, cur,

On 2023/4/18 09:17, Dominique Martinet wrote:
From: Dominique Martinet dominique.martinet@atmark-techno.com
btfs_file_read's truncate path has a comment noting '>0 means no extent' and bailing out immediately, but the buffer has not been written so probably needs zeroing out.
This is a theorical fix only and hasn't been tested on a file that actually runs this code path.
IIRC there is a memset() at the very beginning of btrfs_file_read() to set the whole dest memory to zero.
This is to handle cases like NO_HOLE cases, which we can skip hole extents.
Thanks, Qu
Signed-off-by: Dominique Martinet dominique.martinet@atmark-techno.com
fs/btrfs/inode.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index efffec0f2e68..23c006c98c3b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -756,9 +756,12 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, btrfs_release_path(&path); ret = lookup_data_extent(root, &path, ino, cur, &next_offset);
/* <0 is error, >0 means no extent */
if (ret)
/* <0 is error, >0 means no extent: zero end of buffer */
if (ret) {
if (ret > 0)
memset(dest + cur, 0, end - cur); goto out;
fi = btrfs_item_ptr(path.nodes[0], path.slots[0], struct btrfs_file_extent_item); ret = read_and_truncate_page(&path, fi, cur,}

Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:04:56AM +0800:
This is a theorical fix only and hasn't been tested on a file that actually runs this code path.
IIRC there is a memset() at the very beginning of btrfs_file_read() to set the whole dest memory to zero.
Right, sorry.
I'll drop this and send a new patch that removes the duplicate memset (at "The whole file is a hole" comment) instead as seeing multiple memsets is what made me think it'd be necessary without thinking; that can be done even if we rework the function a bit in later cleanups...
participants (3)
-
Dominique Martinet
-
Dominique Martinet
-
Qu Wenruo