[U-Boot] Bricked when trying to attach UBI

Luca Ceresoli

19 Dec 2012 19 Dec '12

12:28 p.m.

Hi all,

I am facing a problem with some boards that do not boot after some weeks or months of normal usage, being unable to attach UBI. They do not boot anymore event after a power cycle, in other words they are totally bricked. I don't know exactly what problem UBI has, but it is recoverable by Linux, but apparently not by U-Boot.

The boards are DIG297 (dig297 board in mainline U-Boot), based on OMAP3530 and equipped with a NAND flash (Micron MT29F2G16ABBEAHC) as their unique permanent storage.

U-Boot v2012.04.01 starts correctly. The bootcmd tries to load the kernel from UBI, starting with the following commands:

echo Booting from nand ... setenv bootargs console=ttyO2,115200n8 mtdparts=omap2-nand.0:768k(uboot),128k(reserved),128k(uboot-env),-(ubi) ubi.mtd=3 root=ubi0:rootfs ro rootfstype=ubifs ip=.... ubi part nand0,3 ...

On "bricked" devices the output of the "ubi part nand0,3" command is:

Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Now the device is totally blocked, and power cycling does not change the result.

The interesting thing is that if I load Linux (2.6.37 + OMAP patches + board support patches) via TFTP and boot it with bootm, it correctly attaches UBI (fixing any problem it may have) and boots correctly. After that the board is unbricked: U-Boot can boot again normally from NAND.

Without the ambition of understanding all UBI internals, I tried to visually inspect the UBI code around the line where the error is produced and compare it to the corresponding Linux sources. They looked extremely similar, so I haven't and obvious hint of why U-Boot and Linux produce different results.

I also tried with an updated U-Boot master, but the error is still there.

Obviously I have changed nothing in the UBI and MTD code, both in U-Boot and in Linux.

Can you suggest a proper way to track the root of the problem, or to bypass it?

Big thanks in advance,

Luca

Show replies by date

Andreas Bießmann

19 Dec 19 Dec

4:24 p.m.

Dear Luca Ceresoli,

On 19.12.2012 12:28, Luca Ceresoli wrote:

...

Hi all,

I am facing a problem with some boards that do not boot after some weeks or months of normal usage, being unable to attach UBI. They do not boot anymore event after a power cycle, in other words they are totally bricked. I don't know exactly what problem UBI has, but it is recoverable by Linux, but apparently not by U-Boot.

The boards are DIG297 (dig297 board in mainline U-Boot), based on OMAP3530 and equipped with a NAND flash (Micron MT29F2G16ABBEAHC) as their unique permanent storage.

U-Boot v2012.04.01 starts correctly. The bootcmd tries to load the kernel from UBI, starting with the following commands:

echo Booting from nand ... setenv bootargs console=ttyO2,115200n8 mtdparts=omap2-nand.0:768k(uboot),128k(reserved),128k(uboot-env),-(ubi) ubi.mtd=3 root=ubi0:rootfs ro rootfstype=ubifs ip=.... ubi part nand0,3 ...

On "bricked" devices the output of the "ubi part nand0,3" command is:

Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Now the device is totally blocked, and power cycling does not change the result.

have you tried to increase the malloc arena in u-boot (CONIG_SYS_MALLOC_LEN)? We had errors like this before [1],[2] and [3], maybe others - apparently with another error message, but please give it a try. We know ubi recovery needs some ram and 1MiB may be not enough.

...

The interesting thing is that if I load Linux (2.6.37 + OMAP patches + board support patches) via TFTP and boot it with bootm, it correctly attaches UBI (fixing any problem it may have) and boots correctly. After that the board is unbricked: U-Boot can boot again normally from NAND.

The fact that linux can recover with a quite old version points for me towards 'environment constraints' like to few memory in u-boot. Unfortunately the error messages in u-boots ubi sometimes missing such details (like -ENOMEM as in [1]).

Best regards

Andreas Bießmann

[1] http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/124769 [2] http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/145526 [3] http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/145655

Luca Ceresoli

4:56 p.m.

Hi Andreas,

Andreas Bießmann wrote: ...

...

...
Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Now the device is totally blocked, and power cycling does not change the result.

have you tried to increase the malloc arena in u-boot (CONIG_SYS_MALLOC_LEN)? We had errors like this before [1],[2] and [3], maybe others - apparently with another error message, but please give it a try. We know ubi recovery needs some ram and 1MiB may be not enough.

Thanks for your suggestion.

Unfortunately this does not seem to be the cause of my problem: I tried increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any difference.

Luca

Andreas Bießmann

5:09 p.m.

Hi Luca,

On 19.12.2012 16:56, Luca Ceresoli wrote:

...

Hi Andreas,

Andreas Bießmann wrote: ...

...
...
Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Now the device is totally blocked, and power cycling does not change the result.

have you tried to increase the malloc arena in u-boot (CONIG_SYS_MALLOC_LEN)? We had errors like this before [1],[2] and [3], maybe others - apparently with another error message, but please give it a try. We know ubi recovery needs some ram and 1MiB may be not enough.

Thanks for your suggestion.

Unfortunately this does not seem to be the cause of my problem: I tried increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any difference.

Well, ok ... Malloc arena is always my first thought if I read about problems with ubi in u-boot. Have you looked up the differences in drivers/mtd/ubi/ in your u-boot and linux tree? Maybe you can see something obviously different in the ubi_wl_init_scan()?

Best regards

Andreas Bießmann

Luca Ceresoli

6:37 p.m.

Hi Andreas,

Andreas Bießmann wrote:

...

Hi Luca,

On 19.12.2012 16:56, Luca Ceresoli wrote:

...
Hi Andreas,

Andreas Bießmann wrote: ...

...
...
Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Now the device is totally blocked, and power cycling does not change the result.

have you tried to increase the malloc arena in u-boot (CONIG_SYS_MALLOC_LEN)? We had errors like this before [1],[2] and [3], maybe others - apparently with another error message, but please give it a try. We know ubi recovery needs some ram and 1MiB may be not enough.

Thanks for your suggestion.

Unfortunately this does not seem to be the cause of my problem: I tried increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any difference.

Well, ok ... Malloc arena is always my first thought if I read about problems with ubi in u-boot. Have you looked up the differences in drivers/mtd/ubi/ in your u-boot and linux tree? Maybe you can see something obviously different in the ubi_wl_init_scan()?

I had some days ago, but I double-checked now as you suggested. Indeed there is an important difference: attach_by_scanning() (build.c) calls ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in a swapped order!

This swap dates back to:

commit d63894654df72b010de2abb4b3f07d0d755f65b6 Author: Holger Brunck holger.brunck@keymile.com Date: Mon Oct 10 13:08:19 2011 +0200

UBI: init eba tables before wl when attaching a device

This fixes that u-boot gets stuck when a bitflip was detected during "ubi part <ubi_device>". If a bitflip was detected UBI tries to copy the PEB to a different place. This needs that the eba table are initialized, but this was done after the wear levelling worker detects the bitflip. So changes the initialisation of these two tasks in u-boot.

This is a u-boot specific patch and not needed in the linux layer, because due to commit 1b1f9a9d00447d UBI: Ensure that "background thread" operations are really executed we schedule these tasks in place and not as in linux after the inital task which schedule this new task is finished.

Signed-off-by: Holger Brunck holger.brunck@keymile.com cc: Stefan Roese sr@denx.de Signed-off-by: Stefan Roese sr@denx.de

I tried reverting that commit and... surprise! U-Boot can now attach UBI and boot properly!

But the cited commit actually fixed a bug that bite our board a few months back, so it should not be reverted without thinking twice. Now it apparently introduced another bug. :-(

I'm Cc:ing the commit author for comments.

Nonetheless, I have evidence of a different behaviour between U-Boot and Linux even before the two swapped functions are called.

What attach_by_scanning() does in Linux is (abbreviated):

static int attach_by_scanning(struct ubi_device *ubi) { si = ubi_scan(ubi); ...fill ubi->some_fields...; err = ubi_read_volume_table(ubi, si); /* MARK */ err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */ err = ubi_wl_init_scan(ubi, si); /* swapped in U-Boot */ ubi_scan_destroy_si(si); return 0; }

See the two swapped calls.

At MARK, I printed some of the peb counters in *ubi, and I got different results for ubi->avail_pebs between U-Boot and Linux: U-Boot: UBI: POST_TBL: rsvd=2018, avail=21, beb_rsvd_{pebs,level}=0,0 Linux: UBI: POST_TBL: rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0

The printed values were equal before calling ubi_read_volume_table(). I have no idea about where this difference comes from, nor if this difference can cause my troubles. I will better investigate tomorrow looking into ubi_read_volume_table().

Luca

Holger Brunck

20 Dec 20 Dec

1:44 p.m.

Hi Luca,

On 12/19/2012 06:37 PM, Luca Ceresoli wrote:

...

I had some days ago, but I double-checked now as you suggested. Indeed there is an important difference: attach_by_scanning() (build.c) calls ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in a swapped order!

This swap dates back to:

commit d63894654df72b010de2abb4b3f07d0d755f65b6 Author: Holger Brunck holger.brunck@keymile.com Date: Mon Oct 10 13:08:19 2011 +0200
UBI: init eba tables before wl when attaching a device

This fixes that u-boot gets stuck when a bitflip was detected
during "ubi part <ubi_device>". If a bitflip was detected UBI tries
to copy the PEB to a different place. This needs that the eba table
are initialized, but this was done after the wear levelling worker
detects the bitflip. So changes the initialisation of these two
tasks in u-boot.

This is a u-boot specific patch and not needed in the linux layer,
because due to commit 1b1f9a9d00447d
UBI: Ensure that "background thread" operations are really executed
we schedule these tasks in place and not as in linux after the inital
task which schedule this new task is finished.

Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
cc: Stefan Roese <sr@denx.de>
Signed-off-by: Stefan Roese <sr@denx.de>
I tried reverting that commit and... surprise! U-Boot can now attach UBI and boot properly!

:-(

...

But the cited commit actually fixed a bug that bite our board a few months back, so it should not be reverted without thinking twice. Now it apparently introduced another bug. :-(

yes definetely.

I didn't read the whole thread, so I don't know what your exact problem is. On my boards the ubi layer seems to work fine on latest u-boot. But I see a general problem we have in the ubi layer in u-boot. I try to summarize my view:

The UBI layer was initialy copied from the linux implementation. But the linux implementation relies for some tasks e.g. fix correctable errors on a background thread. Due to the fact that u-boot is single threaded there was one commit which wants to take care that these background tasks are really executed (CC-ing the author): commit 1b1f9a9d00 UBI: Ensure that "background thread" operations are really executed

U-boot executes this background taks immediately but the linux implementation executes this tasks later with the help of some synchronisation mechanism. Therefore we have a different order executing these tasks. My fix did now a change in the initialisation order of eba tables and the wear leveling thread, to address my problem. But now it seems to cause a new problem on your side.

So the synchronisation mechanism in u-boot for the ubi tasks which are running on linux in background is incorrect. But how this could be fixed needs to have some deeper analyses.

Regards Holger

Luca Ceresoli

5:02 p.m.

Hi,

I'm Cc'ing the linux-mtd list as well as the authors of the Linux commits cited below.

For these new readers: I reported a problem with U-Boot 2012.04.01 not being able to attach an UBI partition in NAND, while Linux (2.6.37) can attach and repair it.

It looks like an U-Boot bug, but I discovered strange things around the chip->badblockbits variable (in the NAND code) by comparing the relevant code in U-Boot and Linux.

Sorry for Cc'ing so many people, but following this issue I was lead from one subsystem to another (and from U-Boot to Linux).

Previous discussion is here: http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/149624

Luca Ceresoli wrote:

...

Hi Andreas,

Andreas Bießmann wrote:

...
Hi Luca,

On 19.12.2012 16:56, Luca Ceresoli wrote:

...
Hi Andreas,

Andreas Bießmann wrote: ...

...
...
Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Now the device is totally blocked, and power cycling does not change the result.

have you tried to increase the malloc arena in u-boot (CONIG_SYS_MALLOC_LEN)? We had errors like this before [1],[2] and [3], maybe others - apparently with another error message, but please give it a try. We know ubi recovery needs some ram and 1MiB may be not enough.

Thanks for your suggestion.

Unfortunately this does not seem to be the cause of my problem: I tried increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any difference.

Well, ok ... Malloc arena is always my first thought if I read about problems with ubi in u-boot. Have you looked up the differences in drivers/mtd/ubi/ in your u-boot and linux tree? Maybe you can see something obviously different in the ubi_wl_init_scan()?

I had some days ago, but I double-checked now as you suggested. Indeed there is an important difference: attach_by_scanning() (build.c) calls ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in a swapped order!

This swap dates back to:

commit d63894654df72b010de2abb4b3f07d0d755f65b6 Author: Holger Brunck holger.brunck@keymile.com Date: Mon Oct 10 13:08:19 2011 +0200
 UBI: init eba tables before wl when attaching a device

 This fixes that u-boot gets stuck when a bitflip was detected
 during "ubi part <ubi_device>". If a bitflip was detected UBI tries
 to copy the PEB to a different place. This needs that the eba table
 are initialized, but this was done after the wear levelling worker
 detects the bitflip. So changes the initialisation of these two
 tasks in u-boot.

 This is a u-boot specific patch and not needed in the linux layer,
 because due to commit 1b1f9a9d00447d
 UBI: Ensure that "background thread" operations are really executed
 we schedule these tasks in place and not as in linux after the inital
 task which schedule this new task is finished.

 Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
 cc: Stefan Roese <sr@denx.de>
 Signed-off-by: Stefan Roese <sr@denx.de>
I tried reverting that commit and... surprise! U-Boot can now attach UBI and boot properly!

But the cited commit actually fixed a bug that bite our board a few months back, so it should not be reverted without thinking twice. Now it apparently introduced another bug. :-(

I'm Cc:ing the commit author for comments.

Nonetheless, I have evidence of a different behaviour between U-Boot and Linux even before the two swapped functions are called.

What attach_by_scanning() does in Linux is (abbreviated):

static int attach_by_scanning(struct ubi_device *ubi) { si = ubi_scan(ubi); ...fill ubi->some_fields...; err = ubi_read_volume_table(ubi, si); /* MARK */ err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */ err = ubi_wl_init_scan(ubi, si); /* swapped in U-Boot */ ubi_scan_destroy_si(si); return 0; }

See the two swapped calls.

At MARK, I printed some of the peb counters in *ubi, and I got different results for ubi->avail_pebs between U-Boot and Linux: U-Boot: UBI: POST_TBL: rsvd=2018, avail=21, beb_rsvd_{pebs,level}=0,0 Linux: UBI: POST_TBL: rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0

The printed values were equal before calling ubi_read_volume_table(). I have no idea about where this difference comes from, nor if this difference can cause my troubles. I will better investigate tomorrow looking into ubi_read_volume_table().

After half a day of debugging and an insane amount of printk()s added to both U-Boot and Linux, I have some more hints to understand the problem.

The two different results quoted above show that U-Boot counted 21 available eraseblocks, while Linux counts 22. I am not sure if this can cause my problem, but it's the first visible difference between U-Boot and Linux.

This originates from ubi_scan() (scan.c): in U-Boot, it sets si->bad_peb_count to 1, in Linux to 0. U-Boot's ubi_scan() is very similar to Linux's, and the differences do not seem to relevant in my case. So let's dig down...

- ubi_scan() (scan.c) calls process_eb() (scan.c) for each EB - process_eb() calls ubi_io_is_bad() (io.c), and if it returns >0 it increments si->bad_peb_count, which is what is happening to my board when executing U-Boot - ubi_io_is_bad() calls mtd->block_isbad(), which points to nand_block_isbad() (nand_base.c) - nand_block_isbad() is a wrapper to nand_block_checkbad() (nand_base.c) - nand_block_checkbad() differs from the Linux code in something related to lazy bad block scanning (commit fb49454b1b6c7c6, Feb 2012), but this does not seem to change the behaviour I observe; - nand_block_checkbad() calls either chip->block_bad() or nand_isbad_bbt(); I tracked only into the former, but I suspect the latter produces the same effects with regard to the problem I'm facing - chip->block_bad() points to nand_block_bad() (nand_base.c)

nand_block_bad() (nand_base.c) does the following: static int nand_block_bad(struct mtd_info *mtd, loff_t ofs, int getchip) { ...

if (likely(chip->badblockbits == 8)) res = bad != 0xFF; else res = hweight8(bad) < chip->badblockbits;

if (getchip) nand_release_device(mtd);

return res; }

I don't understand the algorithm, but the relevant variables have these values: U-Boot: nand_block_bad: chip->badblockbits=8, bad=0000, hweight8(bad)=0 Linux: nand_block_bad: chip->badblockbits=0, bad=0000, hweight8(bad)=0 ^

Obviously the U-Boot and Linux produce a different return value. This propagates up to ubi->bad_peb_count in ubi_scan(), and from there it changes the behaviour of the following code, leading to a block in U-Boot and a successful attach in Linux.

chip->badblockbits in current Linux master is described as "minimum number of set bits in a good block's bad block marker position; i.e., BBM == 11110111b is not bad when badblockbits == 7".

Still a bit obscure to me because I don't have a general picture. Anyway, here's how its value comes to be different between U-Boot (2012.04.01) and Linux (2.6.37).

Linux: a) commit e0b58d0a7005, Feb 2010: mtd: nand: add ->badblockbits for minimum number of set bits in bad block byte declared the new variable and introduced in nand_get_flash_type() (nand_base.c) the following line: chip->badblockbits = 8; b) commit c7b28e25cb9, Jul 2010: mtd: nand: refactor BB marker detection removed from nand_get_flash_type() (nand_base.c) the same line: chip->badblockbits = 8; c) commit 26d9be11485e, Apr 2011: mtd: return badblockbits back restored in nand_get_flash_type() (nand_base.c) the following line: chip->badblockbits = 8; claiming it had been accidentally removed in commit b).

The version of Linux I'm using (2.6.37), contains commits a) and b), so it has chip->badblockbits equal to 0. According to the log message of commit c), this should be wrong, but the resulting kernel works!

The version of U-Boot (2012.04.01) contains the result of all 3 commits, since

commit 2a8e0fc8b3dc31a3c571e439fbf04b882c8986be Author: Christian Hitz christian.hitz@aizo.com Date: Wed Oct 12 09:32:02 2011 +0200

nand: Merge changes from Linux nand driver

[backport from linux commit 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe]

This patch synchronizes the nand driver with the Linux 3.0 state.

This looks like an improvement, but it bricks my board!

I could not resist, and without even trying to understand what I was doing, I did in U-Boot's nand_get_flash_type() (nand_base.c):

- chip->badblockbits = 8; + chip->badblockbits = 0;

and guess what? U-Boot attached UBI, loaded Linux from it and booted successfully!

No, I don't think changing lines here and there without any real understanding is a way to produce reliable software. But I'm unable to understand why the software that should work better actually bricks the board and the other one runs fine? And how do I know what the correct value for chip->badblockbits should be?

And last but most important: how can I properly fix U-Boot?

Thanks, Luca

Luca Ceresoli

2 Jan 2 Jan

3:37 p.m.

Luca Ceresoli wrote:

...

Hi,

I'm Cc'ing the linux-mtd list as well as the authors of the Linux commits cited below.

For these new readers: I reported a problem with U-Boot 2012.04.01 not being able to attach an UBI partition in NAND, while Linux (2.6.37) can attach and repair it.

It looks like an U-Boot bug, but I discovered strange things around the chip->badblockbits variable (in the NAND code) by comparing the relevant code in U-Boot and Linux.

Sorry for Cc'ing so many people, but following this issue I was lead from one subsystem to another (and from U-Boot to Linux).

Previous discussion is here: http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/149624

Luca Ceresoli wrote:

...
Hi Andreas,

Andreas Bießmann wrote:

...
Hi Luca,

On 19.12.2012 16:56, Luca Ceresoli wrote:

...
Hi Andreas,

Andreas Bießmann wrote: ...

...
...
Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Now the device is totally blocked, and power cycling does not change the result.

have you tried to increase the malloc arena in u-boot (CONIG_SYS_MALLOC_LEN)? We had errors like this before [1],[2] and [3], maybe others - apparently with another error message, but please give it a try. We know ubi recovery needs some ram and 1MiB may be not enough.

Thanks for your suggestion.

Unfortunately this does not seem to be the cause of my problem: I tried increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any difference.

Well, ok ... Malloc arena is always my first thought if I read about problems with ubi in u-boot. Have you looked up the differences in drivers/mtd/ubi/ in your u-boot and linux tree? Maybe you can see something obviously different in the ubi_wl_init_scan()?

I had some days ago, but I double-checked now as you suggested. Indeed there is an important difference: attach_by_scanning() (build.c) calls ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in a swapped order!

This swap dates back to:

commit d63894654df72b010de2abb4b3f07d0d755f65b6 Author: Holger Brunck holger.brunck@keymile.com Date: Mon Oct 10 13:08:19 2011 +0200
 UBI: init eba tables before wl when attaching a device

 This fixes that u-boot gets stuck when a bitflip was detected
 during "ubi part <ubi_device>". If a bitflip was detected UBI tries
 to copy the PEB to a different place. This needs that the eba table
 are initialized, but this was done after the wear levelling worker
 detects the bitflip. So changes the initialisation of these two
 tasks in u-boot.

 This is a u-boot specific patch and not needed in the linux layer,
 because due to commit 1b1f9a9d00447d
 UBI: Ensure that "background thread" operations are really executed
 we schedule these tasks in place and not as in linux after the 
inital task which schedule this new task is finished.
 Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
 cc: Stefan Roese <sr@denx.de>
 Signed-off-by: Stefan Roese <sr@denx.de>
I tried reverting that commit and... surprise! U-Boot can now attach UBI and boot properly!

But the cited commit actually fixed a bug that bite our board a few months back, so it should not be reverted without thinking twice. Now it apparently introduced another bug. :-(

I'm Cc:ing the commit author for comments.

Nonetheless, I have evidence of a different behaviour between U-Boot and Linux even before the two swapped functions are called.

What attach_by_scanning() does in Linux is (abbreviated):

static int attach_by_scanning(struct ubi_device *ubi) { si = ubi_scan(ubi); ...fill ubi->some_fields...; err = ubi_read_volume_table(ubi, si); /* MARK */ err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */ err = ubi_wl_init_scan(ubi, si); /* swapped in U-Boot */ ubi_scan_destroy_si(si); return 0; }

See the two swapped calls.

At MARK, I printed some of the peb counters in *ubi, and I got different results for ubi->avail_pebs between U-Boot and Linux: U-Boot: UBI: POST_TBL: rsvd=2018, avail=21, beb_rsvd_{pebs,level}=0,0 Linux: UBI: POST_TBL: rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0

The printed values were equal before calling ubi_read_volume_table(). I have no idea about where this difference comes from, nor if this difference can cause my troubles. I will better investigate tomorrow looking into ubi_read_volume_table().
After half a day of debugging and an insane amount of printk()s added to both U-Boot and Linux, I have some more hints to understand the problem.

The two different results quoted above show that U-Boot counted 21 available eraseblocks, while Linux counts 22. I am not sure if this can cause my problem, but it's the first visible difference between U-Boot and Linux.

This originates from ubi_scan() (scan.c): in U-Boot, it sets si->bad_peb_count to 1, in Linux to 0. U-Boot's ubi_scan() is very similar to Linux's, and the differences do not seem to relevant in my case. So let's dig down...

ubi_scan() (scan.c) calls process_eb() (scan.c) for each EB

process_eb() calls ubi_io_is_bad() (io.c), and if it returns >0 it increments si->bad_peb_count, which is what is happening to my board when executing U-Boot

ubi_io_is_bad() calls mtd->block_isbad(), which points to nand_block_isbad() (nand_base.c)

nand_block_isbad() is a wrapper to nand_block_checkbad() (nand_base.c)

nand_block_checkbad() differs from the Linux code in something related to lazy bad block scanning (commit fb49454b1b6c7c6, Feb 2012), but this does not seem to change the behaviour I observe;

nand_block_checkbad() calls either chip->block_bad() or nand_isbad_bbt(); I tracked only into the former, but I suspect the latter produces the same effects with regard to the problem I'm facing

chip->block_bad() points to nand_block_bad() (nand_base.c)

nand_block_bad() (nand_base.c) does the following: static int nand_block_bad(struct mtd_info *mtd, loff_t ofs, int getchip) { ...
    if (likely(chip->badblockbits == 8))
            res = bad != 0xFF;
    else
            res = hweight8(bad) < chip->badblockbits;

    if (getchip)
            nand_release_device(mtd);

    return res;
}

I don't understand the algorithm, but the relevant variables have these values: U-Boot: nand_block_bad: chip->badblockbits=8, bad=0000, hweight8(bad)=0 Linux: nand_block_bad: chip->badblockbits=0, bad=0000, hweight8(bad)=0 ^

Obviously the U-Boot and Linux produce a different return value. This propagates up to ubi->bad_peb_count in ubi_scan(), and from there it changes the behaviour of the following code, leading to a block in U-Boot and a successful attach in Linux.

chip->badblockbits in current Linux master is described as "minimum number of set bits in a good block's bad block marker position; i.e., BBM == 11110111b is not bad when badblockbits == 7".

Still a bit obscure to me because I don't have a general picture. Anyway, here's how its value comes to be different between U-Boot (2012.04.01) and Linux (2.6.37).

Linux: a) commit e0b58d0a7005, Feb 2010: mtd: nand: add ->badblockbits for minimum number of set bits in bad block byte declared the new variable and introduced in nand_get_flash_type() (nand_base.c) the following line: chip->badblockbits = 8; b) commit c7b28e25cb9, Jul 2010: mtd: nand: refactor BB marker detection removed from nand_get_flash_type() (nand_base.c) the same line: chip->badblockbits = 8; c) commit 26d9be11485e, Apr 2011: mtd: return badblockbits back restored in nand_get_flash_type() (nand_base.c) the following line: chip->badblockbits = 8; claiming it had been accidentally removed in commit b).

The version of Linux I'm using (2.6.37), contains commits a) and b), so it has chip->badblockbits equal to 0. According to the log message of commit c), this should be wrong, but the resulting kernel works!

The version of U-Boot (2012.04.01) contains the result of all 3 commits, since

commit 2a8e0fc8b3dc31a3c571e439fbf04b882c8986be Author: Christian Hitz christian.hitz@aizo.com Date: Wed Oct 12 09:32:02 2011 +0200
  nand: Merge changes from Linux nand driver

  [backport from linux commit
      02f8c6aee8df3cdc935e9bdd4f2d020306035dbe]

  This patch synchronizes the nand driver with the Linux 3.0 state.
This looks like an improvement, but it bricks my board!

I could not resist, and without even trying to understand what I was doing, I did in U-Boot's nand_get_flash_type() (nand_base.c):
  chip->badblockbits = 8;
  chip->badblockbits = 0;
and guess what? U-Boot attached UBI, loaded Linux from it and booted successfully!

No, I don't think changing lines here and there without any real understanding is a way to produce reliable software. But I'm unable to understand why the software that should work better actually bricks the board and the other one runs fine? And how do I know what the correct value for chip->badblockbits should be?

And last but most important: how can I properly fix U-Boot?

I had another look at the commit that swapped the calls to ubi_eba_init_scan() and ubi_wl_init_scan(), and I noticed that it changed the computationof the available PEB count.

In the original (pre-swap) code, running on a working board:

static int attach_by_scanning(struct ubi_device *ubi) { si = ubi_scan(ubi); ...fill ubi->some_fields...; err = ubi_read_volume_table(ubi, si);

/* here rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0 */

err = ubi_wl_init_scan(ubi, si); /* swapped in U-Boot */

/* herersvd=2019, avail=21, beb_rsvd_{pebs,level}=0,0 ***** */

err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */ ubi_scan_destroy_si(si); return 0; }

In the current (post-swap) code, running on the same board:

static int attach_by_scanning(struct ubi_device *ubi) { si = ubi_scan(ubi); ...fill ubi->some_fields...; err = ubi_read_volume_table(ubi, si);

/* here rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0 */

err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */

/* here rsvd=2039, avail=1, beb_rsvd_{pebs,level}=20,20***** */

err = ubi_wl_init_scan(ubi, si); /* swapped in U-Boot */ ubi_scan_destroy_si(si); return 0; }

Notice the difference on the line marked with "*****": after the swap, the number of available PEBs changed from 21to 1.

According to the docs, UBI reserves some PEBs for bad PEB handling. By default, in my 2048-PEBs NAND, it reserved 20 PEBs, wihch are far enough to recover from a few bad PEBs. These should be computed as part of the "available" PEBs. But current U-Boot (incorrectly?) thinks thereis only 1 available PEB. On a bricked board, it thinks there are 0, so it cannotattach UBI.

I have no fix for this, but I tried a simple workaround: instead of using all the available space for my logical volumes, I created them with a smaller size, leaving 32 unused PEBs. Now, in attach_by_scanning(), I got:

pre-swap: rsvd=1987, avail=53, beb_rsvd_{pebs,level}=0,0 post-swap: rsvd=2007, avail=33, beb_rsvd_{pebs,level}=20,20

The computed number of available PEB is exactly 32 units bigger than it used to be. This means, also after the swap, U-Boot thinks there are plenty of available PEBs.

To try to simulate a board that has bad blocks, I then marked some blocks as bad using 'nand markbad' in U-Boot. The number of available PEBs decreases accordingly, but is still >0 and U-Boot can attach UBI and boot.

So, it seems that leaving some unused PEBs is a workaround to this problem! I'm not 100% sure this is ok and will go on to better understand the problem. Any comments are welcome.

Luca

Vikram Narayanan

19 Dec 19 Dec

6:32 p.m.

On 12/19/2012 4:58 PM, Luca Ceresoli wrote:

...

Hi all,

<snip>

...

On "bricked" devices the output of the "ubi part nand0,3" command is:

Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Just curious, What does the above command say when you try to attach an empty partition. Does it result in the same error?

...

Now the device is totally blocked, and power cycling does not change the result.

The interesting thing is that if I load Linux (2.6.37 + OMAP patches + board support patches) via TFTP and boot it with bootm, it correctly attaches UBI (fixing any problem it may have) and boots correctly. After that the board is unbricked: U-Boot can boot again normally from NAND.

Without the ambition of understanding all UBI internals, I tried to visually inspect the UBI code around the line where the error is produced and compare it to the corresponding Linux sources. They looked extremely similar, so I haven't and obvious hint of why U-Boot and Linux produce different results.

I also tried with an updated U-Boot master, but the error is still there.

Obviously I have changed nothing in the UBI and MTD code, both in U-Boot and in Linux.

Can you suggest a proper way to track the root of the problem, or to bypass it?

I think its the right time to sync the UBI code with the current kernel tree. But it seems like a huge work. Any suggestions?

Regards, Vikram

Stefan Roese

7:22 p.m.

On 12/19/2012 06:32 PM, Vikram Narayanan wrote:

...

...
On "bricked" devices the output of the "ubi part nand0,3" command is:

Creating 1 MTD partitions on "nand0": 0x000000100000-0x000010000000 : "mtd=3" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, need 1)

Just curious, What does the above command say when you try to attach an empty partition. Does it result in the same error?

...
Now the device is totally blocked, and power cycling does not change the result.

The interesting thing is that if I load Linux (2.6.37 + OMAP patches + board support patches) via TFTP and boot it with bootm, it correctly attaches UBI (fixing any problem it may have) and boots correctly. After that the board is unbricked: U-Boot can boot again normally from NAND.

Without the ambition of understanding all UBI internals, I tried to visually inspect the UBI code around the line where the error is produced and compare it to the corresponding Linux sources. They looked extremely similar, so I haven't and obvious hint of why U-Boot and Linux produce different results.

I also tried with an updated U-Boot master, but the error is still there.

Obviously I have changed nothing in the UBI and MTD code, both in U-Boot and in Linux.

Can you suggest a proper way to track the root of the problem, or to bypass it?

I think its the right time to sync the UBI code with the current kernel tree. But it seems like a huge work. Any suggestions?

Yes, syncing with the latest UBI/UBIFS code would be the best solution. Even though a try with an increased malloc area as suggested by Andreas might be a chance.

And yes, this re-sync with the latest-and-greatest Linux code version is of course a bigger task. It has been suggest as part of booting from an UBI volume task to the celinux forum:

http://lists.celinuxforum.org/pipermail/celinux-dev/2012-April/000543.html

But nothing has happened till now. Any volunteers? But please keep in mind that intensive testing is required before the current (stable?) code version can be replaced.

Thanks, Stefan

Vikram Narayanan

7:47 p.m.

On 12/19/2012 11:52 PM, Stefan Roese wrote: <snip>

...

...
I think its the right time to sync the UBI code with the current kernel tree. But it seems like a huge work. Any suggestions?

Yes, syncing with the latest UBI/UBIFS code would be the best solution. Even though a try with an increased malloc area as suggested by Andreas might be a chance.

And yes, this re-sync with the latest-and-greatest Linux code version is of course a bigger task. It has been suggest as part of booting from an UBI volume task to the celinux forum:

http://lists.celinuxforum.org/pipermail/celinux-dev/2012-April/000543.html

Yeah. I had queried sometime back on the activity of this task.

...

But nothing has happened till now. Any volunteers? But please keep in mind that intensive testing is required before the current (stable?) code version can be replaced.

Looks like the MTD layer might needs to be patched up as well at some places. What do you think?

Regards, Vikram

Vikram Narayanan

7:57 p.m.

On 12/20/2012 12:17 AM, Vikram Narayanan wrote:

...

On 12/19/2012 11:52 PM, Stefan Roese wrote:

<snip> >> I think its the right time to sync the UBI code with the current kernel >> tree. But it seems like a huge work. Any suggestions? > > Yes, syncing with the latest UBI/UBIFS code would be the best solution. > Even though a try with an increased malloc area as suggested by Andreas > might be a chance. > > And yes, this re-sync with the latest-and-greatest Linux code version is > of course a bigger task. It has been suggest as part of booting from an > UBI volume task to the celinux forum: > > http://lists.celinuxforum.org/pipermail/celinux-dev/2012-April/000543.html >

Yeah. I had queried sometime back on the activity of this task.

...
But nothing has happened till now. Any volunteers? But please keep in mind that intensive testing is required before the current (stable?) code version can be replaced.

Looks like the MTD layer might needs to be patched up as well at some places. What do you think?

May be we shall start some discussions and put forth some ideas, which might eventually invite some volunteers.

What is your proposal of syncing with the latest code? * Pick out changes from the Kernel's git (pick out UBI related commits right from the point where current u-boot code is) * Compare and move the code

Both are equally complicated with the second option having very less chance to figure out why that was added. Ideas are welcome.

Regards, Vikram

4507

Age (days ago)

4521

Last active (days ago)

List overview

Download

11 comments

5 participants

tags (0)

participants (5)

Andreas Bießmann
Holger Brunck
Luca Ceresoli
Stefan Roese
Vikram Narayanan