Re: [U-Boot] Bricked when trying to attach UBI

2 Jan 2013

      Luca Ceresoli wrote:
...
Hi,
I'm Cc'ing the linux-mtd list as well as the authors of the Linux
commits cited below.
For these new readers: I reported a problem with U-Boot 2012.04.01 not
being able to attach an UBI partition in NAND, while Linux (2.6.37) can
attach and repair it.
It looks like an U-Boot bug, but I discovered strange things around the
chip->badblockbits variable (in the NAND code) by comparing the
relevant code in U-Boot and Linux.
Sorry for Cc'ing so many people, but following this issue I was lead
from one subsystem to another (and from U-Boot to Linux).
Previous discussion is here:
http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/149624
Luca Ceresoli wrote:
...
Hi Andreas,
Andreas Bießmann wrote:
...
Hi Luca,
On 19.12.2012 16:56, Luca Ceresoli wrote:
...
Hi Andreas,
Andreas Bießmann wrote:
...
...
...
Creating 1 MTD partitions on "nand0":
0x000000100000-0x000010000000 : "mtd=3"
UBI: attaching mtd1 to ubi0
UBI: physical eraseblock size:   131072 bytes (128 KiB)
UBI: logical eraseblock size:    129024 bytes
UBI: smallest flash I/O unit:    2048
UBI: sub-page size:              512
UBI: VID header offset:          512 (aligned 512)
UBI: data offset:                2048
UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0,
need 1)
Now the device is totally blocked, and power cycling does not change
the result.
have you tried to increase the malloc arena in u-boot
(CONIG_SYS_MALLOC_LEN)?
We had errors like this before [1],[2] and [3], maybe others -
apparently with another error message, but please give it a try. We
know
ubi recovery needs some ram and 1MiB may be not enough.
Thanks for your suggestion.
Unfortunately this does not seem to be the cause of my problem: I 
tried
increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from
(1024 << 10) to both (1024 << 12) and (1024 << 14), but without any
difference.
Well, ok ... Malloc arena is always my first thought if I read about
problems with ubi in u-boot.
Have you looked up the differences in drivers/mtd/ubi/ in your u-boot
and linux tree? Maybe you can see something obviously different in the
ubi_wl_init_scan()?
I had some days ago, but I double-checked now as you suggested. Indeed
there is an important difference: attach_by_scanning() (build.c) calls
ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in
a swapped order!
This swap dates back to:
commit d63894654df72b010de2abb4b3f07d0d755f65b6
Author: Holger Brunck holger.brunck@keymile.com
Date:   Mon Oct 10 13:08:19 2011 +0200
 UBI: init eba tables before wl when attaching a device

 This fixes that u-boot gets stuck when a bitflip was detected
 during "ubi part <ubi_device>". If a bitflip was detected UBI tries
 to copy the PEB to a different place. This needs that the eba table
 are initialized, but this was done after the wear levelling worker
 detects the bitflip. So changes the initialisation of these two
 tasks in u-boot.

 This is a u-boot specific patch and not needed in the linux layer,
 because due to commit 1b1f9a9d00447d
 UBI: Ensure that "background thread" operations are really executed
 we schedule these tasks in place and not as in linux after the 

inital
     task which schedule this new task is finished.
 Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
 cc: Stefan Roese <sr@denx.de>
 Signed-off-by: Stefan Roese <sr@denx.de>

I tried reverting that commit and... surprise! U-Boot can now attach UBI
and boot properly!
But the cited commit actually fixed a bug that bite our board a few
months back, so it should not be reverted without thinking twice. Now
it apparently introduced another bug. :-(
I'm Cc:ing the commit author for comments.
Nonetheless, I have evidence of a different behaviour between U-Boot
and Linux even before the two swapped functions are called.
What attach_by_scanning() does in Linux is (abbreviated):
static int attach_by_scanning(struct ubi_device *ubi)
{
         si = ubi_scan(ubi);
     ...fill ubi->some_fields...;
         err = ubi_read_volume_table(ubi, si);
     /* MARK */
         err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */
         err = ubi_wl_init_scan(ubi, si);  /* swapped in U-Boot */
         ubi_scan_destroy_si(si);
         return 0;
}
See the two swapped calls.
At MARK, I printed some of the peb counters in *ubi, and I got
different results for ubi->avail_pebs between U-Boot and Linux:
U-Boot: UBI: POST_TBL: rsvd=2018, avail=21, beb_rsvd_{pebs,level}=0,0
Linux:  UBI: POST_TBL: rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0
The printed values were equal before calling ubi_read_volume_table().
I have no idea about where this difference comes from, nor if this
difference can cause my troubles.
I will better investigate tomorrow looking into ubi_read_volume_table().
After half a day of debugging and an insane amount of printk()s added to
both U-Boot and Linux, I have some more hints to understand the problem.
The two different results quoted above show that U-Boot counted 21
available eraseblocks, while Linux counts 22. I am not sure if this can
cause my problem, but it's the first visible difference between U-Boot
and Linux.
This originates from ubi_scan() (scan.c): in U-Boot, it sets
si->bad_peb_count to 1, in Linux to 0. U-Boot's ubi_scan() is very
similar to Linux's, and the differences do not seem to relevant in my 
case. So let's dig down...

ubi_scan() (scan.c) calls process_eb() (scan.c) for each EB
process_eb() calls ubi_io_is_bad() (io.c), and if it returns >0 it
increments si->bad_peb_count, which is what is happening to my board
when executing U-Boot
ubi_io_is_bad() calls mtd->block_isbad(), which points to
nand_block_isbad() (nand_base.c)
nand_block_isbad() is a wrapper to nand_block_checkbad() (nand_base.c)
nand_block_checkbad() differs from the Linux code in something
related to lazy bad block scanning (commit fb49454b1b6c7c6, Feb 2012),
but this does not seem to change the behaviour I observe;
nand_block_checkbad() calls either chip->block_bad() or
nand_isbad_bbt(); I tracked only into the former, but I suspect the
latter produces the same effects with regard to the problem I'm facing
chip->block_bad() points to nand_block_bad() (nand_base.c)

nand_block_bad() (nand_base.c) does the following:
static int nand_block_bad(struct mtd_info *mtd, loff_t ofs, int getchip)
{
    ...
    if (likely(chip->badblockbits == 8))
            res = bad != 0xFF;
    else
            res = hweight8(bad) < chip->badblockbits;

    if (getchip)
            nand_release_device(mtd);

    return res;

}
I don't understand the algorithm, but the relevant variables have these
values:
U-Boot: nand_block_bad: chip->badblockbits=8, bad=0000, hweight8(bad)=0
Linux:  nand_block_bad: chip->badblockbits=0, bad=0000, hweight8(bad)=0
                                           ^
Obviously the U-Boot and Linux produce a different return value.
This propagates up to ubi->bad_peb_count in ubi_scan(), and from there
it changes the behaviour of the following code, leading to a block in
U-Boot and a successful attach in Linux.
chip->badblockbits in current Linux master is described as
"minimum number of set bits in a good block's bad block marker
position; i.e., BBM == 11110111b is not bad when badblockbits == 7".
Still a bit obscure to me because I don't have a general picture.
Anyway, here's how its value comes to be different between U-Boot
(2012.04.01) and Linux (2.6.37).
Linux:
a) commit e0b58d0a7005, Feb 2010:
   mtd: nand: add ->badblockbits for minimum number of set bits in bad
   block byte
   declared the new variable and introduced in nand_get_flash_type()
   (nand_base.c) the following line:
     chip->badblockbits = 8;
b) commit c7b28e25cb9, Jul 2010:
   mtd: nand: refactor BB marker detection
   removed from nand_get_flash_type() (nand_base.c) the same line:
     chip->badblockbits = 8;
c) commit 26d9be11485e, Apr 2011:
   mtd: return badblockbits back
   restored in nand_get_flash_type() (nand_base.c) the following line:
     chip->badblockbits = 8;
  claiming it had been accidentally removed in commit b).
The version of Linux I'm using (2.6.37), contains commits a) and b), so
it has chip->badblockbits equal to 0. According to the log message of
commit c), this should be wrong, but the resulting kernel works!
The version of U-Boot (2012.04.01) contains the result of all 3 commits,
since
commit 2a8e0fc8b3dc31a3c571e439fbf04b882c8986be
  Author: Christian Hitz christian.hitz@aizo.com
  Date:   Wed Oct 12 09:32:02 2011 +0200
  nand: Merge changes from Linux nand driver

  [backport from linux commit
      02f8c6aee8df3cdc935e9bdd4f2d020306035dbe]

  This patch synchronizes the nand driver with the Linux 3.0 state.

This looks like an improvement, but it bricks my board!
I could not resist, and without even trying to understand what I was
doing, I did in U-Boot's nand_get_flash_type() (nand_base.c):

  chip->badblockbits = 8;

  chip->badblockbits = 0;

and guess what? U-Boot attached UBI, loaded Linux from it and booted
successfully!
No, I don't think changing lines here and there without any real
understanding is a way to produce reliable software. But I'm unable
to understand why the software that should work better actually bricks
the board and the other one runs fine? And how do I know what the
correct value for chip->badblockbits should be?
And last but most important: how can I properly fix U-Boot?
I had another look at the commit that swapped the calls to
ubi_eba_init_scan() and ubi_wl_init_scan(), and I noticed that it 
changed the
computationof the available PEB count.
In the original (pre-swap) code, running on a working board:
static int attach_by_scanning(struct ubi_device *ubi)
{
          si = ubi_scan(ubi);
      ...fill ubi->some_fields...;
          err = ubi_read_volume_table(ubi, si);
/* here rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0 */
err = ubi_wl_init_scan(ubi, si);  /* swapped in U-Boot */
/* herersvd=2019, avail=21, beb_rsvd_{pebs,level}=0,0 ***** */
err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */
          ubi_scan_destroy_si(si);
          return 0;
}
In the current (post-swap) code, running on the same board:
static int attach_by_scanning(struct ubi_device *ubi)
{
          si = ubi_scan(ubi);
      ...fill ubi->some_fields...;
          err = ubi_read_volume_table(ubi, si);
/* here rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0 */
err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */
/* here rsvd=2039, avail=1, beb_rsvd_{pebs,level}=20,20***** */
err = ubi_wl_init_scan(ubi, si);  /* swapped in U-Boot */
          ubi_scan_destroy_si(si);
          return 0;
}
Notice the difference on the line marked with "*****": after the swap, the
number of available PEBs changed from 21to 1.
According to the docs, UBI reserves some PEBs for bad PEB handling. By
default, in my 2048-PEBs NAND, it reserved 20 PEBs, wihch are far enough to
recover from a few bad PEBs. These should be computed as part of the 
"available"
PEBs. But current U-Boot (incorrectly?) thinks thereis only 1 available PEB.
On a bricked board, it thinks there are 0, so it cannotattach UBI.
I have no fix for this, but I tried a simple workaround: instead of 
using all
the available space for my logical volumes, I created them with a smaller
size, leaving 32 unused PEBs. Now, in attach_by_scanning(), I got:
pre-swap:  rsvd=1987, avail=53, beb_rsvd_{pebs,level}=0,0
post-swap: rsvd=2007, avail=33, beb_rsvd_{pebs,level}=20,20
The computed number of available PEB is exactly 32 units bigger than it used
to be. This means, also after the swap, U-Boot thinks there are plenty of
available PEBs.
To try to simulate a board that has bad blocks, I then marked some blocks as
bad using 'nand markbad' in U-Boot. The number of available PEBs decreases
accordingly, but is still >0 and U-Boot can attach UBI and boot.
So, it seems that leaving some unused PEBs is a workaround to this problem!
I'm not 100% sure this is ok and will go on to better understand the 
problem.
Any comments are welcome.
Luca