[U-Boot] UBI problems on current u-bo

Hi Stefan, I address this question to you because one of your commits is connected to this problem, but other hints from other readers are also welcome ;-) .
We have a kirkwood based board with a micron NAND flash. We have one ubi device created on the NAND flash and inside the device we have one ubi volume were we store our linux kernel. At startup we attach to the ubi device, to be able to readout the kernel image. On our old u-boot branch which based on v2009.08 we hadn't any problems. Now after upgrading to the newest u-boot version we saw in some rarely cases our u-boot get stuck when we try to attach:
=> ubi part ubi0 Creating 1 MTD partitions on "nand0": 0x000000000000-0x000008000000 : "mtd=0" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI: fixable bit-flip detected at PEB 71
And after this u-boot gets stuck until the end of days and we have to force a reboot, but the u-boot gets stuck again.
If I revert your commit: commit 1b1f9a9d00447d9eab32ae5633f60a106196b75f Author: Stefan Roese sr@denx.de Date: Mon May 17 10:00:51 2010 +0200
UBI: Ensure that "background thread" operations are really executed
the u-boot don't get stuck:
=> ubi part ubi0 Creating 1 MTD partitions on "nand0": 0x000000000000-0x000008000000 : "mtd=0" UBI: attaching mtd1 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI: fixable bit-flip detected at PEB 71 UBI: attached mtd1 to ubi0 UBI: MTD device name: "mtd=0" UBI: MTD device size: 128 MiB UBI: number of good PEBs: 1024 UBI: number of bad PEBs: 0 UBI: max. allowed volumes: 128 UBI: wear-leveling threshold: 4096 UBI: number of internal volumes: 1 UBI: number of user volumes: 9 UBI: available PEBs: 623 UBI: total number of reserved PEBs: 401 UBI: number of PEBs reserved for bad PEB handling: 10 UBI: max/mean erase counter: 8193/3082 =>
This is the reason why our old u-boot works, because the background thread seems to be not or not completely executed...
If I boot a recent linux kernel the kernel also reports an "fixable bit-flip detected at PEB 71" but linux is able to really fix this bit flip and is able to work as expceted, even u-boot is afterwards bootable because the bitflip is corrected and gone.
Now I could revert your commit in my local branch and then it seems to work, but I think this is not a good solution because I expect that the real error is somewhere in the UBI layer in u-boot and already fixed in current linux. AFAIK the ubi layer was initially copied from linux, but it seems that the bugfixes are not backported in the last years. Any thoughts or ideas?
Best regards Holger

Hi Holger,
On Friday 02 September 2011 15:32:40 Holger Brunck wrote:
I address this question to you because one of your commits is connected to this problem, but other hints from other readers are also welcome ;-) .
I'll try to look into this later this week.
BTW: Is this problem reproducible on one of your systems?
Best regards, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de

Hi Stefan,
On 09/05/2011 04:37 PM, Stefan Roese wrote:
I address this question to you because one of your commits is connected to this problem, but other hints from other readers are also welcome ;-) .
I'll try to look into this later this week.
BTW: Is this problem reproducible on one of your systems?
Best regards, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de

Hi Stefan, sorry for the previous mail, but I hit the send button too fast ;-)
On 09/05/2011 04:37 PM, Stefan Roese wrote:
On Friday 02 September 2011 15:32:40 Holger Brunck wrote:
I address this question to you because one of your commits is connected to this problem, but other hints from other readers are also welcome ;-) .
I'll try to look into this later this week.
Thanks. I also wanted to do some further investigations, if I find the time this week.
BTW: Is this problem reproducible on one of your systems?
yes we find a way to reproduce the bug on one of our boards. We need a special bit pattern in one UBI PEB, force a bitflip and afterwards the problem is present.
Best regards Holger

Hi,
On 09/05/2011 05:57 PM, Holger Brunck wrote:
On 09/05/2011 04:37 PM, Stefan Roese wrote:
BTW: Is this problem reproducible on one of your systems?
yes we find a way to reproduce the bug on one of our boards. We need a special bit pattern in one UBI PEB, force a bitflip and afterwards the problem is present.
I have done some further investigations. It's not true that we need a special bit pattern in one UBI peb. We only need a situation where the UBI layer in u-boot finds a fixable bitflip in NAND and u-boot gets stuck.
The loop in which u-boot gets stuck is in driver/mtd/ubi/wlc.:
schedule_erase <-------- | | erase_worker | | | ensure_wear_leveling | | | wear_leveling_worker --|
And from this loop we will never return.
I have seen in mainline kernel this fix in the ubi layer:
commit b86a2c56e512f46d140a4bcb4e35e8a7d4a99a4b Author: Artem Bityutskiy Artem.Bityutskiy@nokia.com Date: Sun May 24 14:13:34 2009 +0300
UBI: do not switch to R/O mode on read errors
This patch improves UBI errors handling. ATM UBI switches to R/O mode when the WL worker fails to read the source PEB. This means that the upper layers (e.g., UBIFS) has no chances to unmap the erroneous PEB and fix the error. This patch changes this behaviour and makes UBI put PEBs like this into a separate RB-tree, thus preventing the WL worker from hitting the same read errors again and again.
[...]
And this sounds like the problem I see in u-boot. But this patch is not easy to port onto u-boot because previously undergoing changes in the kernels ubi layer...
Best regards Holger Brunck

Hi Holger,
On Monday 12 September 2011 19:16:33 Holger Brunck wrote:
I have seen in mainline kernel this fix in the ubi layer:
commit b86a2c56e512f46d140a4bcb4e35e8a7d4a99a4b Author: Artem Bityutskiy Artem.Bityutskiy@nokia.com Date: Sun May 24 14:13:34 2009 +0300
UBI: do not switch to R/O mode on read errors This patch improves UBI errors handling. ATM UBI switches to R/O mode when the WL worker fails to read the source PEB. This means that the upper layers (e.g., UBIFS) has no chances to unmap the erroneous PEB and fix the error. This patch changes this behaviour and makes UBI put PEBs like this into a separate RB-tree, thus preventing the WL worker from hitting the same read errors again and again.
[...]
And this sounds like the problem I see in u-boot.
Yes, very likely.
But this patch is not easy to port onto u-boot because previously undergoing changes in the kernels ubi layer...
Correct. UBI has undergone many changes since the integration into U-Boot back in the end of 2008 (nearly 3 years ago now). Perhaps the best would be to re- synch with the latest Linux UBI version. But this sounds like quite a lot of work as well...
Best regards, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de

Hi Stefan,
On 09/13/2011 09:39 AM, Stefan Roese wrote:
Hi Holger,
On Monday 12 September 2011 19:16:33 Holger Brunck wrote:
I have seen in mainline kernel this fix in the ubi layer:
commit b86a2c56e512f46d140a4bcb4e35e8a7d4a99a4b Author: Artem Bityutskiy Artem.Bityutskiy@nokia.com Date: Sun May 24 14:13:34 2009 +0300
UBI: do not switch to R/O mode on read errors This patch improves UBI errors handling. ATM UBI switches to R/O mode when the WL worker fails to read the source PEB. This means that the upper layers (e.g., UBIFS) has no chances to unmap the erroneous PEB and fix the error. This patch changes this behaviour and makes UBI put PEBs like this into a separate RB-tree, thus preventing the WL worker from hitting the same read errors again and again.
[...]
And this sounds like the problem I see in u-boot.
Yes, very likely.
But this patch is not easy to port onto u-boot because previously undergoing changes in the kernels ubi layer...
Correct. UBI has undergone many changes since the integration into U-Boot back in the end of 2008 (nearly 3 years ago now). Perhaps the best would be to re- synch with the latest Linux UBI version. But this sounds like quite a lot of work as well...
if found a way to port only this patch. But I had to do some changes to it. E.g. exchange the constants for errors with numeric values as present in u-boots ubi version. And remove setting of the constants (MOVE_TARGET_RD_ERR) values which are not evaluated in u-boots ubi version. This fixes my problem. Should I post this patch?
Regards Holger

Hi Holger,
On Tuesday 13 September 2011 10:32:34 Holger Brunck wrote:
But this patch is not easy to port onto u-boot because previously undergoing changes in the kernels ubi layer...
Correct. UBI has undergone many changes since the integration into U-Boot back in the end of 2008 (nearly 3 years ago now). Perhaps the best would be to re- synch with the latest Linux UBI version. But this sounds like quite a lot of work as well...
if found a way to port only this patch.
Good. :)
But I had to do some changes to it. E.g. exchange the constants for errors with numeric values as present in u-boots ubi version. And remove setting of the constants (MOVE_TARGET_RD_ERR) values which are not evaluated in u-boots ubi version. This fixes my problem. Should I post this patch?
Yes, please do.
Best regards, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de

Hi Stefan
On 09/13/2011 10:37 AM, Stefan Roese wrote:
On Tuesday 13 September 2011 10:32:34 Holger Brunck wrote:
But this patch is not easy to port onto u-boot because previously undergoing changes in the kernels ubi layer...
Correct. UBI has undergone many changes since the integration into U-Boot back in the end of 2008 (nearly 3 years ago now). Perhaps the best would be to re- synch with the latest Linux UBI version. But this sounds like quite a lot of work as well...
if found a way to port only this patch.
Good. :)
But I had to do some changes to it. E.g. exchange the constants for errors with numeric values as present in u-boots ubi version. And remove setting of the constants (MOVE_TARGET_RD_ERR) values which are not evaluated in u-boots ubi version. This fixes my problem. Should I post this patch?
Yes, please do.
argh, I realized that in my working branch I had your patch "UBI: Ensure that "background thread" operations are really executed" was still reverted. And after removing this revert, I had the same problems as before...
So there is another problem present and I currently don't find the reason why...
Regards Holger

On 09/13/2011 05:31 PM, Holger Brunck wrote:
On 09/13/2011 10:37 AM, Stefan Roese wrote:
On Tuesday 13 September 2011 10:32:34 Holger Brunck wrote:
But this patch is not easy to port onto u-boot because previously undergoing changes in the kernels ubi layer...
Correct. UBI has undergone many changes since the integration into U-Boot back in the end of 2008 (nearly 3 years ago now). Perhaps the best would be to re- synch with the latest Linux UBI version. But this sounds like quite a lot of work as well...
if found a way to port only this patch.
Good. :)
But I had to do some changes to it. E.g. exchange the constants for errors with numeric values as present in u-boots ubi version. And remove setting of the constants (MOVE_TARGET_RD_ERR) values which are not evaluated in u-boots ubi version. This fixes my problem. Should I post this patch?
Yes, please do.
argh, I realized that in my working branch I had your patch "UBI: Ensure that "background thread" operations are really executed" was still reverted. And after removing this revert, I had the same problems as before...
So there is another problem present and I currently don't find the reason why...
I have finished my investigations here. My patch mentioned above was not the solution for the problem. What really fixed my problem was to revert this "UBI: Ensure that "background thread" operations are really executed".
The reason is that the UBI layer assumes that these "background threads" are executed after the task who scheduled the "thread" was finished. U-boot gets stuck during calling ubi_wl_init_scan(ubi, si). The problem here is that now if a bitflip was detected in the linux implementation a background thread was started to schedule an erase for this PEB. In the current u-boot implementation this is done immediately and this is a problem because some needed datas aren't initialized at this stage. So reverting this commit fixes the problem for me.
Best regards Holger

Part of this can be referenced by the linux commit where ECC errors are handled as common place for MLC nand parts, and a bitflip threshold is setup to return EUCLEAN out of the mtdcore.c. So, the work above avoids the UBI side effect of doing the PEB relocation on getting the bit error indication. However, the bit error is valid on these MLC parts - so, is the idea to port the work from Linux or just make UBI read-only in U-boot and not allow it to relocation PEBs on bit errors?
For reference the Linux commits
commit 3f91e94f7f511de74c0d2abe08672ccdbdd1961c Author: Mike Dunn mikedunn@newsguy.com Date: Wed Apr 25 12:06:09 2012 -0700
mtd: nand: read_page() returns max_bitflips
... snip so web post works ...
commit edbc4540e02c201bdd4f4d498ebb6ed517fd36e2 Author: Mike Dunn mikedunn@newsguy.com Date: Wed Apr 25 12:06:11 2012 -0700
mtd: driver _read() returns max_bitflips; mtd_read() returns -EUCLEAN
... snip so web post works ...
Regards, Charles
participants (3)
-
Charles Hardin
-
Holger Brunck
-
Stefan Roese