Re: [U-Boot] UBI fixable bit-flip issue

12 Jul 2018

      On 12 July 2018 18:46:11 GMT+10:00, Richard Weinberger richard@nod.at wrote:
...
Mark,
Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
...
Hello Mark,
added Richard Weinberger to cc...
Am 12.07.2018 um 02:28 schrieb Mark Spieth:
...
Hi
In the process of investigating a boot failure on one of our
devices, the
...
...
UBI: fixable bit-flip detected at PEB
message was seen with the following behaviour during kernel load in
u-boot.
...
...
Read [2285568] bytes
UBI: fixable bit-flip detected at PEB 415
UBI: schedule PEB 415 for scrubbing
UBI: fixable bit-flip detected at PEB 415
UBI: fixable bit-flip detected at PEB 419
UBI: schedule PEB 419 for scrubbing
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: schedule PEB 420 for scrubbing
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
This repeats until reset.
Do you see the same symptom also on Linux?
We need to be very sure that it is actually a UBI problem.
The linux provided has an up to date mtd/ubi driver so already has the 75% bitflip threshold thus hiding the issue in a new flash. So the 2 are not the same. Untested on linux.
...
...
...
This fix is not a root cause fix though. Investigating further led
to the following root cause
...
...
solution. The following is AFAICT.
When the scrubber chooses a PEB to move the from the free balanced
tree. This tree is sorted by EC
...
...
(erase count) and then by PEB number.
The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF
which is 8192 in this config. So the
...
...
find_wl_entry function will find a PEB that is better in error
count that the current PEB EC. This
error count? You mean erase count?
Yes of course.
...
...
...
can easily cause it to find the PEB that was just moved from if it
is the lowest numbered PEB in the
...
...
free tree. Waiting for EC to go above 8192 would take a long time
and cause premature aging of the
...
...
flash PEBs in question.
The easy solution is to change the max parameter to this call to 0
so it finds a PEB with a smaller
...
...
EC than the one being replaced. This means it wont use the
previously discarded PEB as its first
...
...
choice.
For scrubbing this might be a good idea, but not for regular
wear-leveling.
Yes only for scrubbing, not wear leveling.
...
See comment in UBI:
/*

When a physical eraseblock is moved, the WL sub-system has to pick

the target

physical eraseblock to move to. The simplest way would be just to

pick the

one with the highest erase counter. But in certain workloads this

could lead

to an unlimited wear of one or few physical eraseblock. Indeed,

imagine a

situation when the picked physical eraseblock is constantly erased

after the

data is written to it. So, we have a constant which limits the

highest erase

counter of the free physical eraseblock to pick. Namely, the WL

sub-system

does not pick eraseblocks with erase counter greater than the lowest

erase

counter plus %WL_FREE_MAX_DIFF.

*/
#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)
So we could change the logic such that for regular wear-leveling we
keep using WL_FREE_MAX_DIFF,
but for scrubbing (which is 1:1 wear-leveling but the source PEB is
showing bit-flips) we use
a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
I'm not sure whether 0 is too extreme and might cause other
distortions.
Yes the wear leveling threshold is still WL_FREE_MAX_DIFF and the scubbing threshold is 0.
This is why I'm asking. Because the 2 PEBs will track each others EC I'm not sure that will work.
...
Mark, can you please file a patch and send it to linux-mtd mailing
list?
Such a change needs to go through Linux and then to u-boot.
But first we need to think about and discuss it in detail.
Will do.
...
...
I am not sure if it is so easy ...
...
This fix was implemented and fixable bit-flip errors no longer
hang/freeze the boot process! UBI
...
...
erase and reformat was used between re-tests to get consistent
results.
...
...
Adding the above 75% correctable bitflip threshold is also a good
thing as less movement will ensue
...
...
when the FLASH is new, but as the flash ages, the root cause will
once again be invoked causing
...
...
un-recoverable boot failures.
Note this fault is also in the latest kernel drivers for UBI and
may also exist in other wear
...
...
leveling implementations. The kernel driver issue may be at fault
for android devices locking
...
...
up/freezing sporadically during FLASH read when scrubbing due to a
relatively full flash and
...
...
correctable errors causing ping pong PEB moves.
The question is, is my root cause solution sound or have I missed
something?
...
I have to think about, before I write nonsene, but may Richard has
here a deeper insight.
Thanks for your input.
Mark
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.