[U-Boot] NAND ECC Error with wrong SMC ording bug

Hi All:
It seems that the ECC correction is broken on the Linux with the 4xx NDFC driver. It uses the SMC order when reading the ECC code. 2-1-3
static int ndfc_calculate_ecc(struct mtd_info *mtd, const u_char *dat, u_char *ecc_code) { struct ndfc_controller *ndfc = &ndfc_ctrl; uint32_t ecc; uint8_t *p = (uint8_t *)&ecc;
wmb(); ecc = in_be32(ndfc->ndfcbase + NDFC_ECC); /* The NDFC uses Smart Media (SMC) bytes order */ ecc_code[0] = p[2]; ecc_code[1] = p[1]; ecc_code[2] = p[3];
return 0; }
However, when in the correction function, the byte address order is again reverses causing incorrect byte location.
* performace it does not make any difference */ if (eccsize_mult == 1) byte_addr = (addressbits[b0] << 4) + addressbits[b1];
The above really should be byte_addr = (addressbits[b1] << 4) +
addressbits[b0];
else byte_addr = (addressbits[b2 & 0x3] << 8) + (addressbits[b1] << 4) + addressbits[b0]; bit_addr = addressbits[b2 >> 2]; /* flip the bit */ buf[byte_addr] ^= (1 << bit_addr); printk(KERN_INFO "Corrected b[0] 0x%x b[1]0x%x\n", b0, b1); printk(KERN_INFO "cal ecc b[0] 0x%x b[1]0x%x\n", calc_ecc[0] , calc_ecc[1]); printk(KERN_INFO "read ecc b[0] 0x%x b[1]0x%x\n", read_ecc[0] , read_ecc[1]); return 1;
I see other boards using SMC as well, can someone comment on the change I am proposing. Should I change the correction algorithm or the calculate function? If the later is preferred it would mean the change must be pushed in both U-Boot and Linux.
Feng Kan AMCC Software

On Wed, 19 Aug 2009 16:16:54 -0700 Feng Kan fkan@amcc.com wrote:
I see other boards using SMC as well, can someone comment on the change I am proposing. Should I change the correction algorithm or the calculate function? If the later is preferred it would mean the change must be pushed in both U-Boot and Linux.
Odds are the calculate function is wrong. The correction algo is used by many nand drivers, I *assume* it is correct. The calculate function was set to agree with u-boot (1.3.0).
Cheers, Sean
P.S. Yes, I know the u-boot is an ancient version :(

On Thursday 20 August 2009 06:38:51 Sean MacLennan wrote:
I see other boards using SMC as well, can someone comment on the change I am proposing. Should I change the correction algorithm or the calculate function? If the later is preferred it would mean the change must be pushed in both U-Boot and Linux.
Odds are the calculate function is wrong. The correction algo is used by many nand drivers, I *assume* it is correct. The calculate function was set to agree with u-boot (1.3.0).
Yes, it seems that you changed the order in the calculation function while reworking the NDFC driver for arch/powerpc. So we should probably change this order back to the original version. And change it in U-Boot as well.
BTW: I didn't see any problems with ECC so far with the current code. Feng, how did you spot this problem?
Cheers, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de

On Thu, 20 Aug 2009 07:01:21 +0200 Stefan Roese sr@denx.de wrote:
On Thursday 20 August 2009 06:38:51 Sean MacLennan wrote:
I see other boards using SMC as well, can someone comment on the change I am proposing. Should I change the correction algorithm or the calculate function? If the later is preferred it would mean the change must be pushed in both U-Boot and Linux.
Odds are the calculate function is wrong. The correction algo is used by many nand drivers, I *assume* it is correct. The calculate function was set to agree with u-boot (1.3.0).
Yes, it seems that you changed the order in the calculation function while reworking the NDFC driver for arch/powerpc. So we should probably change this order back to the original version. And change it in U-Boot as well.
BTW: I didn't see any problems with ECC so far with the current code. Feng, how did you spot this problem?
Ok, I think I have reproduced the problem programmatically. Basically, I force a one bit error with the following patch:
diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c index 8c21b89..91dd5b4 100644 --- a/drivers/mtd/nand/nand_base.c +++ b/drivers/mtd/nand/nand_base.c @@ -1628,11 +1628,22 @@ static void nand_write_page_hwecc(struct mtd_info *mtd, struct nand_chip *chip, uint8_t *ecc_calc = chip->buffers->ecccalc; const uint8_t *p = buf; uint32_t *eccpos = chip->ecc.layout->eccpos; + static int count;
for (i = 0; eccsteps; eccsteps--, i += eccbytes, p += eccsize) { chip->ecc.hwctl(mtd, NAND_ECC_WRITE); - chip->write_buf(mtd, p, eccsize); - chip->ecc.calculate(mtd, p, &ecc_calc[i]); + if (count == 0) { + count = 1; + printk("Corrupt one bit: %08x => %08x\n", + *p, *p ^ 8); + *(uint8_t *)p ^= 8; + chip->write_buf(mtd, p, eccsize); + *(uint8_t *)p ^= 8; + nand_calculate_ecc(mtd, p, &ecc_calc[i]); + } else { + chip->write_buf(mtd, p, eccsize); + chip->ecc.calculate(mtd, p, &ecc_calc[i]); + } }
for (i = 0; i < chip->ecc.total; i++)
Basically I write a one bit error to the NAND, but calculate with the correct bit. This assumes nand_calculate_ecc is correct.
I then added debugs to the correction to make sure it corrected properly:
diff --git a/drivers/mtd/nand/nand_ecc.c b/drivers/mtd/nand/nand_ecc.c index c0cb87d..57dcaa1 100644 --- a/drivers/mtd/nand/nand_ecc.c +++ b/drivers/mtd/nand/nand_ecc.c @@ -483,14 +483,20 @@ int nand_correct_data(struct mtd_info *mtd, unsigned char *buf, byte_addr = (addressbits[b2 & 0x3] << 8) + (addressbits[b1] << 4) + addressbits[b0]; bit_addr = addressbits[b2 >> 2]; + + printk("Single bit error: correct %08x => %08x\n", + buf[byte_addr], buf[byte_addr] ^ (1 << bit_addr)); + /* flip the bit */ buf[byte_addr] ^= (1 << bit_addr); return 1;
} /* count nr of bits; use table lookup, faster than calculating it */ - if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1) + if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1) { + printk("ECC DATA BAD\n"); // SAM DBG return 1; /* error in ecc data; no action needed */ + }
printk(KERN_ERR "uncorrectable error : "); return -1;
With the current ndfc code, the error correction gets the bits wrong. Switching it back to the original way and the correction is correct.
diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c index 89bf85a..497e175 100644 --- a/drivers/mtd/nand/ndfc.c +++ b/drivers/mtd/nand/ndfc.c @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
wmb(); ecc = in_be32(ndfc->ndfcbase + NDFC_ECC); - /* The NDFC uses Smart Media (SMC) bytes order */ - ecc_code[0] = p[2]; - ecc_code[1] = p[1]; + ecc_code[0] = p[1]; + ecc_code[1] = p[2]; ecc_code[2] = p[3];
return 0;
Does anybody see a problem with my method of reproducing the bug? This bug is deadly for our customers. I don't want to make the change unless it is absolutely necessary.
Cheers, Sean

Hi Sean,
The change is necessary in both Linux and u-boot. Without this change customer are seeing the problem.
Best Regards,
Victor Gallardo
-----Original Message----- From: linuxppc-dev-bounces+vgallardo=amcc.com@lists.ozlabs.org [mailto:linuxppc-dev- bounces+vgallardo=amcc.com@lists.ozlabs.org] On Behalf Of Sean MacLennan Sent: Thursday, August 20, 2009 12:37 PM To: Stefan Roese Cc: u-boot@lists.denx.de; Feng Kan; linux-mtd@lists.infradead.org; linuxppc-dev@ozlabs.org Subject: Re: [U-Boot] NAND ECC Error with wrong SMC ording bug
On Thu, 20 Aug 2009 07:01:21 +0200 Stefan Roese sr@denx.de wrote:
On Thursday 20 August 2009 06:38:51 Sean MacLennan wrote:
I see other boards using SMC as well, can someone comment on the change I am proposing. Should I change the correction algorithm or the calculate function? If the later is preferred it would mean the change must be pushed in both U-Boot and Linux.
Odds are the calculate function is wrong. The correction algo is used by many nand drivers, I *assume* it is correct. The calculate function was set to agree with u-boot (1.3.0).
Yes, it seems that you changed the order in the calculation function while reworking the NDFC driver for arch/powerpc. So we should probably change this order back to the original version. And change it in U-Boot as well.
BTW: I didn't see any problems with ECC so far with the current code. Feng, how did you spot this problem?
Ok, I think I have reproduced the problem programmatically. Basically, I force a one bit error with the following patch:
diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c index 8c21b89..91dd5b4 100644 --- a/drivers/mtd/nand/nand_base.c +++ b/drivers/mtd/nand/nand_base.c @@ -1628,11 +1628,22 @@ static void nand_write_page_hwecc(struct mtd_info *mtd, struct nand_chip *chip, uint8_t *ecc_calc = chip->buffers->ecccalc; const uint8_t *p = buf; uint32_t *eccpos = chip->ecc.layout->eccpos;
static int count;
for (i = 0; eccsteps; eccsteps--, i += eccbytes, p += eccsize) { chip->ecc.hwctl(mtd, NAND_ECC_WRITE);
chip->write_buf(mtd, p, eccsize);
chip->ecc.calculate(mtd, p, &ecc_calc[i]);
if (count == 0) {
count = 1;
printk("Corrupt one bit: %08x => %08x\n",
*p, *p ^ 8);
*(uint8_t *)p ^= 8;
chip->write_buf(mtd, p, eccsize);
*(uint8_t *)p ^= 8;
nand_calculate_ecc(mtd, p, &ecc_calc[i]);
} else {
chip->write_buf(mtd, p, eccsize);
chip->ecc.calculate(mtd, p, &ecc_calc[i]);
}
}
for (i = 0; i < chip->ecc.total; i++)
Basically I write a one bit error to the NAND, but calculate with the correct bit. This assumes nand_calculate_ecc is correct.
I then added debugs to the correction to make sure it corrected properly:
diff --git a/drivers/mtd/nand/nand_ecc.c b/drivers/mtd/nand/nand_ecc.c index c0cb87d..57dcaa1 100644 --- a/drivers/mtd/nand/nand_ecc.c +++ b/drivers/mtd/nand/nand_ecc.c @@ -483,14 +483,20 @@ int nand_correct_data(struct mtd_info *mtd, unsigned char *buf, byte_addr = (addressbits[b2 & 0x3] << 8) + (addressbits[b1] << 4) + addressbits[b0]; bit_addr = addressbits[b2 >> 2];
printk("Single bit error: correct %08x => %08x\n",
buf[byte_addr], buf[byte_addr] ^ (1 << bit_addr));
/* flip the bit */ buf[byte_addr] ^= (1 << bit_addr); return 1;
} /* count nr of bits; use table lookup, faster than calculating it */
- if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1)
if ((bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2]) == 1) {
printk("ECC DATA BAD\n"); // SAM DBG
return 1; /* error in ecc data; no action needed */
}
printk(KERN_ERR "uncorrectable error : "); return -1;
With the current ndfc code, the error correction gets the bits wrong. Switching it back to the original way and the correction is correct.
diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c index 89bf85a..497e175 100644 --- a/drivers/mtd/nand/ndfc.c +++ b/drivers/mtd/nand/ndfc.c @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
wmb(); ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
- /* The NDFC uses Smart Media (SMC) bytes order */
- ecc_code[0] = p[2];
- ecc_code[1] = p[1];
ecc_code[0] = p[1];
ecc_code[1] = p[2]; ecc_code[2] = p[3];
return 0;
Does anybody see a problem with my method of reproducing the bug? This bug is deadly for our customers. I don't want to make the change unless it is absolutely necessary.
Cheers, Sean _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev

<snip>
With the current ndfc code, the error correction gets the bits wrong. Switching it back to the original way and the correction is correct.
diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c index 89bf85a..497e175 100644 --- a/drivers/mtd/nand/ndfc.c +++ b/drivers/mtd/nand/ndfc.c @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
wmb(); ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
- /* The NDFC uses Smart Media (SMC) bytes order */
- ecc_code[0] = p[2];
- ecc_code[1] = p[1];
- ecc_code[0] = p[1];
- ecc_code[1] = p[2];
ecc_code[2] = p[3];
return 0;
Does anybody see a problem with my method of reproducing the bug? This bug is deadly for our customers. I don't want to make the change unless it is absolutely necessary..
Just one question: did you enabled MTD_NAND_ECC_SMC in configs?
-vimal
Cheers, Sean
Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/

On Fri, 21 Aug 2009 10:47:09 +0530 vimal singh vimal.newwork@gmail.com wrote:
Just one question: did you enabled MTD_NAND_ECC_SMC in configs?
It is automagically selected when you select the NDFC driver.
Cheers, Sean

On Friday 21 August 2009 07:17:09 vimal singh wrote:
diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c index 89bf85a..497e175 100644 --- a/drivers/mtd/nand/ndfc.c +++ b/drivers/mtd/nand/ndfc.c @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
wmb(); ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
/* The NDFC uses Smart Media (SMC) bytes order */
ecc_code[0] = p[2];
ecc_code[1] = p[1];
ecc_code[0] = p[1];
ecc_code[1] = p[2]; ecc_code[2] = p[3]; return 0;
Does anybody see a problem with my method of reproducing the bug? This bug is deadly for our customers. I don't want to make the change unless it is absolutely necessary..
Just one question: did you enabled MTD_NAND_ECC_SMC in configs?
Yes, MTD_NAND_ECC_SMC is selected via Kconfig for this driver.
Cheers, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de

Hi Vimal,
With the current ndfc code, the error correction gets the bits wrong. Switching it back to the original way and the correction is correct.
diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c index 89bf85a..497e175 100644 --- a/drivers/mtd/nand/ndfc.c +++ b/drivers/mtd/nand/ndfc.c @@ -101,9 +101,8 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd,
wmb(); ecc = in_be32(ndfc->ndfcbase + NDFC_ECC);
- /* The NDFC uses Smart Media (SMC) bytes order */
- ecc_code[0] = p[2];
- ecc_code[1] = p[1];
- ecc_code[0] = p[1];
- ecc_code[1] = p[2];
ecc_code[2] = p[3];
return 0;
Does anybody see a problem with my method of reproducing the bug? This bug is deadly for our customers. I don't want to make the change unless it is absolutely necessary..
Just one question: did you enabled MTD_NAND_ECC_SMC in configs?
Yes, it was set.
Best Regards,
Victor Gallardo

Hi Stefan:
We had a board with high number of correctable ECC errors. Which crashed the jffs when it was miss correcting the wrong byte location.
Do you want me to submit a patch for this, or do you prefer to do it. I am submitting a patch for linux right now.
Feng Kan AMCC Software
On 08/19/2009 10:01 PM, Stefan Roese wrote:
On Thursday 20 August 2009 06:38:51 Sean MacLennan wrote:
I see other boards using SMC as well, can someone comment on the change I am proposing. Should I change the correction algorithm or the calculate function? If the later is preferred it would mean the change must be pushed in both U-Boot and Linux.
Odds are the calculate function is wrong. The correction algo is used by many nand drivers, I *assume* it is correct. The calculate function was set to agree with u-boot (1.3.0).
Yes, it seems that you changed the order in the calculation function while reworking the NDFC driver for arch/powerpc. So we should probably change this order back to the original version. And change it in U-Boot as well.
BTW: I didn't see any problems with ECC so far with the current code. Feng, how did you spot this problem?
Cheers, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk& Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de

Hi Feng,
On Friday 21 August 2009 01:42:42 Feng Kan wrote:
We had a board with high number of correctable ECC errors. Which crashed the jffs when it was miss correcting the wrong byte location.
OK, thanks.
Do you want me to submit a patch for this, or do you prefer to do it.
Sure, please go ahead and send a patch to fix this in U-Boot as well.
Thanks.
Cheers, Stefan
-- DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-0 Fax: (+49)-8142-66989-80 Email: office@denx.de
participants (6)
-
Feng Kan
-
Sean MacLennan
-
Stefan Roese
-
Victor Gallardo
-
Victor Gallardo
-
vimal singh