Encorrectable ECC error

738 views
Skip to first unread message

Dimiter Dimitrov

unread,
Oct 20, 2016, 4:02:33 AM10/20/16
to Alt-F
Hello, I have D-Link 320L and running ALT-F from 2 years on it. Now I notice this errors in dmesg:
__nand_correct_data: uncorrectable ECC error
__nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock5, sector 528
Buffer I/O error on device mtdblock5, logical block 66

As i try to dump all mtd devices, there is problem only with mtdblock5:

 mtdinfo -M /dev/mtd5
mtd5
Name:                           config
Type:                           nand
Eraseblock size:                131072 bytes, 128.0 KiB
Amount of eraseblocks:          40 (5242880 bytes, 5.0 MiB)
Minimum input/output unit size: 2048 bytes
Sub-page size:                  512 bytes
OOB size:                       64 bytes
Character device major/minor:   90:10
Bad blocks are allowed:         true
Device is writable:             true
Eraseblock map:
  0: 00000000          1: 00020000          2: 00040000    BAD   3: 00060000
  4: 00080000          5: 000a0000          6: 000c0000          7: 000e0000
  8: 00100000          9: 00120000         10: 00140000         11: 00160000
 12: 00180000         13: 001a0000         14: 001c0000         15: 001e0000
 16: 00200000         17: 00220000         18: 00240000         19: 00260000
 20: 00280000         21: 002a0000         22: 002c0000         23: 002e0000
 24: 00300000         25: 00320000         26: 00340000         27: 00360000
 28: 00380000         29: 003a0000         30: 003c0000         31: 003e0000
 32: 00400000         33: 00420000         34: 00440000         35: 00460000
 36: 00480000         37: 004a0000         38: 004c0000         39: 004e0000

 nanddump -f /mnt/sda2/Vkashti/dns-320l-nanddump-mtd5 /dev/mtd5
ECC failed: 6
ECC corrected: 0
Number of bad blocks: 1
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00500000...

Can I fix this errors or the memory of my device is dying?
Can I flash back the original firmware or the device will brick because of this error?

João Cardoso

unread,
Oct 20, 2016, 2:13:42 PM10/20/16
to Alt-F


On Thursday, 20 October 2016 09:02:33 UTC+1, Dimiter Dimitrov wrote:
Hello, I have D-Link 320L and running ALT-F from 2 years on it. Now I notice this errors in dmesg:
__nand_correct_data: uncorrectable ECC error
__nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock5, sector 528
Buffer I/O error on device mtdblock5, logical block 66

There is another thread on this subject, ecc error.

This seems to affect some/many DNS-320L. My own also has those errors, but on mtd6:

blk_update_request: I/O error, dev mtdblock6, sector 3968
__nand_correct_data: uncorrectable ECC error
blk_update_request: I/O error, dev mtdblock6, sector 4080

Occasionally I have ecc errors on other nand flash areas.
Special concern to mtd0 (most critical), mtd1 and mtd2 (no "system" boot), and to a less extend to mtd3 and mtd5

[root@DNS-320L]# cat /proc/mtd 
dev:    size   erasesize  name
mtd0: 00100000 00020000 "u-boot" # boot loader
mtd1: 00500000 00020000 "uImage" # linux kernel
mtd2: 00500000 00020000 "ramdisk" # linux root filesystem
mtd3: 06400000 00020000 "image" # additional software
mtd4: 00a00000 00020000 "mini firmware" # Not used by Alt-F, used by D-Link recovery
mtd5: 00500000 00020000 "config" # "save settings" area
mtd6: 00200000 00020000 "my-dlink" # not used by Alt-F, but used by D-Link
 

As i try to dump all mtd devices, there is problem only with mtdblock5:


All NAND flash chips have errors, that is of no concern. There are "sectors" (128KB long eraseblocks) marked as BAD at the factory and a certain number of spares that will be used when a new bad block is (automatically) detected, just like it happens on a disk drive.
If the number of bad blocks starts increasing that would be of concern, as soon no spares will become available and writing to the flash chip will fail.

Flash reading errors are worse, as it means that the system would not boot, or might have random problems at boot.
Just like in disk drives, the read data is checked against its ECC checksum, and depending on the number of bits in error it might or not be corrected.
No
 
or the memory of my device is dying?

If the frequency of the errors or the number of bad blocks increases its a sure sign of a dying chip.
 
Can I flash back the original firmware

Yes, but that is not going to cure the nand flash chip -- that's a hardware problem.

The more often you write to the flash chip (flashing new firmware), the most probable is that the issue will become fatal.
The number of NAND write cycles is limited but it is also fairly high -- smart phones, tablets, SD cards, USB sticks, SSD disks, etc, all rely on NAND chips, and they can survive several years of normal usage, so that is not of special concern (we all also admit that disk drives will sooner or latter fail, so that's the same question).
The issue is the "quality" of the NAND chips that manufacturers put in their equipment, and from the reports it looks like that the ones in the DNS-320L are borderline.

or the device will brick because of this error?

If you search the forum or the net, you will find that "flashing the firmware went bad, now I have brick" -- it happens.
I avoid to the maximum to flash firmware; all my development tests are done in RAM memory, and only on the preparation for a new release will I flash (normally 3/6 times) the new and the old firmware. And mtd0, the most critical one, is not touched at all by Alt-F, so that even if a "bad flash" happens a serial adapter might enable recovery.
 
To conclude: don't be alarmed, life is dangerous.

Dimiter Dimitrov

unread,
Oct 21, 2016, 3:35:42 AM10/21/16
to Alt-F
Thank you very much for the post. I was asking, because I need reliable NAS, and if there is a chance to have problems with this one I need to find a solution. This box was not very good at all, it disappointed me, and if these errors become more, may be I will take real server for NAS (HP microserver or something like this). For now the errors are loged, only when I access some of the configs, but I do not need these very often.

João Cardoso

unread,
Oct 22, 2016, 12:18:20 PM10/22/16
to Alt-F


On Friday, 21 October 2016 08:35:42 UTC+1, Dimiter Dimitrov wrote:
Thank you very much for the post. I was asking, because I need reliable NAS, and if there is a chance to have problems with this one I need to find a solution. This box was not very good at all, it disappointed me, and if these errors become more, may be I will take real server for NAS (HP microserver or something like this). For now the errors are loged, only when I access some of the configs, but I do not need these very often.

I have meanwhile done some more readings on the linux mtd site, which is rather technical, and performed some tests on my dns-320L-rev-A1 box, which also shown some nand errors.

Summarizing: there are single bit errors on a block that can be automatically corrected by error-correcting-code (ECC), and those can be bit-flip errors, read-disturbance errors (even on *another* "partition"), and other. Only when the number of errors reach a certain number will the block be marked as bad.

On my box during boot (probably when "settings" from mtd5 are read), I get errors on mtd6:

__nand_correct_data: uncorrectable ECC error
__nand_correct_data: uncorrectable ECC error
blk_update_request: I/O error, dev mtdblock6, sector 3968
__nand_correct_data: uncorrectable ECC error
blk_update_request: I/O error, dev mtdblock6, sector 4080

Those are clearly "read-disturbance" errors, as mtd6 is never used for any purpose. The errors are also not single bit, as they are uncorrectable, and the ECC can only correct a single bit in error and report error when more bits are in error.

So I performed several 'nandtest' on mtd6, which runs fine, without any errors... The same nandtest on mtd5 (where settings are saved) didn't reveal nothing special.
But 'nanddump' on mtd5/6 sometimes report ecc corrected errors and sometimes non-correctable errors:

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd6
ECC failed: 19
ECC corrected: 0
... 

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd6
ECC failed: 19
ECC corrected: 0
...
ECC: 8 uncorrectable bitflip(s) at offset 0x00000000
ECC: 8 uncorrectable bitflip(s) at offset 0x00000800

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd6
ECC failed: 35
ECC corrected: 0
...

That sequence shows that the nanddump initial report is the accumulated stats for the device.
Is ECC working? Yes, as the following tests on mtd5 shows:

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd5
ECC failed: 0
ECC corrected: 0
...
ECC: 1 corrected bitflip(s) at offset 0x00000000
ECC: 1 corrected bitflip(s) at offset 0x00000800
 
[root@DNS-320L]# nanddump -f /dev/null /dev/mtd5
ECC failed: 0
ECC corrected: 2
... 

So, for my box, at this moment, ECC  can as expected correct single bitflips on mtd5, and errors in mtd6 are also induced by reading from mtd5 -- that's the so called "read-disturbance".

Notice that mtd1/2 (essential for booting Alt-F) present no errors, and mtd3 (used but not essential for Alt-F) has 7 uncorrectable errors (which didn't grow during the "session").

I also notice that mtd0 (essential for booting the box) has many errors, but as it is loaded by the SoC eeprom code (the primary bootloader), it is possible that it uses a different ecc correcting technique -- I read somewhere that nands first blocks are specially rugged, as they usually contain the secondary bootloader (u-boot) and are absolutely essential.

I also performed 'nandtest -k -m' on mtd1/2/5/6, which not only reads the nand but also writes on it (and restores its initial content if '-k' is used and there is no power cut, so ***NEVER*** use it on mtd0 and only use it on mtd1/2 if you have soldered a serial adapter on the box), but oddly it shows nothing special. nandtest can eventually be used to simulate several firmware flashing, test-stressing the nand if that is in any way desired.

I have also created JFFS2 filesystems on mtd5/mtd6, as that represents a more real world nand usage, but again that didn't change things.
By the way, 'loadsave_settings -fm' does that on mtd5, just be sure to "save settings" ('loadsave_settings -sf') immediately afterwards.
The DNS-327L uses UBIFS instead of JFFS2, which seems to be more robust, but I don't thing that to make a difference regarding bitflips and read-disturbances (but I think to remember that the 327L uses a more powerful ECC algorithm, capable of correcting more than a single bit error per block) 

To conclude, If your box is "mission critical", that's better to get a substitute.
In your case (and mine), only mtd5 seems to be at risk, and that is where "settings" are saved; as the box boots OK without customized (yours) settings, if mtd5 starts to be unreliable, there is only the inconvenient of having to load settings from a backup PC after a reboot or power cut.

Sorry for being so (oversimplified) technical and to provide no real help other than saying "you are not alone".
Reply all
Reply to author
Forward
0 new messages