Encorrectable ECC error

Dimiter Dimitrov

unread,

Oct 20, 2016, 4:02:33 AM10/20/16

to Alt-F

Hello, I have D-Link 320L and running ALT-F from 2 years on it. Now I notice this errors in dmesg:

__nand_correct_data: uncorrectable ECC error

__nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock5, sector 528

Buffer I/O error on device mtdblock5, logical block 66

As i try to dump all mtd devices, there is problem only with mtdblock5:

mtdinfo -M /dev/mtd5

mtd5

Name: config

Type: nand

Eraseblock size: 131072 bytes, 128.0 KiB

Amount of eraseblocks: 40 (5242880 bytes, 5.0 MiB)

Minimum input/output unit size: 2048 bytes

Sub-page size: 512 bytes

OOB size: 64 bytes

Character device major/minor: 90:10

Bad blocks are allowed: true

Device is writable: true

Eraseblock map:

0: 00000000 1: 00020000 2: 00040000 BAD 3: 00060000

4: 00080000 5: 000a0000 6: 000c0000 7: 000e0000

8: 00100000 9: 00120000 10: 00140000 11: 00160000

12: 00180000 13: 001a0000 14: 001c0000 15: 001e0000

16: 00200000 17: 00220000 18: 00240000 19: 00260000

20: 00280000 21: 002a0000 22: 002c0000 23: 002e0000

24: 00300000 25: 00320000 26: 00340000 27: 00360000

28: 00380000 29: 003a0000 30: 003c0000 31: 003e0000

32: 00400000 33: 00420000 34: 00440000 35: 00460000

36: 00480000 37: 004a0000 38: 004c0000 39: 004e0000

nanddump -f /mnt/sda2/Vkashti/dns-320l-nanddump-mtd5 /dev/mtd5

ECC failed: 6

ECC corrected: 0

Number of bad blocks: 1

Number of bbt blocks: 0

Block size 131072, page size 2048, OOB size 64

Dumping data starting at 0x00000000 and ending at 0x00500000...

Can I fix this errors or the memory of my device is dying?

Can I flash back the original firmware or the device will brick because of this error?

João Cardoso

unread,

Oct 20, 2016, 2:13:42 PM10/20/16

to Alt-F

On Thursday, 20 October 2016 09:02:33 UTC+1, Dimiter Dimitrov wrote:

Hello, I have D-Link 320L and running ALT-F from 2 years on it. Now I notice this errors in dmesg:
__nand_correct_data: uncorrectable ECC error
__nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock5, sector 528
Buffer I/O error on device mtdblock5, logical block 66

There is another thread on this subject, ecc error.

This seems to affect some/many DNS-320L. My own also has those errors, but on mtd6:

blk_update_request: I/O error, dev mtdblock6, sector 3968
__nand_correct_data: uncorrectable ECC error
blk_update_request: I/O error, dev mtdblock6, sector 4080

Occasionally I have ecc errors on other nand flash areas.

Special concern to mtd0 (most critical), mtd1 and mtd2 (no "system" boot), and to a less extend to mtd3 and mtd5

[root@DNS-320L]# cat /proc/mtd
dev: size erasesize name
mtd0: 00100000 00020000 "u-boot" # boot loader
mtd1: 00500000 00020000 "uImage" # linux kernel
mtd2: 00500000 00020000 "ramdisk" # linux root filesystem
mtd3: 06400000 00020000 "image" # additional software
mtd4: 00a00000 00020000 "mini firmware" # Not used by Alt-F, used by D-Link recovery
mtd5: 00500000 00020000 "config" # "save settings" area
mtd6: 00200000 00020000 "my-dlink" # not used by Alt-F, but used by D-Link

As i try to dump all mtd devices, there is problem only with mtdblock5:

All NAND flash chips have errors, that is of no concern. There are "sectors" (128KB long eraseblocks) marked as BAD at the factory and a certain number of spares that will be used when a new bad block is (automatically) detected, just like it happens on a disk drive.

If the number of bad blocks starts increasing that would be of concern, as soon no spares will become available and writing to the flash chip will fail.

Flash reading errors are worse, as it means that the system would not boot, or might have random problems at boot.

Just like in disk drives, the read data is checked against its ECC checksum, and depending on the number of bits in error it might or not be corrected.

No

or the memory of my device is dying?

If the frequency of the errors or the number of bad blocks increases its a sure sign of a dying chip.

Can I flash back the original firmware

Yes, but that is not going to cure the nand flash chip -- that's a hardware problem.

The more often you write to the flash chip (flashing new firmware), the most probable is that the issue will become fatal.

The number of NAND write cycles is limited but it is also fairly high -- smart phones, tablets, SD cards, USB sticks, SSD disks, etc, all rely on NAND chips, and they can survive several years of normal usage, so that is not of special concern (we all also admit that disk drives will sooner or latter fail, so that's the same question).

The issue is the "quality" of the NAND chips that manufacturers put in their equipment, and from the reports it looks like that the ones in the DNS-320L are borderline.

or the device will brick because of this error?

If you search the forum or the net, you will find that "flashing the firmware went bad, now I have brick" -- it happens.

I avoid to the maximum to flash firmware; all my development tests are done in RAM memory, and only on the preparation for a new release will I flash (normally 3/6 times) the new and the old firmware. And mtd0, the most critical one, is not touched at all by Alt-F, so that even if a "bad flash" happens a serial adapter might enable recovery.

To conclude: don't be alarmed, life is dangerous.

Dimiter Dimitrov

unread,

Oct 21, 2016, 3:35:42 AM10/21/16

to Alt-F

Thank you very much for the post. I was asking, because I need reliable NAS, and if there is a chance to have problems with this one I need to find a solution. This box was not very good at all, it disappointed me, and if these errors become more, may be I will take real server for NAS (HP microserver or something like this). For now the errors are loged, only when I access some of the configs, but I do not need these very often.

João Cardoso

unread,

Oct 22, 2016, 12:18:20 PM10/22/16

to Alt-F

On Friday, 21 October 2016 08:35:42 UTC+1, Dimiter Dimitrov wrote:

Thank you very much for the post. I was asking, because I need reliable NAS, and if there is a chance to have problems with this one I need to find a solution. This box was not very good at all, it disappointed me, and if these errors become more, may be I will take real server for NAS (HP microserver or something like this). For now the errors are loged, only when I access some of the configs, but I do not need these very often.

I have meanwhile done some more readings on the linux mtd site, which is rather technical, and performed some tests on my dns-320L-rev-A1 box, which also shown some nand errors.

Summarizing: there are single bit errors on a block that can be automatically corrected by error-correcting-code (ECC), and those can be bit-flip errors, read-disturbance errors (even on *another* "partition"), and other. Only when the number of errors reach a certain number will the block be marked as bad.

On my box during boot (probably when "settings" from mtd5 are read), I get errors on mtd6:

__nand_correct_data: uncorrectable ECC error
__nand_correct_data: uncorrectable ECC error

blk_update_request: I/O error, dev mtdblock6, sector 3968
__nand_correct_data: uncorrectable ECC error
blk_update_request: I/O error, dev mtdblock6, sector 4080

Those are clearly "read-disturbance" errors, as mtd6 is never used for any purpose. The errors are also not single bit, as they are uncorrectable, and the ECC can only correct a single bit in error and report error when more bits are in error.

So I performed several 'nandtest' on mtd6, which runs fine, without any errors... The same nandtest on mtd5 (where settings are saved) didn't reveal nothing special.

But 'nanddump' on mtd5/6 sometimes report ecc corrected errors and sometimes non-correctable errors:

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd6
ECC failed: 19
ECC corrected: 0

...

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd6
ECC failed: 19
ECC corrected: 0
...
ECC: 8 uncorrectable bitflip(s) at offset 0x00000000
ECC: 8 uncorrectable bitflip(s) at offset 0x00000800

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd6
ECC failed: 35
ECC corrected: 0

...

That sequence shows that the nanddump initial report is the accumulated stats for the device.

Is ECC working? Yes, as the following tests on mtd5 shows:

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd5
ECC failed: 0
ECC corrected: 0
...
ECC: 1 corrected bitflip(s) at offset 0x00000000
ECC: 1 corrected bitflip(s) at offset 0x00000800

[root@DNS-320L]# nanddump -f /dev/null /dev/mtd5
ECC failed: 0
ECC corrected: 2

...

So, for my box, at this moment, ECC can as expected correct single bitflips on mtd5, and errors in mtd6 are also induced by reading from mtd5 -- that's the so called "read-disturbance".

Notice that mtd1/2 (essential for booting Alt-F) present no errors, and mtd3 (used but not essential for Alt-F) has 7 uncorrectable errors (which didn't grow during the "session").

I also notice that mtd0 (essential for booting the box) has many errors, but as it is loaded by the SoC eeprom code (the primary bootloader), it is possible that it uses a different ecc correcting technique -- I read somewhere that nands first blocks are specially rugged, as they usually contain the secondary bootloader (u-boot) and are absolutely essential.

I also performed 'nandtest -k -m' on mtd1/2/5/6, which not only reads the nand but also writes on it (and restores its initial content if '-k' is used and there is no power cut, so ***NEVER*** use it on mtd0 and only use it on mtd1/2 if you have soldered a serial adapter on the box), but oddly it shows nothing special. nandtest can eventually be used to simulate several firmware flashing, test-stressing the nand if that is in any way desired.

I have also created JFFS2 filesystems on mtd5/mtd6, as that represents a more real world nand usage, but again that didn't change things.

By the way, 'loadsave_settings -fm' does that on mtd5, just be sure to "save settings" ('loadsave_settings -sf') immediately afterwards.

The DNS-327L uses UBIFS instead of JFFS2, which seems to be more robust, but I don't thing that to make a difference regarding bitflips and read-disturbances (but I think to remember that the 327L uses a more powerful ECC algorithm, capable of correcting more than a single bit error per block)

To conclude, If your box is "mission critical", that's better to get a substitute.

In your case (and mine), only mtd5 seems to be at risk, and that is where "settings" are saved; as the box boots OK without customized (yours) settings, if mtd5 starts to be unreliable, there is only the inconvenient of having to load settings from a backup PC after a reboot or power cut.

Sorry for being so (oversimplified) technical and to provide no real help other than saying "you are not alone".

Reply all

Reply to author

Forward